使用SQL Server在varchar列中查找非ASCII字符

Ger*_*iss 52 t-sql sql-server sql-server-2005 non-ascii-characters

如何使用SQL Server返回包含非ASCII字符的行?
如果你能展示如何为一个列做这件事会很棒.

我现在正在做这样的事情,但它没有用

select *
from Staging.APARMRE1 as ar
where ar.Line like '%[^!-~ ]%'
Run Code Online (Sandbox Code Playgroud)

对于额外的功劳,如果它可以跨越表中的所有 varchar列,那将是非常出色的!在这个解决方案中,返回三列是很好的:

  • 该记录的标识字段.(这将允许使用另一个查询来审核整个记录.)
  • 列名称
  • 带有无效字符的文本
 Id | FieldName | InvalidText       |
----+-----------+-------------------+
 25 | LastName  | Solís             |
 56 | FirstName | François          |
100 | Address1  | 123 Ümlaut street |
Run Code Online (Sandbox Code Playgroud)

无效字符可以是SPACE(32 10)到~(127 10)范围之外的任何字符

Ger*_*iss 72

这是使用PATINDEX进行单列搜索的解决方案.
它还显示StartPosition,InvalidCharacter和ASCII代码.

select line,
  patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line) as [Position],
  substring(line,patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line),1) as [InvalidCharacter],
  ascii(substring(line,patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line),1)) as [ASCIICode]
from  staging.APARMRE1
where patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line) >0
Run Code Online (Sandbox Code Playgroud)

  • Gerhard正在为PATINDEX函数提供正则表达式.正则表达式是[^! - 〜].我不确定为什么他在那里包含感叹号,因为它在数字空间字符之后.关键是正则表达式找到的字符不在Space-Tilde(32-126)的范围内. (6认同)

KM.*_*KM. 22

尝试这样的事情:

DECLARE @YourTable table (PK int, col1 varchar(20), col2 varchar(20), col3 varchar(20))
INSERT @YourTable VALUES (1, 'ok','ok','ok')
INSERT @YourTable VALUES (2, 'BA'+char(182)+'D','ok','ok')
INSERT @YourTable VALUES (3, 'ok',char(182)+'BAD','ok')
INSERT @YourTable VALUES (4, 'ok','ok','B'+char(182)+'AD')
INSERT @YourTable VALUES (5, char(182)+'BAD','ok',char(182)+'BAD')
INSERT @YourTable VALUES (6, 'BAD'+char(182),'B'+char(182)+'AD','BAD'+char(182)+char(182)+char(182))

--if you have a Numbers table use that, other wise make one using a CTE
;WITH AllNumbers AS
(   SELECT 1 AS Number
    UNION ALL
    SELECT Number+1
        FROM AllNumbers
        WHERE Number<1000
)
SELECT 
    pk, 'Col1' BadValueColumn, CONVERT(varchar(20),col1) AS BadValue --make the XYZ in convert(varchar(XYZ), ...) the largest value of col1, col2, col3
    FROM @YourTable           y
        INNER JOIN AllNumbers n ON n.Number <= LEN(y.col1)
    WHERE ASCII(SUBSTRING(y.col1, n.Number, 1))<32 OR ASCII(SUBSTRING(y.col1, n.Number, 1))>127
UNION
SELECT 
    pk, 'Col2' BadValueColumn, CONVERT(varchar(20),col2) AS BadValue --make the XYZ in convert(varchar(XYZ), ...) the largest value of col1, col2, col3
    FROM @YourTable           y
        INNER JOIN AllNumbers n ON n.Number <= LEN(y.col2)
    WHERE ASCII(SUBSTRING(y.col2, n.Number, 1))<32 OR ASCII(SUBSTRING(y.col2, n.Number, 1))>127
UNION
SELECT 
    pk, 'Col3' BadValueColumn, CONVERT(varchar(20),col3) AS BadValue --make the XYZ in convert(varchar(XYZ), ...) the largest value of col1, col2, col3
    FROM @YourTable           y
        INNER JOIN AllNumbers n ON n.Number <= LEN(y.col3)
    WHERE ASCII(SUBSTRING(y.col3, n.Number, 1))<32 OR ASCII(SUBSTRING(y.col3, n.Number, 1))>127
order by 1
OPTION (MAXRECURSION 1000)
Run Code Online (Sandbox Code Playgroud)

OUTPUT:

pk          BadValueColumn BadValue
----------- -------------- --------------------
2           Col1           BA¶D
3           Col2           ¶BAD
4           Col3           B¶AD
5           Col1           ¶BAD
5           Col3           ¶BAD
6           Col1           BAD¶
6           Col2           B¶AD
6           Col3           BAD¶¶¶

(8 row(s) affected)
Run Code Online (Sandbox Code Playgroud)

  • CTE需要“ OPTION(MAXRECURSION 1000)”,它以递归方式构建从1到1000的行集,默认值为100(我认为)cte中任何嵌套的递归调用都超过默认值,则需要将此选项设置为组。如果您有一个数字表http://stackoverflow.com/q/1393951/65223,则不需要CTE或此“ OPTION(MAXRECURSION 1000)”行 (2认同)

小智 15

我成功地运行了这段代码

declare @UnicodeData table (
     data nvarchar(500)
)
insert into 
    @UnicodeData
values 
    (N'Horse?')
    ,(N'Dog')
    ,(N'Cat')

select
    data
from
    @UnicodeData 
where
    data collate LATIN1_GENERAL_BIN != cast(data as varchar(max))
Run Code Online (Sandbox Code Playgroud)

这适用于已知列.

为了额外的功劳,我写了这个快速脚本来搜索给定表中所有nvarchar列的Unicode字符.

declare 
    @sql    varchar(max)    = ''
    ,@table sysname         = 'mytable' -- enter your table here

;with ColumnData as (
    select
        RowId               = row_number() over (order by c.COLUMN_NAME)
        ,c.COLUMN_NAME
        ,ColumnName         = '[' + c.COLUMN_NAME + ']'
        ,TableName          = '[' + c.TABLE_SCHEMA + '].[' + c.TABLE_NAME + ']' 
    from
        INFORMATION_SCHEMA.COLUMNS c
    where
        c.DATA_TYPE         = 'nvarchar'
        and c.TABLE_NAME    = @table
)
select
    @sql = @sql + 'select FieldName = ''' + c.ColumnName + ''',         InvalidCharacter = [' + c.COLUMN_NAME + ']  from ' + c.TableName + ' where ' + c.ColumnName + ' collate LATIN1_GENERAL_BIN != cast(' + c.ColumnName + ' as varchar(max)) '  +  case when c.RowId <> (select max(RowId) from ColumnData) then  ' union all ' else '' end + char(13)
from
    ColumnData c

-- check
-- print @sql
exec (@sql)
Run Code Online (Sandbox Code Playgroud)

我不是动态SQL的粉丝,但它确实有用于这样的探索性查询.


And*_*mar 13

此脚本在一列中搜索非ascii字符.它生成一个包含所有有效字符的字符串,此处代码点为32到127.然后它搜索与列表不匹配的行:

declare @str varchar(128)
declare @i int
set @str = ''
set @i = 32
while @i <= 127
    begin
    set @str = @str + '|' + char(@i)
    set @i = @i + 1
    end

select  col1
from    YourTable
where   col1 like '%[^' + @str + ']%' escape '|'
Run Code Online (Sandbox Code Playgroud)

  • 这适用于一个小改动 Varchar(128) 需要更大,因为要存储 2 个字符。我做了它 Varchar(200)。运行我的数据库确实需要一些时间。我也很惊讶不能使用范围来简化这个过程。即像 '%[^| -|~]%' 转义 '|' 我试图让一个范围工作,但它没有返回正确的信息。 (2认同)

And*_*ill 6

在现实世界的数据上运行各种解决方案-1200万行varchar长度〜30,大约9k易行的行,没有全文索引在运行,patIndex解决方案是最快的,并且它选择的行也最多。

(预先运行km。将缓存设置为已知状态,运行3个进程,最后再次运行km-最后2 km运行在2秒内给出了时间)

Gerhard Weiss的patindex解决方案-运行时间0:38,返回9144行

select dodgyColumn from myTable fcc
WHERE  patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,dodgyColumn ) >0
Run Code Online (Sandbox Code Playgroud)

MT的substring-numbers解决方案。-运行时间1:16,返回了8996行

select dodgyColumn from myTable fcc
INNER JOIN dbo.Numbers32k dn ON dn.number<(len(fcc.dodgyColumn ))
WHERE ASCII(SUBSTRING(fcc.dodgyColumn , dn.Number, 1))<32 
    OR ASCII(SUBSTRING(fcc.dodgyColumn , dn.Number, 1))>127
Run Code Online (Sandbox Code Playgroud)

Deon Robertson的udf解决方案-运行时间3:47,返回7316行

select dodgyColumn 
from myTable 
where dbo.udf_test_ContainsNonASCIIChars(dodgyColumn , 1) = 1
Run Code Online (Sandbox Code Playgroud)