Hon*_* Ao 4 c# sql-server unicode encoding collation
我遇到了一个情况下SQL服务器可以存储"索菲亚"和"索菲亚"是两个不同的字符串,但在TSQL比较时,他们是不管逐份使用,即使二进制分页相同:
CREATE TABLE #R (NAME NvarchAR(255) COLLATE SQL_Latin1_General_CP1_CI_AS)
INSERT INTO #R VALUES (N'sofia')
INSERT INTO #r VALUES (N'?????')
SELECT * FROM #r WHERE NAME = N'?????'
sofia
?????
(2 row(s) affected)
IF '?????' = 'sofia' COLLATE SQL_Latin1_General_CP1_CI_AS
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
-------------------
Values are the same
(1 row(s) affected)
IF '?????' = 'sofia' COLLATE SQL_Latin1_General_CP437_BIN
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
-------------------
Values are the same
(1 row(s) affected)
I tried to find out the encode of "?????"
http://stackoverflow.com/questions/1025332/determine-a-strings-encoding-in-c-sharp
It said:
// If all else fails, the encoding is probably (though certainly not
// definitely) the user's local codepage! One might present to the user a
// list of alternative encodings as shown here: http://stackoverflow.com/questions/8509339/what-is-the-most-common-encoding-of-each-language
// A full list can be found using Encoding.GetEncodings();
I iterate through all the encoding returned from Encoding.GetEncodings(), none of them match
Looking into the binary I found an interesting fact: “?????” itself is encoded with UTF16, but it can be generated from "SOFIA" UTF16 by filling “1” instead of “0” in the extra byte besides ASCII code (Ex for ‘S’: 83 255 vs 83 0) It is shown as lower case. In C#,
“?????”
[0] 83 byte
[1] 255 byte
[2] 79 byte
[3] 255 byte
[4] 70 byte
[5] 255 byte
[6] 73 byte
[7] 255 byte
[8] 65 byte
[9] 255 byte
"SOFIA"
[0] 83 byte
[1] 0 byte
[2] 79 byte
[3] 0 byte
[4] 70 byte
[5] 0 byte
[6] 73 byte
[7] 0 byte
[8] 65 byte
[9] 0 byte
"sofia"
[0] 115 byte
[1] 0 byte
[2] 79 byte
[3] 0 byte
[4] 70 byte
[5] 0 byte
[6] 105 byte
[7] 0 byte
[8] 97 byte
[9] 0 byte
One can create two different directorie/files with name as C:\?????\, C:\sofia\ or ?????.txt, sofia.txt.
Why does the SQL engine think they are the same while storing them with the original streams?
In order to get just the exact I want I had to convert to binary first:
SELECT * FROM #r WHERE CONVERT(VARBINARY(100), Name) = CONVERT(VARBINARY(100), N'?????')
?????
(1 row(s) affected)
SELECT * FROM #r WHERE CONVERT(VARBINARY(100), Name) = CONVERT(VARBINARY(100), N'sofia')
sofia
(1 row(s) affected)
Run Code Online (Sandbox Code Playgroud)
但这有很多副作用,比如文化和案例.如何教 TSQL引擎在不花费太多成本的情况下知道它们是不同的?
是否有这种字符串编码的官方名称?
这里有两个问题.
第一:有整理问题.排序定义字符的排序和相等性.正如@Kazetsukai所建议的那样,这里提供的特定校对属性是宽度敏感度.但是,您不能简单地添加_WS到任何排序规则名称并假设它将是有效的排序规则.事实上,SQL_Latin1_General_CP1_CI_AS_WS这不是一个有效的整理.
您可以通过一组有限的排序规则获得SELECT * FROM fn_helpcollations() WHERE [name] LIKE N'latin%[_]ws';.该查询的结果表明您可能想要的排序规则Latin1_General_CI_AS_WS.结束的任何排序都_BIN2可以工作(尝试不使用结束的排序,_BIN因为已经弃用的排序,就像开始的排序一样SQL_).
但是,出于某种原因,即使使用那些似乎也不起作用:
IF '?????' = 'sofia' COLLATE Latin1_General_CI_AS_WS
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
IF '?????' = 'sofia' COLLATE Latin1_General_BIN2
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
Run Code Online (Sandbox Code Playgroud)
两者的结果是"值是相同的".这带来了:
第二:使用NVARCHAR1个数据时,必须使用大写字母为字符串文字加前缀N,否则它首先将字符隐式转换为各自的VARCHAR2个字符(?如果Unicode代码点和存在的字符之间没有定义映射,则转换为字符在字段或操作的排序规则指定的代码页中.
IF N'?????' = N'sofia' COLLATE Latin1_General_CI_AS_WS
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
IF N'?????' = N'sofia' COLLATE Latin1_General_BIN2
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
Run Code Online (Sandbox Code Playgroud)
前缀这些文字值N允许预期的行为和两个查询的结果现在是"值是不同的".
1的XML和N-prefixed类型存储数据为UTF-16小字节序.默认处理只是UCS-2/Base多语言平面(BMP)字符.但是,如果使用结束的排序规则_SC,则可以使用补充字符正确处理完整的UTF-16.
2的CHAR,VARCHAR和TEXT(但不使用这最后一个,因为它是不建议使用)类型是8位ASCII与代码页扩展.