为什么TSQL将"sofia"视为与"sofia"相同?这是什么字符串编码?

Hon*_* Ao 4 c# sql-server unicode encoding collation

我遇到了一个情况下SQL服务器可以存储"索菲亚"和"索菲亚"是两个不同的字符串,但在TSQL比较时,他们是不管逐份使用,即使二进制分页相同:

CREATE TABLE #R (NAME NvarchAR(255) COLLATE SQL_Latin1_General_CP1_CI_AS)
INSERT INTO #R VALUES (N'sofia')
INSERT INTO #r VALUES (N'?????')

SELECT * FROM #r WHERE NAME = N'?????'

sofia
?????

(2 row(s) affected)

IF '?????' = 'sofia'  COLLATE SQL_Latin1_General_CP1_CI_AS 
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

-------------------
Values are the same

(1 row(s) affected)

IF '?????' = 'sofia'  COLLATE SQL_Latin1_General_CP437_BIN
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

-------------------
Values are the same

(1 row(s) affected)

I tried to find out the encode of "?????"

http://stackoverflow.com/questions/1025332/determine-a-strings-encoding-in-c-sharp

It said:

            // If all else fails, the encoding is probably (though certainly not
            // definitely) the user's local codepage! One might present to the user a
            // list of alternative encodings as shown here: http://stackoverflow.com/questions/8509339/what-is-the-most-common-encoding-of-each-language
            // A full list can be found using Encoding.GetEncodings();

I iterate through all the encoding returned from Encoding.GetEncodings(), none of them match

Looking into the binary I found an interesting fact: “?????” itself is encoded with UTF16, but it can be generated from  "SOFIA" UTF16 by filling “1” instead of “0” in the extra byte besides ASCII code (Ex for ‘S’: 83 255 vs 83 0)  It is shown as lower case. In C#, 

“?????”

                             [0]         83          byte                                    
                             [1]         255        byte
                             [2]         79          byte
                             [3]         255        byte
                             [4]         70          byte
                             [5]         255        byte
                             [6]         73          byte
                             [7]         255        byte
                             [8]         65          byte
                             [9]         255        byte

"SOFIA"

                             [0]         83          byte                                    
                             [1]         0        byte
                             [2]         79          byte
                             [3]         0        byte
                             [4]         70          byte
                             [5]         0        byte
                             [6]         73          byte
                             [7]         0        byte
                             [8]         65          byte
                             [9]         0        byte

"sofia"

                             [0]         115          byte                                    
                             [1]         0        byte
                             [2]         79          byte
                             [3]         0        byte
                             [4]         70          byte
                             [5]         0        byte
                             [6]         105          byte
                             [7]         0        byte
                             [8]         97          byte
                             [9]         0        byte

One can create two different directorie/files with name as C:\?????\, C:\sofia\ or  ?????.txt, sofia.txt.

Why does the SQL engine think they are the same while storing them with the original streams?

In order to get just the exact I want I had to convert to binary first:

SELECT * FROM #r WHERE CONVERT(VARBINARY(100), Name) = CONVERT(VARBINARY(100), N'?????')

?????

(1 row(s) affected)

SELECT * FROM #r WHERE CONVERT(VARBINARY(100), Name) = CONVERT(VARBINARY(100), N'sofia')

sofia

(1 row(s) affected)
Run Code Online (Sandbox Code Playgroud)

但这有很多副作用,比如文化和案例.如何 TSQL引擎在不花费太多成本的情况下知道它们是不同的?

是否有这种字符串编码的官方名称?

Sol*_*zky 6

这里有两个问题.

第一:有整理问题.排序定义字符的排序和相等性.正如@Kazetsukai所建议的那样,这里提供的特定校对属性是宽度敏感度.但是,您不能简单地添加_WS到任何排序规则名称并假设它将是有效的排序规则.事实上,SQL_Latin1_General_CP1_CI_AS_WS这不是一个有效的整理.

您可以通过一组有限的排序规则获得SELECT * FROM fn_helpcollations() WHERE [name] LIKE N'latin%[_]ws';.该查询的结果表明您可能想要的排序规则Latin1_General_CI_AS_WS.结束的任何排序都_BIN2可以工作(尝试不使用结束的排序,_BIN因为已经弃用的排序,就像开始的排序一样SQL_).

但是,出于某种原因,即使使用那些似乎也不起作用:

IF '?????' = 'sofia' COLLATE Latin1_General_CI_AS_WS
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

IF '?????' = 'sofia' COLLATE Latin1_General_BIN2
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
Run Code Online (Sandbox Code Playgroud)

两者的结果是"值是相同的".这带来了:

第二:使用NVARCHAR1个数据时,必须使用大写字母为字符串文字加前缀N,否则它首先将字符隐式转换为各自的VARCHAR2个字符(?如果Unicode代码点和存在的字符之间没有定义映射,则转换为字符在字段或操作的排序规则指定的代码页中.

IF N'?????' = N'sofia' COLLATE Latin1_General_CI_AS_WS
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

IF N'?????' = N'sofia' COLLATE Latin1_General_BIN2
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
Run Code Online (Sandbox Code Playgroud)

前缀这些文字值N允许预期的行为和两个查询的结果现在是"值是不同的".


1XMLN-prefixed类型存储数据为UTF-16小字节序.默认处理只是UCS-2/Base多语言平面(BMP)字符.但是,如果使用结束的排序规则_SC,则可以使用补充字符正确处理完整的UTF-16.

2CHAR,VARCHARTEXT(但不使用这最后一个,因为它是不建议使用)类型是8位ASCII与代码页扩展.