如何通过字符比较执行Unicode识别字符？

Question

如何通过字符比较执行Unicode识别字符？

我的申请有一个国际目标,来自许多国家的人都会使用它,他们将使用自己的语言输入文本(我必须处理的文本).

例如,如果我必须使用字符比较来列出两个字符串的差异,那么这个简单的C#代码就足够了,或者我错过了什么？

var differences = new List<Tuple<int, char, char>>();
for (int i=0; i < myString1.Length; ++i)
{
    if (myString1[i] != myString2[i])
        differences.Add(new Tuple<int, char, char>(i, myString1[i], myString2[i]));
}

Run Code Online (Sandbox Code Playgroud)

给定代码有效以不同语言执行此任务(我的用户不限于美国字符集)？

Answer 1

Adr*_*tti 28

编码

Unicode定义了一个字符列表(字母,数字,分析符号,控制代码等),但它们的表示(以字节为单位)被定义为编码.目前最常见的Unicode编码是UTF-8,UTF-16和UTF-32.UTF-16通常与Unicode相关联,因为它是在Windows,Java,.NET环境,C和C++语言(在Windows上)中为Unicode支持选择的.请注意,这不是唯一的,在您的生活中,您肯定也会遇到UTF-8文本(特别是来自Web和Linux文件系统)和UTF-32(Windows世界之外).一篇非常具有介绍性的必读文章:绝对最低每个软件开发人员绝对必须知道Unicode和字符集(没有借口!)和UTF-8无处不在 - 宣言.IMO尤其是第二个链接(无论您的意见是UTF-8还是UTF-16)都非常具有启发性.

让我引用维基百科:

因为最常用的字符都在基本多语言平面中,所以代理对的处理通常没有经过彻底的测试.这会导致持续存在的漏洞和潜在的安全漏洞,即使在流行且经过良好评估的应用软件中也是如此(例如CVE-2008-2938,CVE-2012-2135)

要看问题只是从一些简单的数学开始:Unicode定义了大约110K的代码点(请注意,并非所有代码点都是字形).C,C++,C#,VB.NET,Java和Windows环境中的许多其他语言中的"Unicode字符类型"(旧ASP经典页面上的VBScript除外)是UTF-16编码然后它是两个字节(这里的类型名称是直观但完全误导,因为它是一个代码单元,而不是字符或代码点).

请检查这种区别,因为它是基本的:代码单元在逻辑上与字符不同,即使有时它们重合,它们也不是一回事.这会如何影响您的编程生活？想象一下,你有这个C#代码和你的规范(由思考字符真正定义的人写的)说"密码长度必须是4个字符 ":

bool IsValidPassword(string text ) {
    return text.Length >= 4;
}

Run Code Online (Sandbox Code Playgroud)

那段代码是丑陋的,错误的和破碎的.Length属性返回的数代码单元的text字符串变量,现在你知道他们是不同的.您的代码将验证n?o?为有效密码(但它由两个字符,四个代码点组成 - 几乎总是与代码单元一致).现在试着想象这应用于你的应用程序的所有层:一个UTF-8编码的数据库字段,通过以前的代码(其中输入是UTF-16)进行了初步验证,错误将总结和你的波兰朋友？wi？tos？aw Ko？micki不会对此感到高兴.现在认为您必须使用相同的技术验证用户的名字,并且您的用户是中文(但不要担心,如果您不在乎,那么他们将在很短的时间内成为您的用户).另一个例子:这个简单的C#算法来计算字符串中的不同字符将失败的原因相同:

myString.Distinct().Count()

Run Code Online (Sandbox Code Playgroud)

如果用户输入这个Han字符,那么你的代码将错误地返回... 2,因为它的UTF-16表示是0xD840 0xDC11(BTW中的每一个,单独,不是有效的Unicode字符,因为它们分别是高和低代理).在这篇文章中更详细地解释了原因,还提供了一个工作解决方案,所以我在这里重复一下基本代码:

StringInfo.GetTextElementEnumerator(text)
    .AsEnumerable<string>()
    .Distinct()
    .Count();

Run Code Online (Sandbox Code Playgroud)

这大致等同于codePointCount()在Java中计算字符串中的代码点.我们需要AsEnumerable<T>()因为GetTextElementEnumerator()返回IEnumerator而不是IEnumerable,在将字符串拆分为相同长度的块时描述了一个简单的实现.

这个东西只与字符串长度有关吗？当然不是,如果您处理键盘输入 Char,则Char可能需要修复代码.例如,请参阅有关在事件中处理的韩文字符的此问题KeyUp.

不相关但IMO有助于理解,这个C代码(取自这篇文章)适用于char(ASCII/ANSI或UTF-8),但如果直接转换为使用它将失败wchar_t:

wchar_t* pValue = wcsrchr(wcschr(pExpression, L'|'), L':') + 1;

Run Code Online (Sandbox Code Playgroud)

请注意,在C++ 11有一个新的大组类来处理编码和更清晰的类型别名:char8_t,char16_t和char32_t分别用于,UTF-8,UTF-16和UTF-32编码的字符.请注意,你也有std::u8string,std::u16string和std::u32string.请注意,即使length()(及其size()别名)仍将返回代码单元数,您可以轻松地使用codecvt()模板函数执行编码转换,并使用这些类型IMO,您将使您的代码更清晰明确(不会惊讶size()于u16string将返回的数字char16_t 元素).有关使用C++计算字符数的更多详细信息,请查看此帖.在C中,使用charUTF-8编码非常容易:这篇文章 IMO是必读的.

文化差异

并非所有语言都相似,它们甚至不共享一些基本概念.例如我们目前的定义字形可以从我们的理念很远字符.让我用一个例子来解释一下:在韩语韩语中,字母被组合成一个单独的音节(字母和音节都是字符,只有在单独和单词和其他字母时以不同的方式表示).字?(古克)是一个音节由三个字母组成?,?和?(第一和最后一个字母是一样的,但是当他们在开头或单词的结束他们与不同的声音明显的,这就是为什么他们音译g和k).

音节让我们介绍另一个概念:预先组合和分解的序列.韩语音节汉 ?可以表示为单个字符(U+0D55C)或分解的字母序列?,?和?.例如,如果你正在阅读一个文本文件,你可能同时拥有它们(用户可以在输入框中输入两个序列),但它们必须相等.请注意,如果您按顺序键入这些字母,它们将始终显示为单个音节(复制和粘贴单个字符 - 不含空格 - 并尝试)但最终形式(预先组合或分解)取决于您的IME.

在捷克语中,"ch"是有向图,它被视为单个字母.它有它自己的整理规则(它之间H和I),与捷克的排序fyzika到来之前化学!如果你计算字符,你告诉你的用户,Chechtal这个词由8个字符组成,他们会认为你的软件被窃听,你对他们语言的支持仅限于一堆翻译的资源.让我们添加例外:在puchoblík(和其他几个词)C并且H不是有向图,它们是分开的.请注意,在斯洛伐克和其他地方还有其他案例如"dž"即使它使用两个/三个UTF-16代码点,它也算作单个字符!在许多其他语言中也是如此(例如,加泰罗尼亚语中的ll).真正的语言有比PHP更多的例外和特殊情况!

注意单靠外观是不够的等价性,例如:A(U+0041大写拉丁字母A)不等同于?(U+0410CYRILLIC大写字母A).相反,字符?(U+0662ARABIC-INDIC DIGIT TWO)和?(U+06F2EXTENDED ARABIC-INDIC DIGIT TWO)在视觉上和概念上是等价的,但它们是不同的Unicode代码点(另见下一段关于数字和同义词的内容).

像?和!有时用作符号的符号,例如最早的海达语言.在某些语言中(如美国原住民语言的最早书面形式),数字和其他符号也是从拉丁字母中借用并用作字母(请记住,如果你必须处理这些语言,你必须从符号中删除字母数字,Unicode可以'这是一个例子!Kung用Khoisan非洲语言.在加泰罗尼亚语中,当ll不是有向图时,他们使用变音符号(或middot(+U00B7)...)来分隔字符,就像在celhesles中一样(在这种情况下,字符数为6,代码单元/代码点为7,其中为假设不存在的单词celles 会导致5个字符).

Same word may be written using in more than one form. This may be something you have to care about if, for example, you provide a full-text search. For example Chinese word ? (house) can be transliterated as Ji? in pinyin and in Japanese same word may be also written with same Kanji ? or as ?? in Hiragana (and others too) or transliterated in romaji as ie. Is this limited to words? No, also characters, for numbers is pretty common: 2 (Arabic number in Roman alphabet), ? (in Arabic and Persian) and ? (Chinese and Japanese) are exactly same cardinal number. Let's add some complexity: in Chinese it's also very common to write the same number as ? (simplified: ?). I don't even mention prefixes (micro, nano, kilo and so on). See this post for a real world example of this issue. It's not limited to far-east languages only: apostrophe (U+0027 APOSTROPHE or better (U+2019 RIGHT SINGLE QUOTATION MARK) is used often in Czech and Slovak instead of its superimposed counterpart (U+02BC MODIFIER LETTER APOSTROPHE): dʼ and d' are then equivalent (similar to what I said about middot in Catalan).

Maybeyou should properly handle lower case "ss" in German to be compared to ß (and problems will arise for case insensitive comparison). Similar issue is in Turkish if you have to provide a non-exact string matching for i and its forms (see section about Case).

If you're working with professional text you may also meet ligatures; even in English, for example æsthetics is 9 code points but 10 characters! Same applies, for example, for ethel character œ (U+0153 LATIN SMALL LIGATURE OE, absolutely necessary if you're working with French text); horse d'ouvre is equivalent to horse d'œvre (but also ethel and œthel). Both are (together with German ß) lexical ligatures but you may also meet typographical ligatures (such as ﬀ U+FB00 LATIN SMALL LIGATURE FF) and they have they're own part on Unicode character set (presentation forms). Nowadays diacritics are much more common even in English (see tchrist's post about people freed of the tyranny of the typewriter, please read carefully Bringhurst's citation). Do you think you (and your users) won't ever type façade, naïve and prêt-à-porter or "classy" noöne or coöperation?

Here I don't even mention word counting because it'll open even more problems: in Korean each word is composed by syllables but in, for example, Chinese and Japanese, Characters are counted as words (unless you want to implement word counting using a dictionary). Now let's take this Chinese sentence: ???????? rougly equivalent to Japanese sentence ???????????????. How do you count them? Moreover if they're transliterated to Shì y?gè shìlì wénb?n and Kore wa, sanpuru no tekisutodesu then they should be matched in a text search?

Speaking about Japanese: full width Latin Characters are different from half width Characters and if your input is Japanese romaji text you have to handle this otherwise your users will be astonished when ? won't compare equal to T (in this case what should be just glyphs became code points).

OK, is this enough to highlight problem surface?

Duplicated Characters

Unicode (primary for ASCII compatibility and other historical reasons) has duplicated characters, before you do a comparison you have to perform normalization otherwise à (single code point) won't be equal to a? (a plus U+0300 COMBINING GRAVE ACCENT). Is this a corner uncommon case? Not really, also take a look to this real world example from Jon Skeet. Also (see section Culture Difference) precomposed and decomposed sequences introduce duplicates.

Note that diacritics are not only source of confusion. When user is typing with his keyboard he'll probably enter ' (U+0027 APOSTROPHE) but it's supposed to match also ’ (U+2019 RIGHT SINGLE QUOTATION MARK) normally used in typography (same is true for many many Unicode symbols almost equivalent from user point of view but distinct in typography, imagine to write a text search inside digital books).

In short two strings must be considered equal (this is a very important concept!) if they are canonically equivalent and they are canonically equivalent if they have the same linguistic meaning and appearance, even if they are composed from different Unicode code points.

Case

If you have to perform case insensitive comparison then you'll have even more problems. I assume you do not perform hobbyist case insensitive comparison using toupper() or equivalent unless, one for all, you want to explain to your users why 'i'.ToUpper() != 'I' for Turkish language (I is not upper case of i which is ?. BTW lower case letter for I is ?).

Another problem is eszett ß in German (a ligature for long s + short s used - in ancient times - also in English elevated to dignity of a character). It has an upper case version ? but (at this moment) .NET Framework wrongly returns "?" != "ß".ToUpper() (but its use is mandatory in some scenarios, see also this post). Unfortunately not always ss becomes ? (upper case), not always ss is equal to ß (lower case) and also sz sometimes is ? in upper case. Confusing, right?

Even More

Globalization is not only about text: what about dates and calendars, number formatting and parsing, colors and layout. A book won't be enough to describe all things you should care about but what I would highlight here is that few localized strings won't make your application ready for an international market.

Even just about text more questions arise: how this applies to regex? How spaces should be handled? Is an em space equal to an en space? In a professional application how "U.S.A." should be compared with "USA" (in a free-text search)? On the same line of thinking: how to manage diacritics in comparison?

How to handle text storage? Forget you can safely detect encoding, to open a file you need to know its encoding. Of course unless you're planning to do like HTML parsers with <meta charset="UTF-8"> or XML/XHTML encoding="UTF-8" in <?xml>).

Historical "Introduction"

What we see as text on our monitors is just a chunk of bytes in computer memory. By convention each value (or group of values, like an int32_t represents a number) represents a character. How that character is then drawn on screen is delegated to something else (to simplify little bit think about a font).

If we arbitrary decide that each character is represented with one byte then we have available 256 symbols (as when we use int8_t, System.SByte or java.lang.Byte for a number we have a numeric range of 256 values). What we need now to so decide each value which character it represents, an example of this is ASCII (limited to 7 bits, 128 values) with custom extensions to also use upper 128 values.

That's done, habemus character encoding for 256 symbols (including letters, numbers, analphabetic characters and control codes). Yes each ASCII extension is proprietary but things are clear and easy to manage. Text processing is so common that we just need to add a proper data type in our favorite languages (char in C, note that formally it's not an alias for unsigned char or signed char but a distinct type; char in Pascal; character in FORTRAN and so on) and few library functions to manage that.

Unfortunately it's not so easy. ASCII is limited to a very basic character set and it includes only latin characters used in USA (that's why its preferred name should be usASCII). It's so limited that even English words with diacritical marks aren't supported (if this made the change in modern language or vice-versa is another story). You'll see it also has other problems (for example its wrong sorting order with the problems of ordinal and alphabetic comparison).

How to deal with that? Introduce a new concept: code pages. Keep a fixed set of basic characters (ASCII) and add another 128 characters specific for each language. Value 0x81 will represent Cyrillic character ? (in DOS code page 866) and Greek character ? (in DOS code page 869).

Now serious problems arise: 1) you cannot mix in the same text file different alphabets. 2) To properly understand a text you have to also know with which code page it's expressed. Where? There is not a standard method for that and you'll have to handle this asking user or with a reasonable guess (?!). Even nowadays ZIP file "format" is limited to ASCII for file names (you may use UTF-8 - see later - but it's not standard - because there is not a standard ZIP format). In this post a Java working solution. 3) Even code pages are not standard and each environment has different sets (even DOS code pages and Windows code pages are different) and also names vary. 4) 255 characters are still too few for, for example, Chinese or Japanese language then more complicated encodings have been introduced (Shift JIS, for example).

Situation was terrible at that time (~ 1985) and a standard was absolutely needed. ISO/IEC 8859 arrived and it, at least, solved point 3 in previous problem list. Point 1, 2 and 4 were still unsolved and a solution was needed (especially if your target is not just raw text but also special typography characters). This standard (after many revisions) is still with us nowadays (and it somehow coincides with Windows-1252 code page) but probably you won't ever use it unless you're working with some legacy system.

Standard which emerged to save us from this chaos is world wide known: Unicode. From Wikipedia:

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems. [...] the latest version of Unicode contains a repertoire of more than 110,000 characters covering 100 scripts and multiple symbol sets.

Languages, libraries, Operating Systems have been updated to support Unicode. Now we have all characters we need, a shared well-known code for each, and the past is just a nightmare. Replace char with wchar_t (and accept to live with wcout, wstring and friends), just use System.Char or java.lang.Character and live happy. Right?

NO. It's never so easy. Unicode mission is about "...encoding, representation and handling of text...", it doesn't translate and adapt different cultures into an abstract code (and it's impossible to do it unless you kill the beauty in the variety of all our languages). Moreover encoding itself introduces some (not so obvious?!) things we have to care about.

致@DOWNVOTERS:你应该知道答案和提问的投票是无关的(即使它们都是由同一作者写的).随意不同意问题(并在[关于这个的meta帖子]上留下你的意见(http://meta.stackoverflow.com/q/278206))但是如果你也是downvote回答请发表一个简短的评论来解释你的原因.非常感谢,它将提高我的知识,它将帮助未来的读者更好地理解这个主题. (13认同)
@Deduplicator是的,这篇文章非常局限于Windows世界(以及跨平台环境.NET和Java).即使没有C和C++跨平台兼容性,它也足够复杂.在Windows上,除非你使用的是Windows NT 4,否则它总是UTF-16(从Win2K开始,所以这是一个非常安全的假设).你对codepoint vs codeunit是正确的,修复了我在哪里看到的,并在适当的地方,tnx! (2认同)

归档时间：	10 年，9 月前
查看次数：	3983 次
最近记录：	9 年，9 月前