Unicode定义了两种等价的000规范等价和兼容等价.Unicode技术附件#15中兼容性等效的示例是SUPERSCRIPT ONE(U + 00B9)和DIGIT ONE(U + 0031).它没有讨论视觉上无法区分的字符.
我很好奇,如果在视觉上无法区分的字符在标准下具有兼容性等价.
谢谢..
tch*_*ist 21
ᴇᴅɪᴛ:在底部添加了原始问题的确切内容.这真的很酷.
关于ʀᴏᴍᴀɴɴᴜᴍᴇʀᴀʟᴏɴᴇ和ʟᴀᴛɪɴᴄᴀᴘɪᴛᴀʟʟᴇᴛᴛᴇʀquestion的问题的答案是肯定的.这是一个快速检查方法:
$ perl -Mcharnames=:full -MUnicode::Normalize -le 'print
NFKD "\N{ROMAN NUMERAL ONE}" eq NFKD "\N{LATIN CAPITAL LETTER I}"'
1
Run Code Online (Sandbox Code Playgroud)
但是,关于是否具有视觉上无法区分的字符具有兼容性等同性的问题的答案绝对不是!
例如,ᴄʜᴇʀᴏᴋᴇᴇʟᴇᴛᴛᴇʀɢᴏ(Ꭺ)看起来像ʟᴀᴛɪɴᴄᴀᴘɪᴛᴀʟʟᴇᴛᴛᴇʀᴀ(A),但肯定不是NFKD等价物.类似地,ɢʀᴇᴇᴋᴄᴀᴘɪᴛᴀʟʟᴇᴛᴛᴇʀᴀʟᴘʜᴀ(Α)和ᴄʏʀɪʟʟɪᴄᴄᴀᴘɪᴛᴀʟʟᴇᴛᴛᴇʀᴀ(А)不是NFKD等价物.实际上有很多(好吧,我不能算数:)这样的问题.例如,NFKD等于ʟᴀᴛɪɴᴄᴀᴘɪᴛᴀʟʟᴇᴛᴛᴇʀonly的唯一代码点是:
U+00041 ? A GC=Lu SC=Latin LATIN CAPITAL LETTER A
U+01D2C ? ? GC=Lm SC=Latin MODIFIER LETTER CAPITAL A
U+024B6 ? ? GC=So SC=Common CIRCLED LATIN CAPITAL LETTER A
U+0FF21 ? ? GC=Lu SC=Latin FULLWIDTH LATIN CAPITAL LETTER A
U+1D400 ? GC=Lu SC=Common MATHEMATICAL BOLD CAPITAL A
U+1D434 ? GC=Lu SC=Common MATHEMATICAL ITALIC CAPITAL A
U+1D468 ? GC=Lu SC=Common MATHEMATICAL BOLD ITALIC CAPITAL A
U+1D49C ? GC=Lu SC=Common MATHEMATICAL SCRIPT CAPITAL A
U+1D4D0 ? GC=Lu SC=Common MATHEMATICAL BOLD SCRIPT CAPITAL A
U+1D504 ? GC=Lu SC=Common MATHEMATICAL FRAKTUR CAPITAL A
U+1D538 ? GC=Lu SC=Common MATHEMATICAL DOUBLE-STRUCK CAPITAL A
U+1D56C ? GC=Lu SC=Common MATHEMATICAL BOLD FRAKTUR CAPITAL A
U+1D5A0 ? GC=Lu SC=Common MATHEMATICAL SANS-SERIF CAPITAL A
U+1D5D4 ? GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD CAPITAL A
U+1D608 ? GC=Lu SC=Common MATHEMATICAL SANS-SERIF ITALIC CAPITAL A
U+1D63C ? GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL A
U+1D670 ? GC=Lu SC=Common MATHEMATICAL MONOSPACE CAPITAL A
U+1F130 ? GC=So SC=Common SQUARED LATIN CAPITAL LETTER A
Run Code Online (Sandbox Code Playgroud)
同样,这里的代码点是NFKD,相当于你所看到的ʟᴀᴛɪɴᴄᴀᴘɪᴛᴀʟʟᴇᴛᴛᴇʀ::
U+00049 ? I GC=Lu SC=Latin LATIN CAPITAL LETTER I
U+01D35 ? ? GC=Lm SC=Latin MODIFIER LETTER CAPITAL I
U+02110 ? ? GC=Lu SC=Common SCRIPT CAPITAL I
U+02111 ? ? GC=Lu SC=Common BLACK-LETTER CAPITAL I
U+02160 ? ? GC=Nl SC=Latin ROMAN NUMERAL ONE
U+024BE ? ? GC=So SC=Common CIRCLED LATIN CAPITAL LETTER I
U+0FF29 ? ? GC=Lu SC=Latin FULLWIDTH LATIN CAPITAL LETTER I
U+1D408 ? GC=Lu SC=Common MATHEMATICAL BOLD CAPITAL I
U+1D43C ? GC=Lu SC=Common MATHEMATICAL ITALIC CAPITAL I
U+1D470 ? GC=Lu SC=Common MATHEMATICAL BOLD ITALIC CAPITAL I
U+1D4D8 ? GC=Lu SC=Common MATHEMATICAL BOLD SCRIPT CAPITAL I
U+1D540 ? GC=Lu SC=Common MATHEMATICAL DOUBLE-STRUCK CAPITAL I
U+1D574 ? GC=Lu SC=Common MATHEMATICAL BOLD FRAKTUR CAPITAL I
U+1D5A8 ? GC=Lu SC=Common MATHEMATICAL SANS-SERIF CAPITAL I
U+1D5DC ? GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD CAPITAL I
U+1D610 ? GC=Lu SC=Common MATHEMATICAL SANS-SERIF ITALIC CAPITAL I
U+1D644 ? GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL I
U+1D678 ? GC=Lu SC=Common MATHEMATICAL MONOSPACE CAPITAL I
U+1F138 ? GC=So SC=Common SQUARED LATIN CAPITAL LETTER I
Run Code Online (Sandbox Code Playgroud)
注意那里没有ɢʀᴇᴇᴋᴄᴀᴘɪᴛᴀʟʟᴇᴛᴛᴇʀ,,就像一个例子.
不能使用NFKD找到真人秀,还有一些东西是 NKFD当量不看相像.因此,在一般情况下你不能这样做.这不是一个问题,你甚至可以在不查看实际字体的情况下开始查看.
我相信ICU有一个扩展的非标准属性,比如\p{X-Confusable=A}.我为此下载了他们的数据文件,但还没有玩过它.
事实证明,UTS#39,Unicode安全机制,正是您正在寻找的.如果您获取其原始明文数据文件,您将能够确定哪些代码点可能彼此混淆.
例如,在本消息前面的文本中,我列举了NFKD等同于ʟᴀᴛɪɴᴄᴀᴘɪᴛᴀʟʟᴇᴛᴛᴇʀ的代码点,并指出该集合中缺少许多潜在的可混淆因素.那是因为NFKD映射不是为了检测混淆而设计的.但是,UTS#39的数据文件非常适用于此目的.
要重做我ʟᴀᴛɪɴᴄᴀᴘɪᴛᴀʟʟᴇᴛᴛᴇʀɪ枚举,更新它来处理所有的代码点是UTS#39认为与之相混淆,我们这些使用格式化的单字符并使用Unicode归类算法排序ucsort:
U+0007C ? | GC=Sm SC=Common VERTICAL LINE
U+02223 ? ? GC=Sm SC=Common DIVIDES
U+0FFE8 ? ? GC=So SC=Common HALFWIDTH FORMS LIGHT VERTICAL
U+00031 ? 1 GC=Nd SC=Common DIGIT ONE
U+1D7CF ? GC=Nd SC=Common MATHEMATICAL BOLD DIGIT ONE
U+1D7D9 ? GC=Nd SC=Common MATHEMATICAL DOUBLE-STRUCK DIGIT ONE
U+1D7E3 ? GC=Nd SC=Common MATHEMATICAL SANS-SERIF DIGIT ONE
U+1D7ED ? GC=Nd SC=Common MATHEMATICAL SANS-SERIF BOLD DIGIT ONE
U+1D7F7 ? GC=Nd SC=Common MATHEMATICAL MONOSPACE DIGIT ONE
U+00049 ? I GC=Lu SC=Latin LATIN CAPITAL LETTER I
U+0FF29 ? ? GC=Lu SC=Latin FULLWIDTH LATIN CAPITAL LETTER I
U+02160 ? ? GC=Nl SC=Latin ROMAN NUMERAL ONE
U+02110 ? ? GC=Lu SC=Common SCRIPT CAPITAL I
U+02111 ? ? GC=Lu SC=Common BLACK-LETTER CAPITAL I
U+1D408 ? GC=Lu SC=Common MATHEMATICAL BOLD CAPITAL I
U+1D43C ? GC=Lu SC=Common MATHEMATICAL ITALIC CAPITAL I
U+1D470 ? GC=Lu SC=Common MATHEMATICAL BOLD ITALIC CAPITAL I
U+1D4D8 ? GC=Lu SC=Common MATHEMATICAL BOLD SCRIPT CAPITAL I
U+1D540 ? GC=Lu SC=Common MATHEMATICAL DOUBLE-STRUCK CAPITAL I
U+1D574 ? GC=Lu SC=Common MATHEMATICAL BOLD FRAKTUR CAPITAL I
U+1D5A8 ? GC=Lu SC=Common MATHEMATICAL SANS-SERIF CAPITAL I
U+1D5DC ? GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD CAPITAL I
U+1D610 ? GC=Lu SC=Common MATHEMATICAL SANS-SERIF ITALIC CAPITAL I
U+1D644 ? GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL I
U+1D678 ? GC=Lu SC=Common MATHEMATICAL MONOSPACE CAPITAL I
U+00196 ? ? GC=Lu SC=Latin LATIN CAPITAL LETTER IOTA
U+0006C ? l GC=Ll SC=Latin LATIN SMALL LETTER L
U+0FF4C ? ? GC=Ll SC=Latin FULLWIDTH LATIN SMALL LETTER L
U+0217C ? ? GC=Nl SC=Latin SMALL ROMAN NUMERAL FIFTY
U+02113 ? ? GC=Ll SC=Common SCRIPT SMALL L
U+1D425 ? GC=Ll SC=Common MATHEMATICAL BOLD SMALL L
U+1D459 ? GC=Ll SC=Common MATHEMATICAL ITALIC SMALL L
U+1D48D ? GC=Ll SC=Common MATHEMATICAL BOLD ITALIC SMALL L
U+1D4C1 ? GC=Ll SC=Common MATHEMATICAL SCRIPT SMALL L
U+1D4F5 ? GC=Ll SC=Common MATHEMATICAL BOLD SCRIPT SMALL L
U+1D529 ? GC=Ll SC=Common MATHEMATICAL FRAKTUR SMALL L
U+1D55D ? GC=Ll SC=Common MATHEMATICAL DOUBLE-STRUCK SMALL L
U+1D591 ? GC=Ll SC=Common MATHEMATICAL BOLD FRAKTUR SMALL L
U+1D5C5 ? GC=Ll SC=Common MATHEMATICAL SANS-SERIF SMALL L
U+1D5F9 ? GC=Ll SC=Common MATHEMATICAL SANS-SERIF BOLD SMALL L
U+1D62D ? GC=Ll SC=Common MATHEMATICAL SANS-SERIF ITALIC SMALL L
U+1D661 ? GC=Ll SC=Common MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL L
U+1D695 ? GC=Ll SC=Common MATHEMATICAL MONOSPACE SMALL L
U+001C0 ? ? GC=Lo SC=Latin LATIN LETTER DENTAL CLICK
U+00399 ? ? GC=Lu SC=Greek GREEK CAPITAL LETTER IOTA
U+1D6B0 ? GC=Lu SC=Common MATHEMATICAL BOLD CAPITAL IOTA
U+1D6EA ? GC=Lu SC=Common MATHEMATICAL ITALIC CAPITAL IOTA
U+1D724 ? GC=Lu SC=Common MATHEMATICAL BOLD ITALIC CAPITAL IOTA
U+1D75E ? GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD CAPITAL IOTA
U+1D798 ? GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL IOTA
U+02C92 ? ? GC=Lu SC=Coptic COPTIC CAPITAL LETTER IAUDA
U+00406 ? ? GC=Lu SC=Cyrillic CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
U+004C0 ? ? GC=Lu SC=Cyrillic CYRILLIC LETTER PALOCHKA
U+005D5 ? ? GC=Lo SC=Hebrew HEBREW LETTER VAV
U+005DF ? ? GC=Lo SC=Hebrew HEBREW LETTER FINAL NUN
U+007CA ? ? GC=Lo SC=Nko NKO LETTER A
U+02D4F ? ? GC=Lo SC=Tifinagh TIFINAGH LETTER YAN
U+0A4F2 ? ? GC=Lo SC=Lisu LISU LETTER I
Run Code Online (Sandbox Code Playgroud)
尽管那很漂亮,但它变得更好.数据文件不仅包括单码点可混淆,还包括可能在某些情况下需要多个代码点的混淆.例如,这是一个这样的集合,这次是文件原生格式:
# C? ?? Ç ?
(? C? ?) 0043 0326 LATIN CAPITAL LETTER C, COMBINING COMMA BELOW
? (? ?? ?) 0421 0321 CYRILLIC CAPITAL LETTER ES, COMBINING PALATALIZED HOOK BELOW
? (? Ç ?) 00C7 LATIN CAPITAL LETTER C WITH CEDILLA # ???????
? (? ? ?) 04AA CYRILLIC CAPITAL LETTER ES WITH DESCENDER # ????
Run Code Online (Sandbox Code Playgroud)
那不是那么膨胀吗?唯一的障碍是,除非你使用ICU课程,否则你必须从UTS#39数据文件中推出自己的课程.
由于我没有其他语言绑定,我已经添加到我的ᴛᴏᴅᴏ列表中创建Perl绑定以模仿\p{X-Confusable=I}正则表达式引擎中的ICU写作风格.
请注意,您可能还希望考虑UTS#36 和 UTS#39,ICU SpoofChecker类为您处理.它专门用于URI类型的东西(读取:使用受限字符集的Internet标识符),而不仅仅是任何旧的任意文本.