在 Swift 中规范化(组合和分解)utf8 字符串

Jac*_*cob 1 string unicode unicode-normalization swift

Unicode 字符串中带有重音符号的字符可以用“短”(组合)和“长”(分解)格式表示。这意味着在 Xcode 中,字符串a的长度为 8,而字符串b的长度为 10,即使它们看起来相同:

\n
let a:String = "\xce\xb4\xce\xad\xce\xba\xce\xb1" // 8 bytes\nprint(a.data(using:String.Encoding.utf8)!.count)\n\nlet b:String = "\xce\xb4\xce\xad\xce\xba\xce\xb1" // 10 bytes\nprint(b.data(using:String.Encoding.utf8)!.count)\n
Run Code Online (Sandbox Code Playgroud)\n

在此输入图像描述

\n

我需要“收缩”字符串以确保它们始终处于较短的“组合”格式。这在 Swift 中是如何完成的?

\n
\n

脚注:我知道可以像这样完全去掉重音(如下)。我不想那样做,我只是想“创作”角色。

\n
let usPosixLocale = Locale(identifier: "en_US_POSIX")\nlet out = "\xce\xb4\xce\xad\xce\xba\xce\xb1".folding(options: [.caseInsensitive, .diacriticInsensitive], locale: usPosixLocale)\n
Run Code Online (Sandbox Code Playgroud)\n

我知道这个.widthInsensitive选项,但文档似乎表明它仅适用于亚洲字符。具体来说,这不适用于组合或分解字符:

\n
let out = a.folding(options: [.widthInsensitive], locale: usPosixLocale)\n
Run Code Online (Sandbox Code Playgroud)\n
\n

更新

\n

这是代码的第二个较长版本,为了清楚起见,它显示了字节差异。

\n
let a:String = String(bytes:[206, 180, 206, 173, 206, 186, 206, 177], encoding:.utf8)!\nprint(a, a.data(using:String.Encoding.utf8)!.count)\n\nlet b:String = String(bytes:[206, 180, 206, 181, 204, 129, 206, 186, 206, 177], encoding:.utf8)!\nprint(b, b.data(using:String.Encoding.utf8)!.count)\n\nlet usPosixLocale = Locale(identifier: "en_US_POSIX")\nlet out = b.folding(options: [.widthInsensitive], locale: usPosixLocale)\n    print(out.data(using:String.Encoding.utf8)!.count)\n
Run Code Online (Sandbox Code Playgroud)\n

在此输入图像描述

\n

Mar*_*n R 5

precomposedStringWithCanonicalMapping进行标准化:

\n
let a = "\xce\xb4\xce\xad\xce\xba\xce\xb1"\nprint(a, Data(a.utf8).count) // \xce\xb4\xce\xad\xce\xba\xce\xb1 8\n\nlet b = "\xce\xb4\xce\xb5\\u{0301}\xce\xba\xce\xb1"\nprint(b, Data(b.utf8).count) // \xce\xb4\xce\xad\xce\xba\xce\xb1 10\n\nlet bn = b.precomposedStringWithCanonicalMapping\nprint(bn, Data(bn.utf8).count) // \xce\xb4\xce\xad\xce\xba\xce\xb1 8\n
Run Code Online (Sandbox Code Playgroud)\n

\xe2\x80\x9cliteral\xe2\x80\x9c\xc2\xa0comparison 表明 与a相同bn,但与 不同b

\n
print(b.compare(a, options: .literal) == .orderedSame)  // false\nprint(bn.compare(a, options: .literal) == .orderedSame) // true\n
Run Code Online (Sandbox Code Playgroud)\n

备注: precomposedStringWithCanonicalMapping生成 \xe2\x80\x9cUnicode 规范化形式 C。\xe2\x80\x9d 还可以precomposedStringWithCompatibilityMapping生成 \xe2\x80\x9cUnicode 规范化形式 KC。\xe2\x80\x9d 请参阅

\n\n

Unicode 标准中的精确差异。粗略地说,后者折叠了更多差异,这些差异在许多情况下都可以适当区分。\xe2\x80\x9d 示例:

\n
let c = "\\u{fb01}" // LATIN SMALL LIGATURE FI\nprint(c, c.precomposedStringWithCanonicalMapping, c.precomposedStringWithCompatibilityMapping)\n// \xef\xac\x81 \xef\xac\x81 fi\n\nlet d = "2\\u{2075}"\nprint(d, d.precomposedStringWithCanonicalMapping, d.precomposedStringWithCompatibilityMapping)\n// 2\xe2\x81\xb5 2\xe2\x81\xb5 25\n\nlet e = "\\u{2165}" // ROMAN NUMERAL SIX\nprint(e, e.precomposedStringWithCanonicalMapping, e.precomposedStringWithCompatibilityMapping)\n// \xe2\x85\xa5 \xe2\x85\xa5 VI\n
Run Code Online (Sandbox Code Playgroud)\n