为什么 locale.strxfrm("Gè") 不是 locale.strxfrm("Gène")) 与区域设置“fr_FR.UTF-8”的前缀？

Question

为什么 locale.strxfrm("Gè") 不是 locale.strxfrm("Gène")) 与区域设置“fr_FR.UTF-8”的前缀？

Bas*_*uet 11 python string comparison locale utf-8

这里的代码是用 Python 写的，但在 C/C++ 中使用locale的行为应该是一样的。

>>> import locale
>>> locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")
>>> locale.strxfrm("Gène").startswith(locale.strxfrm("Gè"))
False

Run Code Online (Sandbox Code Playgroud)

我知道它不应该那样使用，但我想知道发生了什么......

上下文：
我有一个 strxfrm 转换的字符串数组和一个普通的输入文本。我想知道哪些 strxfrm 转换后的字符串在转换前以文本开头。它完全可行吗？如何？

奖金问题：

我们可以获取每个语言环境的等效字母列表吗？我们可以检查等效的字符串吗？

我的意思是：
在"de_DE.UTF8"，我能得到类似的东西吗

locale.strxfrm("Wissen").startswith(locale.strxfrm("Wiß"))

Run Code Online (Sandbox Code Playgroud)

返回 True ？

因为"ß" and "ss" are equivalent在排序中（除非它是唯一的区别）：

> locale.strxfrm("Wiessen") < locale.strxfrm("Wießen") < locale.strxfrm("Wiessen0")
True

Run Code Online (Sandbox Code Playgroud)

法语中的“œ”和“oe”也一样。

编辑：关于奖金，我看到了Python 语言环境感知字符串比较，但答案依赖于 3rd 方库，所以我提出了一个变通方法 hacked 函数：

def isEquivalent(str1, str2):
    return ( locale.strxfrm(str2[:-1]) < locale.strxfrm(str1) <= locale.strxfrm(str2) < locale.strxfrm(str1+"0") 
    or 
    locale.strxfrm(str1[:-1]) < locale.strxfrm(str2) <= locale.strxfrm(str1) < locale.strxfrm(str2+"0") )

Run Code Online (Sandbox Code Playgroud)

Answer 1

Dim*_*nek 3

一个非常有趣的问题！\n这个答案不是规范的，我认为glibc-dev这将是最好的论坛。

\n\n

长话短说

\n\n

唯一的要求strxfrm是：

\n\n

strcmp(strxfrm(a), strxfrm(b)) == strcoll(a, b)\n

Run Code Online (Sandbox Code Playgroud)\n\n

允许strxfrm将事物的相对顺序导出到另一个（较笨的）系统，例如，在数据库表中维护二级索引。

\n\n

我们来测试一下

\n\n

让我们检查一下Python3（Python3.9、OSX、组合范式）：

\n\n

>>> locale.strxfrm(unicodedata.normalize("NFC", "G\xc3\xa8ne"))\n\'J\xc3\xabqh\\x01J\xc3\xabqh\'\n>>> locale.strxfrm(unicodedata.normalize("NFC", "G\xc3\xa8"))\n\'J\xc3\xab\\x01J\xc3\xab\'\n

Run Code Online (Sandbox Code Playgroud)\n\n

如果您要按字节分解输出<SOH>，您实际上会得到一个有效的子字符串。

\n\n

我不知道分隔符两侧基本上重复的输出的意义。

\n\n

Python 3 NFD似乎遵循相同的语义，但输出不同，我想这只是强调了规范化文本的重要性

\n\n

>>> locale.strxfrm(unicodedata.normalize("NFD", "G\xc3\xa8ne"))\n\'Jh\xc4\x83qh\\x01Jh\xd0\x83qh\'\n>>> locale.strxfrm(unicodedata.normalize("NFD", "G\xc3\xa8"))\n\'Jh\xc4\x83\\x01Jh\xd0\x83\'\n

Run Code Online (Sandbox Code Playgroud)\n\n

其他脚本有更时髦的输出，这里是日语语言环境中的日语：

\n\n

>>> locale.strxfrm(unicodedata.normalize("NFC", "\xe6\x9d\x91\xe4\xb8\x8a  \xe6\x98\xa5\xe6\xa8\xb9"))\n\'\xc4\x83\xc4\x83#\xc4\x83\xc4\x83\\x01\xe6\xa1\x94\xe4\xbc\x8d#\xe6\x9c\xa8\xe6\xac\xbc\'\n>>> locale.strxfrm(unicodedata.normalize("NFC", "\xe6\x9d\x91\xe4\xb8\x8a\xe6\x98\xa5\xe6\xa8\xb9"))\n\'\xc4\x83\xc4\x83\xc4\x83\xc4\x83\\x01\xe6\xa1\x94\xe4\xbc\x8d\xe6\x9c\xa8\xe6\xac\xbc\'\n>>> locale.strxfrm(unicodedata.normalize("NFC", "\xe6\x9d\x91\xe4\xb8\x8a"))\n\'\xc4\x83\xc4\x83\\x01\xe6\xa1\x94\xe4\xbc\x8d\'\n>>> \'\xc4\x83\xc4\x83\xc4\x83\xc4\x83\\x01\xe6\xa1\x94\xe4\xbc\x8d\xe6\x9c\xa8\xe6\xac\xbc\' > \'\xc4\x83\xc4\x83#\xc4\x83\xc4\x83\\x01\xe6\xa1\x94\xe4\xbc\x8d#\xe6\x9c\xa8\xe6\xac\xbc\' > \'\xc4\x83\xc4\x83\\x01\xe6\xa1\x94\xe4\xbc\x8d\'\nTrue\n

Run Code Online (Sandbox Code Playgroud)\n\n

Python2有不同的格式，其中内容也是重复的，但不清楚如何检测分隔符。所以，我们不要使用 Python 2，它已经停产了

\n\n

>>> locale.strxfrm(unicodedata.normalize("NFC", u"G\xc3\xa8ne").encode("utf-8"))\n\'0019003Z001`001W00000019003Z001`001W\'\n>>> locale.strxfrm(unicodedata.normalize("NFC", u"G\xc3\xa8").encode("utf-8"))\n\'0019003Z00000019003Z\'\n

Run Code Online (Sandbox Code Playgroud)\n\n

JavaScript有该Intl模块，它通过提供排序规则（排序）new Intl.Collator(...).compare()，但据我所知，它没有公开strxfrm. 我想知道这是否存在一些根本性的困难。我希望这样的函数可用于构建例如自定义 IndexedDB 索引，但可惜！\xe2\x80\x8d\xe2\x99\x82\xef\xb8\x8f

\n

归档时间：	10 年，9 月前
查看次数：	259 次
最近记录：	5 年，8 月前