Python3 和组合变音符号

Question

Python3 和组合变音符号

Clu*_*ain 7 unicode diacritics python-3.x

我在 python3 中遇到了 Unicode 问题，我似乎无法理解为什么会发生这种情况。

\n\n

symbol= "\xe1\xbf\x87\xcc\xa3"\nprint(len(symbol))\n>>>>2\n

Run Code Online (Sandbox Code Playgroud)\n\n

这封信来自一个单词：\xe1\xbc\x90\xcc\xa3\xce\xbd\xcc\xa3\xcf\x84\xcc\xa3\xe1\xbf\x87\xcc\xa3[\xce\xb1\xe1 \xbd\x90\xcf\x84]\xe1\xbf\x87 我在其中组合了变音符号。我想在 Python 3 中进行统计分析并将结果存储在数据库中，问题是我还将字符的位置（索引）存储在文本中。数据库应用程序正确地将示例中的符号变量计为一个字符，而 Python 将其计为两个字符 - 丢弃整个索引。

\n\n

该项目要求我保留变音符号，因此我不能简单地忽略它们或.replace("combining diacritical mark","")对字符串执行 a 操作。

\n\n

由于 Python3 将 unicode 作为字符串的默认值，我对此感到有点困惑。

\n\n

我尝试使用Greek-accentuation 中的base()、strip()和方法： https://pypi.org/project/greek-accentuation/但这也没有帮助。strip_length()

\n\n

项目要求是：

\n\n

检测属于该字符的字母表（OK）
存储字符串位置（在数据库中突出显示所需）（NotOK）
能够处理混合在一个字符串中的多种语言/字母。（好的）
迭代 CSV 输入。（好的）
忽略预定义字符串集（确定）
忽略匹配某些条件的字符串集（确定）

\n\n

这是该项目的简化代码：

\n\n

# -*- coding: utf-8 -*-\nimport csv\nfrom alphabet_detector import AlphabetDetector\nad = AlphabetDetector()\nwith open("tbltext.csv", "r", encoding="utf8") as txt:\n    data = csv.reader(txt)\n    for row in data:\n        text = row[1]\n        ### Here I have some string manipulation (lowering everything, replacing the predefined set of strings by equal-length \'-\',...)\n        ###then I use the ad-module to detect the language by looping over my characters, this is where it goes wrong.\n        for letter in text:\n            lang = ad.detect_alphabet(letter)\n

Run Code Online (Sandbox Code Playgroud)\n\n

如果我使用这个词：\xe1\xbc\x90\xcc\xa3\xce\xbd\xcc\xa3\xcf\x84\xcc\xa3\xe1\xbf\x87\xcc\xa3[\xce\xb1\xe1\xbd\x90\xcf\x84]\xe1\xbf\x87作为 forloop 的例子；我的结果是：

\n\n

>>> word = "\xe1\xbc\x90\xcc\xa3\xce\xbd\xcc\xa3\xcf\x84\xcc\xa3\xe1\xbf\x87\xcc\xa3[\xce\xb1\xe1\xbd\x90\xcf\x84]\xe1\xbf\x87"\n>>> for letter in word:\n...     print(letter)\n...\n\xe1\xbc\x90\n\xcc\xa3\n\xce\xbd\n\xcc\xa3\n\xcf\x84\n\xcc\xa3\n\xe1\xbf\x87\n\xcc\xa3\n[\n\xce\xb1\n\xe1\xbd\x90\n\xcf\x84\n]\n\xe1\xbf\x87\n

Run Code Online (Sandbox Code Playgroud)\n\n

如何让Python将带有组合变音标记的字母视为一个字母，而不是让它分别打印字母和变音标记？

\n

Answer 1

Gia*_*zzi 4

该字符串的长度为 2，因此这是正确的：两个代码点：

>>> list(hex(ord(c)) for c in symbol)
['0x1fc7', '0x323']
>>> list(unicodedata.name(c) for c in symbol)
['GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI', 'COMBINING DOT BELOW']

Run Code Online (Sandbox Code Playgroud)

所以你不应该用来len计算字符数。

您可以计算非组合的字符，因此：

>>> import unicodedata
>>> len(''.join(ch for ch in symbol if unicodedata.combining(ch) == 0))
1

Run Code Online (Sandbox Code Playgroud)

来自：如何在 Python 中获取组合 Unicode 字符串的“可见”长度？（但我把它移植到了python3）。

但这也不是最优方案，具体取决于计数字符的范围。我认为在你的情况下这就足够了，但是字体可以将字符合并为连字。在某些语言中，这些字符在视觉上是新的（且非常不同）的字符（与西方语言中的连字不同）。

最后评论：我认为你应该规范化字符串。使用上面的代码，在这种情况下没关系，但在其他情况下，您可能会得到不同的结果。特别是如果有人使用可战斗字符（例如 mu 表示单位，或 Eszett，而不是真正的希腊字符）。

归档时间：	7 年，3 月前
查看次数：	1686 次
最近记录：	7 年，3 月前