Min*_*ark 462 python unicode diacritics python-2.x python-3.x
我在Python中有一个Unicode字符串,我想删除所有的重音符号(变音符号).
我在Web上发现了一种在Java中执行此操作的优雅方法:
我是否需要安装pyICU等库?或者只使用python标准库?那python 3怎么样?
重要说明:我想避免代码使用重音字符到非重音符号的显式映射.
Chr*_*ard 400
Unidecode是对此的正确答案.它将任何unicode字符串音译为ascii文本中最接近的可能表示形式.
例:
accented_string = u'Málaga'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaga'and is of type 'str'
Run Code Online (Sandbox Code Playgroud)
oef*_*efe 263
这个怎么样:
import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
Run Code Online (Sandbox Code Playgroud)
这也适用于希腊字母:
>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>
Run Code Online (Sandbox Code Playgroud)
该字符类别 "锰"表示Nonspacing_Mark,这是类似于MiniQuark的答案unicodedata.combining(我没想到unicodedata.combining的,但它可能是更好的解决方案,因为它更明确).
请记住,这些操作可能会显着改变文本的含义.口音,变音等不是"装饰".
Min*_*ark 137
我刚刚在网上找到了这个答案:
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
only_ascii = nfkd_form.encode('ASCII', 'ignore')
return only_ascii
Run Code Online (Sandbox Code Playgroud)
它运行正常(例如法语),但我认为第二步(删除重音符号)可以比删除非ASCII字符更好地处理,因为对于某些语言(例如希腊语)会失败.最好的解决方案可能是明确删除被标记为变音符号的unicode字符.
编辑:这样做的诀窍:
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])
Run Code Online (Sandbox Code Playgroud)
unicodedata.combining(c)如果角色c可以与前面的角色组合,则返回true ,主要是因为它是变音符号.
编辑2:remove_accents需要一个unicode字符串,而不是一个字节字符串.如果你有一个字节字符串,那么你必须将它解码为一个unicode字符串,如下所示:
encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café" # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)
Run Code Online (Sandbox Code Playgroud)
hex*_*Jer 35
实际上我在项目兼容的python 2.6,2.7和3.4上工作,我必须从免费用户条目创建ID.
多亏了你,我创造了这个能创造奇迹的功能.
import re
import unicodedata
def strip_accents(text):
"""
Strip accents from input String.
:param text: The input string.
:type text: String.
:returns: The processed String.
:rtype: String.
"""
try:
text = unicode(text, 'utf-8')
except (TypeError, NameError): # unicode is a default on python 3
pass
text = unicodedata.normalize('NFD', text)
text = text.encode('ascii', 'ignore')
text = text.decode("utf-8")
return str(text)
def text_to_id(text):
"""
Convert input text to id.
:param text: The input string.
:type text: String.
:returns: The processed String.
:rtype: String.
"""
text = strip_accents(text.lower())
text = re.sub('[ ]+', '_', text)
text = re.sub('[^0-9a-zA-Z_-]', '', text)
return text
Run Code Online (Sandbox Code Playgroud)
结果:
text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'
Run Code Online (Sandbox Code Playgroud)
RiG*_*onz 25
在我看来,提出的解决方案不应被接受。最初的问题是要求删除重音符号,因此正确的答案应该只这样做,而不是加上其他未指定的更改。
\n只需观察该代码的结果即可接受答案。我将“M\xc3\xa1laga”更改为“M\xc3\xa1lague\xc3\xb1a”:
\naccented_string = u'M\xc3\xa1lague\xc3\xb1a'\n# accented_string is of type 'unicode'\nimport unidecode\nunaccented_string = unidecode.unidecode(accented_string)\n# unaccented_string contains 'Malaguena'and is of type 'str'\nRun Code Online (Sandbox Code Playgroud)\n还有一个额外的更改 (\xc3\xb1 -> n),OQ 中未请求该更改。
\n一个简单的函数,以较低的形式执行请求的任务:
\ndef f_remove_accents(old):\n """\n Removes common accent characters, lower form.\n Uses: regex.\n """\n new = old.lower()\n new = re.sub(r'[\xc3\xa0\xc3\xa1\xc3\xa2\xc3\xa3\xc3\xa4\xc3\xa5]', 'a', new)\n new = re.sub(r'[\xc3\xa8\xc3\xa9\xc3\xaa\xc3\xab]', 'e', new)\n new = re.sub(r'[\xc3\xac\xc3\xad\xc3\xae\xc3\xaf]', 'i', new)\n new = re.sub(r'[\xc3\xb2\xc3\xb3\xc3\xb4\xc3\xb5\xc3\xb6]', 'o', new)\n new = re.sub(r'[\xc3\xb9\xc3\xba\xc3\xbb\xc3\xbc]', 'u', new)\n return new\nRun Code Online (Sandbox Code Playgroud)\n
len*_*enz 19
这不仅可以处理重音,还可以处理"笔画"(如ø等):
import unicodedata as ud
def rmdiacritics(char):
'''
Return the base character of char, by "removing" any
diacritics like accents or curls and strokes and the like.
'''
desc = ud.name(char)
cutoff = desc.find(' WITH ')
if cutoff != -1:
desc = desc[:cutoff]
try:
char = ud.lookup(desc)
except KeyError:
pass # removing "WITH ..." produced an invalid name
return char
Run Code Online (Sandbox Code Playgroud)
这是我能想到的最优雅的方式(亚历克西斯在本页的评论中已经提到过),尽管我认为它确实不是很优雅.
仍然有一些特殊的字母没有被处理,例如翻转和倒置字母,因为它们的unicode名称不包含'WITH'.这取决于你想要做什么.我有时需要重音剥离来实现字典排序顺序.
ase*_*ram 15
回应@ MiniQuark的回答:
我试图读取一个半法语(包含重音符号)的csv文件以及一些最终会变成整数和浮点数的字符串.作为测试,我创建了一个如下所示的test.txt文件:
蒙特利尔,über,12.89,Mère,Françoise,noël,889
我必须包含行2并3使其工作(我在python票证中找到),以及合并@ Jabba的评论:
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import csv
import unicodedata
def remove_accents(input_str):
nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])
with open('test.txt') as f:
read = csv.reader(f)
for row in read:
for element in row:
print remove_accents(element)
Run Code Online (Sandbox Code Playgroud)
结果:
Montreal
uber
12.89
Mere
Francoise
noel
889
Run Code Online (Sandbox Code Playgroud)
(注意:我使用的是Mac OS X 10.8.4并使用Python 2.7.3)
Pio*_*dal 12
来自Gensim的gensim.utils.deaccent(text)- 人类主题建模:
'?'
另一个解决方案是unidecode.
不与该建议的解决方案unicodedata通常仅在一些字符消除重音符号(例如,它变成''入'l',而不是进入'?').
import unicodedata
from random import choice
import perfplot
import regex
import text_unidecode
def remove_accent_chars_regex(x: str):
return regex.sub(r'\p{Mn}', '', unicodedata.normalize('NFKD', x))
def remove_accent_chars_join(x: str):
# answer by MiniQuark
# https://stackoverflow.com/a/517974/7966259
return u"".join([c for c in unicodedata.normalize('NFKD', x) if not unicodedata.combining(c)])
perfplot.show(
setup=lambda n: ''.join([choice('Málaga François Phút H?n ??') for i in range(n)]),
kernels=[
remove_accent_chars_regex,
remove_accent_chars_join,
text_unidecode.unidecode,
],
labels=['regex', 'join', 'unidecode'],
n_range=[2 ** k for k in range(22)],
equality_check=None, relative_to=0, xlabel='str len'
)
Run Code Online (Sandbox Code Playgroud)
这是一个简短的函数,它去除变音符号,但保留非拉丁字符。大多数情况(例如"\xc3\xa0"-> "a")由(标准库)处理unicodedata,但有几种情况(例如"\xc3\xa6"-> "ae")依赖于给定的并行字符串。
from unicodedata import combining, normalize\n\nLATIN = "\xc3\xa4 \xc3\xa6 \xc7\xbd \xc4\x91 \xc3\xb0 \xc6\x92 \xc4\xa7 \xc4\xb1 \xc5\x82 \xc3\xb8 \xc7\xbf \xc3\xb6 \xc5\x93 \xc3\x9f \xc5\xa7 \xc3\xbc "\nASCII = "ae ae ae d d f h i l o o oe oe ss t ue"\n\ndef remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))):\n return "".join(c for c in normalize("NFD", s.lower().translate(outliers)) if not combining(c))\nRun Code Online (Sandbox Code Playgroud)\n注意。默认参数仅outliers计算一次,并不意味着由调用者提供。
作为以更 \xe2\x80\x9cnatural\xe2\x80\x9d 顺序对字符串列表进行排序的键:
\nsorted([\'cote\', \'coteau\', "crottez", \'crott\xc3\xa9\', \'c\xc3\xb4te\', \'c\xc3\xb4t\xc3\xa9\'], key=remove_diacritics)\nRun Code Online (Sandbox Code Playgroud)\n输出:
\n[\'cote\', \'c\xc3\xb4te\', \'c\xc3\xb4t\xc3\xa9\', \'coteau\', \'crott\xc3\xa9\', \'crottez\']\nRun Code Online (Sandbox Code Playgroud)\n如果您的字符串混合了文本和数字,您可能有兴趣使用我在其他地方remove_diacritics()提供的函数进行组合。string_to_pairs()
为了确保该行为满足您的需求,请查看下面的全语法:
\nexamples = [\n ("hello, world", "hello, world"),\n ("42", "42"),\n ("\xe4\xbd\xa0\xe5\xa5\xbd\xef\xbc\x8c\xe4\xb8\x96\xe7\x95\x8c", "\xe4\xbd\xa0\xe5\xa5\xbd\xef\xbc\x8c\xe4\xb8\x96\xe7\x95\x8c"),\n (\n "D\xc3\xa8s No\xc3\xabl, o\xc3\xb9 un z\xc3\xa9phyr ha\xc3\xaf me v\xc3\xaat de gla\xc3\xa7ons w\xc3\xbcrmiens, je d\xc3\xaene d\xe2\x80\x99exquis r\xc3\xb4tis de b\xc5\x93uf au kir, \xc3\xa0 l\xe2\x80\x99a\xc3\xbf d\xe2\x80\x99\xc3\xa2ge m\xc3\xbbr, &c\xc3\xa6tera.",\n "des noel, ou un zephyr hai me vet de glacons wuermiens, je dine d\xe2\x80\x99exquis rotis de boeuf au kir, a l\xe2\x80\x99ay d\xe2\x80\x99age mur, &caetera.",\n ),\n (\n "Falsches \xc3\x9cben von Xylophonmusik qu\xc3\xa4lt jeden gr\xc3\xb6\xc3\x9feren Zwerg.",\n "falsches ueben von xylophonmusik quaelt jeden groesseren zwerg.",\n ),\n (\n "\xd0\x89\xd1\x83\xd0\xb1\xd0\xb0\xd0\xb7\xd0\xbd\xd0\xb8 \xd1\x84\xd0\xb5\xd1\x9a\xd0\xb5\xd1\x80\xd1\x9f\xd0\xb8\xd1\x98\xd0\xb0 \xd1\x87\xd0\xb0\xd1\x92\xd0\xb0\xd0\xb2\xd0\xbe\xd0\xb3 \xd0\xbb\xd0\xb8\xd1\x86\xd0\xb0 \xd1\x85\xd0\xbe\xd1\x9b\xd0\xb5 \xd0\xb4\xd0\xb0 \xd0\xbc\xd0\xb8 \xd0\xbf\xd0\xbe\xd0\xba\xd0\xb0\xd0\xb6\xd0\xb5 \xd1\x88\xd1\x82\xd0\xbe\xd1\x81.",\n "\xd1\x99\xd1\x83\xd0\xb1\xd0\xb0\xd0\xb7\xd0\xbd\xd0\xb8 \xd1\x84\xd0\xb5\xd1\x9a\xd0\xb5\xd1\x80\xd1\x9f\xd0\xb8\xd1\x98\xd0\xb0 \xd1\x87\xd0\xb0\xd1\x92\xd0\xb0\xd0\xb2\xd0\xbe\xd0\xb3 \xd0\xbb\xd0\xb8\xd1\x86\xd0\xb0 \xd1\x85\xd0\xbe\xd1\x9b\xd0\xb5 \xd0\xb4\xd0\xb0 \xd0\xbc\xd0\xb8 \xd0\xbf\xd0\xbe\xd0\xba\xd0\xb0\xd0\xb6\xd0\xb5 \xd1\x88\xd1\x82\xd0\xbe\xd1\x81.",\n ),\n (\n "Ljubazni fenjerd\xc5\xbeija \xc4\x8da\xc4\x91avog lica ho\xc4\x87e da mi poka\xc5\xbee \xc5\xa1tos.",\n "ljubazni fenjerdzija cadavog lica hoce da mi pokaze stos.",\n ),\n (\n "Quizdeltagerne spiste jordb\xc3\xa6r med fl\xc3\xb8de, mens cirkusklovnen Walther spillede p\xc3\xa5 xylofon.",\n "quizdeltagerne spiste jordbaer med flode, mens cirkusklovnen walther spillede pa xylofon.",\n ),\n (\n "K\xc3\xa6mi n\xc3\xbd \xc3\xb6xi h\xc3\xa9r ykist \xc3\xbej\xc3\xb3fum n\xc3\xba b\xc3\xa6\xc3\xb0i v\xc3\xadl og \xc3\xa1drepa.",\n "kaemi ny oexi her ykist \xc3\xbejofum nu baedi vil og adrepa.",\n ),\n (\n "Gl\xc4\x81\xc5\xbe\xc5\xa1\xc4\xb7\xc5\xab\xc5\x86a r\xc5\xab\xc4\xb7\xc4\xab\xc5\xa1i dz\xc4\x93rum\xc4\x81 \xc4\x8diepj Baha koncertfl\xc4\xab\xc4\xa3e\xc4\xbcu v\xc4\x81kus.",\n "glazskuna rukisi dzeruma ciepj baha koncertfligelu vakus.",\n )\n]\n\nfor (given, expected) in examples:\n assert remove_diacritics(given) == expected\nRun Code Online (Sandbox Code Playgroud)\nLATIN = "\xc3\xa4 \xc3\xa6 \xc7\xbd \xc4\x91 \xc3\xb0 \xc6\x92 \xc4\xa7 \xc4\xb1 \xc5\x82 \xc3\xb8 \xc7\xbf \xc3\xb6 \xc5\x93 \xc3\x9f \xc5\xa7 \xc3\xbc \xc3\x84 \xc3\x86 \xc7\xbc \xc4\x90 \xc3\x90 \xc6\x91 \xc4\xa6 I \xc5\x81 \xc3\x98 \xc7\xbe \xc3\x96 \xc5\x92 \xe1\xba\x9e \xc5\xa6 \xc3\x9c "\nASCII = "ae ae ae d d f h i l o o oe oe ss t ue AE AE AE D D F H I L O O OE OE SS T UE"\n\ndef remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))):\n return "".join(c for c in normalize("NFD", s.translate(outliers)) if not combining(c))\n\nRun Code Online (Sandbox Code Playgroud)\n
这里已经有很多答案了,但是之前没有考虑过这一点:using sklearn
\nfrom sklearn.feature_extraction.text import strip_accents_ascii, strip_accents_unicode\n\naccented_string = u\'M\xc3\xa1lague\xc3\xb1a\xc2\xae\'\n\nprint(strip_accents_unicode(accented_string)) # output: Malaguena\xc2\xae\nprint(strip_accents_ascii(accented_string)) # output: Malaguena\n\nRun Code Online (Sandbox Code Playgroud)\n如果您已经在使用 sklearn 来处理文本,这尤其有用。这些是CountVectorizer之类的类在内部调用的函数,用于规范化字符串:使用时调用strip_accents=\'ascii\'then ,使用时调用 then 。strip_accents_asciistrip_accents=\'unicode\'strip_accents_unicode
更多细节
\n最后,考虑其文档字符串中的这些细节:
\nSignature: strip_accents_ascii(s)\nTransform accentuated unicode symbols into ascii or nothing\n\nWarning: this solution is only suited for languages that have a direct\ntransliteration to ASCII symbols.\nRun Code Online (Sandbox Code Playgroud)\n和
\nSignature: strip_accents_unicode(s)\nTransform accentuated unicode symbols into their simple counterpart\n\nWarning: the python-level loop and join operations make this\nimplementation 20 times slower than the strip_accents_ascii basic\nnormalization.\nRun Code Online (Sandbox Code Playgroud)\n
有些语言将变音符号组合为语言字母和重音变音符号来指定重音。
我认为明确指定要删除的变音符号更安全:
def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
accents = set(map(unicodedata.lookup, accents))
chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
return unicodedata.normalize('NFC', ''.join(chars))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
232575 次 |
| 最近记录: |