使用Python查找和替换非ascii字符的正则表达式

Question

使用Python查找和替换非ascii字符的正则表达式

我需要将一些非ASCII的字符更改为"_".例如,

Tannh‰user -> Tannh_user

如果我使用Python的正则表达式,我该怎么做？
有没有更好的方法来做这个不使用RE？

Answer 1

int*_*jay 9

re.sub(r'[^\x00-\x7F]', '_', theString)

Run Code Online (Sandbox Code Playgroud)

如果theString是unicode,或者在ASCII占用0到0x7F(latin-1,UTF-8等)的编码中的字符串,这将起作用.

Answer 2

Max*_*cia 7

回答问题

'[\u0080-\uFFFF]'

Run Code Online (Sandbox Code Playgroud)

将匹配任何不在前 128 个字符范围内的 UTF-8 字符

re.sub('[\u0080-\uFFFF]+', '_', x)

Run Code Online (Sandbox Code Playgroud)

将用下划线替换任何连续的 nonascii 字符序列

Answer 3

Mes*_*ssa 5

针对Python 3更新：

>>> 'Tannh‰user'.encode().decode('ascii', 'replace').replace(u'\ufffd', '_')
'Tannh___user'

Run Code Online (Sandbox Code Playgroud)

首先，我们使用创建字节字符串encode()-默认情况下，它使用UTF-8编解码器。如果您有字节字符串，那么当然可以跳过此编码步骤。然后，我们使用ascii编解码器将其转换为“普通”字符串。

这使用UTF-8的属性，即所有非ASCII字符都被编码为值大于等于0x80的字节序列。

原始答案–对于Python 2：

如何使用内置str.decode方法做到这一点：

>>> 'Tannh‰user'.decode('ascii', 'replace').replace(u'\ufffd', '_')
u'Tannh___user'

Run Code Online (Sandbox Code Playgroud)

（您会得到unicode字符串，因此可以str根据需要将其转换为。）

您也可以转换unicode为str，因此一个非ASCII字符将被ASCII之一代替。但问题是，unicode.encode与replace转换非ASCII字符到'?'，所以你不知道的问号在那里已经前; 请参阅Ignacio Vazquez-Abrams的解决方案。

另一种使用ord()和比较每个字符的值（如果它们适合ASCII范围（0-127））的方法-适用于unicode字符串以及strutf-8，拉丁语和其他一些编码：

>>> s = 'Tannh‰user' # or u'Tannh‰user' in Python 2
>>> 
>>> ''.join(c if ord(c) < 128 else '_' for c in s)
'Tannh_user'

Run Code Online (Sandbox Code Playgroud)

Answer 4

Ign*_*ams 5

使用Python对字符编码的支持:

# coding: utf8
import codecs

def underscorereplace_errors(exc):
  return (u'_', exc.end)

codecs.register_error('underscorereplace', underscorereplace_errors)

print u'Tannh‰user'.encode('ascii', 'underscorereplace')

Run Code Online (Sandbox Code Playgroud)

归档时间：	15 年，10 月前
查看次数：	11952 次
最近记录：	7 年，7 月前