Python - 替换字符串中的非ascii字符(»)

Question

Python - 替换字符串中的非ascii字符(»)

Hyp*_*ion 14 python regex string encoding decoding

我需要在字符串中用空格替换字符"»",但我仍然会收到错误.这是我使用的代码:

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

# other code

soup = BeautifulSoup(data, 'lxml')
mystring = soup.find('a').text.replace(' »','')

Run Code Online (Sandbox Code Playgroud)

UnicodeEncodeError:'ascii'编解码器无法对位置13中的字符u'\ xbb'进行编码:序数不在范围内(128)

但如果我用其他脚本测试它:

# -*- coding: utf-8 -*-
a = "hi »"
b = a.replace('»','')

Run Code Online (Sandbox Code Playgroud)

有用.为什么这个？

Answer 1

Moi*_*dri 19

为了使用str.replace()方法替换字符串的内容; 你需要首先解码字符串,然后替换文本并将其编码回原始文本:

>>> a = "hi »"
>>> a.decode('utf-8').replace("»".decode('utf-8'), "").encode('utf-8')
'hi '

Run Code Online (Sandbox Code Playgroud)

您还可以使用以下正则表达式从字符串中删除所有非ascii字符:

>>> import re
>>> re.sub(r'[^\x00-\x7f]',r'', 'hi »')
'hi '

Run Code Online (Sandbox Code Playgroud)

regex版本是最快的。我没有使用`[^ \ x00- \ x7f]`，而是使用了[^ \ x20- \ x7E]来删除ASCII控制字符，从0到31和127。 (2认同)

Answer 2

bla*_*ite 8

@Moinuddin Quadri的答案更适合您的用例，但是通常，从给定字符串中删除非ASCII字符的简单方法是执行以下操作：

# the characters '¡' and '¢' are non-ASCII
string = "hello, my name is ¢arl... ¡Hola!"

all_ascii = ''.join(char for char in string if ord(char) < 128)

Run Code Online (Sandbox Code Playgroud)

结果是：

>>> print(all_ascii)
"hello, my name is arl... Hola!"

Run Code Online (Sandbox Code Playgroud)

您也可以这样做：

''.join(filter(lambda c: ord(c) < 128, string))

Run Code Online (Sandbox Code Playgroud)

但这比该char for char ...方法慢30％。

归档时间：	8 年，11 月前
查看次数：	14437 次
最近记录：	7 年，10 月前