red*_*vil 12 python string character-encoding pandas
我一直试图解决这个问题.我试图从DB_user列中删除非ASCII字符并尝试用空格替换它们.但我不断收到一些错误.这就是我的数据框的外观:
+----------------------------------------------------------- | DB_user source count | +----------------------------------------------------------- | ???/"Ò|Z?)?]??C %??J A 10 | | ?D$ZGU ;@D??_???T(?) B 3 | | ?Q`H??M'?Y??KTK$?Ù‹???ЩJL4??*?_?? C 2 | +-----------------------------------------------------------
我正在使用这个函数,这是我在研究SO上的问题时遇到的.
def filter_func(string):
for i in range(0,len(string)):
if (ord(string[i])< 32 or ord(string[i])>126
break
return ''
And then using the apply function:
df['DB_user'] = df.apply(filter_func,axis=1)
Run Code Online (Sandbox Code Playgroud)
我一直收到错误:
'ord() expected a character, but string of length 66 found', u'occurred at index 2'
但是,我想通过在filter_func函数中使用循环,我通过在'ord'中输入char来处理这个问题.因此,当它命中非ASCII字符时,它应该被空格替换.
有人可以帮帮我吗?
谢谢!
Max*_*axU 22
你可以试试这个:
df.DB_user.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)
Run Code Online (Sandbox Code Playgroud)
您的代码失败,因为您没有将其应用于每个字符,您将其应用于每个单词和ord错误,因为它需要一个字符,您将需要:
df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))
Run Code Online (Sandbox Code Playgroud)
您还可以使用链式比较简化连接:
''.join([i if 32 < ord(i) < 126 else " " for i in x])
Run Code Online (Sandbox Code Playgroud)
您还可以使用string.printable
过滤字符:
from string import printable
st = set(printable)
df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if i not in st else i for i in x]))
Run Code Online (Sandbox Code Playgroud)
最快的是使用翻译:
from string import maketrans
del_chars = " ".join(chr(i) for i in range(32) + range(127, 256))
trans = maketrans(t, " "*len(del_chars))
df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans))
Run Code Online (Sandbox Code Playgroud)
有趣的是,它比:
df['DB_user'] = df["DB_user"].str.translate(trans)
Run Code Online (Sandbox Code Playgroud)
A common trick is to perform ASCII encoding with the errors="ignore"
flag, then subsequently decoding it into ASCII:
df['DB_user'].str.encode('ascii', 'ignore').str.decode('ascii')
Run Code Online (Sandbox Code Playgroud)
From python3.x and above, this is my recommended solution.
Minimal Code Sample
s = pd.Series(['Déjà vu', 'Ò|zz', ';test 123'])
s
0 Déjà vu
1 Ò|zz
2 ;test 123
dtype: object
s.str.encode('ascii', 'ignore').str.decode('ascii')
0 Dj vu
1 |zz
2 ;test 123
dtype: object
Run Code Online (Sandbox Code Playgroud)
P.S.: This can also be extended to cases where you need to filter out characters that do not belong to any character encoding scheme (not just ASCII).
归档时间: |
|
查看次数: |
14858 次 |
最近记录: |