删除非ASCII字符并用Pandas数据帧中的空格替换

Question

删除非ASCII字符并用Pandas数据帧中的空格替换

red*_*vil 12 python string character-encoding pandas

我一直试图解决这个问题.我试图从DB_user列中删除非ASCII字符并尝试用空格替换它们.但我不断收到一些错误.这就是我的数据框的外观:


+-----------------------------------------------------------
|      DB_user                            source   count  |                                             
+-----------------------------------------------------------
| ???/"Ò|Z?)?]??C %??J                      A        10   |                                       
| ?D$ZGU   ;@D??_???T(?)                    B         3   |                                       
| ?Q`H??M'?Y??KTK$?Ù‹???Ð©JL4??*?_??        C         2   |                                        
+-----------------------------------------------------------

我正在使用这个函数,这是我在研究SO上的问题时遇到的.

def filter_func(string):
   for i in range(0,len(string)):


      if (ord(string[i])< 32 or ord(string[i])>126
           break

      return ''

And then using the apply function:

df['DB_user'] = df.apply(filter_func,axis=1)

Run Code Online (Sandbox Code Playgroud)

我一直收到错误:


'ord() expected a character, but string of length 66 found', u'occurred at index 2'

但是,我想通过在filter_func函数中使用循环,我通过在'ord'中输入char来处理这个问题.因此,当它命中非ASCII字符时,它应该被空格替换.

有人可以帮帮我吗？

谢谢!

Answer 1

Max*_*axU 22

你可以试试这个:

df.DB_user.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)

Run Code Online (Sandbox Code Playgroud)

很好的答案，这也可以用于整个DataFrame。 (3认同)
这执行的任务与问题中所示的任务略有不同 - 它接受所有 ASCII 字符，而问题中的示例代码通过从字符 32 而不是 0 开始拒绝不可打印的字符。可以替换字符“\x00”使用单个空格使该答案与其行为中接受的答案相匹配。 (2认同)

Answer 2

Pad*_*ham 6

您的代码失败，因为您没有将其应用于每个字符，您将其应用于每个单词和ord错误，因为它需要一个字符，您将需要：

  df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))

Run Code Online (Sandbox Code Playgroud)

您还可以使用链式比较简化连接：

   ''.join([i if 32 < ord(i) < 126 else " " for i in x])

Run Code Online (Sandbox Code Playgroud)

您还可以使用string.printable过滤字符：

from string import printable
st = set(printable)
df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))

Run Code Online (Sandbox Code Playgroud)

最快的是使用翻译：

from string import maketrans

del_chars =  " ".join(chr(i) for i in range(32) + range(127, 256))
trans = maketrans(t, " "*len(del_chars))

df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans))

Run Code Online (Sandbox Code Playgroud)

有趣的是，它比：

  df['DB_user'] = df["DB_user"].str.translate(trans)

Run Code Online (Sandbox Code Playgroud)

Answer 3

cs9*_*s95 5

A common trick is to perform ASCII encoding with the errors="ignore" flag, then subsequently decoding it into ASCII:

df['DB_user'].str.encode('ascii', 'ignore').str.decode('ascii')

Run Code Online (Sandbox Code Playgroud)

From python3.x and above, this is my recommended solution.

Minimal Code Sample

s = pd.Series(['Déjà vu', 'Ò|zz', ';test 123'])
s

0      Déjà vu
1         Ò|zz
2    ;test 123
dtype: object


s.str.encode('ascii', 'ignore').str.decode('ascii')

0        Dj vu
1          |zz
2    ;test 123
dtype: object

Run Code Online (Sandbox Code Playgroud)

P.S.: This can also be extended to cases where you need to filter out characters that do not belong to any character encoding scheme (not just ASCII).

归档时间：	9 年，5 月前
查看次数：	14858 次
最近记录：	6 年，1 月前