从 Pandas DataFrame 列中删除特定符号（unicode）

Question

从 Pandas DataFrame 列中删除特定符号（unicode）

Mik*_*Sam 3 python string char dataframe pandas

我有数据帧（熊猫）：

data1 = pandas.DataFrame(['??????, ????', '??? ?????', '????!!'])

Run Code Online (Sandbox Code Playgroud)

如您所见，它包含 unicode 符号（西里尔文）：

>>> data1
              0
0  ??????, ????
1     ??? ?????
2        ????!!

Run Code Online (Sandbox Code Playgroud)

我尝试从数据框列中删除所有特定符号。 我试过：

data1.apply(replace ???)
data1[0].replace()

Run Code Online (Sandbox Code Playgroud)

甚至还有 lambda 的东西。但我不知道如何replace正确调用。所以我想显示所有符号必须按范围删除：

x in '!@#$%^&*()'

Run Code Online (Sandbox Code Playgroud)

或者

if chr(x) not in range(1040,1072) # chr() of cyrillic

Run Code Online (Sandbox Code Playgroud)

Answer 1

Max*_*axU 6

您可以使用 unicode RegEx (?u)：

来源DF：

In [30]: df
Out[30]:
                        col
0              ??????, ????
1                 ??? ?????
2              ???? 23 45!!
3  ????? ????, ?? ????????!

Run Code Online (Sandbox Code Playgroud)

解决方案（删除所有数字、所有尾随空格和所有非字符，空格和问号除外）：

In [36]: df.replace(['\d+', r'(?u)[^\w\s\?]+', '\s*$'], ['','',''], regex=True)
Out[36]:
                      col
0             ?????? ????
1               ??? ?????
2                    ????
3  ????? ???? ?? ????????

Run Code Online (Sandbox Code Playgroud)

正则表达式解释...

Answer 2

cs9*_*s95 5

好的，IIUC，使用string.punctuation并执行替换replace-

import string
data1.replace(r'[{}]'.format(string.punctuation), '', regex=True)

             0
0  ?????? ????
1     ??? ????
2         ????

Run Code Online (Sandbox Code Playgroud)

在哪里，

string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Run Code Online (Sandbox Code Playgroud)

如果你想排除一个特定的字符/字符集，这是一种方法，使用set.difference-

c = set(string.punctuation)
p_to_exclude = ['?', ...]

c = c.difference(p_to_exclude)

Run Code Online (Sandbox Code Playgroud)

现在，您可以c像以前一样使用-

data1.replace(r'[{}]'.format(re.escape(''.join(c))), '', regex=True)
             0
0  ?????? ????
1    ??? ?????
2         ????

Run Code Online (Sandbox Code Playgroud)

这里的另一件事是使用re.escape, 因为[和]被视为元字符，需要进行转义。

归档时间：	7 年，11 月前
查看次数：	1782 次
最近记录：	7 年，11 月前