基于标点符号列表替换数据框中的标点符号

Question

基于标点符号列表替换数据框中的标点符号

Ber*_*rdL 5 python large-data dataframe pandas

使用Canopy和Pandas,我有数据框a,其定义如下:

a=pd.read_csv('text.txt')

df=pd.DataFrame(a)

df.columns=["test"]

Run Code Online (Sandbox Code Playgroud)

test.txt是一个单列文件,包含一个包含文本,数字和标点符号的字符串列表.

假设df看起来像:

测试

%HGH&12

ABC123!

porkyfries

我希望我的结果是:

测试

hgh12

ABC123

porkyfries

到目前为止的努力:

from string import punctuation /-- import punctuation list from python itself

a=pd.read_csv('text.txt')

df=pd.DataFrame(a)

df.columns=["test"] /-- define the dataframe


for p in list(punctuation):

     ...:     df2=df.med.str.replace(p,'')

     ...:     df2=pd.DataFrame(df2);

     ...:     df2

Run Code Online (Sandbox Code Playgroud)

上面的命令基本上只返回我相同的数据集.感谢任何线索.

编辑:我使用Pandas的原因是因为数据很大,跨越了大约1M行,未来编码的使用将应用于最多30M行的列表.简而言之,我需要以非常有效的方式为大数据集清理数据.

Answer 1

小智 6

要从数据帧中的文本列中删除标点符号：

在：

import re
import string
rem = string.punctuation
pattern = r"[{}]".format(rem)

pattern

Run Code Online (Sandbox Code Playgroud)

出去：

'[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]'

Run Code Online (Sandbox Code Playgroud)

在：

df = pd.DataFrame({'text':['book...regh', 'book...', 'boo,', 'book. ', 'ball, ', 'ballnroll"', '"rope"', 'rick % ']})
df

Run Code Online (Sandbox Code Playgroud)

出去：

        text
0  book...regh
1      book...
2         boo,
3       book. 
4       ball, 
5   ballnroll"
6       "rope"
7      rick %

Run Code Online (Sandbox Code Playgroud)

在：

df['text'] = df['text'].str.replace(pattern, '')
df

Run Code Online (Sandbox Code Playgroud)

您可以将图案替换为您想要的字符。前 - 替换（模式，'$'）

出去：

        text
0   bookregh
1       book
2        boo
3      book 
4      ball 
5  ballnroll
6       rope
7     rick

Run Code Online (Sandbox Code Playgroud)

Answer 2

EdC*_*ica 5

使用replace正确的正则表达式会更容易:

In [41]:

import pandas as pd
pd.set_option('display.notebook_repr_html', False)
df = pd.DataFrame({'text':['test','%hgh&12','abc123!!!','porkyfries']})
df
Out[41]:
         text
0        test
1     %hgh&12
2   abc123!!!
3  porkyfries

[4 rows x 1 columns]

Run Code Online (Sandbox Code Playgroud)

使用带有模式的正则表达式,这意味着不是字母数字/空格

In [49]:

df['text'] = df['text'].str.replace('[^\w\s]','')
df
Out[49]:
         text
0        test
1       hgh12
2      abc123
3  porkyfries

[4 rows x 1 columns]

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，10 月前
查看次数：	7146 次
最近记录：	8 年，3 月前