use*_*665 4 python unicode pandas
对于一个字符串,下面的代码删除unicode字符和新行/回车:
t = "We've\xe5\xcabeen invited to attend TEDxTeen, an independently organized TED event focused on encouraging youth to find \x89\xdb\xcfsimply irresistible\x89\xdb\x9d solutions to the complex issues we face every day.,"
t2 = t.decode('unicode_escape').encode('ascii', 'ignore').strip()
import sys
sys.stdout.write(t2.strip('\n\r'))
Run Code Online (Sandbox Code Playgroud)
但是当我尝试在pandas中编写一个函数来将它应用于列的每个单元格时,它会因为属性错误而失败,或者我收到一条警告,表示正在尝试在DataFrame的一个切片副本上设置一个值
def clean_text(row):
row= row["text"].decode('unicode_escape').encode('ascii', 'ignore')#.strip()
import sys
sys.stdout.write(row.strip('\n\r'))
return row
Run Code Online (Sandbox Code Playgroud)
应用于我的数据框:
df["text"] = df.apply(clean_text, axis=1)
Run Code Online (Sandbox Code Playgroud)
如何将此代码应用于系列的每个元素?
问题似乎是你row['text']
在执行apply函数时试图访问和更改并返回行本身,当你执行apply
a时DataFrame
,它应用于每个Series,所以如果更改为this应该有帮助:
import pandas as pd
df = pd.DataFrame([t for _ in range(5)], columns=['text'])
df
text
0 We've??????been invited to attend TEDxTeen, an ind...
1 We've??????been invited to attend TEDxTeen, an ind...
2 We've??????been invited to attend TEDxTeen, an ind...
3 We've??????been invited to attend TEDxTeen, an ind...
4 We've??????been invited to attend TEDxTeen, an ind...
Run Code Online (Sandbox Code Playgroud)
def clean_text(row):
# return the list of decoded cell in the Series instead
return [r.decode('unicode_escape').encode('ascii', 'ignore') for r in row]
df['text'] = df.apply(clean_text)
df
text
0 We'vebeen invited to attend TEDxTeen, an indep...
1 We'vebeen invited to attend TEDxTeen, an indep...
2 We'vebeen invited to attend TEDxTeen, an indep...
3 We'vebeen invited to attend TEDxTeen, an indep...
4 We'vebeen invited to attend TEDxTeen, an indep...
Run Code Online (Sandbox Code Playgroud)
或者,您可以使用lambda
如下,并直接仅适用于text
列:
df['text'] = df['text'].apply(lambda x: x.decode('unicode_escape').\
encode('ascii', 'ignore').\
strip())
Run Code Online (Sandbox Code Playgroud)
我实际上无法重现您的错误:以下代码为我运行没有错误或警告.
df = pd.DataFrame([t,t,t],columns = ['text'])
df["text"] = df.apply(clean_text, axis=1)
Run Code Online (Sandbox Code Playgroud)
如果它有帮助,我认为更多"熊猫"方法来解决这类问题可能是使用正则表达式与其中一个DataFrame.str
方法,例如:
df["text"] = df.text.str.replace('[^\x00-\x7F]','')
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
20230 次 |
最近记录: |