N08*_*N08 5 python dataframe pandas
我有一个包含每行句子的数据帧.我需要在这些句子中搜索某些单词的出现.这就是我目前的做法:
import pandas as pd
p = pd.DataFrame({"sentence" : ["this is a test", "yet another test", "now two tests", "test a", "no test"]})
test_words = ["yet", "test"]
p["word_test"] = ""
p["word_yet"] = ""
for i in range(len(p)):
for word in test_words:
p.loc[i]["word_"+word] = p.loc[i]["sentence"].find(word)
Run Code Online (Sandbox Code Playgroud)
这可以按预期工作,但是,是否可以对此进行优化?对于大型数据帧,它运行速度相当慢
IIUC,使用简单的列表理解并调用str.find每个单词:
u = pd.DataFrame({
# 'word_{}'.format(w)
f'word_{w}': df.sentence.str.find(w) for w in test_words}, index=df.index)
u
word_yet word_test
0 -1 10
1 0 12
2 -1 8
3 -1 0
4 -1 3
Run Code Online (Sandbox Code Playgroud)
pd.concat([df, u], axis=1)
sentence word_yet word_test
0 this is a test -1 10
1 yet another test 0 12
2 now two tests -1 8
3 test a -1 0
4 no test -1 3
Run Code Online (Sandbox Code Playgroud)
你可以使用str.find
p['word_test'] = p.sentence.str.find('test')
p['word_yet'] = p.sentence.str.find('yet')
sentence word_test word_yet word_yest
0 this is a test 10 -1 -1
1 yet another test 12 0 0
2 now two tests 8 -1 -1
3 test a 0 -1 -1
4 no test 3 -1 -1
Run Code Online (Sandbox Code Playgroud)
因为你提到了更好的性能 np.char.find
df=pd.DataFrame(data=[np.char.find(p.sentence.values.astype(str),x) for x in test_words],index=test_words,columns=p.index)
pd.concat([p,df.T],axis=1)
Out[32]:
sentence yet test
0 this is a test -1 10
1 yet another test 0 12
2 now two tests -1 8
3 test a -1 0
4 no test -1 3
Run Code Online (Sandbox Code Playgroud)