通过在字符串列中查找确切的单词来创建新列

Question

通过在字符串列中查找确切的单词来创建新列

Sam*_*Sam 3 python string dataframe python-3.x pandas

如果列表中的任何单词与数据帧字符串列完全匹配，我想创建一个包含 1 或 0 的新列。

list_provided=["mul","the"]
#how my dataframe looks
id  text
a    simultaneous there the
b    simultaneous there
c    mul why

Run Code Online (Sandbox Code Playgroud)

预期产出

id  text                     found
a    simultaneous there the   1
b    simultaneous there       0
c    mul why                  1

Run Code Online (Sandbox Code Playgroud)

第二行分配为 0，因为“mul”或“the”在字符串列“text”中不完全匹配

到目前为止尝试过代码

#For exact match I am using the below code
data["Found"]=np.where(data["text"].str.contains(r'(?:\s|^)penalidades(?:\s|$)'),1,0)

Run Code Online (Sandbox Code Playgroud)

如何迭代循环以找到所提供的单词列表中所有单词的完全匹配？

编辑： 如果我按照 Georgey 的建议使用 str.contains(pattern)，则 data["Found"] 的所有行都会变为 1

data=pd.DataFrame({"id":("a","b","c","d"), "text":("simultaneous there the","simultaneous there","mul why","mul")})
list_of_word=["mul","the"]
pattern = '|'.join(list_of_word)
data["Found"]=np.where(data["text"].str.contains(pattern),1,0)

Output:
id  text                     found
a    simultaneous there the   1
b    simultaneous there       1
c    mul why                  1
d    mul                      1

Run Code Online (Sandbox Code Playgroud)

此处找到的列中的第二行应该为 0

Answer 1

jpp*_*jpp 5

您可以使用pd.Series.apply和sum生成器表达式来执行此操作：

import pandas as pd

df = pd.DataFrame({'id': ['a', 'b', 'c'],
                   'text': ['simultaneous there the', 'simultaneous there', 'mul why']})

test_set = {'mul', 'the'}

df['found'] = df['text'].apply(lambda x: sum(i in test_set for i in x.split()))

#   id                    text  found
# 0  a  simultaneous there the      1
# 1  b      simultaneous there      0
# 2  c                 mul why      1

Run Code Online (Sandbox Code Playgroud)

上面提供了一个计数。如果您只需要布尔值，请使用any：

df['found'] = df['text'].apply(lambda x: any(i in test_set for i in x.split()))

Run Code Online (Sandbox Code Playgroud)

对于整数表示，链.astype(int)。

归档时间：	7 年，7 月前
查看次数：	4030 次
最近记录：	7 年，7 月前