如何从数据框列中提取与列表的精确匹配？

Question

如何从数据框列中提取与列表的精确匹配？

ali*_*naz 1 python regex dataframe pandas

我有一个包含文本的大型数据框，我想用它从单词列表（其中大约有 1k 个单词）中查找匹配项。

我已经设法从数据框中的列表中获取该单词的缺失/存在，但对我来说知道哪个单词匹配也很重要。有时与列表中的多个单词完全匹配，我希望将它们全部匹配。

我尝试使用下面的代码，但它给了我部分匹配 - 音节而不是完整的单词。

#this is a code to recreate the initial DF

import pandas as pd

df_data= [['orange','0'],
['apple and lemon','1'],
['lemon and orange','1']]

df= pd.DataFrame(df_data,columns=['text','match','exact word'])

Run Code Online (Sandbox Code Playgroud)

初始DF：

 text                 match
 orange               0
 apple and lemon      1
 lemon and orange     1

Run Code Online (Sandbox Code Playgroud)

这是我需要匹配的单词列表

 exactmatch = ['apple', 'lemon']

Run Code Online (Sandbox Code Playgroud)

预期结果：

 text                    match  exact words
 orange                    0         0 
 apple and lemon           1        'apple','lemon'
 lemon and orange          1        'lemon'

Run Code Online (Sandbox Code Playgroud)

这是我尝试过的：

# for some rows it gives me words I want, 
#and for some it gives me parts of the word

#regex attempt 1, gives me partial matches (syllables or single letters)

pattern1 = '|'.join(exactmatch)
df['contains'] = df['text'].str.extract("(" + "|".join(exactmatch) 
+")", expand=False)

#regex attempt 2 - this gives me an error - unexpected EOL

df['contains'] = df['text'].str.extractall
("(" + "|".join(exactmatch) +")").unstack().apply(','.join, 1)

#TypeError: ('sequence item 1: expected str instance, float found', 
#'occurred at index 2')

#no regex attempt, does not give me matches if the word is in there

lst = list(df['text'])
match = []
for w in lst:
 if w in exactmatch:
    match.append(w)
    break

Run Code Online (Sandbox Code Playgroud)

Answer 1

Rak*_*esh 5

使用str.findall

前任：

exactmatch = ['apple', 'lemon']
df_data= [['orange'],['apple and lemon',],['lemon and orange'],]

df= pd.DataFrame(df_data,columns=['text'])
df['exact word'] = df["text"].str.findall(r"|".join(exactmatch)).apply(", ".join)
print(df)

Run Code Online (Sandbox Code Playgroud)

输出：

               text    exact word
0            orange              
1   apple and lemon  apple, lemon
2  lemon and orange         lemon

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，10 月前
查看次数：	4580 次
最近记录：	4 年，5 月前