ali*_*naz 1 python regex dataframe pandas
我有一个包含文本的大型数据框,我想用它从单词列表(其中大约有 1k 个单词)中查找匹配项。
我已经设法从数据框中的列表中获取该单词的缺失/存在,但对我来说知道哪个单词匹配也很重要。有时与列表中的多个单词完全匹配,我希望将它们全部匹配。
我尝试使用下面的代码,但它给了我部分匹配 - 音节而不是完整的单词。
#this is a code to recreate the initial DF
import pandas as pd
df_data= [['orange','0'],
['apple and lemon','1'],
['lemon and orange','1']]
df= pd.DataFrame(df_data,columns=['text','match','exact word'])
Run Code Online (Sandbox Code Playgroud)
初始DF:
text match
orange 0
apple and lemon 1
lemon and orange 1
Run Code Online (Sandbox Code Playgroud)
这是我需要匹配的单词列表
exactmatch = ['apple', 'lemon']
Run Code Online (Sandbox Code Playgroud)
预期结果:
text match exact words
orange 0 0
apple and lemon 1 'apple','lemon'
lemon and orange 1 'lemon'
Run Code Online (Sandbox Code Playgroud)
这是我尝试过的:
# for some rows it gives me words I want,
#and for some it gives me parts of the word
#regex attempt 1, gives me partial matches (syllables or single letters)
pattern1 = '|'.join(exactmatch)
df['contains'] = df['text'].str.extract("(" + "|".join(exactmatch)
+")", expand=False)
#regex attempt 2 - this gives me an error - unexpected EOL
df['contains'] = df['text'].str.extractall
("(" + "|".join(exactmatch) +")").unstack().apply(','.join, 1)
#TypeError: ('sequence item 1: expected str instance, float found',
#'occurred at index 2')
#no regex attempt, does not give me matches if the word is in there
lst = list(df['text'])
match = []
for w in lst:
if w in exactmatch:
match.append(w)
break
Run Code Online (Sandbox Code Playgroud)
使用str.findall
前任:
exactmatch = ['apple', 'lemon']
df_data= [['orange'],['apple and lemon',],['lemon and orange'],]
df= pd.DataFrame(df_data,columns=['text'])
df['exact word'] = df["text"].str.findall(r"|".join(exactmatch)).apply(", ".join)
print(df)
Run Code Online (Sandbox Code Playgroud)
输出:
text exact word
0 orange
1 apple and lemon apple, lemon
2 lemon and orange lemon
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4580 次 |
| 最近记录: |