如何在熊猫数据框中提取单个单词（不包含较大单词）？

Question

如何在熊猫数据框中提取单个单词（不包含较大单词）？

我想提取这样的词：

a dog ==> dog
some dogs ==> dog
dogmatic ==> None

Run Code Online (Sandbox Code Playgroud)

有一个类似的链接：从pandas DataFrame中的文本中提取子字符串作为新列

但这不能满足我的要求。

从此数据帧：

df = pd.DataFrame({'comment': ['A likes cat', 'B likes Cats',
                               'C likes cats.', 'D likes cat!', 
                               'E is educated',
                              'F is catholic',
                              'G likes cat, he has three of them.',
                              'H likes cat; he has four of them.',
                              'I adore !!cats!!',
                              'x is dogmatic',
                              'x is eating hotdogs.',
                              'x likes dogs, he has three of them.',
                              'x likes dogs; he has four of them.',
                              'x adores **dogs**'
                              ]})

Run Code Online (Sandbox Code Playgroud)

如何获得正确的输出？

                            comment      label EXTRACT
0                           A likes cat   cat     cat
1                          B likes Cats   cat     cat
2                         C likes cats.   cat     cat
3                          D likes cat!   cat     cat
4                         E is educated  None     cat
5                         F is catholic  None     cat
6    G likes cat, he has three of them.   cat     cat
7     H likes cat; he has four of them.   cat     cat
8                      I adore !!cats!!   cat     cat
9                         x is dogmatic  None     dog
10                 x is eating hotdogs.  None     dog
11  x likes dogs, he has three of them.   dog     dog
12   x likes dogs; he has four of them.   dog     dog
13                    x adores **dogs**   dog     dog

Run Code Online (Sandbox Code Playgroud)

注意：EXTRACT列给出了错误的答案，我需要像列标签一样。

Answer 1

Erf*_*fan 6

我们可以用str.extract用negative lookahead：?!。我们检查比赛后的字符是否不超过2个字母。例如dogmatic：

之后，我们使用np.where与positive lookahead。伪逻辑如下所示：

所有前面带有字母字符的“ dog”或“ cat”的行都将替换为NaN

words = ['cat', 'dog']

df['label'] = df['comment'].str.extract('(?i)'+'('+'|'.join(words)+')(?![A-Za-z]{2,})')
df['label'] = np.where(df['comment'].str.contains('(?<=\wdog)|(?<=\wcat)'), np.NaN, df['label'])

Run Code Online (Sandbox Code Playgroud)

输出量

                                comment label
0                           A likes cat   cat
1                          B likes Cats   Cat
2                         C likes cats.   cat
3                          D likes cat!   cat
4                         E is educated   NaN
5                         F is catholic   NaN
6    G likes cat, he has three of them.   cat
7     H likes cat; he has four of them.   cat
8                      I adore !!cats!!   cat
9                         x is dogmatic   NaN
10                 x is eating hotdogs.   NaN
11  x likes dogs, he has three of them.   dog
12   x likes dogs; he has four of them.   dog
13                    x adores **dogs**   dog

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，9 月前
查看次数：	105 次
最近记录：	6 年，9 月前