如何在熊猫数据框中提取单个单词(不包含较大单词)?

Mil*_*001 1 python regex pandas

我想提取这样的词:

a dog ==> dog
some dogs ==> dog
dogmatic ==> None
Run Code Online (Sandbox Code Playgroud)

有一个类似的链接: 从pandas DataFrame中的文本中提取子字符串作为新列

但这不能满足我的要求。

从此数据帧:

df = pd.DataFrame({'comment': ['A likes cat', 'B likes Cats',
                               'C likes cats.', 'D likes cat!', 
                               'E is educated',
                              'F is catholic',
                              'G likes cat, he has three of them.',
                              'H likes cat; he has four of them.',
                              'I adore !!cats!!',
                              'x is dogmatic',
                              'x is eating hotdogs.',
                              'x likes dogs, he has three of them.',
                              'x likes dogs; he has four of them.',
                              'x adores **dogs**'
                              ]})
Run Code Online (Sandbox Code Playgroud)

如何获得正确的输出?

                            comment      label EXTRACT
0                           A likes cat   cat     cat
1                          B likes Cats   cat     cat
2                         C likes cats.   cat     cat
3                          D likes cat!   cat     cat
4                         E is educated  None     cat
5                         F is catholic  None     cat
6    G likes cat, he has three of them.   cat     cat
7     H likes cat; he has four of them.   cat     cat
8                      I adore !!cats!!   cat     cat
9                         x is dogmatic  None     dog
10                 x is eating hotdogs.  None     dog
11  x likes dogs, he has three of them.   dog     dog
12   x likes dogs; he has four of them.   dog     dog
13                    x adores **dogs**   dog     dog
Run Code Online (Sandbox Code Playgroud)

注意:EXTRACT列给出了错误的答案,我需要像列标签一样。

在此处输入图片说明

Erf*_*fan 6

我们可以用str.extractnegative lookahead?!。我们检查比赛后的字符是否不超过2个字母。例如dogmatic

之后,我们使用np.wherepositive lookahead。伪逻辑如下所示:

所有前面带有字母字符的“ dog”或“ cat”的行都将替换为NaN

words = ['cat', 'dog']

df['label'] = df['comment'].str.extract('(?i)'+'('+'|'.join(words)+')(?![A-Za-z]{2,})')
df['label'] = np.where(df['comment'].str.contains('(?<=\wdog)|(?<=\wcat)'), np.NaN, df['label'])
Run Code Online (Sandbox Code Playgroud)

输出量

                                comment label
0                           A likes cat   cat
1                          B likes Cats   Cat
2                         C likes cats.   cat
3                          D likes cat!   cat
4                         E is educated   NaN
5                         F is catholic   NaN
6    G likes cat, he has three of them.   cat
7     H likes cat; he has four of them.   cat
8                      I adore !!cats!!   cat
9                         x is dogmatic   NaN
10                 x is eating hotdogs.   NaN
11  x likes dogs, he has three of them.   dog
12   x likes dogs; he has four of them.   dog
13                    x adores **dogs**   dog
Run Code Online (Sandbox Code Playgroud)