Mil*_*001 1 python regex pandas
我想提取这样的词:
a dog ==> dog
some dogs ==> dog
dogmatic ==> None
Run Code Online (Sandbox Code Playgroud)
有一个类似的链接: 从pandas DataFrame中的文本中提取子字符串作为新列
但这不能满足我的要求。
从此数据帧:
df = pd.DataFrame({'comment': ['A likes cat', 'B likes Cats',
'C likes cats.', 'D likes cat!',
'E is educated',
'F is catholic',
'G likes cat, he has three of them.',
'H likes cat; he has four of them.',
'I adore !!cats!!',
'x is dogmatic',
'x is eating hotdogs.',
'x likes dogs, he has three of them.',
'x likes dogs; he has four of them.',
'x adores **dogs**'
]})
Run Code Online (Sandbox Code Playgroud)
如何获得正确的输出?
comment label EXTRACT
0 A likes cat cat cat
1 B likes Cats cat cat
2 C likes cats. cat cat
3 D likes cat! cat cat
4 E is educated None cat
5 F is catholic None cat
6 G likes cat, he has three of them. cat cat
7 H likes cat; he has four of them. cat cat
8 I adore !!cats!! cat cat
9 x is dogmatic None dog
10 x is eating hotdogs. None dog
11 x likes dogs, he has three of them. dog dog
12 x likes dogs; he has four of them. dog dog
13 x adores **dogs** dog dog
Run Code Online (Sandbox Code Playgroud)
我们可以用str.extract用negative lookahead:?!。我们检查比赛后的字符是否不超过2个字母。例如dogmatic:
之后,我们使用np.where与positive lookahead。伪逻辑如下所示:
所有前面带有字母字符的“ dog”或“ cat”的行都将替换为NaN
words = ['cat', 'dog']
df['label'] = df['comment'].str.extract('(?i)'+'('+'|'.join(words)+')(?![A-Za-z]{2,})')
df['label'] = np.where(df['comment'].str.contains('(?<=\wdog)|(?<=\wcat)'), np.NaN, df['label'])
Run Code Online (Sandbox Code Playgroud)
输出量
comment label
0 A likes cat cat
1 B likes Cats Cat
2 C likes cats. cat
3 D likes cat! cat
4 E is educated NaN
5 F is catholic NaN
6 G likes cat, he has three of them. cat
7 H likes cat; he has four of them. cat
8 I adore !!cats!! cat
9 x is dogmatic NaN
10 x is eating hotdogs. NaN
11 x likes dogs, he has three of them. dog
12 x likes dogs; he has four of them. dog
13 x adores **dogs** dog
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
105 次 |
| 最近记录: |