如何在"not","no"和"never"后面的字符串中添加标签到否定词

sen*_*on5 2 python regex python-2.7 sentiment-analysis

如何将标记添加NEG_到以后的所有单词not,no并且never直到在字符串中的下一个标点符号(用于情感分析)?我假设可以使用正则表达式,但我不确定如何.

输入:
It was never going to work, he thought. He did not play so well, so he had to practice some more.

期望的输出:
It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more.

不知道怎么解决这个问题?

Rob*_*bin 7

为了弥补Python的re正则表达式引擎缺少一些Perl功能,您可以在re.sub函数中使用lambda表达式来创建动态替换:

import re
string = "It was never going to work, he thought. He did not play so well, so he had to practice some more. Not foobar !"
transformed = re.sub(r'\b(?:not|never|no)\b[\w\s]+[^\w\s]', 
       lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)), 
       string,
       flags=re.IGNORECASE)
Run Code Online (Sandbox Code Playgroud)

将打印(这里演示)

It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more. Not NEG_foobar !
Run Code Online (Sandbox Code Playgroud)

说明

  • 第一步是选择你感兴趣的字符串部分.这是完成的

    \b(?:not|never|no)\b[\w\s]+[^\w\s]
    
    Run Code Online (Sandbox Code Playgroud)

    您的否定关键字(\b是一个单词边界,(?:...)一个非捕获组),后面是alpahnum和空格(\w[0-9a-zA-Z_],\s是所有类型的空格),直到某个既不是字母也不是空格(充当标点符号).

    请注意,标点符号在此处是必需的,但您也可以安全地删除[^\w\s]以匹配字符串的结尾.

  • 现在你正在处理never going to work,各种字符串.只需选择带有空格的单词

    (\s+)(\w+)
    
    Run Code Online (Sandbox Code Playgroud)

    并用你想要的东西取而代之

    \1NEG_\2
    
    Run Code Online (Sandbox Code Playgroud)