具有正则表达式的命名实体识别:NLTK

pg2*_*455 10 regex nlp named-entity-recognition nltk

我一直在玩NLTK工具包.我经常遇到这个问题并在网上寻找解决方案,但我无处可寻.所以我在这里提出我的问题.

很多时候,NER不会将连续的NNP标记为一个NE.我认为编辑NER以使用RegexpTagger也可以提高NER.

例:

输入:

巴拉克奥巴马是一个伟大的人.

输出:

树('S',[树('PERSON',[('Barack','NNP')]),树('组织',[('奥巴马','NNP')]),('是', 'VBZ'),('a','DT'),('great','JJ'),('person','NN'),('.','.')])

在哪里

输入:

前副总统迪克·切尼告诉保守派电台主持人劳拉·英格拉汉姆,他"很荣幸"在任期间与达斯维德相提并论.

输出:

树('S',[('前','JJ'),('副','NNP'),('总统','NNP'),树('NE',[('Dick',' NNP'),('切尼','NNP')]),('告诉','VBD'),('保守','JJ'),('收音机','NN'),('主持人' ,'NN'),树('NE',[('Laura','NNP'),('Ingraham','NNP')]),('that','IN'),('他', 'PRP'),(' ', ''),('是','VBD'),('荣幸','VBN'),('''',''''),('to','''' ),('be','VB'),('比较','VBN'),('到','TO'),树('NE',[('Darth','NNP'),( 'Vader','NNP')]),('while','IN'),('in','IN'),('office','NN'),('.','.') ])

在这里,副总统/ NNP,总统/ NNP(迪克/ NNP,切尼/ NNP)被正确提取.

所以我认为如果首先使用nltk.ne_chunk然后如果两个连续的树是NNP,那么两者都很有可能引用一个实体.

任何建议都将非常感激.我正在寻找我的方法中的缺陷.

谢谢.

alv*_*vas 16

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    if continuous_chunk:
        named_entity = " ".join(current_chunk)
        if named_entity not in continuous_chunk:
            continuous_chunk.append(named_entity)

    return continuous_chunk

txt = "Barack Obama is a great person." 
print get_continuous_chunks(txt)
Run Code Online (Sandbox Code Playgroud)

[OUT]:

['Barack Obama']
Run Code Online (Sandbox Code Playgroud)

但请注意,如果连续块不应该是单个NE,那么您将把多个NE组合成一个.我想不出这样的例子,但我相信它会发生.但如果它们不连续,上面的脚本运行正常:

>>> txt = "Barack Obama is the husband of Michelle Obama."  
>>> get_continuous_chunks(txt)
['Barack Obama', 'Michelle Obama']
Run Code Online (Sandbox Code Playgroud)


小智 5

@alvas的答案中有一个错误。围栏错误。确保在循环之外也运行elif检查,以免遗漏句子结尾处的NE。所以:

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue
    if current_chunk:
        named_entity = " ".join(current_chunk)
        if named_entity not in continuous_chunk:
            continuous_chunk.append(named_entity)
            current_chunk = []
    return continuous_chunk

txt = "Barack Obama is a great person and so is Michelle Obama." 
print get_continuous_chunks(txt)
Run Code Online (Sandbox Code Playgroud)