Cos*_*ang 11 python nlp named-entity-recognition nltk stanford-nlp
我在NLTK中使用NER来查找句子中的人员,地点和组织.我能够产生这样的结果:
[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]
Run Code Online (Sandbox Code Playgroud)
是否可以通过使用它来将事物组合在一起?我想要的是这样的:
u'Remaking'/ u'O', u'The'/u'O', (u'Republican', u'Party')/u'ORGANIZATION'
Run Code Online (Sandbox Code Playgroud)
谢谢!
它看起来很长但它做的工作:
ner_output = [(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]
chunked, pos = [], ""
for i, word_pos in enumerate(ner_output):
word, pos = word_pos
if pos in ['PERSON', 'ORGANIZATION', 'LOCATION'] and pos == prev_tag:
chunked[-1]+=word_pos
else:
chunked.append(word_pos)
prev_tag = pos
clean_chunked = [tuple([" ".join(wordpos[::2]), wordpos[-1]]) if len(wordpos)!=2 else wordpos for wordpos in chunked]
print clean_chunked
Run Code Online (Sandbox Code Playgroud)
[OUT]:
[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican Party', u'ORGANIZATION')]
Run Code Online (Sandbox Code Playgroud)
更多细节:
第一个for-loop"with memory"实现了这样的事情:
[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION')]
Run Code Online (Sandbox Code Playgroud)
你会发现所有Name Enitties都会在一个元组中有两个以上的项目,你想要的是单词作为列表中的元素,即'Republican Party'
in (u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION')
,所以你会做这样的事情来得到偶数元素:
>>> x = [0,1,2,3,4,5,6]
>>> x[::2]
[0, 2, 4, 6]
>>> x[1::2]
[1, 3, 5]
Run Code Online (Sandbox Code Playgroud)
然后你也意识到NE元组中的最后一个元素是你想要的标签,所以你会做`
>>> x = (u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION')
>>> x[::2]
(u'Republican', u'Party')
>>> x[-1]
u'ORGANIZATION'
Run Code Online (Sandbox Code Playgroud)
这有点特别和冗长,但我希望它有所帮助.这里有一个功能,祝福圣诞节:
ner_output = [(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]
def rechunk(ner_output):
chunked, pos = [], ""
for i, word_pos in enumerate(ner_output):
word, pos = word_pos
if pos in ['PERSON', 'ORGANIZATION', 'LOCATION'] and pos == prev_tag:
chunked[-1]+=word_pos
else:
chunked.append(word_pos)
prev_tag = pos
clean_chunked = [tuple([" ".join(wordpos[::2]), wordpos[-1]])
if len(wordpos)!=2 else wordpos for wordpos in chunked]
return clean_chunked
print rechunk(ner_output)
Run Code Online (Sandbox Code Playgroud)
您可以使用nltk.Tree使用标准 NLTK 方式来表示块使用标准 NLTK 方式来表示块。这可能意味着您必须稍微改变一下您的表示方式。
我通常做的是将NER 标记的句子表示为三元组列表:
sentence = [('Andrew', 'NNP', 'PERSON'), ('is', 'VBZ', 'O'), ('part', 'NN', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('Republican', 'NNP', 'ORGANIZATION'), ('Party', 'NNP', 'ORGANIZATION'), ('in', 'IN', 'O'), ('Dallas', 'NNP', 'LOCATION')]
Run Code Online (Sandbox Code Playgroud)
当我使用外部工具来标记句子时,我会这样做。现在您可以将这句话转换为 NLTK 表示形式:
from nltk import Tree
def IOB_to_tree(iob_tagged):
root = Tree('S', [])
for token in iob_tagged:
if token[2] == 'O':
root.append((token[0], token[1]))
else:
try:
if root[-1].label() == token[2]:
root[-1].append((token[0], token[1]))
else:
root.append(Tree(token[2], [(token[0], token[1])]))
except:
root.append(Tree(token[2], [(token[0], token[1])]))
return root
sentence = [('Andrew', 'NNP', 'PERSON'), ('is', 'VBZ', 'O'), ('part', 'NN', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('Republican', 'NNP', 'ORGANIZATION'), ('Party', 'NNP', 'ORGANIZATION'), ('in', 'IN', 'O'), ('Dallas', 'NNP', 'LOCATION')]
print IOB_to_tree(sentence)
Run Code Online (Sandbox Code Playgroud)
表示形式的变化是有意义的,因为您肯定需要 POS 标签来进行 NER 标记。
最终结果应如下所示:
(S
(PERSON Andrew/NNP)
is/VBZ
part/NN
of/IN
the/DT
(ORGANIZATION Republican/NNP Party/NNP)
in/IN
(LOCATION Dallas/NNP))
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
2138 次 |
最近记录: |