使用 nltk 检测国家名称不适用于表单

vin*_*ita 5 python nlp machine-learning nltk

我正在解析包含此文本的表单

'1a。国家美国'

哪个没有被检测为 GPE

from nltk import pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer

tokenizer = SpaceTokenizer()
toks = tokenizer.tokenize(cioms_)
pos = pos_tag(toks)
chunked_nes = ne_chunk(pos) 

nes = [' '.join(map(lambda x: x[0], ne.leaves())) for ne in chunked_nes if isinstance(ne, nltk.tree.Tree)]
chunked_nes

Out[83]: Tree('S', [(u'1a.', 'CD'), Tree('ORGANIZATION', [(u'Country', 'NNP'), (u'United', 'NNP'), (u'States', 'NNPS')])])

Run Code Online (Sandbox Code Playgroud)

但是当我将其修剪为“美国国家”时，它会被检测到

Out[81]: Tree('S', [Tree('PERSON', [(u'Country', 'NNP')]), Tree('GPE', [(u'United', 'NNP'), (u'States', 'NNPS')])])

Run Code Online (Sandbox Code Playgroud)

为什么会这样？

归档时间：	8 年，6 月前
查看次数：	375 次
最近记录：	8 年，6 月前