vin*_*ita 5 python nlp machine-learning nltk
我正在解析包含此文本的表单
'1a。国家美国'
哪个没有被检测为 GPE
from nltk import pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer
tokenizer = SpaceTokenizer()
toks = tokenizer.tokenize(cioms_)
pos = pos_tag(toks)
chunked_nes = ne_chunk(pos)
nes = [' '.join(map(lambda x: x[0], ne.leaves())) for ne in chunked_nes if isinstance(ne, nltk.tree.Tree)]
chunked_nes
Out[83]: Tree('S', [(u'1a.', 'CD'), Tree('ORGANIZATION', [(u'Country', 'NNP'), (u'United', 'NNP'), (u'States', 'NNPS')])])
Run Code Online (Sandbox Code Playgroud)
但是当我将其修剪为“美国国家”时,它会被检测到
Out[81]: Tree('S', [Tree('PERSON', [(u'Country', 'NNP')]), Tree('GPE', [(u'United', 'NNP'), (u'States', 'NNPS')])])
Run Code Online (Sandbox Code Playgroud)
为什么会这样?
| 归档时间: |
|
| 查看次数: |
375 次 |
| 最近记录: |