Oll*_*ass 25 python tagging nltk
Python NLTK书的第5章给出了在一个句子中标记单词的示例:
>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]
Run Code Online (Sandbox Code Playgroud)
nltk.pos_tag调用默认标记器,它使用一整套标记.本章后面将介绍一组简化的标签.
如何使用这组简化的词性标签标记句子?
我也正确理解了标记器,即我可以更改标记器使用的标记集,或者我应该将它返回的标记映射到简化集,还是应该从新创建新的标记器,简单标记的语料库?
小智 29
更新,以防任何人遇到同样的问题.NLTK从此迈上了一个"通用"标记集,源在这里.标记文本后,使用map_tag简化标记.
import nltk
from nltk.tag import pos_tag, map_tag
text = nltk.word_tokenize("And now for something completely different")
posTagged = pos_tag(text)
simplifiedTags = [(word, map_tag('en-ptb', 'universal', tag)) for word, tag in posTagged]
print(simplifiedTags)
# [('And', u'CONJ'), ('now', u'ADV'), ('for', u'ADP'), ('something', u'NOUN'), ('completely', u'ADV'), ('different', u'ADJ')]
Run Code Online (Sandbox Code Playgroud)
Jac*_*cob 22
要简化默认标记器中的标记,您可以使用nltk.tag.simplify.simplify_wsj_tag,如下所示:
>>> import nltk
>>> from nltk.tag.simplify import simplify_wsj_tag
>>> tagged_sent = nltk.pos_tag(tokens)
>>> simplified = [(word, simplify_wsj_tag(tag)) for word, tag in tagged_sent]
Run Code Online (Sandbox Code Playgroud)
您可以简单地在 pos_tag 方法中将 tagset 属性设置为“universal”。
In [39]: from nltk import word_tokenize, pos_tag
...:
...: text = word_tokenize("Here is a simple way of doing this")
...: tags = pos_tag(text, tagset='universal')
...: print(tags)
...:
[('Here', 'ADV'), ('is', 'VERB'), ('a', 'DET'), ('simple', 'ADJ'), ('way', 'NOUN'), ('of', 'ADP'), ('doing', 'VERB'), ('this', 'DET')]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
15363 次 |
| 最近记录: |