如何在NLTK中为复制器添加复合词？

Question

如何在NLTK中为复制器添加复合词？

所以,我想知道是否有人知道如何组合多个术语来在NLTK中的标记器中创建单个术语..

例如,当我这样做时:

nltk.pos_tag(nltk.word_tokenize('Apple Incorporated is the largest company'))

Run Code Online (Sandbox Code Playgroud)

它给了我:

[('Apple', 'NNP'), ('Incorporated', 'NNP'), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), ('company', 'NN')]

Run Code Online (Sandbox Code Playgroud)

我如何将它与'Apple'和'Incorporated'放在一起 ('Apple Incorporated','NNP')

Answer 1

小智 1

您可以尝试查看nltk.RegexParser。它允许您根据正则表达式对标记内容的词性进行分块。在你的例子中，你可以做类似的事情

pattern = "NP:{<NN|NNP|NNS|NNPS>+}"
c = nltk.RegexpParser(p)
t = c.parse(nltk.pos_tag(nltk.word_tokenize("Apple Incorporated is the largest company")))
print t

Run Code Online (Sandbox Code Playgroud)

这会给你：

Tree('S', [Tree('NP', [('Apple', 'NNP'), ('Incorporated', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), Tree('NP', [('company', 'NN')])])

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，4 月前
查看次数：	1341 次
最近记录：	12 年，4 月前