NLTK在python / RSS feed分块中将子树变成列表

Question

NLTK在python / RSS feed分块中将子树变成列表

Eng*_*rad 1 tree parsing list nltk chunks

使用下面的代码，我将已经标记和标记化的rss feed分块。“ print subtree.leaves（）”输出：

[（'Prime'，'NNP'），（'Minister'，'NNP'），（'Stephen'，'NNP'），（'Harper'，'NNP'）] [（'US'，'NNP' ），（“总统”，“ NNP”），（“巴拉克”，“ NNP”），（“奥巴马”，“ NNP”）] [['what \'，'NNP'）] [['Keystone'， '（NNP'），（'XL'，'NNP'）] [（'CBC'，'NNP'），（'新闻'，'NNP'）]

这看起来像一个python列表，但我不知道如何直接访问它或对其进行迭代。我认为这是一个子树输出。

我希望能够将此子树转换为可以操纵的列表。是否有捷径可寻？这是我第一次在python中遇到树木，我迷路了。我要结束此列表：

docs = [“总理史蒂芬·哈珀”，“美国总统巴拉克·奥巴马”，“内容”，“基斯通XL”，“加拿大广播公司新闻”]

有没有简单的方法可以做到这一点？

谢谢，一如既往的帮助！

grammar = r""" Proper: {<NNP>+} """

cp = nltk.RegexpParser(grammar)
result = cp.parse(posDocuments)
nounPhraseDocs.append(result) 

for subtree in result.subtrees(filter=lambda t: t.node == 'Proper'):
# print the noun phrase as a list of part-of-speech tagged words

    print subtree.leaves()
print" "

Run Code Online (Sandbox Code Playgroud)

Answer 1

Dea*_*oke 5

nodelabel现在已被替换。因此，修改Viktor的答案：

docs = []

for subtree in result.subtrees(filter=lambda t: t.label() == 'Proper'):
    docs.append(" ".join([a for (a,b) in subtree.leaves()]))

Run Code Online (Sandbox Code Playgroud)

这将为您提供仅属于Proper卡盘的那些代币的列表。您可以filter从subtrees()方法中删除参数，然后将获得属于树的特定父级的所有标记的列表。

归档时间：	12 年，2 月前
查看次数：	1732 次
最近记录：	8 年，11 月前