sta*_*kit 11 python algorithm tree nltk stanford-nlp
通常名词短语的头部是NP的最右边的名词,如下所示,树是父NP的头部.所以
ROOT
|
S
___|________________________
NP |
___|_____________ |
| PP VP
| ____|____ ____|___
NP | NP | PRT
___|_______ | | | |
DT JJ NN NN IN NNP VBD RP
| | | | | | | |
The old oak tree from India fell down
Out [40]:Tree('S',[Tree('NP',[Tree('NP',[Tree('DT',['The']),Tree('JJ',['old'] ),树('NN',['oak']),树('NN',['树'])]),树('PP',[树('IN',['from']), Tree('NP',[Tree('NNP',['India'])])])]),Tree('VP',[Tree('VBD',['fall']),Tree('PRT ',[树('RP',['down'])])])])
以下基于java实现的代码使用简单的规则来查找NP的头部,但我需要基于以下规则:
parsestr='(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))'
def traverse(t):
try:
t.label()
except AttributeError:
return
else:
if t.label()=='NP':
print 'NP:'+str(t.leaves())
print 'NPhead:'+str(t.leaves()[-1])
for child in t:
traverse(child)
else:
for child in t:
traverse(child)
tree=Tree.fromstring(parsestr)
traverse(tree)
Run Code Online (Sandbox Code Playgroud)
上面的代码给出了输出:
NP:['The','old','oak','tree','from','India'] NPhead:India NP:['The','old','oak','tree'] NPhead :树NP:['印度'] NPhead:印度
虽然现在它给出了给出的句子的正确输出但是我需要结合一个条件,只有最右边的名词被提取为头部,目前它不检查它是否是名词(NN)
print 'NPhead:'+str(t.leaves()[-1])
Run Code Online (Sandbox Code Playgroud)
所以类似于上面代码中的np head条件:
t.leaves().getrightmostnoun()
Run Code Online (Sandbox Code Playgroud)
Michael Collins的论文(附录A)包括Penn Treebank的头部发现规则,因此没有必要只有最右边的名词才是头部.因此,上述条件应包含这种情况.
对于其中一个答案中给出的以下示例:
(给(NP谈话)的NP(NP人)回家了
主题的名词是人,但是NP的最后一个离开节点是讲话的人.
Tree在NLTK中有内置的字符串对象(http://www.nltk.org/_modules/nltk/tree.html),请参阅https://github.com/nltk/nltk/blob/develop/nltk/tree .py#L541.
>>> from nltk.tree import Tree
>>> parsestr='(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))'
>>> for i in Tree.fromstring(parsestr).subtrees():
... if i.label() == 'NP':
... print i
...
(NP
(NP (DT The) (JJ old) (NN oak) (NN tree))
(PP (IN from) (NP (NNP India))))
(NP (DT The) (JJ old) (NN oak) (NN tree))
(NP (NNP India))
>>> for i in Tree.fromstring(parsestr).subtrees():
... if i.label() == 'NP':
... print i.leaves()
...
['The', 'old', 'oak', 'tree', 'from', 'India']
['The', 'old', 'oak', 'tree']
['India']
Run Code Online (Sandbox Code Playgroud)
请注意,并非总是最右边的名词是NP的头名词,例如
>>> s = '(ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP ((DT a) (NN talk))))))'
>>> Tree.fromstring(s)
Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NN', ['Carnac']), Tree('DT', ['the']), Tree('NN', ['Magnificent'])]), Tree('VP', [Tree('VBD', ['gave']), Tree('NP', [Tree('', [Tree('DT', ['a']), Tree('NN', ['talk'])])])])])])
>>> for i in Tree.fromstring(s).subtrees():
... if i.label() == 'NP':
... print i.leaves()[-1]
...
Magnificent
talk
Run Code Online (Sandbox Code Playgroud)
可以说,Magnificent仍然可以是头名词.另一个例子是当NP包含一个相关子句时:
(给(NP谈话)的NP(NP人)回家了
该中心词的主题是person,但NP的最后一个离开节点the person that gave the talk是talk.