根据寻找NP头部的规则,在NLTK和stanford解析中查找名词短语的头部

sta*_*kit 11 python algorithm tree nltk stanford-nlp

通常名词短语的头部是NP的最右边的名词,如下所示,树是父NP的头部.所以

            ROOT                             
             |                                
             S                               
          ___|________________________        
         NP                           |      
      ___|_____________               |       
     |                 PP             VP     
     |             ____|____      ____|___    
     NP           |         NP   |       PRT 
  ___|_______     |         |    |        |   
 DT  JJ  NN  NN   IN       NNP  VBD       RP 
 |   |   |   |    |         |    |        |   
The old oak tree from     India fell     down

Out [40]:Tree('S',[Tree('NP',[Tree('NP',[Tree('DT',['The']),Tree('JJ',['old'] ),树('NN',['oak']),树('NN',['树'])]),树('PP',[树('IN',['from']), Tree('NP',[Tree('NNP',['India'])])])]),Tree('VP',[Tree('VBD',['fall']),Tree('PRT ',[树('RP',['down'])])])])

以下基于java实现的代码使用简单的规则来查找NP的头部,但我需要基于以下规则:

parsestr='(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))'
def traverse(t):
    try:
        t.label()
    except AttributeError:
          return
    else:
        if t.label()=='NP':
            print 'NP:'+str(t.leaves())
            print 'NPhead:'+str(t.leaves()[-1])
            for child in t:
                 traverse(child)

        else:
            for child in t:
                traverse(child)


tree=Tree.fromstring(parsestr)
traverse(tree)
Run Code Online (Sandbox Code Playgroud)

上面的代码给出了输出:

NP:['The','old','oak','tree','from','India'] NPhead:India NP:['The','old','oak','tree'] NPhead :树NP:['印度'] NPhead:印度

虽然现在它给出了给出的句子的正确输出但是我需要结合一个条件,只有最右边的名词被提取为头部,目前它不检查它是否是名词(NN)

print 'NPhead:'+str(t.leaves()[-1])
Run Code Online (Sandbox Code Playgroud)

所以类似于上面代码中的np head条件:

t.leaves().getrightmostnoun() 
Run Code Online (Sandbox Code Playgroud)

Michael Collins的论文(附录A)包括Penn Treebank的头部发现规则,因此没有必要只有最右边的名词才是头部.因此,上述条件应包含这种情况.

对于其中一个答案中给出的以下示例:

(给(NP谈话)的NP(NP人)回家了

主题的名词是人,但是NP的最后一个离开节点是讲话的人.

alv*_*vas 8

Tree在NLTK中有内置的字符串对象(http://www.nltk.org/_modules/nltk/tree.html),请参阅https://github.com/nltk/nltk/blob/develop/nltk/tree .py#L541.

>>> from nltk.tree import Tree
>>> parsestr='(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))'
>>> for i in Tree.fromstring(parsestr).subtrees():
...     if i.label() == 'NP':
...             print i
... 
(NP
  (NP (DT The) (JJ old) (NN oak) (NN tree))
  (PP (IN from) (NP (NNP India))))
(NP (DT The) (JJ old) (NN oak) (NN tree))
(NP (NNP India))


>>> for i in Tree.fromstring(parsestr).subtrees():
...     if i.label() == 'NP':
...             print i.leaves()
... 
['The', 'old', 'oak', 'tree', 'from', 'India']
['The', 'old', 'oak', 'tree']
['India']
Run Code Online (Sandbox Code Playgroud)

请注意,并非总是最右边的名词是NP的头名词,例如

>>> s = '(ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP ((DT a) (NN talk))))))'
>>> Tree.fromstring(s)
Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NN', ['Carnac']), Tree('DT', ['the']), Tree('NN', ['Magnificent'])]), Tree('VP', [Tree('VBD', ['gave']), Tree('NP', [Tree('', [Tree('DT', ['a']), Tree('NN', ['talk'])])])])])])
>>> for i in Tree.fromstring(s).subtrees():
...     if i.label() == 'NP':
...             print i.leaves()[-1]
... 
Magnificent
talk
Run Code Online (Sandbox Code Playgroud)

可以说,Magnificent仍然可以是头名词.另一个例子是当NP包含一个相关子句时:

(给(NP谈话)的NP(NP人)回家了

该中心词的主题是person,但NP的最后一个离开节点the person that gave the talktalk.

  • Michael Collins论文(附录A)包括Penn Treebank的头部发现规则,因此没有必要只有最右边的名词是head3 (2认同)
  • 如果你有麻烦,请礼貌地询问NLTK github问题,以帮助实施它.更好的是,尝试实现,使用您的工作代码执行拉取请求并要求进行代码审查,我确信NLTK dev将帮助您解决问题.或者等到其他人编码=) (2认同)