使用NLTK使用MaltParser解析多个句子

alv*_*vas 11 python java parsing nlp nltk

有许多与MaltParser和/或NLTK相关的问题:

现在,在NLTK中有一个更稳定的MaltParser API版本:https://github.com/nltk/nltk/pull/944但是在同时解析多个句子时会出现问题.

一次解析一句似乎很好:

_path_to_maltparser = '/home/alvas/maltparser-1.8/dist/maltparser-1.8/'
_path_to_model= '/home/alvas/engmalt.linear-1.7.mco'     
>>> mp = MaltParser(path_to_maltparser=_path_to_maltparser, model=_path_to_model)
>>> sent = 'I shot an elephant in my pajamas'.split()
>>> sent2 = 'Time flies like banana'.split()
>>> print(mp.parse_one(sent).tree())
(pajamas (shot I) an elephant in my)
Run Code Online (Sandbox Code Playgroud)

但解析一个句子列表不会返回DependencyGraph对象:

_path_to_maltparser = '/home/alvas/maltparser-1.8/dist/maltparser-1.8/'
_path_to_model= '/home/alvas/engmalt.linear-1.7.mco'     
>>> mp = MaltParser(path_to_maltparser=_path_to_maltparser, model=_path_to_model)
>>> sent = 'I shot an elephant in my pajamas'.split()
>>> sent2 = 'Time flies like banana'.split()
>>> print(mp.parse_one(sent).tree())
(pajamas (shot I) an elephant in my)
>>> print(next(mp.parse_sents([sent,sent2])))
<listiterator object at 0x7f0a2e4d3d90> 
>>> print(next(next(mp.parse_sents([sent,sent2]))))
[{u'address': 0,
  u'ctag': u'TOP',
  u'deps': [2],
  u'feats': None,
  u'lemma': None,
  u'rel': u'TOP',
  u'tag': u'TOP',
  u'word': None},
 {u'address': 1,
  u'ctag': u'NN',
  u'deps': [],
  u'feats': u'_',
  u'head': 2,
  u'lemma': u'_',
  u'rel': u'nn',
  u'tag': u'NN',
  u'word': u'I'},
 {u'address': 2,
  u'ctag': u'NN',
  u'deps': [1, 11],
  u'feats': u'_',
  u'head': 0,
  u'lemma': u'_',
  u'rel': u'null',
  u'tag': u'NN',
  u'word': u'shot'},
 {u'address': 3,
  u'ctag': u'AT',
  u'deps': [],
  u'feats': u'_',
  u'head': 11,
  u'lemma': u'_',
  u'rel': u'nn',
  u'tag': u'AT',
  u'word': u'an'},
 {u'address': 4,
  u'ctag': u'NN',
  u'deps': [],
  u'feats': u'_',
  u'head': 11,
  u'lemma': u'_',
  u'rel': u'nn',
  u'tag': u'NN',
  u'word': u'elephant'},
 {u'address': 5,
  u'ctag': u'NN',
  u'deps': [],
  u'feats': u'_',
  u'head': 11,
  u'lemma': u'_',
  u'rel': u'nn',
  u'tag': u'NN',
  u'word': u'in'},
 {u'address': 6,
  u'ctag': u'NN',
  u'deps': [],
  u'feats': u'_',
  u'head': 11,
  u'lemma': u'_',
  u'rel': u'nn',
  u'tag': u'NN',
  u'word': u'my'},
 {u'address': 7,
  u'ctag': u'NNS',
  u'deps': [],
  u'feats': u'_',
  u'head': 11,
  u'lemma': u'_',
  u'rel': u'nn',
  u'tag': u'NNS',
  u'word': u'pajamas'},
 {u'address': 8,
  u'ctag': u'NN',
  u'deps': [],
  u'feats': u'_',
  u'head': 11,
  u'lemma': u'_',
  u'rel': u'nn',
  u'tag': u'NN',
  u'word': u'Time'},
 {u'address': 9,
  u'ctag': u'NNS',
  u'deps': [],
  u'feats': u'_',
  u'head': 11,
  u'lemma': u'_',
  u'rel': u'nn',
  u'tag': u'NNS',
  u'word': u'flies'},
 {u'address': 10,
  u'ctag': u'NN',
  u'deps': [],
  u'feats': u'_',
  u'head': 11,
  u'lemma': u'_',
  u'rel': u'nn',
  u'tag': u'NN',
  u'word': u'like'},
 {u'address': 11,
  u'ctag': u'NN',
  u'deps': [3, 4, 5, 6, 7, 8, 9, 10],
  u'feats': u'_',
  u'head': 2,
  u'lemma': u'_',
  u'rel': u'dep',
  u'tag': u'NN',
  u'word': u'banana'}]
Run Code Online (Sandbox Code Playgroud)

为什么使用parse_sents()不返回可迭代的parse_one

然而,我可以懒得做:

_path_to_maltparser = '/home/alvas/maltparser-1.8/dist/maltparser-1.8/'
_path_to_model= '/home/alvas/engmalt.linear-1.7.mco'     
>>> mp = MaltParser(path_to_maltparser=_path_to_maltparser, model=_path_to_model)
>>> sent1 = 'I shot an elephant in my pajamas'.split()
>>> sent2 = 'Time flies like banana'.split()
>>> sentences = [sent1, sent2]
>>> for sent in sentences:
>>> ...    print(mp.parse_one(sent).tree())
Run Code Online (Sandbox Code Playgroud)

但这不是我正在寻找的解决方案.我的问题是如何回答为什么不parse_sent()返回可迭代的parse_one().怎么能在NLTK代码中修复?


在@NikitaAstrakhantsev回答之后,我已经尝试过它现在输出一个解析树,但它似乎很混乱并且在解析之前将两个句子放在一起.

# Initialize a MaltParser object with a pre-trained model.
mp = MaltParser(path_to_maltparser=path_to_maltparser, model=path_to_model) 
sent = 'I shot an elephant in my pajamas'.split()
sent2 = 'Time flies like banana'.split()
# Parse a single sentence.
print(mp.parse_one(sent).tree())
print(next(next(mp.parse_sents([sent,sent2]))).tree())
Run Code Online (Sandbox Code Playgroud)

[OUT]:

(pajamas (shot I) an elephant in my)
(shot I (banana an elephant in my pajamas Time flies like))
Run Code Online (Sandbox Code Playgroud)

从代码中看起来似乎做了一些奇怪的事情:https://github.com/nltk/nltk/blob/develop/nltk/parse/api.py#L45

为什么NLTK中的解析器抽象类在解析之前将两个句子拼凑成一个?我打电话parse_sents()不正确吗?如果是这样,打电话的正确方法是parse_sents()什么?

Nik*_*sev 5

正如我在您的代码示例中看到的那样,您不会tree()在此行中调用

>>> print(next(next(mp.parse_sents([sent,sent2])))) 
Run Code Online (Sandbox Code Playgroud)

而你tree()在所有情况下都打电话parse_one().

否则,我不明白为什么它可能发生的原因:parse_one()ParserI中没有重写MaltParser它确实是简单地调用一切parse_sents()MaltParser,看到的代码.

更新: 您正在讨论的行未被调用,因为parse_sents()被覆盖MaltParser并被直接调用.

我现在唯一的猜测是java lib maltparser无法正常使用包含几个句子的输入文件(我的意思是这个块 - 运行java的地方).也许原始的麦芽解析器已经改变了格式,现在却没有'\n\n'.不幸的是,我不能自己运行这个代码,因为maltparser.org第二天就失败了.我检查了输入文件是否具有预期的格式(句子由双端线分隔),因此python包装器合并句子的可能性很小.