NLTK 树中叶子的绝对位置

Question

NLTK 树中叶子的绝对位置

Cor*_*one 5 python tree nlp chunking nltk

我试图在给定的句子中找到名词短语的跨度（开始索引，结束索引）。以下是提取名词短语的代码

sent=nltk.word_tokenize(a)
sent_pos=nltk.pos_tag(sent)
grammar = r"""
    NBAR:
        {<NN.*|JJ>*<NN.*>}  # Nouns and Adjectives, terminated with Nouns

    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}  # Above, connected with in/of/etc...
    VP:
        {<VBD><PP>?}
        {<VBZ><PP>?}
        {<VB><PP>?}
        {<VBN><PP>?}
        {<VBG><PP>?}
        {<VBP><PP>?}
"""

cp = nltk.RegexpParser(grammar)
result = cp.parse(sent_pos)
nounPhrases = []
for subtree in result.subtrees(filter=lambda t: t.label() == 'NP'):
  np = ''
  for x in subtree.leaves():
    np = np + ' ' + x[0]
  nounPhrases.append(np.strip())

Run Code Online (Sandbox Code Playgroud)

对于a =“美国内战，也称为州之间的战争或简称为内战，是 1861 年至 1865 年在美国进行的一场内战，此前数个南方蓄奴州宣布脱离并成立了南部邦联。 America. ”，提取的名词短语是

['美国内战'，'战争'，'州'，'内战'，'内战'，'美国'，'几个南方'，'州'，'分裂'，'同盟国'，'美国']。

现在我需要找到名词短语的跨度（短语的开始位置和结束位置）。例如，上述名词短语的范围将是

[(1,3), (9,9), (12, 12), (16, 17), (21, 23), ....]。

我对 NLTK 还很陌生，我查看了http://www.nltk.org/_modules/nltk/tree.html。我尝试使用Tree.treepositions()但我无法使用这些索引提取绝对位置。任何帮助将不胜感激。谢谢你！

Answer 1

alv*_*vas 4

没有任何隐式函数返回https://github.com/nltk/nltk/issues/1214突出显示的字符串/令牌的偏移量

但是您可以使用https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#L123中的RIBES 分数使用的 ngram 搜索器

>>> from nltk import word_tokenize
>>> from nltk.translate.ribes_score import position_of_ngram
>>> s = word_tokenize("The American Civil War, also known as the War between the States or simply the Civil War, was a civil war fought from 1861 to 1865 in the United States after several Southern slave states declared their secession and formed the Confederate States of America.")
>>> position_of_ngram(tuple('American Civil War'.split()), s)
1
>>> position_of_ngram(tuple('Confederate States of America'.split()), s)
43

Run Code Online (Sandbox Code Playgroud)

（它返回查询ngram的起始位置）

归档时间：	9 年，7 月前
查看次数：	1386 次
最近记录：	5 年，2 月前