nltk 的 RegexpParser 中的递归

Leo*_*lán 4 python nlp nltk

基于NLTK Book 第 7 章中语法

grammar = r"""
      NP: {<DT|JJ|NN.*>+} # ...
"""
Run Code Online (Sandbox Code Playgroud)

我想扩展NP(名词短语)以包含由CC(并列连词:)或,(逗号)连接的多个NP以捕获名词短语,例如:

  • 房子和树
  • 苹果、橙子和芒果
  • 汽车、房子和飞机

我无法将修改后的语法捕获为单个NP

import nltk

grammar = r"""
  NP: {<DT|JJ|NN.*>+(<CC|,>+<NP>)?}
"""

sentence = 'The house and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
Run Code Online (Sandbox Code Playgroud)

结果是:

(S (NP The/DT house/NN) and/CC (NP tree/NN))
Run Code Online (Sandbox Code Playgroud)

我试过将NP移到开头:NP: {(<NP><CC|,>+)?<DT|JJ|NN.*>+}但我得到了相同的结果

(S (NP The/DT house/NN) and/CC (NP tree/NN))
Run Code Online (Sandbox Code Playgroud)

alv*_*vas 6

让我们从小处着手并正确捕获 NP(名词短语):

import nltk

grammar = r"""
  NP: {<DT|JJ|NN.*>+}
"""

sentence = 'The house and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
Run Code Online (Sandbox Code Playgroud)

[出去]:

(S (NP The/DT house/NN) and/CC (NP tree/NN))
Run Code Online (Sandbox Code Playgroud)

现在让我们试着抓住那个and/CC。只需添加一个更高级别的短语来重用<NP>规则:

grammar = r"""
  NP: {<DT|JJ|NN.*>+}
  CNP: {<NP><CC><NP>}
"""

sentence = 'The house and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
Run Code Online (Sandbox Code Playgroud)

[出去]:

(S (CNP (NP The/DT house/NN) and/CC (NP tree/NN)))
Run Code Online (Sandbox Code Playgroud)

现在我们捕捉到了NP CC NP短语,让我们花点心思看看它是否捕捉到了逗号:

grammar = r"""
  NP: {<DT|JJ|NN.*>+}
  CNP: {<NP><CC|,><NP>}
"""

sentence = 'The house, the bear and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
Run Code Online (Sandbox Code Playgroud)

现在我们看到它仅限于捕获第一个左有界NP CC|, NP并单独留下最后一个 NP。

由于我们知道连接短语在英语中有左有界连词和右有界NP,即CC|, NP,例如and the tree,我们看到CC|, NP模式是重复的,因此我们可以将其用作中间表示。

grammar = r"""
  NP: {<DT|JJ|NN.*>+}
  XNP: {<CC|,><NP>}
  CNP: {<NP><XNP>+}
"""

sentence = 'The house, the bear and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
Run Code Online (Sandbox Code Playgroud)

[出去]:

(S
  (CNP
    (NP The/DT house/NN)
    (XNP ,/, (NP the/DT bear/NN))
    (XNP and/CC (NP tree/NN))))
Run Code Online (Sandbox Code Playgroud)

最终,CNP(Conjunctive NPs)语法捕获了英语中的链式名词短语连词,即使是复杂的,例如

import nltk

grammar = r"""
  NP: {<DT|JJ|NN.*>+}
  XNP: {<CC|,><NP>}
  CNP: {<NP><XNP>+}
"""

sentence = 'The house, the bear, the green house and a tree went to the park or the river.'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
Run Code Online (Sandbox Code Playgroud)

[出去]:

(S
  (CNP
    (NP The/DT house/NN)
    (XNP ,/, (NP the/DT bear/NN))
    (XNP ,/, (NP the/DT green/JJ house/NN))
    (XNP and/CC (NP a/DT tree/JJ)))
  went/VBD
  to/TO
  (CNP (NP the/DT park/NN) (XNP or/CC (NP the/DT river/NN)))
  ./.)
Run Code Online (Sandbox Code Playgroud)

如果您只是对从如何遍历 NLTK 树对象中提取名词短语感兴趣

noun_phrases = []

def traverse_tree(tree):
    if tree.label() == 'CNP':
        noun_phrases.append(' '.join([token for token, tag in tree.leaves()]))
    for subtree in tree:
        if type(subtree) == nltk.tree.Tree:
            traverse_tree(subtree)

    return noun_phrases

sentence = 'The house, the bear, the green house and a tree went to the park or the river.'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
traverse_tree(chunkParser.parse(tagged))
Run Code Online (Sandbox Code Playgroud)

[出去]:

['The house , the bear , the green house and a tree', 'the park or the river']
Run Code Online (Sandbox Code Playgroud)

另外,请参阅Python (NLTK) - 提取名词短语的更有效方法?