如何漂亮地打印 nltk 树对象?

Ler*_*ang 3 python tree nltk pprint

我想以视觉方式查看下面的结果是否是我需要的:

import nltk 
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]

pattern = """NP: {<DT>?<JJ>*<NN>}
VBD: {<VBD>}
IN: {<IN>}"""
NPChunker = nltk.RegexpParser(pattern) 
result = NPChunker.parse(sentence)
Run Code Online (Sandbox Code Playgroud)

来源:https : //stackoverflow.com/a/31937278/3552975

我不知道为什么我不能漂亮_打印result.

result.pretty_print()
Run Code Online (Sandbox Code Playgroud)

错误是这样写的TypeError: not all arguments converted during string formatting。我使用Python3.5,nltk3.3。

alv*_*vas 9

如果您正在寻找括号内的解析输出,您可以使用Tree.pprint()

>>> import nltk 
>>> sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
>>> 
>>> pattern = """NP: {<DT>?<JJ>*<NN>}
... VBD: {<VBD>}
... IN: {<IN>}"""
>>> NPChunker = nltk.RegexpParser(pattern) 
>>> result = NPChunker.parse(sentence)
>>> result.pprint()
(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  (VBD barked/VBD)
  (IN at/IN)
  (NP the/DT cat/NN))
Run Code Online (Sandbox Code Playgroud)

但很可能你正在寻找

                             S                                      
            _________________|_____________________________          
           NP                        VBD       IN          NP       
   ________|_________________         |        |      _____|____     
the/DT little/JJ yellow/JJ dog/NN barked/VBD at/IN the/DT     cat/NN
Run Code Online (Sandbox Code Playgroud)

让我们深入研究Tree.pretty_print() https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L692 中的代码:

def pretty_print(self, sentence=None, highlight=(), stream=None, **kwargs):
    """
    Pretty-print this tree as ASCII or Unicode art.
    For explanation of the arguments, see the documentation for
    `nltk.treeprettyprinter.TreePrettyPrinter`.
    """
    from nltk.treeprettyprinter import TreePrettyPrinter
    print(TreePrettyPrinter(self, sentence, highlight).text(**kwargs),
          file=stream)
Run Code Online (Sandbox Code Playgroud)

它正在创建一个TreePrettyPrinter对象,https://github.com/nltk/nltk/blob/develop/nltk/treeprettyprinter.py#L50

class TreePrettyPrinter(object):
    def __init__(self, tree, sentence=None, highlight=()):
        if sentence is None:
            leaves = tree.leaves()
            if (leaves and not any(len(a) == 0 for a in tree.subtrees())
                    and all(isinstance(a, int) for a in leaves)):
                sentence = [str(a) for a in leaves]
            else:
                # this deals with empty nodes (frontier non-terminals)
                # and multiple/mixed terminals under non-terminals.
                tree = tree.copy(True)
                sentence = []
                for a in tree.subtrees():
                    if len(a) == 0:
                        a.append(len(sentence))
                        sentence.append(None)
                    elif any(not isinstance(b, Tree) for b in a):
                        for n, b in enumerate(a):
                            if not isinstance(b, Tree):
                                a[n] = len(sentence)
                                sentence.append('%s' % b)
        self.nodes, self.coords, self.edges, self.highlight = self.nodecoords(
                tree, sentence, highlight)
Run Code Online (Sandbox Code Playgroud)

看起来引发错误的行是sentence.append('%s' % b) https://github.com/nltk/nltk/blob/develop/nltk/treeprettyprinter.py#L97

问题是为什么它会引发 TypeError

TypeError: not all arguments converted during string formatting
Run Code Online (Sandbox Code Playgroud)

如果我们仔细看,它看起来让我们可以print('%s' % b)用于大多数基本的 Python 类型

# String
>>> x = 'abc'
>>> type(x)
<class 'str'>
>>> print('%s' % x)
abc

# Integer
>>> x = 123
>>> type(x)
<class 'int'>
>>> print('%s' % x)
123

# Float 
>>> x = 1.23
>>> type(x)
<class 'float'>
>>> print('%s' % x)
1.23

# Boolean
>>> x = True
>>> type(x)
<class 'bool'>
>>> print('%s' % x)
True
Run Code Online (Sandbox Code Playgroud)

令人惊讶的是,它甚至可以在列表中使用!

>>> x = ['abc', 'def']
>>> type(x)
<class 'list'>
>>> print('%s' % x)
['abc', 'def']
Run Code Online (Sandbox Code Playgroud)

但它受阻了tuple!!

>>> x = ('DT', 123)
>>> x = ('abc', 'def')
>>> type(x)
<class 'tuple'>
>>> print('%s' % x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: not all arguments converted during string formatting
Run Code Online (Sandbox Code Playgroud)

所以如果我们回到https://github.com/nltk/nltk/blob/develop/nltk/treeprettyprinter.py#L95的代码

if not isinstance(b, Tree):
    a[n] = len(sentence)
    sentence.append('%s' % b)
Run Code Online (Sandbox Code Playgroud)

由于我们知道sentence.append('%s' % b)无法处理tuple,因此添加对元组类型的检查并以某种方式连接元组中的项目并转换为 astr将产生 nice pretty_print

if not isinstance(b, Tree):
    a[n] = len(sentence)
    if type(b) == tuple:
        b = '/'.join(b)
    sentence.append('%s' % b)
Run Code Online (Sandbox Code Playgroud)

[出去]:

                             S                                      
            _________________|_____________________________          
           NP                        VBD       IN          NP       
   ________|_________________         |        |      _____|____     
the/DT little/JJ yellow/JJ dog/NN barked/VBD at/IN the/DT     cat/NN
Run Code Online (Sandbox Code Playgroud)

不更改nltk代码,是否仍然可以获得漂亮的打印效果?

让我们看看resultie 一个Tree对象的样子:

Tree('S', [Tree('NP', [('the', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN')]), Tree('VBD', [('barked', 'VBD')]), Tree('IN', [('at', 'IN')]), Tree('NP', [('the', 'DT'), ('cat', 'NN')])])
Run Code Online (Sandbox Code Playgroud)

看起来叶子是作为字符串元组列表保存的,例如[('the', 'DT'), ('cat', 'NN')],所以我们可以做一些修改,使其成为字符串列表,例如[('the/DT'), ('cat/NN')],这样Tree.pretty_print()会很好玩。

因为我们知道这Tree.pprint()有助于使用将字符串的元组连接到我们想要的形式,即

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  (VBD barked/VBD)
  (IN at/IN)
  (NP the/DT cat/NN))
Run Code Online (Sandbox Code Playgroud)

我们可以简单地输出到括号中的解析字符串,然后使用以下命令重新读取解析Tree对象Tree.fromstring()

from nltk import Tree
Tree.fromstring(str(result)).pretty_print()
Run Code Online (Sandbox Code Playgroud)

结案:

import nltk 
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]

pattern = """NP: {<DT>?<JJ>*<NN>}
VBD: {<VBD>}
IN: {<IN>}"""
NPChunker = nltk.RegexpParser(pattern) 
result = NPChunker.parse(sentence)

Tree.fromstring(str(result)).pretty_print()
Run Code Online (Sandbox Code Playgroud)

[出去]:

                             S                                      
            _________________|_____________________________          
           NP                        VBD       IN          NP       
   ________|_________________         |        |      _____|____     
the/DT little/JJ yellow/JJ dog/NN barked/VBD at/IN the/DT     cat/NN
Run Code Online (Sandbox Code Playgroud)

  • 这个答案被低估了,这是迄今为止我在 StackOverflow 上找到的最好答案之一。谢谢阿尔瓦斯! (2认同)