Ale*_*man 2 python environment-variables ghostscript ipython nltk
当我尝试使用块模块时,我正在玩 NLTK
enter import nltk as nk
Sentence = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
tokens = nk.word_tokenize(Sentence)
tagged = nk.pos_tag(tokens)
entities = nk.chunk.ne_chunk(tagged)
Run Code Online (Sandbox Code Playgroud)
当我输入时,代码运行良好
>> entities
Run Code Online (Sandbox Code Playgroud)
我收到以下错误消息:
enter code here Out[2]: Tree('S', [Tree('PERSON', [('Betty', 'NNP')]), Tree('PERSON', [('Botter', 'NNP')]), ('bought', 'VBD'), ('some', 'DT'), ('butter', 'NN'), (',', ','), ('but', 'CC'), ('she', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('butter', 'NN'), ('is', 'VBZ'), ('bitter', 'JJ'), (',', ','), ('I', 'PRP'), ('f', 'VBP'), ('I', 'PRP'), ('put', 'VBD'), ('it', 'PRP'), ('in', 'IN'), ('my', 'PRP$'), ('batter', 'NN'), (',', ','), ('it', 'PRP'), ('will', 'MD'), ('make', 'VB'), ('my', 'PRP$'), ('batter', 'NN'), ('bitter', 'NN'), ('.', '.')])Traceback (most recent call last):
File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\IPython\core\formatters.py", line 343, in __call__
return method()
File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\tree.py", line 726, in _repr_png_
subprocess.call([find_binary('gs', binary_names=['gswin32c.exe', 'gswin64c.exe'], env_vars=['PATH'], verbose=False)] +
File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 602, in find_binary
binary_names, url, verbose))
File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 596, in find_binary_iter
url, verbose):
File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 567, in find_file_iter
raise LookupError('\n\n%s\n%s\n%s' % (div, msg, div))
LookupError:
===========================================================================
NLTK was unable to find the gs file!
Use software specific configuration paramaters or set the PATH environment variable.
===========================================================================
Run Code Online (Sandbox Code Playgroud)
根据this post,解决方案是安装Ghostscript,因为chunker试图使用它来显示解析树,并且正在寻找3个二进制文件之一:
file_names=['gs', 'gswin32c.exe', 'gswin64c.exe']
Run Code Online (Sandbox Code Playgroud)
使用。但是即使我安装了 ghostscript 并且我现在可以在 Windows 搜索中找到二进制文件,但我仍然遇到相同的错误。
我需要修复或更新什么?
附加路径信息:
import os; print os.environ['PATH']
Run Code Online (Sandbox Code Playgroud)
返回:
C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Scripts;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Program Files (x86)\Parallels\Parallels Tools\Applications;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Oracle\RPAS14.1\RpasServer\bin;C:\Oracle\RPAS14.1\RpasServer\applib;C:\Program Files (x86)\Java\jre7\bin;C:\Program Files (x86)\Java\jre7\bin\client;C:\Program Files (x86)\Java\jre7\lib;C:\Program Files (x86)\Java\jre7\jre\bin\client;C:\Users\QP19\AppData\Local\Continuum\Anaconda2;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Scripts;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;
Run Code Online (Sandbox Code Playgroud)
简而言之:
而不是>>> entities,请执行以下操作:
>>> print entities.__repr__()
Run Code Online (Sandbox Code Playgroud)
或者:
>>> import os
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> path_to_gs = "C:\Program Files\gs\gs9.19\bin"
>>> os.environ['PATH'] += os.pathsep + path_to_gs
>>> sent = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
>>> entities = ne_chunk(pos_tag(word_tokenize(sent)))
>>> entities
Run Code Online (Sandbox Code Playgroud)
长:
问题在于您试图打印 和 的输出,ne_chunk这将触发 ghostscript 以获取作为nltk.tree.Tree对象的 NE 标记句子的字符串和绘图表示。这将需要 ghostscript 以便您可以使用小部件将其可视化。
让我们一步一步地演练。
首先,当您使用 时ne_chunk,您可以直接在顶层导入它,如下所示:
from nltk import ne_chunk
Run Code Online (Sandbox Code Playgroud)
并且建议为您的导入使用命名空间,即:
from nltk import word_tokenize, pos_tag, ne_chunk
Run Code Online (Sandbox Code Playgroud)
而当你使用ne_chunk,它来自https://github.com/nltk/nltk/blob/develop/nltk/chunk/初始化的.py
不清楚pickle loading是什么函数,但经过一些检查,我们发现只有一个内置的NE chunker不是基于规则的,而且由于pickle二进制的名称是maxent,我们可以假设它是一个统计分块器,因此它很可能来自以下NEChunkParser对象:https : //github.com/nltk/nltk/blob/develop/nltk/chunk/named_entity.py。也有 ACE 数据 API 函数,如 pickle 二进制文件的名称。
现在,只要您可以使用该ne_chunk函数,它实际上就是调用NEChunkParser.parse()返回nltk.tree.Tree对象的
函数:https : //github.com/nltk/nltk/blob/develop/nltk/chunk/named_entity.py#L118
class NEChunkParser(ChunkParserI):
"""
Expected input: list of pos-tagged words
"""
def __init__(self, train):
self._train(train)
def parse(self, tokens):
"""
Each token should be a pos-tagged word
"""
tagged = self._tagger.tag(tokens)
tree = self._tagged_to_parse(tagged)
return tree
def _train(self, corpus):
# Convert to tagged sequence
corpus = [self._parse_to_tagged(s) for s in corpus]
self._tagger = NEChunkParserTagger(train=corpus)
def _tagged_to_parse(self, tagged_tokens):
"""
Convert a list of tagged tokens to a chunk-parse tree.
"""
sent = Tree('S', [])
for (tok,tag) in tagged_tokens:
if tag == 'O':
sent.append(tok)
elif tag.startswith('B-'):
sent.append(Tree(tag[2:], [tok]))
elif tag.startswith('I-'):
if (sent and isinstance(sent[-1], Tree) and
sent[-1].label() == tag[2:]):
sent[-1].append(tok)
else:
sent.append(Tree(tag[2:], [tok]))
return sent
Run Code Online (Sandbox Code Playgroud)
如果我们看一下在nltk.tree.Tree尝试调用_repr_png_函数时出现 ghostscript 问题的对象:https : //github.com/nltk/nltk/blob/develop/nltk/tree.py#L702:
def _repr_png_(self):
"""
Draws and outputs in PNG for ipython.
PNG is used instead of PDF, since it can be displayed in the qt console and
has wider browser support.
"""
import os
import base64
import subprocess
import tempfile
from nltk.draw.tree import tree_to_treesegment
from nltk.draw.util import CanvasFrame
from nltk.internals import find_binary
_canvas_frame = CanvasFrame()
widget = tree_to_treesegment(_canvas_frame.canvas(), self)
_canvas_frame.add_widget(widget)
x, y, w, h = widget.bbox()
# print_to_file uses scrollregion to set the width and height of the pdf.
_canvas_frame.canvas()['scrollregion'] = (0, 0, w, h)
with tempfile.NamedTemporaryFile() as file:
in_path = '{0:}.ps'.format(file.name)
out_path = '{0:}.png'.format(file.name)
_canvas_frame.print_to_file(in_path)
_canvas_frame.destroy_widget(widget)
subprocess.call([find_binary('gs', binary_names=['gswin32c.exe', 'gswin64c.exe'], env_vars=['PATH'], verbose=False)] +
'-q -dEPSCrop -sDEVICE=png16m -r90 -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -dSAFER -dBATCH -dNOPAUSE -sOutputFile={0:} {1:}'
.format(out_path, in_path).split())
with open(out_path, 'rb') as sr:
res = sr.read()
os.remove(in_path)
os.remove(out_path)
return base64.b64encode(res).decode()
Run Code Online (Sandbox Code Playgroud)
但请注意,奇怪的是 Python 解释器会在您使用解释器时触发_repr_png而不是触发(请参阅Python 的 __repr__ 的目的)。在尝试打印出对象的表示时,本机 CPython 解释器不可能是如何工作的,所以我们看一下,我们看到它允许在https://github.com/ipython/ipython/上触发blob/master/IPython/core/formatters.py#L725:__repr__>>> entitiesIpython.core.formatters_repr_png
class PNGFormatter(BaseFormatter):
"""A PNG formatter.
To define the callables that compute the PNG representation of your
objects, define a :meth:`_repr_png_` method or use the :meth:`for_type`
or :meth:`for_type_by_name` methods to register functions that handle
this.
The return value of this formatter should be raw PNG data, *not*
base64 encoded.
"""
format_type = Unicode('image/png')
print_method = ObjectName('_repr_png_')
_return_type = (bytes, unicode_type)
Run Code Online (Sandbox Code Playgroud)
我们看到,当 IPython 初始化一个DisplayFormatter对象时,它会尝试激活所有格式化程序:https : //github.com/ipython/ipython/blob/master/IPython/core/formatters.py#L66
def _formatters_default(self):
"""Activate the default formatters."""
formatter_classes = [
PlainTextFormatter,
HTMLFormatter,
MarkdownFormatter,
SVGFormatter,
PNGFormatter,
PDFFormatter,
JPEGFormatter,
LatexFormatter,
JSONFormatter,
JavascriptFormatter
]
d = {}
for cls in formatter_classes:
f = cls(parent=self)
d[f.format_type] = f
return d
Run Code Online (Sandbox Code Playgroud)
请注意Ipython,在本机 CPython 解释器之外,它只会调用__repr__而非_repr_png:
>>> from nltk import ne_chunk
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> Sentence = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
>>> sentence = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
>>> entities = ne_chunk(pos_tag(word_tokenize(sentence)))
>>> entities
Tree('S', [Tree('PERSON', [('Betty', 'NNP')]), Tree('PERSON', [('Botter', 'NNP')]), ('bought', 'VBD'), ('some', 'DT'), ('butter', 'NN'), (',', ','), ('but', 'CC'), ('she', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('butter', 'NN'), ('is', 'VBZ'), ('bitter', 'JJ'), (',', ','), ('I', 'PRP'), ('f', 'VBP'), ('I', 'PRP'), ('put', 'VBD'), ('it', 'PRP'), ('in', 'IN'), ('my', 'PRP$'), ('batter', 'NN'), (',', ','), ('it', 'PRP'), ('will', 'MD'), ('make', 'VB'), ('my', 'PRP$'), ('batter', 'NN'), ('bitter', 'NN'), ('.', '.')])
Run Code Online (Sandbox Code Playgroud)
所以现在的解决方案:
解决方案1:
打印出 的字符串输出时ne_chunk,您可以使用
>>> print entities.__repr__()
Run Code Online (Sandbox Code Playgroud)
而不是>>> entities那种方式,IPython 应该只显式调用__repr__而不是调用所有可能的格式化程序。
解决方案2
如果您确实需要使用_repr_png_来可视化 Tree 对象,那么我们将需要弄清楚如何将 ghostscript 二进制文件添加到 NLTK 环境变量中。
在您的情况下,似乎默认nltk.internals无法找到二进制文件。更具体地说,我们指的是https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L599
如果我们回到https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L726,我们会看到,它试图寻找
env_vars=['PATH']
Run Code Online (Sandbox Code Playgroud)
当 NLTK 尝试初始化它的环境变量时,它正在查看os.environ,请参阅https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L495
请注意,find_binary调用find_binary_iterwhichfind_binary_iter试图env_vars通过获取来查找的调用os.environ
所以如果我们添加到路径:
>>> import os
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> path_to_gs = "C:\Program Files\gs\gs9.19\bin"
>>> os.environ['PATH'] += os.pathsep + path_to_gs
Run Code Online (Sandbox Code Playgroud)
现在这应该在 Ipython 中工作:
>>> import os
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> path_to_gs = "C:\Program Files\gs\gs9.19\bin"
>>> os.environ['PATH'] += os.pathsep + path_to_gs
>>> sent = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
>>> entities = ne_chunk(pos_tag(word_tokenize(sent)))
>>> entities
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4932 次 |
| 最近记录: |