TypeError:必须是unicode,而不是NLTK中的str

Question

TypeError:必须是unicode,而不是NLTK中的str

我使用的是python2.7,nltk 3.2.1和python-crfsuite 0.8.4.我关注此页面:http://www.nltk.org/api/nltk.tag.html？hilight = stanford = nltk.tag.stanford.NERTagger for nltk.tag.crf module.

首先,我只是运行它

from nltk.tag import CRFTagger
ct = CRFTagger()
train_data = [[('dfd','dfd')]]
ct.train(train_data,"abc")

Run Code Online (Sandbox Code Playgroud)

我也尝试过这个

f = open("abc","wb")
ct.train(train_data,f)

Run Code Online (Sandbox Code Playgroud)

但我收到以下错误,

  File "C:\Python27\lib\site-packages\nltk\tag\crf.py", line 129, in <genexpr>
    if all (unicodedata.category(x) in punc_cat for x in token):
TypeError: must be unicode, not str

Run Code Online (Sandbox Code Playgroud)

Answer 1

tri*_*eee 14

在Python 2中,定期引用'...'或"..."创建字节字符串.要获取Unicode字符串,请u在字符串前使用前缀,例如u'dfd'.

要从文件中读取,您需要指定编码.有关选项,请参阅Backporting Python 3 open(encoding="utf-8")到Python 2 ; 最直接的,替换open()为io.open().

要转换现有字符串,请使用该unicode()方法; 虽然通常,你也想要使用decode()和提供编码.

对于(更多)更多细节,建议使用Ned Batchelder的"实用Unicode"幻灯片,如果不是直接的强制性阅读; http://nedbatchelder.com/text/unipain.html

归档时间：	9 年，5 月前
查看次数：	19158 次
最近记录：	9 年，5 月前