我有以下代码:
import string
def translate_non_alphanumerics(to_translate, translate_to='_'):
not_letters_or_digits = u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~'
translate_table = string.maketrans(not_letters_or_digits,
translate_to
*len(not_letters_or_digits))
return to_translate.translate(translate_table)
Run Code Online (Sandbox Code Playgroud)
哪个适用于非unicode字符串:
>>> translate_non_alphanumerics('<foo>!')
'_foo__'
Run Code Online (Sandbox Code Playgroud)
但unicode字符串失败:
>>> translate_non_alphanumerics(u'<foo>!')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in translate_non_alphanumerics
TypeError: character mapping must return integer, None or unicode
Run Code Online (Sandbox Code Playgroud)
对于str.translate()方法,我无法理解Python 2.6.2文档中 "Unicode对象"的段落.
如何使这个工作适用于Unicode字符串?
我有以下代码
import nltk, os, json, csv, string, cPickle
from scipy.stats import scoreatpercentile
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()
def sanitize(wordList):
answer = [word.translate(None, string.punctuation) for word in wordList]
answer = [lmtzr.lemmatize(word.lower()) for word in answer]
return answer
words = []
for filename in json_list:
words.extend([sanitize(nltk.word_tokenize(' '.join([tweet['text']
for tweet in json.load(open(filename,READ))])))])
Run Code Online (Sandbox Code Playgroud)
我写的时候,我在一个单独的testing.py文件中测试过2-4行
import nltk, os, json, csv, string, cPickle
from scipy.stats import scoreatpercentile
wordList= ['\'the', 'the', '"the']
print wordList
wordList2 = [word.translate(None, string.punctuation) for word in wordList]
print wordList2
answer = [lmtzr.lemmatize(word.lower()) for word …Run Code Online (Sandbox Code Playgroud)