我使用NLTK和组合scikit-learn的CountVectorizer对词干的单词和符号化.
下面是一个简单用法的例子CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
vocab = ['The swimmer likes swimming so he swims.']
vec = CountVectorizer().fit(vocab)
sentence1 = vec.transform(['The swimmer likes swimming.'])
sentence2 = vec.transform(['The swimmer swims.'])
print('Vocabulary: %s' %vec.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())
Run Code Online (Sandbox Code Playgroud)
哪个会打印
Vocabulary: ['he', 'likes', 'so', 'swimmer', 'swimming', 'swims', 'the']
Sentence 1: [[0 1 0 1 1 0 1]]
Sentence 2: [[0 0 0 1 0 1 1]]
Run Code Online (Sandbox Code Playgroud)
现在,让我们说我想删除停用词并阻止这些词.一种选择是这样做:
from nltk import word_tokenize
from nltk.stem.porter …Run Code Online (Sandbox Code Playgroud) 见下文,为什么+=在我原来的柜台上吹掉一把钥匙?
>>> c = Counter({'a': 0, 'b': 0, 'c': 0})
>>> c.items()
[('a', 0), ('c', 0), ('b', 0)]
>>> c += Counter('abba')
>>> c.items()
[('a', 2), ('b', 2)]
Run Code Online (Sandbox Code Playgroud)
我认为至少可以说这是不礼貌的,"X被统计0次"和"我们甚至不算Xs"之间存在很大差异.它似乎collections.Counter根本不是一个反击,它更像是一个多重集.
但是计数器是dict的子类,我们允许用零值或负值构造它们:Counter(a=0, b=-1).如果它实际上是"一包东西",这不会被禁止,限制init接受可迭代的可迭代物品吗?
进一步混淆事项,对与操作员有不同行为的反制具update和subtract方法.看来这堂课正在发生身份危机!+-
反击是一个字典还是一个包?