CountVectorizer给出错误的单词计数？

Question

CountVectorizer给出错误的单词计数？

Lod*_*e66 2 python nlp nltk scikit-learn countvectorizer

假设我的文本文件包含以下文本：

敏捷的棕狐跳过了懒狗。小洞不补，大洞吃苦。快速的棕色针脚跳过了懒惰的时光。狐狸及时救了一条狗。

我想使用sk-learn的CountVectorizer来获取文件中所有单词的单词计数。（我知道还有其他方法可以执行此操作，但是出于某些原因，我想使用CountVectorizer。）这是我的代码：

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

text = input('Please enter the filepath for the text: ') 
text = open(text, 'r', encoding = 'utf-8')
tokens = CountVectorizer(analyzer = 'word', stop_words = 'english')


X = tokens.fit_transform(text)
dictionary = tokens.vocabulary_

Run Code Online (Sandbox Code Playgroud)

除了打电话时dictionary，它给了我错误的计数：

>>> dictionary
{'time': 9, 'dog': 1, 'stitch': 8, 'quick': 6, 'lazy': 5, 'brown': 0, 'saves': 7, 'jumped': 4, 'fox': 3, 'dogs': 2}

Run Code Online (Sandbox Code Playgroud)

有人可以建议我在这里犯下的（毫无疑问的）错误吗？

Answer 1

Mos*_*oye 6

vocabulary_ 是将术语与文档术语矩阵中其索引的字典/映射，而不是计数：

vocabulary_ ：术语到特征索引的映射。

X 真正为您提供了特征索引和相应计数的矩阵。

>>> for i in X:
...    print(i)
... 
  (0, 1)    1
  (0, 7)    2
  (0, 9)    3
  (0, 8)    2
  (0, 2)    1
  (0, 5)    2
  (0, 4)    2
  (0, 3)    2
  (0, 0)    2
  (0, 6)    2

Run Code Online (Sandbox Code Playgroud)

例如9 -> 'time'，计数为3。

归档时间：	8 年，4 月前
查看次数：	1378 次
最近记录：	8 年，4 月前