使用scikit-learn时的属性错误

Ani*_*dey 4 python nltk feature-extraction scikit-learn

我试图使用scikit使用余弦相似性找到类似的问题.我正在尝试在互联网上提供此示例代码.Link1Link2

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
import numpy as np
import numpy.linalg as LA

train_set = ["The sky is blue.", "The sun is bright."]
test_set = ["The sun in the sky is bright."]
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
transformer = TfidfTransformer()

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
trainVectorizerArray = vectorizer.
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray
cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

for vector in trainVectorizerArray:
    print vector
    for testV in testVectorizerArray:
        print testV
        cosine = cx(vector, testV)
        print cosine

transformer.fit(trainVectorizerArray)
print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()
Run Code Online (Sandbox Code Playgroud)

我总是得到这个错误

Traceback (most recent call last):
File "C:\Users\Animesh\Desktop\NLP\ngrams2.py", line 14, in <module>
trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
File "C:\Python27\lib\site-packages\scikit_learn-0.13.1-py2.7-win32.egg\sklearn  \feature_extraction\text.py", line 740, in fit_transform
raise ValueError("empty vocabulary; training set may have"
ValueError: empty vocabulary; training set may have contained only stop words or min_df  (resp. max_df) may be too high (resp. too low).
Run Code Online (Sandbox Code Playgroud)

我甚至检查了这个链接上的代码.我有错误AttributeError: 'CountVectorizer' object has no attribute 'vocabulary'.

如何解决这个问题?

我在Windows 7 32位和scikit_learn 0.13.1上使用Python 2.7.3.

Fre*_*Foo 6

由于我正在运行开发(0.14之前的版本)版本,feature_extraction.text模块进行了大修,我没有收到相同的错误消息.但我怀疑你可以解决这个问题:

vectorizer = CountVectorizer(stop_words=stopWords, min_df=1)
Run Code Online (Sandbox Code Playgroud)

min_df参数导致CountVectorizer丢弃在文档太少的情况下发生的任何术语(因为它没有任何预测值).默认情况下,它设置为2,这意味着您的所有术语都会被丢弃,因此您将获得一个空的词汇表.

  • 调用fit方法时,提取带有尾部`_`的`vocabulary_`(除非用户提供为构造函数参数).请参阅[文档](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction). (2认同)