我正在学习第1 部分和第2 部分提供的教程.不幸的是,作者没有时间进行涉及使用余弦相似性的最后一节实际找到两个文档之间的距离.我在文章的示例中借助stackoverflow中的以下链接,包括上面链接中提到的代码(只是为了让生活更轻松)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
import numpy as np
import numpy.linalg as LA
train_set = ["The sky is blue.", "The sun is bright."] # Documents
test_set = ["The sun in the sky is bright."] # Query
stopWords = stopwords.words('english')
vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer
trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray …Run Code Online (Sandbox Code Playgroud)