Python:tf-idf-cosine:查找文档相似性

add*_*ons 84 python information-retrieval machine-learning nltk tf-idf

我正在学习第1 部分和第2 部分提供的教程.不幸的是,作者没有时间进行涉及使用余弦相似性的最后一节实际找到两个文档之间的距离.我在文章的示例中借助stackoverflow中的以下链接,包括上面链接中提到的代码(只是为了让生活更轻松)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
import numpy as np
import numpy.linalg as LA

train_set = ["The sky is blue.", "The sun is bright."]  # Documents
test_set = ["The sun in the sky is bright."]  # Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray

transformer.fit(trainVectorizerArray)
print
print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)
print 
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

Run Code Online (Sandbox Code Playgroud)

由于上面的代码,我有以下矩阵

Fit Vectorizer to train set [[1 0 1 0]
 [0 1 0 1]]
Transform Vectorizer to test set [[0 1 1 1]]

[[ 0.70710678  0.          0.70710678  0.        ]
 [ 0.          0.70710678  0.          0.70710678]]

[[ 0.          0.57735027  0.57735027  0.57735027]]

Run Code Online (Sandbox Code Playgroud)

我不知道如何使用这个输出来计算余弦相似度,我知道如何实现相似长度的两个矢量的余弦相似性,但在这里我不知道如何识别这两个矢量.

首先,如果要提取计数功能并应用TF-IDF规范化和行式欧几里德规范化,您可以在一个操作中执行以下操作TfidfVectorizer:

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.datasets import fetch_20newsgroups
>>> twenty = fetch_20newsgroups()

>>> tfidf = TfidfVectorizer().fit_transform(twenty.data)
>>> tfidf
<11314x130088 sparse matrix of type '<type 'numpy.float64'>'
    with 1787553 stored elements in Compressed Sparse Row format>

Run Code Online (Sandbox Code Playgroud)

现在要找到一个文档的余弦距离(例如数据集中的第一个)和所有其他文档,你只需要计算第一个向量的点积,其他所有其他因为tfidf向量已经行标准化.scipy稀疏矩阵API有点奇怪(不像密集的N维numpy数组那样灵活).要获得第一个向量,您需要逐行切割矩阵以获得具有单行的子矩阵:

>>> tfidf[0:1]
<1x130088 sparse matrix of type '<type 'numpy.float64'>'
    with 89 stored elements in Compressed Sparse Row format>

Run Code Online (Sandbox Code Playgroud)

scikit-learn已经提供了成对度量(在机器学习术语中称为内核),它适用于向量集合的密集和稀疏表示.在这种情况下,我们需要一个点积,也称为线性内核:

>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1.        ,  0.04405952,  0.11016969, ...,  0.04433602,
    0.04457106,  0.03293218])

Run Code Online (Sandbox Code Playgroud)

因此,为了找到前5个相关文档,我们可以使用argsort和一些负数组切片(大多数相关文档具有最高余弦相似度值,因此在排序索引数组的末尾):

>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]
>>> related_docs_indices
array([    0,   958, 10576,  3277])
>>> cosine_similarities[related_docs_indices]
array([ 1.        ,  0.54967926,  0.32902194,  0.2825788 ])

Run Code Online (Sandbox Code Playgroud)

第一个结果是健全性检查:我们发现查询文档是最相似的文档,余弦相似度得分为1,其中包含以下文本:

>>> print twenty.data[0]
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----

Run Code Online (Sandbox Code Playgroud)

第二个最相似的文档是引用原始邮件的回复,因此有许多常用词:

>>> print twenty.data[958]
From: rseymour@reed.edu (Robert Seymour)
Subject: Re: WHAT car is this!?
Article-I.D.: reed.1993Apr21.032905.29286
Reply-To: rseymour@reed.edu
Organization: Reed College, Portland, OR
Lines: 26

In article <1993Apr20.174246.14375@wam.umd.edu> lerxst@wam.umd.edu (where's my
thing) writes:
>
>  I was wondering if anyone out there could enlighten me on this car I saw
> the other day. It was a 2-door sports car, looked to be from the late 60s/
> early 70s. It was called a Bricklin. The doors were really small. In
addition,
> the front bumper was separate from the rest of the body. This is
> all I know. If anyone can tellme a model name, engine specs, years
> of production, where this car is made, history, or whatever info you
> have on this funky looking car, please e-mail.

Bricklins were manufactured in the 70s with engines from Ford. They are rather
odd looking with the encased front bumper. There aren't a lot of them around,
but Hemmings (Motor News) ususally has ten or so listed. Basically, they are a
performance Ford with new styling slapped on top.

>    ---- brought to you by your neighborhood Lerxst ----

Rush fan?

--
Robert Seymour              rseymour@reed.edu
Physics and Philosophy, Reed College    (NeXTmail accepted)
Artificial Life Project         Reed College
Reed Solar Energy Project (SolTrain)    Portland, OR

Run Code Online (Sandbox Code Playgroud)

如果其他人像我一样想知道,在这种情况下,linear_kernel等同于cosine_similarity,因为TfidfVectorizer产生标准化向量.请参阅文档中的说明:http://scikit-learn.org/stable/modules/metrics.html#cosine-similarity (9认同)
这会给你每个文档与每个其他文档的余弦相似性,而不仅仅是第一个:`cosine_similarities = linear_kernel(tfidf,tfidf)`？ (6认同)
是的，这将为您提供成对相似的方阵。 (2认同)

在@ excray评论的帮助下,我设法找出答案,我们需要做的是实际编写一个简单的for循环来迭代表示列车数据和测试数据的两个数组.

首先实现一个简单的lambda函数来保存余弦计算的公式:

cosine_function = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

Run Code Online (Sandbox Code Playgroud)

然后只需编写一个简单的for循环来遍历to vector,逻辑就是每个"对于trainVectorizerArray中的每个向量,你必须在testVectorizerArray中找到与向量的余弦相似性."

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
import numpy as np
import numpy.linalg as LA

train_set = ["The sky is blue.", "The sun is bright."] #Documents
test_set = ["The sun in the sky is bright."] #Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray
cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

for vector in trainVectorizerArray:
    print vector
    for testV in testVectorizerArray:
        print testV
        cosine = cx(vector, testV)
        print cosine

transformer.fit(trainVectorizerArray)
print
print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)
print 
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

Run Code Online (Sandbox Code Playgroud)

这是输出:

Fit Vectorizer to train set [[1 0 1 0]
 [0 1 0 1]]
Transform Vectorizer to test set [[0 1 1 1]]
[1 0 1 0]
[0 1 1 1]
0.408
[0 1 0 1]
[0 1 1 1]
0.816

[[ 0.70710678  0.          0.70710678  0.        ]
 [ 0.          0.70710678  0.          0.70710678]]

[[ 0.          0.57735027  0.57735027  0.57735027]]

Run Code Online (Sandbox Code Playgroud)

I know its an old post. but I tried the http://scikit-learn.sourceforge.net/stable/ package. here is my code to find the cosine similarity. The question was how will you calculate the cosine similarity with this package and here is my code for that

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

f = open("/root/Myfolder/scoringDocuments/doc1")
doc1 = str.decode(f.read(), "UTF-8", "ignore")
f = open("/root/Myfolder/scoringDocuments/doc2")
doc2 = str.decode(f.read(), "UTF-8", "ignore")
f = open("/root/Myfolder/scoringDocuments/doc3")
doc3 = str.decode(f.read(), "UTF-8", "ignore")

train_set = ["president of India",doc1, doc2, doc3]

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix_train = tfidf_vectorizer.fit_transform(train_set)  #finds the tfidf score with normalization
print "cosine scores ==> ",cosine_similarity(tfidf_matrix_train[0:1], tfidf_matrix_train)  #here the first element of tfidf_matrix_train is matched with other three elements

Run Code Online (Sandbox Code Playgroud)

Here suppose the query is the first element of train_set and doc1,doc2 and doc3 are the documents which I want to rank with the help of cosine similarity. then I can use this code.

Also the tutorials provided in the question was very useful. Here are all the parts for it part-I,part-II,part-III

the output will be as follows :

[[ 1.          0.07102631  0.02731343  0.06348799]]

Run Code Online (Sandbox Code Playgroud)

here 1 represents that query is matched with itself and the other three are the scores for matching the query with the respective documents.

如何处理 `ValueError: X 和 Y 矩阵的不兼容维度: X.shape[1] == 1664 while Y.shape[1] == 2` (2认同)

让我给你一个由我写的另一个教程.它回答了你的问题,但也解释了我们为什么要做一些事情.我也试着简明扼要.

所以你有一个list_of_documents只是一个字符串数组,另一个document只是一个字符串.您需要从list_of_documents最相似的文档中找到此类文档document.

让我们把它们组合在一起: documents = list_of_documents + [document]

让我们从依赖开始吧.很明显我们为什么要使用它们.

from nltk.corpus import stopwords
import string
from nltk.tokenize import wordpunct_tokenize as tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.spatial.distance import cosine

Run Code Online (Sandbox Code Playgroud)

可以使用的方法之一是词袋方法,我们将文档中的每个单词独立于其他单词处理,并将它们全部放在大包中.从一个角度来看,它丢失了很多信息(比如单词是如何连接的),但从另一个角度来看,它使模型变得简单.

在英语和任何其他人类语言中,有许多"无用"的词,如'a','the','in',它们很常见,不具备很多意义.它们被称为停用词,删除它们是个好主意.人们可以注意到的另一件事是"分析","分析器","分析"等词语非常相似.它们有一个共同的根,所有都可以转换为一个单词.这个过程称为词干,并且存在不同的词干分析器,它们在速度,攻击性等方面不同.因此,我们将每个文档转换为单词的词干列表,没有停用词.我们也丢弃所有的标点符号.

porter = PorterStemmer()
stop_words = set(stopwords.words('english'))

modified_arr = [[porter.stem(i.lower()) for i in tokenize(d.translate(None, string.punctuation)) if i.lower() not in stop_words] for d in documents]

Run Code Online (Sandbox Code Playgroud)

那么这一句话怎么会帮助我们呢？试想一下,我们有3个包:[a, b, c],[a, c, a]和[b, c, d].我们可以在它们的基础上将它们转换为向量[a, b, c, d].因此,我们最终与向量:[1, 1, 1, 0],[2, 0, 1, 0]和[0, 1, 1, 1].类似的事情是我们的文件(只有矢量会更长).现在我们看到我们删除了很多单词并阻止了其他单词也减少了向量的维数.这里有一个有趣的观察.较长的文档将比较短的文档具有更多的正面元素,这就是为什么将矢量规范化很好.这被称为术语频率TF,人们还使用关于该词在其他文档中使用频率的附加信息 - 逆文档频率IDF.我们一起有一个公制TF-IDF,有几种口味.这可以通过sklearn中的一行来实现:-)

modified_doc = [' '.join(i) for i in modified_arr] # this is only to convert our list of lists to list of strings that vectorizer uses.
tf_idf = TfidfVectorizer().fit_transform(modified_doc)

Run Code Online (Sandbox Code Playgroud)

实际上,矢量化器允许执行很多操作,例如删除停用词和小写.我之所以在单独的步骤中完成它们只是因为sklearn没有非英语的停用词,但是nltk有.

所以我们计算了所有的向量.最后一步是找出哪一个与最后一个最相似.有多种方法来实现这一目标,其中之一是欧氏距离这不是理由如此之大在这里讨论.另一种方法是余弦相似性.我们迭代所有文档并计算文档和最后一个文档之间的余弦相似度:

l = len(documents) - 1
for i in xrange(l):
    minimum = (1, None)
    minimum = min((cosine(tf_idf[i].todense(), tf_idf[l + 1].todense()), i), minimum)
print minimum

Run Code Online (Sandbox Code Playgroud)

现在最小值将包含有关最佳文档及其分数的信息.

签名,这不是op所要求的:在语料库中搜索给定查询的最佳文档而不是"最佳文档".请不要这样做,像我这样的人会浪费时间尝试将你的例子用于op任务,并被拖入矩阵调整疯狂. (3认同)

这应该对你有帮助.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity  

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(train_set)
print tfidf_matrix
cosine = cosine_similarity(tfidf_matrix[length-1], tfidf_matrix)
print cosine

Run Code Online (Sandbox Code Playgroud)

和输出将是:

[[ 0.34949812  0.81649658  1.        ]]

Run Code Online (Sandbox Code Playgroud)

你怎么获得长度？ (9认同)

归档时间：	13 年，2 月前
查看次数：	93040 次
最近记录：	6 年前

搜索查询的TF*IDF 15

更多相关链接

从python中的字符串中剥离不可打印的字符 81

将spark DataFrame列转换为python列表 76

Python/Json:期望用双引号括起来的属性名称 63

蟒蛇:如何编辑安装包？ 53

如何从列表中删除每个出现的子列表 45

分解随机森林分类适合python中的部分？ 7

使用NLTK对来自OCR的未分裂单词进行标记 5

将文本拆分为段落 NLTK - nltk.tokenize.texttiling 的用法？ 5

尝试在 290X 上设置机器学习库 5

分类指标不能同时处理二进制目标和连续目标 5

AngularJS:服务与提供商vs工厂 3296

电话和申请有什么区别？ 3012

仅存储使用Git更改的多个文件中的一个文件？ 2895

如何在不手动指定编码的情况下在C#中获得字符串的一致字节表示？ 2121

JavaScript .prototype如何工作？ 1988

JavaScript中的变量范围是什么？ 1952

如何在C#中生成随机int数？ 1792

LINQ中的多个"order by" 1537

如何获得最近提交的Git分支列表？ 1197

查找包含具有指定名称的列的所有表 - MS SQL Server 1090