nad*_*dre 4 python tf-idf cosine-similarity scikit-learn
我正在研究大约10万篇研究论文.我正在考虑三个领域:
我使用TfIdfVectorizer获取明文字段的TfIdf表示,并将由此产生的词汇反馈到标题和摘要的矢量化器中,以确保所有三个表示都在同一个词汇上工作.我的想法是,由于明文字段比其他两个字段大得多,它的词汇很可能涵盖其他字段中的所有字词.但是,如果情况并非如此,那么TfIdfVectorizer如何处理新的单词/代币呢?
这是我的代码示例:
vectorizer = TfidfVectorizer(min_df=2)
plaintexts_tfidf = vectorizer.fit_transform(plaintexts)
vocab = vectorizer.vocabulary_
# later in an another script after loading the vocab from disk
vectorizer = TfidfVectorizer(min_df=2, vocabulary=vocab)
titles_tfidf = vectorizer.fit_transform(titles)
Run Code Online (Sandbox Code Playgroud)
词汇有~90万字.
在矢量化期间,我没有遇到任何问题,但后来当我想使用sklearn.metrics.pairwise.cosine_similarity比较矢量化标题之间的相似性时,我遇到了这个错误:
>> titles_sim = cosine_similarity(titles_tfidf)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-237-5aa86fe892da> in <module>()
----> 1 titles_sim = cosine_similarity(titles)
/usr/local/lib/python3.5/dist-packages/sklearn/metrics/pairwise.py in cosine_similarity(X, Y, dense_output)
916 Y_normalized = normalize(Y, copy=True)
917
--> 918 K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
919
920 return K
/usr/local/lib/python3.5/dist-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output)
184 ret = a * b
185 if dense_output and hasattr(ret, "toarray"):
--> 186 ret = ret.toarray()
187 return ret
188 else:
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py in toarray(self, order, out)
918 def toarray(self, order=None, out=None):
919 """See the docstring for `spmatrix.toarray`."""
--> 920 return self.tocoo(copy=False).toarray(order=order, out=out)
921
922 ##############################################################
/usr/local/lib/python3.5/dist-packages/scipy/sparse/coo.py in toarray(self, order, out)
256 M,N = self.shape
257 coo_todense(M, N, self.nnz, self.row, self.col, self.data,
--> 258 B.ravel('A'), fortran)
259 return B
260
ValueError: could not convert integer scalar
Run Code Online (Sandbox Code Playgroud)
我不确定它是否相关,但我不能真正看到这里出了什么问题.另外,因为我在计算明文向量的相似性时没有遇到错误.
我错过了什么吗?有没有更好的方法来使用Vectorizer?
编辑:
稀疏csr_matrices的形状相等.
>> titles_tfidf.shape
(96582, 852885)
>> plaintexts_tfidf.shape
(96582, 852885)
Run Code Online (Sandbox Code Playgroud)
我担心矩阵可能太大了.它将是96582*96582 = 9328082724细胞.尝试将titles_tfidf切片并检查.
资料来源:http://scipy-user.10969.n7.nabble.com/SciPy-User-strange-error-when-creating-csr-matrix-td20129.html
EDT:如果您使用的是较旧的SciPy/Numpy版本,您可能需要更新:https://github.com/scipy/scipy/pull/4678
EDT2:如果你使用32位python,切换到64位可能有帮助(我想)
EDT3:回答你的原始问题.当你使用词汇时plaintexts
,会有新词titles
会被忽略 - 但不会影响tfidf值.希望这个片段可以让它更容易理解:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
plaintexts =["They are", "plain texts texts amoersand here"]
titles = ["And here", "titles ", "wolf dog eagle", "But here plain"]
vectorizer = TfidfVectorizer()
plaintexts_tfidf = vectorizer.fit_transform(plaintexts)
vocab = vectorizer.vocabulary_
vectorizer = TfidfVectorizer(vocabulary=vocab)
titles_tfidf = vectorizer.fit_transform(titles)
print('values using vocabulary')
print(titles_tfidf)
print(vectorizer.get_feature_names())
print('Brand new vectorizer')
vectorizer = TfidfVectorizer()
titles_tfidf = vectorizer.fit_transform(titles)
print(titles_tfidf)
print(vectorizer.get_feature_names())
Run Code Online (Sandbox Code Playgroud)
结果是:
values using vocabulary
(0, 2) 1.0
(3, 3) 0.78528827571
(3, 2) 0.61913029649
['amoersand', 'are', 'here', 'plain', 'texts', 'they']
Brand new vectorizer
(0, 0) 0.78528827571
(0, 4) 0.61913029649
(1, 6) 1.0
(2, 7) 0.57735026919
(2, 2) 0.57735026919
(2, 3) 0.57735026919
(3, 4) 0.486934264074
(3, 1) 0.617614370976
(3, 5) 0.617614370976
['and', 'but', 'dog', 'eagle', 'here', 'plain', 'titles', 'wolf']
Run Code Online (Sandbox Code Playgroud)
请注意,这与我从标题中删除明文中没有出现的单词不一样.
归档时间: |
|
查看次数: |
7071 次 |
最近记录: |