Oul*_*der 6 python tf-idf cosine-similarity scikit-learn
我的目标是输入3个查询,并找出哪个查询与一组5个文档最相似。
到目前为止,我已经计算出tf-idf执行以下操作的文档:
from sklearn.feature_extraction.text import TfidfVectorizer
def get_term_frequency_inverse_data_frequency(documents):
allDocs = []
for document in documents:
allDocs.append(nlp.clean_tf_idf_text(document))
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(allDocs)
return matrix
def get_tf_idf_query_similarity(documents, query):
tfidf = get_term_frequency_inverse_data_frequency(documents)
Run Code Online (Sandbox Code Playgroud)
我现在遇到的问题是我拥有tf-idf文档,我对该查询执行哪些操作,以便可以找到与文档的余弦相似度?
Ven*_*lam 10
这是我的建议:
TfidfVectorizer直接使用preprocessing属性插入文本清理功能。 from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
vectorizer = TfidfVectorizer(preprocessor=nlp.clean_tf_idf_text)
docs_tfidf = vectorizer.fit_transform(allDocs)
def get_tf_idf_query_similarity(vectorizer, docs_tfidf, query):
"""
vectorizer: TfIdfVectorizer model
docs_tfidf: tfidf vectors for all docs
query: query doc
return: cosine similarity between query and all docs
"""
query_tfidf = vectorizer.transform([query])
cosineSimilarities = cosine_similarity(query_tfidf, docs_tfidf).flatten()
return cosineSimilarities
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1368 次 |
| 最近记录: |