Fre*_*Foo 263
这样做的常用方法是将文档转换为tf-idf向量,然后计算它们之间的余弦相似度.任何有关信息检索(IR)的教科书都涵盖了这一点.尤其是 信息检索简介,免费在线提供.
Tf-idf(和类似的文本转换)在Python包Gensim和scikit-learn中实现.在后一种方案中,计算余弦相似度就像
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [open(f) for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T
Run Code Online (Sandbox Code Playgroud)
或者,如果文件是简单的字符串,
>>> corpus = ["I'd like an apple",
... "An apple a day keeps the doctor away",
... "Never compare an apple to an orange",
... "I prefer scikit-learn to Orange",
... "The scikit-learn docs are Orange and Blue"]
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")
>>> tfidf = vect.fit_transform(corpus)
>>> pairwise_similarity = tfidf * tfidf.T
Run Code Online (Sandbox Code Playgroud)
虽然Gensim可能有更多选择来完成这类任务.
另见这个问题.
[免责声明:我参与了scikit-learn tf-idf实现.]
Ren*_*aud 84
与@larsman相同,但有一些预处理
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('punkt') # if necessary...
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
'''remove punctuation, lowercase, stem'''
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
def cosine_sim(text1, text2):
tfidf = vectorizer.fit_transform([text1, text2])
return ((tfidf * tfidf.T).A)[0,1]
print cosine_sim('a little bird', 'a little bird')
print cosine_sim('a little bird', 'a little bird chirps')
print cosine_sim('a little bird', 'a big dog barks')
Run Code Online (Sandbox Code Playgroud)
Kou*_*nha 37
这是一个老问题,但我发现这可以通过Spacy轻松完成.一旦读取文档,similarity就可以使用简单的api 来找到文档向量之间的余弦相似度.
import spacy
nlp = spacy.load('en')
doc1 = nlp(u'Hello hi there!')
doc2 = nlp(u'Hello hi there!')
doc3 = nlp(u'Hey whatsup?')
print doc1.similarity(doc2) # 0.999999954642
print doc2.similarity(doc3) # 0.699032527716
print doc1.similarity(doc3) # 0.699032527716
Run Code Online (Sandbox Code Playgroud)
Pul*_*yal 17
通常,两个文档之间的余弦相似性被用作文档的相似性度量.在Java中,您可以使用Lucene(如果您的集合非常大)或LingPipe来执行此操作.基本概念是计算每个文档中的术语并计算术语向量的点积.这些库确实提供了对这种通用方法的若干改进,例如使用逆文档频率和计算tf-idf向量.如果你想做一些copmlex,LingPipe还提供了计算文档之间LSA相似性的方法,它提供了比余弦相似性更好的结果.对于Python,您可以使用NLTK.
If you are looking for something very accurate, you need to use some better tool than tf-idf. Universal sentence encoder is one of the most accurate ones to find the similarity between any two pieces of text. Google provided pretrained models that you can use for your own application without a need to train from scratch anything. First, you have to install tensorflow and tensorflow-hub:
pip install tensorflow
pip install tensorflow_hub
Run Code Online (Sandbox Code Playgroud)
The code below lets you convert any text to a fixed length vector representation and then you can use the dot product to find out the similarity between them
pip install tensorflow
pip install tensorflow_hub
Run Code Online (Sandbox Code Playgroud)
and the code for plotting:
module_url = "https://tfhub.dev/google/universal-sentence-encoder/1?tf-hub-format=compressed"
# Import the Universal Sentence Encoder's TF Hub module
embed = hub.Module(module_url)
# sample text
messages = [
# Smartphones
"My phone is not good.",
"Your cellphone looks great.",
# Weather
"Will it snow tomorrow?",
"Recently a lot of hurricanes have hit the US",
# Food and health
"An apple a day, keeps the doctors away",
"Eating strawberries is healthy",
]
similarity_input_placeholder = tf.placeholder(tf.string, shape=(None))
similarity_message_encodings = embed(similarity_input_placeholder)
with tf.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
message_embeddings_ = session.run(similarity_message_encodings, feed_dict={similarity_input_placeholder: messages})
corr = np.inner(message_embeddings_, message_embeddings_)
print(corr)
heatmap(messages, messages, corr)
Run Code Online (Sandbox Code Playgroud)
as you can see the most similarity is between texts with themselves and then with their close texts in meaning.
重要信息:第一次运行代码时,它会很慢,因为它需要下载模型。如果要防止它再次下载模型并使用本地模型,则必须创建一个用于缓存的文件夹并将其添加到环境变量中,然后在第一次运行后使用该路径:
tf_hub_cache_dir = "universal_encoder_cached/"
os.environ["TFHUB_CACHE_DIR"] = tf_hub_cache_dir
# pointing to the folder inside cache dir, it will be unique on your system
module_url = tf_hub_cache_dir+"/d8fbeb5c580e50f975ef73e80bebba9654228449/"
embed = hub.Module(module_url)
Run Code Online (Sandbox Code Playgroud)
更多信息:https : //tfhub.dev/google/universal-sentence-encoder/2
对于句法相似性 有 3 种简单的方法可以检测相似性。
对于语义相似度,可以使用 BERT Embedding 并尝试不同的词池策略来获得文档嵌入,然后在文档嵌入上应用余弦相似度。
研究论文链接:https://arxiv.org/abs/1904.09675
小智 5
这是一个可以帮助您入门的小应用程序...
import difflib as dl
a = file('file').read()
b = file('file1').read()
sim = dl.get_close_matches
s = 0
wa = a.split()
wb = b.split()
for i in wa:
if sim(i, wb):
s += 1
n = float(s) / float(len(wa))
print '%d%% similarity' % int(n * 100)
Run Code Online (Sandbox Code Playgroud)
要使用非常少的数据集查找句子相似度并获得高精度,您可以使用下面的 python 包,该包使用预训练的 BERT 模型,
pip install similar-sentences
Run Code Online (Sandbox Code Playgroud)