Sha*_*kol 6 python dictionary similarity
我知道可以通过使用以下函数返回两个字符串的相似程度:
from difflib import SequenceMatcher
def similar(a, b):
output=SequenceMatcher(None, a, b).ratio()
return output
In [37]: similar("Hey, this is a test!","Hey, man, this is a test, man.")
Out[37]: 0.76
In [38]: similar("This should be one.","This should be one.")
Out[38]: 1.0
Run Code Online (Sandbox Code Playgroud)
但是有可能根据键的相似性及其相应的值来获得两个词典吗?不是一些共同的密钥,或者是共同的密钥,而是从0到1的分数,就像上面的字符串示例一样.
我试图在这本字典中找到评级['Shane']和评级['Joe']之间的相似性得分:
ratings={'Shane': {'127 Hours': 3.0, 'Avatar': 4.0, 'Nonstop': 5.0}, 'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}}
我使用的是Python 2.7.10
import math
ratings={'Shane': {'127 Hours': 3.0, 'Avatar': 4.0, 'Nonstop': 5.0}, 'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}}
def cosine_similarity(vec1,vec2):
sum11, sum12, sum22 = 0, 0, 0
for i in range(len(vec1)):
x = vec1[i]; y = vec2[i]
sum11 += x*x
sum22 += y*y
sum12 += x*y
return sum12/math.sqrt(sum11*sum22)
list1 = list(ratings['Shane'].values())
list2 = list(ratings['Joe'].values())
sim = cosine_similarity(list1,list2)
print(sim)
Run Code Online (Sandbox Code Playgroud)
输出
o/p : 0.9205746178983233
Run Code Online (Sandbox Code Playgroud)
我使用时更新:
ratings={'Shane': {'127 Hours': 5.0, 'Avatar': 4.0, 'Nonstop': 5.0},
'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}}
Run Code Online (Sandbox Code Playgroud)
输出:0.9574271077563381
Update2:标准化长度和考虑的密钥
from math import*
ratings={'Shane': {'127 Hours': 5.0, 'Avatar': 4.0, 'Nonstop': 5.0},
'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0},
'Bob': {'Panic Room':5.0,'Nonstop':5.0}}
def square_rooted(x):
return round(sqrt(sum([a*a for a in x])),3)
def cosine_similarity(x,y):
input1 = {}
input2 = {}
vector2 = []
vector1 =[]
if len(x) > len(y):
input1 = x
input2 = y
else:
input1 = y
input2 = x
vector1 = list(input1.values())
for k in input1.keys(): # Normalizing input vectors.
if k in input2:
vector2.append(float(input1[k]))
else :
vector2.append(float(0))
numerator = sum(a*b for a,b in zip(vector2,vector1))
denominator = square_rooted(vector1)*square_rooted(vector2)
return round(numerator/float(denominator),3)
print("Similarity between Shane and Joe")
print (cosine_similarity(ratings['Shane'],ratings['Joe']))
print("Similarity between Joe and Bob")
print (cosine_similarity(ratings['Joe'],ratings['Bob']))
print("Similarity between Shane and Bob")
print (cosine_similarity(ratings['Shane'],ratings['Bob']))
Run Code Online (Sandbox Code Playgroud)
输出:
Similarity between Shane and Joe
0.887
Similarity between Joe and Bob
0.346
Similarity between Shane and Bob
0.615
Run Code Online (Sandbox Code Playgroud)
jaccurd和余弦之间的好解释:https : //datascience.stackexchange.com/questions/5121/applications-and-differences-for-jaccard-similarity-and-cosine-similarity
我正在使用Python 3.4
注意:我为缺失值分配了0。但是您也可以分配一些适当的值。请参阅:http : //www.analyticsvidhya.com/blog/2015/02/7-steps-data-exploration-preparation-building-model-part-2/
归档时间: |
|
查看次数: |
1872 次 |
最近记录: |