根据Python中两个词典的相似性返回"相似得分"?

Sha*_*kol 6 python dictionary similarity

我知道可以通过使用以下函数返回两个字符串的相似程度:

from difflib import SequenceMatcher
def similar(a, b):
    output=SequenceMatcher(None, a, b).ratio()
    return output

In [37]: similar("Hey, this is a test!","Hey, man, this is a test, man.")
Out[37]: 0.76
In [38]: similar("This should be one.","This should be one.")
Out[38]: 1.0
Run Code Online (Sandbox Code Playgroud)

但是有可能根据键的相似性及其相应的值来获得两个词典吗?不是一些共同的密钥,或者共同的密钥,而是从0到1的分数,就像上面的字符串示例一样.

我试图在这本字典中找到评级['Shane']和评级['Joe']之间的相似性得分:

ratings={'Shane': {'127 Hours': 3.0, 'Avatar': 4.0, 'Nonstop': 5.0}, 'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}}

我使用的是Python 2.7.10

bac*_*ack 5

import math

ratings={'Shane': {'127 Hours': 3.0, 'Avatar': 4.0, 'Nonstop': 5.0}, 'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}}

def cosine_similarity(vec1,vec2):
        sum11, sum12, sum22 = 0, 0, 0
        for i in range(len(vec1)):
            x = vec1[i]; y = vec2[i]
            sum11 += x*x
            sum22 += y*y
            sum12 += x*y
        return sum12/math.sqrt(sum11*sum22)

list1 = list(ratings['Shane'].values())
list2 =  list(ratings['Joe'].values())

sim = cosine_similarity(list1,list2)
print(sim)
Run Code Online (Sandbox Code Playgroud)

输出

o/p : 0.9205746178983233
Run Code Online (Sandbox Code Playgroud)

我使用时更新

ratings={'Shane': {'127 Hours': 5.0, 'Avatar': 4.0, 'Nonstop': 5.0},
         'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}}
Run Code Online (Sandbox Code Playgroud)

输出:0.9574271077563381

Update2:标准化长度和考虑的密钥

from math import*


ratings={'Shane': {'127 Hours': 5.0, 'Avatar': 4.0, 'Nonstop': 5.0},
         'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0},
         'Bob': {'Panic Room':5.0,'Nonstop':5.0}}


def square_rooted(x):

    return round(sqrt(sum([a*a for a in x])),3)

def cosine_similarity(x,y):

    input1 = {}
    input2 = {}
    vector2 = []
    vector1 =[]

    if len(x) > len(y):
        input1 = x
        input2 = y
    else:
        input1 = y
        input2 = x


    vector1 = list(input1.values())

    for k in input1.keys():    # Normalizing input vectors. 
        if k in input2:
            vector2.append(float(input1[k]))
        else :
            vector2.append(float(0))


    numerator = sum(a*b for a,b in zip(vector2,vector1))
    denominator = square_rooted(vector1)*square_rooted(vector2)
    return round(numerator/float(denominator),3)


print("Similarity between Shane and Joe")
print (cosine_similarity(ratings['Shane'],ratings['Joe']))

print("Similarity between Joe and Bob")
print (cosine_similarity(ratings['Joe'],ratings['Bob']))

print("Similarity between Shane and Bob")
print (cosine_similarity(ratings['Shane'],ratings['Bob']))
Run Code Online (Sandbox Code Playgroud)

输出:

Similarity between Shane and Joe
0.887
Similarity between Joe and Bob
0.346
Similarity between Shane and Bob
0.615
Run Code Online (Sandbox Code Playgroud)

jaccurd和余弦之间的好解释https : //datascience.stackexchange.com/questions/5121/applications-and-differences-for-jaccard-similarity-and-cosine-similarity

我正在使用Python 3.4

注意:我为缺失值分配了0。但是您也可以分配一些适当的值。请参阅:http : //www.analyticsvidhya.com/blog/2015/02/7-steps-data-exploration-preparation-building-model-part-2/