Python:Jaccard使用单词交集但不是字符交集的距离

add*_*ons 8 python intersection set

我没有意识到Python设置函数实际上将字符串分成单个字符.我为Jaccard编写了python函数并使用了python intersection方法.我将两个集合传递给了这个方法,在将两个集合传递给我的jaccard函数之前,我在setring上使用了set函数.

例如:假设我有字符串NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg我将调用set(NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg),它将字符串分成字符.所以当我把它发送到jaccard函数交集时实际看字符交集而不是字对话.我该如何进行单词到单词的交集.

#implementing jaccard
def jaccard(a, b):
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))
Run Code Online (Sandbox Code Playgroud)

如果我不在set我的字符串上调用函数,我NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg会收到以下错误:

    c = a.intersection(b)
AttributeError: 'str' object has no attribute 'intersection'
Run Code Online (Sandbox Code Playgroud)

而不是字符到字符的交集我想做单词到单词交叉并获得jaccard相似性.

Amb*_*ber 9

尝试首先将字符串拆分为单词:

word_set = set(your_string.split())
Run Code Online (Sandbox Code Playgroud)

例:

>>> word_set = set("NEW Fujifilm 16MP 5x".split())
>>> character_set = set("NEW Fujifilm 16MP 5x")
>>> word_set
set(['NEW', '16MP', '5x', 'Fujifilm'])
>>> character_set
set([' ', 'f', 'E', 'F', 'i', 'M', 'j', 'm', 'l', 'N', '1', 'P', 'u', 'x', 'W', '6', '5'])
Run Code Online (Sandbox Code Playgroud)


小智 7

我计算Jaccard距离的函数:

def DistJaccard(str1, str2):
    str1 = set(str1.split())
    str2 = set(str2.split())
    return float(len(str1 & str2)) / len(str1 | str2)

>>> DistJaccard("hola amigo", "chao amigo")
0.333333333333
Run Code Online (Sandbox Code Playgroud)