使用pdist的Python中的字符串距离矩阵

Mar*_*k W 4 python string jaro-winkler pdist

如何计算Python中字符串的Jaro Winkler距离矩阵?

我有大量手工输入的字符串(名称和记录号),我试图在列表中找到重复项,包括可能在拼写上有轻微变化的重复项.一到类似的问题的回答使用SciPy的的pdist功能与定制距离函数建议.我试图用Levenshtein包中的jaro_winkler函数实现这个解决方案.这个问题是jaro_winkler函数需要字符串输入,而pdict函数似乎需要2D数组输入.

例:

import numpy as np
from scipy.spatial.distance import pdist
from Levenshtein import jaro_winkler

fname = np.array(['Bob','Carl','Kristen','Calr', 'Doug']).reshape(-1,1)
dm = pdist(fname, jaro_winkler)
dm = squareform(dm)
Run Code Online (Sandbox Code Playgroud)

预期输出 - 这样的事情:

          Bob  Carl   Kristen  Calr  Doug
Bob       1.0   -        -       -     -
Carl      0.0   1.0      -       -     -
Kristen   0.0   0.46    1.0      -     -
Calr      0.0   0.93    0.46    1.0    -
Doug      0.53  0.0     0.0     0.0   1.0
Run Code Online (Sandbox Code Playgroud)

实际错误:

jaro_winkler expected two Strings or two Unicodes
Run Code Online (Sandbox Code Playgroud)

我假设这是因为jaro_winkler函数看到的是ndarray而不是字符串,我不知道如何在pdist函数的上下文中将函数输入转换为字符串.

有没有人建议允许这个工作?提前致谢!

Zep*_*hro 13

您需要包装距离函数,就像我在下面的示例中演示的Levensthein距离一样

import numpy as np    
from Levenshtein import distance
from scipy.spatial.distance import pdist, squareform

# my list of strings
strings = ["hello","hallo","choco"]

# prepare 2 dimensional array M x N (M entries (3) with N dimensions (1)) 
transformed_strings = np.array(strings).reshape(-1,1)

# calculate condensed distance matrix by wrapping the Levenshtein distance function
distance_matrix = pdist(transformed_strings,lambda x,y: distance(x[0],y[0]))

# get square matrix
print(squareform(distance_matrix))

Output:
array([[ 0.,  1.,  4.],
       [ 1.,  0.,  4.],
       [ 4.,  4.,  0.]])
Run Code Online (Sandbox Code Playgroud)