python numpy pairwise edit-distance

Vah*_*ili 5 python lambda numpy scipy pdist

所以,我有一个numpy字符串数组,我想用这个函数计算每对元素之间的成对编辑距离:来自http://docs.scipy.org/doc/scipy的 scipy.spatial.distance.pdist -0.13.0 /参考/生成/ scipy.spatial.distance.pdist.html

我的数组样本如下:

 >>> d[0:10]
 array(['TTTTT', 'ATTTT', 'CTTTT', 'GTTTT', 'TATTT', 'AATTT', 'CATTT',
   'GATTT', 'TCTTT', 'ACTTT'], 
  dtype='|S5')
Run Code Online (Sandbox Code Playgroud)

但是,因为它没有'editdistance'选项,所以我想给出一个自定义的距离函数.我试过这个,我遇到了以下错误:

 >>> import editdist
 >>> import scipy
 >>> import scipy.spatial
 >>> scipy.spatial.distance.pdist(d[0:10], lambda u,v: editdist.distance(u,v))

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 1150, in pdist
    [X] = _copy_arrays_if_base_present([_convert_to_double(X)])
  File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 153, in _convert_to_double
    X = np.double(X)
ValueError: could not convert string to float: TTTTT
Run Code Online (Sandbox Code Playgroud)

per*_*iae 4

如果您确实必须使用pdist,您首先需要将字符串转换为数字格式。如果您知道所有字符串的长度相同,则可以相当轻松地做到这一点:

numeric_d = d.view(np.uint8).reshape((len(d),-1))
Run Code Online (Sandbox Code Playgroud)

这只是将字符串数组视为一个长字节uint8数组,然后重新调整它的形状,使每个原始字符串单独位于一行。在您的示例中,这看起来像:

In [18]: d.view(np.uint8).reshape((len(d),-1))
Out[18]:
array([[84, 84, 84, 84, 84],
       [65, 84, 84, 84, 84],
       [67, 84, 84, 84, 84],
       [71, 84, 84, 84, 84],
       [84, 65, 84, 84, 84],
       [65, 65, 84, 84, 84],
       [67, 65, 84, 84, 84],
       [71, 65, 84, 84, 84],
       [84, 67, 84, 84, 84],
       [65, 67, 84, 84, 84]], dtype=uint8)
Run Code Online (Sandbox Code Playgroud)

然后,您可以pdist像平常一样使用。只需确保您的editdist函数需要整数数组,而不是字符串。您可以通过调用快速转换新输入.tostring()

def editdist(x, y):
  s1 = x.tostring()
  s2 = y.tostring()
  ... rest of function as before ...
Run Code Online (Sandbox Code Playgroud)

  • ...或者直接在“uint8”上进行编辑距离。 (2认同)