Deb*_*tra 7 python binary numpy xor hamming-distance
设a和b是具有8位整数(0-255)的相同大小的向量.我想计算这些向量不同的位数,即通过串联这些数字的二进制表示形成的向量之间的汉明距离.例如:
a = [127,255]
b= [127,240]
Run Code Online (Sandbox Code Playgroud)
使用numpy库
np.bitwise_xor(a,b)
# Output: array([ 0, 15])
Run Code Online (Sandbox Code Playgroud)
我现在需要的是二进制表示上述数组的每个元素,并在数组的所有元素中计数1的数量.上面的例子将给出汉明距离0 + 4 = 4.在Python中任何快速而优雅的解决方案?
方法#1:我们可以将它们广播成二进制位并计算不同位的数量,如下所示 -
def hamming_distance(a, b):
r = (1 << np.arange(8))[:,None]
return np.count_nonzero( (a & r) != (b & r) )
Run Code Online (Sandbox Code Playgroud)
样品运行 -
In [144]: a = [127,255]
...: b = [127,240]
...:
In [145]: hamming_distance(a, b)
Out[145]: 4
Run Code Online (Sandbox Code Playgroud)
方法#2:使用bitwise-xor操作,我们可以找出a和之间的不同二进制位的数量b-
def hamming_distance_v2(a, b):
r = (1 << np.arange(8))[:,None]
return np.count_nonzero((np.bitwise_xor(a,b) & r) != 0)
Run Code Online (Sandbox Code Playgroud)
如果要在程序执行期间多次调用距离函数,则可以通过使用预先计算的位计数表来获得一些速度.这是汉明距离函数的(又一个)版本:
# _nbits[k] is the number of 1s in the binary representation of k for 0 <= k < 256.
_nbits = np.array(
[0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3,
4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4,
4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2,
3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5,
4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4,
5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3,
3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2,
3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6,
4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5,
6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5,
5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6,
7, 7, 8], dtype=np.uint8)
def hamming_distance1(a, b):
c = np.bitwise_xor(a, b)
n = _nbits[c].sum()
return n
Run Code Online (Sandbox Code Playgroud)
在下文中,a并且b是在该问题的注释给定长度32的Python列表. divakar_hamming_distance()和divakar_hamming_distance_v2()来自@ Divakar的答案.
以下是@Divakar的功能时间:
In [116]: %timeit divakar_hamming_distance(a, b)
The slowest run took 5.57 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 11.3 µs per loop
In [117]: %timeit divakar_hamming_distance_v2(a, b)
The slowest run took 5.35 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 10.3 µs per loop
Run Code Online (Sandbox Code Playgroud)
hamming_distance1(a, b) 有点快:
In [118]: %timeit hamming_distance1(a, b)
The slowest run took 6.04 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.42 µs per loop
Run Code Online (Sandbox Code Playgroud)
在我的计算机上,初始化_nbits大约需要11μs,因此hamming_distance1如果您只调用一次该功能,则没有任何优势.如果你打三次或更多次,那么性能就会有所增加.
如果输入已经是numpy数组,则所有函数都明显更快:
In [119]: aa = np.array(a)
In [120]: bb = np.array(b)
In [121]: %timeit divakar_hamming_distance_v2(aa, bb)
The slowest run took 8.22 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.72 µs per loop
In [122]: %timeit hamming_distance1(aa, bb)
The slowest run took 12.67 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.77 µs per loop
Run Code Online (Sandbox Code Playgroud)
当然,如果您在计算汉明距离之前总是这样做,那么进行转换的时间必须包含在整体时间中.但是,如果您编写生成的代码a并b利用先前的numpy,那么在计算汉明距离时,您可能已经将它们作为numpy数组.
(我还对8位值之间的预计汉明距离的二维阵列进行了实验 - 一个形状为(256,256)的阵列 - 但初始化成本更高,性能增益也很小.)
| 归档时间: |
|
| 查看次数: |
3843 次 |
| 最近记录: |