Fastest way to convert a list of indices to 2D numpy array of ones

Spc*_*ond 7 python arrays performance numpy

I have a list of indices

a = [
  [1,2,4],
  [0,2,3],
  [1,3,4],
  [0,2]]
Run Code Online (Sandbox Code Playgroud)

What's the fastest way to convert this to a numpy array of ones, where each index shows the position where 1 would occur?

I.e. what I want is:

output = array([
  [0,1,1,0,1],
  [1,0,1,1,0],
  [0,1,0,1,1],
  [1,0,1,0,0]])
Run Code Online (Sandbox Code Playgroud)

I know the max size of the array beforehand. I know I could loop through each list and insert a 1 into at each index position, but is there a faster/vectorized way to do this?

My use case could have thousands of rows/cols and I need to do this thousands of times, so the faster the better.

Pau*_*zer 10

How about this:

ncol = 5
nrow = len(a)
out = np.zeros((nrow, ncol), int)
out[np.arange(nrow).repeat([*map(len,a)]), np.concatenate(a)] = 1
out
# array([[0, 1, 1, 0, 1],
#        [1, 0, 1, 1, 0],
#        [0, 1, 0, 1, 1],
#        [1, 0, 1, 0, 0]])
Run Code Online (Sandbox Code Playgroud)

以下是1000x1000二进制数组的计时,请注意,我使用了上面的优化版本,请参见pp下面的函数:

pp 21.717635259992676 ms
ts 37.10938713003998 ms
u9 37.32933565042913 ms
Run Code Online (Sandbox Code Playgroud)

产生计时的代码:

import itertools as it
import numpy as np

def make_data(n,m):
    I,J = np.where(np.random.random((n,m))<np.random.random((n,1)))
    return [*map(np.ndarray.tolist, np.split(J, I.searchsorted(np.arange(1,n))))]

def pp():
    sz = np.fromiter(map(len,a),int,nrow)
    out = np.zeros((nrow,ncol),int)
    out[np.arange(nrow).repeat(sz),np.fromiter(it.chain.from_iterable(a),int,sz.sum())] = 1
    return out

def ts():
    out = np.zeros((nrow,ncol),int)
    for i, ix in enumerate(a):
        out[i][ix] = 1
    return out

def u9():
    out = np.zeros((nrow,ncol),int)
    for i, (x, y) in enumerate(zip(a, out)):
        y[x] = 1
        out[i] = y
    return out

nrow,ncol = 1000,1000
a = make_data(nrow,ncol)

from timeit import timeit
assert (pp()==ts()).all()
assert (pp()==u9()).all()

print("pp", timeit(pp,number=100)*10, "ms")
print("ts", timeit(ts,number=100)*10, "ms")
print("u9", timeit(u9,number=100)*10, "ms")
Run Code Online (Sandbox Code Playgroud)

  • 从外观上看,使用几个`numpy`函数和`map`的速度也会变慢(如果不尝试就无法确认) (2认同)
  • @TeshanShanukaJ是否意味着您的解决方案更快?您有时间来备份吗?性能取决于数据,而IMO可以很好地扩展(这也是我赞成的原因)。 (2认同)

Tes*_*a J 6

这可能不是最快的方法。您将需要使用大型数组比较这些答案的执行时间,以找出最快的方法。这是我的解决方案

output = np.zeros((4,5))
for i, ix in enumerate(a):
    output[i][ix] = 1

# output -> 
#   array([[0, 1, 1, 0, 1],
#   [1, 0, 1, 1, 0],
#   [0, 1, 0, 1, 1],
#   [1, 0, 1, 0, 0]])
Run Code Online (Sandbox Code Playgroud)