Spc*_*ond 7 python arrays performance numpy
I have a list of indices
a = [
[1,2,4],
[0,2,3],
[1,3,4],
[0,2]]
Run Code Online (Sandbox Code Playgroud)
What's the fastest way to convert this to a numpy array of ones, where each index shows the position where 1 would occur?
I.e. what I want is:
output = array([
[0,1,1,0,1],
[1,0,1,1,0],
[0,1,0,1,1],
[1,0,1,0,0]])
Run Code Online (Sandbox Code Playgroud)
I know the max size of the array beforehand. I know I could loop through each list and insert a 1 into at each index position, but is there a faster/vectorized way to do this?
My use case could have thousands of rows/cols and I need to do this thousands of times, so the faster the better.
Pau*_*zer 10
How about this:
ncol = 5
nrow = len(a)
out = np.zeros((nrow, ncol), int)
out[np.arange(nrow).repeat([*map(len,a)]), np.concatenate(a)] = 1
out
# array([[0, 1, 1, 0, 1],
# [1, 0, 1, 1, 0],
# [0, 1, 0, 1, 1],
# [1, 0, 1, 0, 0]])
Run Code Online (Sandbox Code Playgroud)
以下是1000x1000二进制数组的计时,请注意,我使用了上面的优化版本,请参见pp下面的函数:
pp 21.717635259992676 ms
ts 37.10938713003998 ms
u9 37.32933565042913 ms
Run Code Online (Sandbox Code Playgroud)
产生计时的代码:
import itertools as it
import numpy as np
def make_data(n,m):
I,J = np.where(np.random.random((n,m))<np.random.random((n,1)))
return [*map(np.ndarray.tolist, np.split(J, I.searchsorted(np.arange(1,n))))]
def pp():
sz = np.fromiter(map(len,a),int,nrow)
out = np.zeros((nrow,ncol),int)
out[np.arange(nrow).repeat(sz),np.fromiter(it.chain.from_iterable(a),int,sz.sum())] = 1
return out
def ts():
out = np.zeros((nrow,ncol),int)
for i, ix in enumerate(a):
out[i][ix] = 1
return out
def u9():
out = np.zeros((nrow,ncol),int)
for i, (x, y) in enumerate(zip(a, out)):
y[x] = 1
out[i] = y
return out
nrow,ncol = 1000,1000
a = make_data(nrow,ncol)
from timeit import timeit
assert (pp()==ts()).all()
assert (pp()==u9()).all()
print("pp", timeit(pp,number=100)*10, "ms")
print("ts", timeit(ts,number=100)*10, "ms")
print("u9", timeit(u9,number=100)*10, "ms")
Run Code Online (Sandbox Code Playgroud)
这可能不是最快的方法。您将需要使用大型数组比较这些答案的执行时间,以找出最快的方法。这是我的解决方案
output = np.zeros((4,5))
for i, ix in enumerate(a):
output[i][ix] = 1
# output ->
# array([[0, 1, 1, 0, 1],
# [1, 0, 1, 1, 0],
# [0, 1, 0, 1, 1],
# [1, 0, 1, 0, 0]])
Run Code Online (Sandbox Code Playgroud)