如何利用两个大型列表之间的索引关系更快地获取列表?

Swa*_*ani 5 python performance numpy list python-3.x

问题给出了以下两个列表。

import numpy as np
import random as rnd

num = 100000

a = list(range(num))
b = [rnd.randint(0, num) for r in range(num)]
Run Code Online (Sandbox Code Playgroud)

在两个尺寸巨大的列表之间(假设引用列表为),使用 list( ) 理解方法创建a一个指示相同元素位于相对 array( ) 中位置的列表。batob

atob = [np.abs(np.subtract(b, i)).argmin() for i in a]
print(f"index a to b: {atob}")
Run Code Online (Sandbox Code Playgroud)

当列表较小时,不需要很长时间。然而,我意识到获取列表的过程atob相当耗时。

有没有办法更快地获取列表atob?还是目前没有办法?


(回答后编辑。此修订的目的是为了将来的读者。)非常感谢您的回复!根据答案进行代码分析。

  • 检查输出

结果的比较是用 进行的num = 20

import numpy as np
import random as rnd
import time

# set lists
num = 20
a = list(range(num))
# b = [rnd.randint(0, num) for r in range(num)] # Duplicate numbers occur among the elements in the list
b = rnd.sample(range(0, num), num)
print(f"list a: {a}")
print(f"list b: {b}\n")

# set array as same as lists
arr_a = np.array(range(num))
arr_b = np.array(rnd.sample(range(0, num), num))

# --------------------------------------------------------- #
# existing method
ck_time = time.time()
atob = [np.abs(np.subtract(b, i)).argmin() for i in a]
print(f"index a to b (existed): {atob}, type: {type(atob)}")
print(f"running time (existed): {time.time() - ck_time}\n")
ck_time = time.time()

# dankal444 method
dk = {val: idx for idx, val in enumerate(b)}
atob_dk = [dk.get(n) for n in a] # same as atob_dk = [d.get(n) for n in range(num)]
print(f"index a to b (dankal): {atob_dk}, type: {type(atob_dk)}")
print(f"running time (dankal): {time.time() - ck_time}")
print(f"equal to exist method: {np.array_equal(atob, atob_dk)}\n")
ck_time = time.time()

# smp55 method
comb = np.array([a, b]).transpose()
atob_smp = comb[comb[:, 1].argsort()][:, 0]
print(f"index a to b (smp55): {atob_smp}, type: {type(atob_smp)}")
print(f"running time (smp55): {time.time() - ck_time}")
print(f"equal to exist method: {np.array_equal(atob, atob_smp)}\n")
ck_time = time.time()

# Roman method
from numba import jit
@jit(nopython=True)
def find_argmin(_a, _b):
    out = np.empty_like(_a)  # allocating result array
    for i in range(_a.shape[0]):
        out[i] = np.abs(np.subtract(_b, _a[i])).argmin()
    return out

atob_rom = find_argmin(arr_a, arr_b)
print(f"index a to b (Roman): {atob_rom}, type: {type(atob_rom)}")
print(f"running time (Roman): {time.time() - ck_time}")
print(f"equal to exist method: {np.array_equal(atob, atob_rom)}\n")
ck_time = time.time()

# Alain method
from bisect import bisect_left
ub   = {n:-i for i,n in enumerate(reversed(b),1-len(b))}  # unique first pos
sb   = sorted(ub.items())                                 # sorted to bisect
ib   = (bisect_left(sb,(n,0)) for n in a)                 # index of >= val
rb   = ((sb[i-1],sb[min(i,len(sb)-1)]) for i in ib)       # low and high pairs
atob_ala = [ i if (abs(lo-n),i)<(abs(hi-n),j) else j      # closest index
               for ((lo,i),(hi,j)),n in zip(rb,a) ]
print(f"index a to b (Alain): {atob_ala}, type: {type(atob_ala)}")
print(f"running time (Alain): {time.time() - ck_time}")
print(f"equal to exist method: {np.array_equal(atob, atob_ala)}\n")
ck_time = time.time()

# ken method
b_sorted, b_sort_indices = np.unique(b, return_index=True)
def find_nearest(value):
    """Finds the nearest value from b."""
    right_index = np.searchsorted(b_sorted[:-1], value)
    left_index = max(0, right_index - 1)
    right_delta = b_sorted[right_index] - value
    left_delta = value - b_sorted[left_index]
    if right_delta == left_delta:
        # This is only necessary to replicate the behavior of your original code.
        return min(b_sort_indices[left_index], b_sort_indices[right_index])
    elif left_delta < right_delta:
        return b_sort_indices[left_index]
    else:
        return b_sort_indices[right_index]

atob_ken = [find_nearest(ai) for ai in a]
print(f"index a to b (ken): {atob_ken}, type: {type(atob_ken)}")
print(f"running time (ken): {time.time() - ck_time}")
print(f"equal to exist method: {np.array_equal(atob, atob_ken)}\n")
ck_time = time.time()
Run Code Online (Sandbox Code Playgroud)

上面的代码结果是

list a: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
list b: [9, 12, 0, 2, 3, 15, 4, 16, 13, 6, 7, 18, 14, 10, 1, 8, 5, 17, 11, 19]

index a to b (existed): [2, 14, 3, 4, 6, 16, 9, 10, 15, 0, 13, 18, 1, 8, 12, 5, 7, 17, 11, 19], type: <class 'list'>
running time (existed): 0.00024008750915527344

index a to b (dankal): [2, 14, 3, 4, 6, 16, 9, 10, 15, 0, 13, 18, 1, 8, 12, 5, 7, 17, 11, 19], type: <class 'list'>
running time (dankal): 1.5497207641601562e-05
equal to exist method: True

index a to b (smp55): [ 2 14  3  4  6 16  9 10 15  0 13 18  1  8 12  5  7 17 11 19], type: <class 'numpy.ndarray'>
running time (smp55): 0.00020551681518554688
equal to exist method: True

index a to b (Roman): [17 11  1  6 16 14  9  4  8  3  5 12  7  2 19 15 18 13  0 10], type: <class 'numpy.ndarray'>
running time (Roman): 0.5710980892181396
equal to exist method: False

index a to b (Alain): [2, 14, 3, 4, 6, 16, 9, 10, 15, 0, 13, 18, 1, 8, 12, 5, 7, 17, 11, 19], type: <class 'list'>
running time (Alain): 3.552436828613281e-05
equal to exist method: True

index a to b (ken): [2, 14, 3, 4, 6, 16, 9, 10, 15, 0, 13, 18, 1, 8, 12, 5, 7, 17, 11, 19], type: <class 'list'>
running time (ken): 0.00011754035949707031
equal to exist method: True
Run Code Online (Sandbox Code Playgroud)
  • 运行时间增加列表大小

如果我运行代码num = 1000000

running time (dankal): 0.45094847679138184

running time (smp55): 0.36011743545532227

running time (Alain): 2.178112030029297

running time (ken): 2.663684368133545
Run Code Online (Sandbox Code Playgroud)

(使用Roman的方法,很难检查尺寸增加的时间。)

内存的角度也需要检查,但首先,@smp55方法是根据回复所需时间获取列表的最快方法。(我确信还有其他好方法。)

再次感谢大家的关注和回复!!!

(也欢迎后续回复和评论。如果大家有好的想法,也欢迎分享!)

smp*_*p55 3

基于您的第一个列表只是一个索引这一事实,我可以非常快速地给出具体答案。如果将它们组合在二维数组中,则可以按第二个列表排序,这会将第一个列表(第二个列表的索引)按照您想要的结果顺序放置:

import numpy as np
import random as rnd

num = 100000

a = list(range(num))
b = [rnd.randint(0, num) for r in range(num)]

comb = np.array([a, b]).transpose()
atob = comb[comb[:, 1].argsort()][:,0]
Run Code Online (Sandbox Code Playgroud)

大约需要 0.08 秒。现在,第一项是第一项出现的atob索引。in 的第二项是的第二项的索引,依此类推。baatobba