将列表字典转换为键和值列表的有效方法

tit*_*ata 5 python dictionary python-itertools

我有如下列表的字典(它可以超过1M个元素,也假设字典按键排序)

import scipy.sparse as sp
d = {0: [0,1], 1: [1,2,3], 
     2: [3,4,5], 3: [4,5,6], 
     4: [5,6,7], 5: [7], 
     6: [7,8,9]}
Run Code Online (Sandbox Code Playgroud)

我想知道什么是最有效的方式(大字典的最快方法)将其转换为行和列索引列表,如:

r_index = [0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6, 6, 6]
c_index = [0, 1, 1, 2, 3, 3, 4, 5, 4, 5, 6, 5, 6, 7, 7, 7, 8, 9]
Run Code Online (Sandbox Code Playgroud)

以下是我到目前为止的一些解决方案:

  1. 使用迭代

    row_ind = [k for k, v in d.iteritems() for _ in range(len(v))] # or d.items() in Python 3
    col_ind = [i for ids in d.values() for i in ids]
    
    Run Code Online (Sandbox Code Playgroud)
  2. 使用pandas库

    import pandas as pd
    df = pd.DataFrame.from_dict(d, orient='index')
    df = df.stack().reset_index()
    row_ind = list(df['level_0'])
    col_ind = list(df[0])
    
    Run Code Online (Sandbox Code Playgroud)
  3. 使用itertools

    import itertools
    indices = [(x,y) for x, y in itertools.chain.from_iterable([itertools.product((k,), v) for k, v in d.items()])]
    indices = np.array(indices)
    row_ind = indices[:, 0]
    col_ind = indices[:, 1]
    
    Run Code Online (Sandbox Code Playgroud)

如果我的字典中有很多元素,我不知道哪种方式是处理这个问题的最快方法.谢谢!

Mar*_*kus 2

python 中优化的第一条经验法则是,确保最内层的循环外包给某个库函数。这仅适用于 cpython - pypy 是一个完全不同的故事。在您的情况下,使用扩展会带来一些显着的加速。

import time
l = range(10000)
x = dict([(k, list(l)) for k in range(1000)])

def org(d):
    row_ind = [k for k, v in d.items() for _ in range(len(v))]
    col_ind = [i for ids in d.values() for i in ids]

def ext(d):
    row_ind = [k for k, v in d.items() for _ in range(len(v))]
    col_ind = []
    for ids in d.values():
        col_ind.extend(ids)

def ext_both(d):
    row_ind = []
    for k, v in d.items():
        row_ind.extend([k] * len(v))
    col_ind = []
    for ids in d.values():
        col_ind.extend(ids)

functions = [org, ext, ext_both]
for func in functions:
    begin = time.time()
    func(x)
    elapsed = time.time() - begin
    print(func.__name__ + ": "  + str(elapsed))
Run Code Online (Sandbox Code Playgroud)

使用python2时的输出:

org: 0.512559890747
ext: 0.340406894684
ext_both: 0.149670124054
Run Code Online (Sandbox Code Playgroud)