使用pandas数据帧在numpy数组中设置索引

Question

使用pandas数据帧在numpy数组中设置索引

我有一个pandas数据帧,其索引为numpy数组.对于那些索引,数组的值必须设置为1.我需要在一个大的numpy阵列上做这个数百万次.有没有比下面显示的方法更有效的方法？

from numpy import float32, uint
from numpy.random import choice
from pandas import DataFrame
from timeit import timeit

xy = 2000,300000
sz = 10000000
ind = DataFrame({"i":choice(range(xy[0]),sz),"j":choice(range(xy[1]),sz)}).drop_duplicates()
dtype = uint
repeats = 10

#original (~21s)
stmt = '''\
from numpy import zeros
a = zeros(xy, dtype=dtype)
a[ind.values[:,0],ind.values[:,1]] = 1'''

print(timeit(stmt, "from __main__ import xy,sz,ind,dtype", number=repeats))

#suggested by @piRSquared (~13s)
stmt = '''\
from numpy import ones
from scipy.sparse import coo_matrix
i,j = ind.i.values,ind.j.values
a = coo_matrix((ones(i.size, dtype=dtype), (i, j)), dtype=dtype).toarray()
'''

print(timeit(stmt, "from __main__ import xy,sz,ind,dtype", number=repeats))

Run Code Online (Sandbox Code Playgroud)

我编辑了上面的帖子,以显示@piRSquared建议的方法,并重新编写它以允许进行苹果对苹果的比较.无论数据类型(尝试过uint和float32),建议的方法都减少了40%的时间.

Answer 1

piR*_*red 5

OP时间

56.56 s

Run Code Online (Sandbox Code Playgroud)

我只能略微提高

i, j = ind.i.values, ind.j.values
a[i, j] = 1

Run Code Online (Sandbox Code Playgroud)

新时代

52.19 s

Run Code Online (Sandbox Code Playgroud)

但是,通过使用scipy.sparse.coo_matrix实例化稀疏矩阵然后将其转换为a ,可以大大提高速度numpy.array.

import timeit

stmt = '''\
import numpy, pandas
from scipy.sparse import coo_matrix

xy = 2000,300000

sz = 10000000
ind = pandas.DataFrame({"i":numpy.random.choice(range(xy[0]),sz),"j":numpy.random.choice(range(xy[1]),sz)}).drop_duplicates()

################################################
i, j = ind.i.values, ind.j.values
dtype = numpy.uint8
a = coo_matrix((numpy.ones(i.size, dtype=dtype), (i, j)), dtype=dtype).toarray()'''

timeit.timeit(stmt, number=10)

33.06471237000369

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，4 月前
查看次数：	103 次
最近记录：	8 年，4 月前