SLa*_*r01 8 python arrays numpy vectorization pandas
我想知道是否有人可以提供有关以下编码问题的任何想法或建议,我对快速Python实现特别感兴趣(即避免使用Pandas)。
我有一组(虚拟示例)数据,例如:
| User | Day | Place | Foo | Bar |
1 10 5 True False
1 11 8 True False
1 11 9 True False
2 11 9 True False
2 12 1 False True
1 12 2 False True
Run Code Online (Sandbox Code Playgroud)
在给定的日期/地点包含2个用户(“ user1”和“ user2”)的数据,其中有2个布尔值(这里称为foo和bar)。
我只对在同一天和同一地点为两个用户记录数据的情况感兴趣。有了这些相关的数据行,我接下来要为日期/地点条目创建新列,以将用户和foo / bar描述为bool。
| Day | Place | User 1 Foo | User 1 Bar | User 2 Foo | User 2 Bar |
11 9 True False True False
Run Code Online (Sandbox Code Playgroud)
每个列数据存储在numpy数组中。我感谢这是使用枢纽分析表功能(例如Pandas解决方案是:
user = np.array([1, 1, 1, 2, 2, 1], dtype=int)
day = np.array([10, 11, 11, 11, 12, 12], dtype=int)
place = np.array([5,8,9,9,1,2], dtype=int)
foo = np.array([1, 1, 1, 1, 0, 0], dtype=bool)
bar = np.array([0, 0, 0, 0, 1, 1], dtype=bool)
df = pd.DataFrame({
'user': user,
'day': day,
'place': place,
'foo': foo,
'bar': bar,
})
df2 = df.set_index(['day','place']).pivot(columns='user')
df2.columns = ["User1_foo", "User2_foo", "User1_bar", "User2_bar"]
df2 = df2.reset_index()
df2.dropna(inplace=True)
Run Code Online (Sandbox Code Playgroud)
但是在我的实际使用中,我有数百万行数据,并且分析表明,数据帧的使用和数据透视操作是性能瓶颈。
因此,我如何才能实现相同的输出,即仅针对在同一天两个用户都存在数据并将其放置在原始输入数组中的情况,将numpy数组用于day,place和user1_foo,user1_bar,user2_foo,user2_bar?
我想知道是否可以通过np.unique查找索引然后将其求反,但这可能无法解决。因此,非常感谢任何解决方案(最好是快速执行)!
方法#1
这是一种基于降维的方法,以提高内存效率并np.searchsorted追溯并寻找两个用户数据之间的匹配数据 -
# Extract array data for efficiency, as we will work NumPy tools
a = df.to_numpy(copy=False) #Pandas >= 0.24, use df.values otherwise
i = a[:,:3].astype(int)
j = a[:,3:].astype(bool)
# Test out without astype(int),astype(bool) conversions and see how they perform
# Get grouped scalars for Day and place headers combined
# This assumes that Day and Place data are positive integers
g = i[:,2]*(i[:,1].max()+1) + i[:,1]
# Get groups for user1,2 for original and grouped-scalar items
m1 = i[:,0]==1
uj1,uj2 = j[m1],j[~m1]
ui1 = i[m1]
u1,u2 = g[m1],g[~m1]
# Use searchsorted to look for matching ones between user-1,2 grouped scalars
su1 = u1.argsort()
ssu1_idx = np.searchsorted(u1,u2,sorter=su1)
ssu1_idx[ssu1_idx==len(u1)] = 0
ssu1_idxc = su1[ssu1_idx]
match_mask = u1[ssu1_idxc]==u2
match_idx = ssu1_idxc[match_mask]
# Select matching items off original table
p1,p2 = uj1[match_idx],uj2[match_mask]
# Setup output arrays
day_place = ui1[match_idx,1:]
user1_bools = p1
user2_bools = p2
Run Code Online (Sandbox Code Playgroud)
方法#1-扩展:通用Day和Place数据类型数据
我们可以扩展到一般情况,当Day数据Place不一定是正整数时。在这种情况下,我们可以利用数据类型组合的基于视图的方法来执行数据缩减。因此,唯一需要的改变就是变得g不同,这将是基于视图的数组类型,并且将像这样获得 -
# /sf/answers/3149930661/ @Divakar
def view1D(a): # a is array
a = np.ascontiguousarray(a)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel()
# Get grouped scalars for Day and place headers combined with dtype combined view
g = view1D(i[:,1:])
Run Code Online (Sandbox Code Playgroud)
方法#2
我们将使用lex-sorting这样一种方式对数据进行分组:在连续行中查找相同的元素将告诉我们两个用户之间是否存在匹配的元素。我们将重新使用a,i,j来自Approach#1. 实施将是 -
# Lexsort the i table
sidx = np.lexsort(i.T)
# OR sidx = i.dot(np.r_[1,i[:,:-1].max(0)+1].cumprod()).argsort()
b = i[sidx]
# Get matching conditions on consecutive rows
m = (np.diff(b,axis=0)==[1,0,0]).all(1)
# Or m = (b[:-1,1] == b[1:,1]) & (b[:-1,2] == b[1:,2]) & (np.diff(b[:,0])==1)
# Trace back to original order by using sidx
match1_idx,match2_idx = sidx[:-1][m],sidx[1:][m]
# Index into relevant table and get desired array outputs
day_place,user1_bools,user2_bools = i[match1_idx,1:],j[match1_idx],j[match2_idx]
Run Code Online (Sandbox Code Playgroud)
或者,我们可以使用 的扩展掩码m来索引sidx并生成match1_idx,match2_idx。其余代码保持不变。因此,我们可以这样做——
from scipy.ndimage import binary_dilation
# Binary extend the mask to have the same length as the input.
# Index into sidx with it. Use one-off offset and stepsize of 2 to get
# user1,2 matching indices
m_ext = binary_dilation(np.r_[m,False],np.ones(2,dtype=bool),origin=-1)
match_idxs = sidx[m_ext]
match1_idx,match2_idx = match_idxs[::2],match_idxs[1::2]
Run Code Online (Sandbox Code Playgroud)
方法#3
这是另一个基于内存和性能的Approach #2移植。numba效率,我们将重新a,i,j利用approach #1-
from numba import njit
@njit
def find_groups_numba(i_s,j_s,user_data,bools):
n = len(i_s)
found_iterID = 0
for iterID in range(n-1):
if i_s[iterID,1] == i_s[iterID+1,1] and i_s[iterID,2] == i_s[iterID+1,2]:
bools[found_iterID,0] = j_s[iterID,0]
bools[found_iterID,1] = j_s[iterID,1]
bools[found_iterID,2] = j_s[iterID+1,0]
bools[found_iterID,3] = j_s[iterID+1,1]
user_data[found_iterID,0] = i_s[iterID,1]
user_data[found_iterID,1] = i_s[iterID,2]
found_iterID += 1
return found_iterID
# Lexsort the i table
sidx = np.lexsort(i.T)
# OR sidx = i.dot(np.r_[1,i[:,:-1].max(0)+1].cumprod()).argsort()
i_s = i[sidx]
j_s = j[sidx]
n = len(i_s)
user_data = np.empty((n//2,2),dtype=i.dtype)
bools = np.empty((n//2,4),dtype=j.dtype)
found_iterID = find_groups_numba(i_s,j_s,user_data,bools)
out_bools = bools[:found_iterID] # Output bool
out_userd = user_data[:found_iterID] # Output user-Day, Place data
Run Code Online (Sandbox Code Playgroud)
如果输出必须有自己的内存空间,请在最后 2 个步骤附加 .copy()。
或者,我们可以将索引操作卸载回 NumPy 端,以获得更干净的解决方案 -
@njit
def find_consec_matching_group_indices(i_s,idx):
n = len(i_s)
found_iterID = 0
for iterID in range(n-1):
if i_s[iterID,1] == i_s[iterID+1,1] and i_s[iterID,2] == i_s[iterID+1,2]:
idx[found_iterID] = iterID
found_iterID += 1
return found_iterID
# Lexsort the i table
sidx = np.lexsort(i.T)
# OR sidx = i.dot(np.r_[1,i[:,:-1].max(0)+1].cumprod()).argsort()
i_s = i[sidx]
j_s = j[sidx]
idx = np.empty(len(i_s)//2,dtype=np.uint64)
found_iterID = find_consec_matching_group_indices(i_s,idx)
fidx = idx[:found_iterID]
day_place,user1_bools,user2_bools = i_s[fidx,1:],j_s[fidx],j_s[fidx+1]
Run Code Online (Sandbox Code Playgroud)