Lui*_*ier 9 python performance numpy pandas
我有一个(非常大的)pandas Dataframe df:
country age gender
Brazil 10 F
USA 20 F
Brazil 10 F
USA 20 M
Brazil 10 M
USA 20 M
Run Code Online (Sandbox Code Playgroud)
我有另一个熊猫数据帧频率:
age gender counting
10 F 0
10 M 0
20 F 0
Run Code Online (Sandbox Code Playgroud)
当它们出现在df 中时,我想计算freq 中的一对值:
age gender counting
10 F 2
10 M 1
20 F 1
Run Code Online (Sandbox Code Playgroud)
我正在使用此代码,但它需要太长时间:
for row in df.itertuples(index=False):
freq.loc[np.all(freq['age','gender']==row[2:3],axis=1),'counting'] += 1
Run Code Online (Sandbox Code Playgroud)
有没有更快的方法来做到这一点?
请注意:
Ben*_*n.T 10
你可以用innermerge来过滤你不想要的df中的组合,然后是groupby年龄和性别以及count列数。只需 reset_index 以适合您的预期输出。
freq = (df.merge(freq, on=['age', 'gender'], how='inner')
.groupby(['age','gender'])['counting'].size()
.reset_index())
print (freq)
age gender counting
0 10 F 2
1 10 M 1
2 20 F 1
Run Code Online (Sandbox Code Playgroud)
根据您不想组合的数量,它可以更快地groupby在df做之前merge,如:
freq = (df.groupby(['age','gender']).size()
.rename('counting').reset_index()
.merge(freq[['age','gender']])
)
Run Code Online (Sandbox Code Playgroud)
另一种方法是使用reindex过滤到频率列表:
df.groupby(['gender', 'age']).count()\
.reindex(pd.MultiIndex.from_arrays([df1['gender'], df1['age']]))
Run Code Online (Sandbox Code Playgroud)
输出:
country
gender age
F 10 2
M 10 1
F 20 1
Run Code Online (Sandbox Code Playgroud)
将 NumPy 与一些性能(希望如此!)与降维的想法混合在一起1D,以便我们可以引入高效的bincount-
agec = np.r_[df.age,freq.age]
genderc = np.r_[df.gender,freq.gender]
aIDs,aU = pd.factorize(agec)
gIDs,gU = pd.factorize(genderc)
cIDs = aIDs*(gIDs.max()+1) + gIDs
count = np.bincount(cIDs[:len(df)], minlength=cIDs.max()+1)
freq['counting'] = count[cIDs[-len(freq):]]
Run Code Online (Sandbox Code Playgroud)
样品运行 -
In [44]: df
Out[44]:
country age gender
0 Brazil 10 F
1 USA 20 F
2 Brazil 10 F
3 USA 20 M
4 Brazil 10 M
5 USA 20 M
In [45]: freq # introduced a missing element as the second row for variety
Out[45]:
age gender counting
0 10 F 2
1 23 M 0
2 20 F 1
Run Code Online (Sandbox Code Playgroud)
具体场景优化#1
如果age已知 header 只包含整数,我们可以跳过一个factorize。因此,跳过aIDs,aU = pd.factorize(agec)并cIDs使用 -
cIDs = agec*(gIDs.max()+1) + gIDs
Run Code Online (Sandbox Code Playgroud)