gen*_*ser 3 python algorithm optimization pandas pandas-groupby
我有如下所示的数据框:
import time
import pandas as pd
import numpy as np
N = 3
l = []
for i in range(N):
n = np.random.choice(5)+2
l += [pd.DataFrame(dict(ID = np.repeat(i, n),
t = list(range(n)),
X = np.random.normal(size = n)))]
df = pd.concat(l)
df
Out[85]:
ID t X
0 0 0 0.992300
1 0 1 0.226487
2 0 2 -0.731178
3 0 3 0.748376
4 0 4 1.269106
0 1 0 0.512957
1 1 1 -1.274963
2 1 2 0.186314
3 1 3 1.243093
0 2 0 0.321971
1 2 1 0.233895
2 2 2 0.293439
Run Code Online (Sandbox Code Playgroud)
我需要设置其最后的值t对每个ID来NaN。现在,我可以通过以下两种方式之一进行操作:
trimlast = df.groupby('ID').apply(lambda x: x.head(-1)).reset_index(drop=True)
df = df.drop(columns='X').merge(trimlast, how='left')
Run Code Online (Sandbox Code Playgroud)
或者
def f(d):
d.loc[d.t == d.t.max(), 'X'] = np.nan
return d
df = df.groupby('ID').apply(f).reset_index(drop=True)
Run Code Online (Sandbox Code Playgroud)
两者都产生:
df
Out[87]:
ID t X
0 0 0 0.992300
1 0 1 0.226487
2 0 2 -0.731178
3 0 3 0.748376
4 0 4 NaN
5 1 0 0.512957
6 1 1 -1.274963
7 1 2 0.186314
8 1 3 NaN
9 2 0 0.321971
10 2 1 0.233895
11 2 2 NaN
Run Code Online (Sandbox Code Playgroud)
当数据变大时,它们太慢了。时间大约是线性的。
def sizetry(N, other_way = False):
np.random.seed(0)
l = []
for i in range(N):
n = np.random.choice(5) + 2
l += [pd.DataFrame(dict(ID=np.repeat(i, n),
t=list(range(n)),
X=np.random.normal(size=n)))]
df = pd.concat(l)
start = time.time()
if other_way:
trimlast = df.groupby('ID').apply(lambda x: x.head(-1)).reset_index(drop=True)
df = df.drop(columns='X').merge(trimlast, how='left')
else:
df = df.groupby('ID').apply(f).reset_index(drop=True)
end = time.time()
return end-start
tvec = [sizetry(2**i) for i in range(15)]
tvec_other = [sizetry(2**i, other_way = True) for i in range(15)]
import matplotlib.pyplot as plt
plt.plot(np.log2(tvec), label = "merge way")
plt.plot(np.log2(tvec_other), label = 'other way')
plt.legend()
plt.show()
Run Code Online (Sandbox Code Playgroud)
我怀疑问题是groupby. 有没有更快的方法来做到这一点?
首先重置您的索引。
df = df.reset_index(drop=True)
Run Code Online (Sandbox Code Playgroud)
然后duplicated()与反转布尔值一起使用。
import numpy as np
df.loc[~df.duplicated(subset=['ID'],keep='last'),'X'] = np.nan
print(df)
ID t X
0 0 0 0.424902
1 0 1 1.597951
2 0 2 1.453884
3 0 3 NaN
4 1 0 0.534653
5 1 1 -0.318361
6 1 2 0.188290
7 1 3 1.157802
8 1 4 NaN
9 2 0 0.186005
10 2 1 0.036017
11 2 2 1.039822
12 2 3 -1.602205
13 2 4 -0.210601
14 2 5 NaN
Run Code Online (Sandbox Code Playgroud)
如果你想最大T改变值,然后使用idxmax()具有groupby
df.loc[df.groupby('ID')['t'].idxmax(),'x'] = np.nan
Run Code Online (Sandbox Code Playgroud)