I have a data frame and big function like below and i wanted to apply norm_group function to data frame columns but its taking too much time with apply command. is there any way to reduce the time for this code? currently it's taking 24.4s for each loop.
import pandas as pd
import numpy as np
np.random.seed(1234)
n = 1500000
df = pd.DataFrame()
df['group'] = np.random.randint(1700, size=n)
df['ID'] = np.random.randint(5, size=n)
df['s_count'] = np.random.randint(5, size=n)
df['p_count'] = np.random.randint(5, size=n)
df['d_count'] = np.random.randint(5, size=n)
df['Total'] = np.random.randint(400, size=n)
df['Normalized_total'] = df.groupby('group')['Total'].apply(lambda x: (x-x.min())/(x.max()- x.min()))
df['Normalized_total'] = df['Normalized_total'].apply(lambda x:round(x,2))
def norm_group(a,b,c,d,e):
if a >= 0.7 and b >=1000 and c >2:
return "Both High "
elif a >= 0.7 and b >=1000 and c < 2:
return "High and C Low"
elif a >= 0.4 and b >=500 and d > 2:
return "Medium and D High"
elif a >= 0.4 and b >=500 and d < 2:
return "Medium and D Low"
elif a >= 0.4 and b >=500 and e > 2:
return "Medium and E High"
elif a >= 0.4 and b >=500 and e < 2:
return "Medium and E Low"
else:
return "Low"
%timeit df['Categery'] = df.apply(lambda x:norm_group(a=x['Normalized_total'],b=x['group']), axis=1)
Run Code Online (Sandbox Code Playgroud)
24.4 s ± 551 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
i have multiple text columns in my original data frame and wanted to apply similar kind of function that is taking much more time compare to this one.
Thanks
You can vectorize with np.select:
df['Category'] = np.select((df['Normalized_total'].ge(0.7) & df['group'].ge(1000),
df['Normalized_total'].ge(0.4) & df['group'].ge(500)),
('High', 'Medium'), default='Low'
)
Run Code Online (Sandbox Code Playgroud)
Performance:
255 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
34 次 |
| 最近记录: |