Alo*_*ber 6 performance runtime r
我注意到,根据数据集具有的变量/特征的数量,首先使用pivot_wider(变量较少)或pivot_longer(变量较多)可能会更快。我想知道这主要是因为什么?我的猜测是分组操作(更长的格式)和绑定(更宽的格式)之间的权衡,但我想知道是否有人有更多的见解。下面的示例将不同组的幂提高到随机值:
代码(许多功能):
# 5000 features , 100 observations
df_long = data.frame(grp=rep(1:100, each=5e3),
feat=paste0("V",rep(1:5e3, times=100)),
val=rnorm(5e5))
df_wide = df_long %>% group_by(feat) %>%
pivot_wider(names_from = feat,
values_from = val) %>% ungroup()
system.time({
df_long %>%
group_by(grp) %>% mutate(val=grp^val)
})
system.time({
df_wide %>% mutate(across(paste0("V",1:5000), ~grp^.))
})
Run Code Online (Sandbox Code Playgroud)
结果:
user system elapsed
0.028 0.000 0.028
user system elapsed
0.158 0.000 0.158
Run Code Online (Sandbox Code Playgroud)
代码(较少功能):
# 100 features , 5000 observations
df_long = data.frame(grp=rep(1:5e3, each=100),
feat=paste0("V",rep(1:100, times=5e3)),
val=rnorm(5e5))
df_wide = df_long %>% group_by(feat) %>%
pivot_wider(names_from = feat,
values_from = val) %>% ungroup()
system.time({
df_long %>%
group_by(grp) %>% mutate(val=grp^val)
})
system.time({
df_wide %>% mutate(across(paste0("V",1:100), ~grp^.))
})
Run Code Online (Sandbox Code Playgroud)
结果:
user system elapsed
0.051 0.000 0.050
user system elapsed
0.025 0.000 0.025
Run Code Online (Sandbox Code Playgroud)