在 R 中,为什么使用 dplyr 的操作根据数据帧的格式(长与宽)花费不同的时间?

Alo*_*ber 6 performance runtime r

我注意到,根据数据集具有的变量/特征的数量,首先使用pivot_wider(变量较少)或pivot_longer(变量较多)可能会更快。我想知道这主要是因为什么?我的猜测是分组操作(更长的格式)和绑定(更宽的格式)之间的权衡,但我想知道是否有人有更多的见解。下面的示例将不同组的幂提高到随机值:

代码(许多功能):

# 5000 features , 100 observations

df_long = data.frame(grp=rep(1:100, each=5e3),
                     feat=paste0("V",rep(1:5e3, times=100)), 
                     val=rnorm(5e5))

df_wide = df_long %>% group_by(feat) %>% 
  pivot_wider(names_from = feat, 
              values_from = val) %>% ungroup()

system.time({
  df_long %>% 
    group_by(grp) %>% mutate(val=grp^val) 
})

system.time({
  df_wide %>% mutate(across(paste0("V",1:5000), ~grp^.))
})
Run Code Online (Sandbox Code Playgroud)

结果:

   user  system elapsed 
  0.028   0.000   0.028 

   user  system elapsed 
  0.158   0.000   0.158

Run Code Online (Sandbox Code Playgroud)

代码(较少功能):

# 100 features , 5000 observations

df_long = data.frame(grp=rep(1:5e3, each=100),
                     feat=paste0("V",rep(1:100, times=5e3)), 
                     val=rnorm(5e5))

df_wide = df_long %>% group_by(feat) %>% 
  pivot_wider(names_from = feat, 
              values_from = val) %>% ungroup()

system.time({
  df_long %>% 
    group_by(grp) %>% mutate(val=grp^val) 
})

system.time({
  df_wide %>% mutate(across(paste0("V",1:100), ~grp^.))
})
Run Code Online (Sandbox Code Playgroud)

结果:

   user  system elapsed 
  0.051   0.000   0.050 

  user  system elapsed 
 0.025   0.000   0.025 
Run Code Online (Sandbox Code Playgroud)