R:没有forloop的子数据集和排序大数据

Cpt*_*emo 3 for-loop r split-apply-combine dplyr data.table

我有97M行的长表.每行包含一个人采取的操作的信息以及该操作的时间戳,格式如下:

actions <- c("walk","sleep", "run","eat")
people <- c("John","Paul","Ringo","George")
timespan <- seq(1000,2000,1)

set.seed(28100)
df.in <- data.frame(who = sample(people, 10, replace=TRUE),
                    what = sample(actions, 10, replace=TRUE),
                    when = sample(timespan, 10, replace=TRUE))

df.in
#       who  what when
# 1    Paul   eat 1834
# 2    Paul sleep 1295
# 3    Paul   eat 1312
# 4   Ringo   eat 1635
# 5    John sleep 1424
# 6  George   run 1092
# 7    Paul  walk 1849
# 8    John   run 1854
# 9  George sleep 1036
# 10  Ringo  walk 1823
Run Code Online (Sandbox Code Playgroud)

每个动作都可以由人采取或不采取,并且可以以任何顺序采取行动.

我有兴趣总结我的数据集中的操作顺序.特别是对于每个人,我想找到哪个动作是第一,第二,第三和第四.如果多次采取行动,我只对第一次出现感兴趣.那么,如果有人运行,吃,吃,跑和睡觉我感兴趣的总结,例如run,eat,sleep.

df.out <- data.frame(who = factor(character(), levels=people),
                     action1 = factor(character(), levels=actions),
                     action2 = factor(character(), levels=actions),
                     action3 = factor(character(), levels=actions),
                     action4 = factor(character(), levels=actions))
Run Code Online (Sandbox Code Playgroud)

我可以通过forloop获得我想要的东西:

for (person in people) {
  tmp <- subset(df.in, who==person)
  tmp <- tmp[order(tmp$when),]
  chrono_list <- unique(tmp$what)
  df.out <- rbind(df.out, data.frame(who = person,
                                     action1 = chrono_list[1],
                                     action2 = chrono_list[2],
                                     action3 = chrono_list[3],
                                     action4 = chrono_list[4]))
}

df.out
#        who action1 action2 action3 action4
#   1   John   sleep     run    <NA>    <NA>
#   2   Paul   sleep     eat    walk    <NA>
#   3  Ringo     eat    walk    <NA>    <NA>
#   4 George   sleep     run    <NA>    <NA>
Run Code Online (Sandbox Code Playgroud)

这种结果是否也可以在没有循环的情况下以更有效的方式获得?

akr*_*run 5

我们可以使用dcastdevel版本data.table,即.v1.9.5.我们可以安装它here

library(data.table)#v1.9.5+
dcast(setDT(df.in)[order(when),action:= paste0('action', 1:.N) ,who],
                           who~action, value.var='what')
Run Code Online (Sandbox Code Playgroud)

如果你需要unique每个'谁'的'什么'

dcast(setDT(df.in)[, .SD[!duplicated(what)], who][order(when),
    action:= paste0('action', 1:.N), who], who~action, value.var='what')
#         who action1 action2 action3
#1: George   sleep     run      NA
#2:   John   sleep     run      NA
#3:   Paul   sleep     eat    walk
#4:  Ringo     eat    walk      NA
Run Code Online (Sandbox Code Playgroud)

或者使用.I会更快一些

 ind <- setDT(df.in)[,.I[!duplicated(what)], who]$V1 

 dcast(df.in[ind][order(when),action:= paste0('action', 1:.N) ,who], 
            who~action, value.var='what')
Run Code Online (Sandbox Code Playgroud)

或者使用setorder,unique并且可以setorder通过引用对数据集重新排序,这可以是高效的内存.

 dcast(unique(setorder(setDT(df.in), who, when), by=c('who', 'what'))[,
     action:= paste0('action', 1:.N), who], who~action, value.var='what')
 #     who action1 action2 action3
 #1: George   sleep     run      NA
 #2:   John   sleep     run      NA
 #3:   Paul   sleep     eat    walk
 #4:  Ringo     eat    walk      NA
Run Code Online (Sandbox Code Playgroud)