如何提取每组前n行?

Con*_*ngo 27 r data.table

我有一个data.table dt.此data.table首先按列date(我的分组变量)排序,然后按列排序age:

library(data.table)
setkeyv(dt, c("date", "age")) # Sorts table first by column "date" then by "age"
> dt
         date age     name
1: 2000-01-01   3   Andrew
2: 2000-01-01   4      Ben
3: 2000-01-01   5  Charlie
4: 2000-01-02   6     Adam
5: 2000-01-02   7      Bob
6: 2000-01-02   8 Campbell
Run Code Online (Sandbox Code Playgroud)

我的问题是:我想知道是否可以为每个唯一日期提取前两行?或更一般地说:

如何提取每组中的前n行

在此示例中,结果dt.f将是:

> dt.f = ???????? # function of dt to extract the first 2 rows per unique date
> dt.f
         date age   name
1: 2000-01-01   3 Andrew
2: 2000-01-01   4    Ben
3: 2000-01-02   6   Adam
4: 2000-01-02   7    Bob
Run Code Online (Sandbox Code Playgroud)

ps这是创建上述data.table的代码:

install.packages("data.table")
library(data.table)
date <- c("2000-01-01","2000-01-01","2000-01-01",
    "2000-01-02","2000-01-02","2000-01-02")
age <- c(3,4,5,6,7,8)
name <- c("Andrew","Ben","Charlie","Adam","Bob","Campbell")
dt <- data.table(date, age, name)
setkeyv(dt,c("date","age")) # Sorts table first by column "date" then by "age"
Run Code Online (Sandbox Code Playgroud)

Ric*_*rta 45

是的,只需.SD根据需要使用和索引.

  DT[, .SD[1:2], by=date]

           date age   name
  1: 2000-01-01   3 Andrew
  2: 2000-01-01   4    Ben
  3: 2000-01-02   6   Adam
  4: 2000-01-02   7    Bob
Run Code Online (Sandbox Code Playgroud)

根据@ eddi的建议编辑.

@ eddi的建议是现场:

使用它代替速度:

  DT[DT[, .I[1:2], by = date]$V1]

  # using a slightly larger data set
  > microbenchmark(SDstyle=DT[, .SD[1:2], by=date], IStyle=DT[DT[, .I[1:2], by = date]$V1], times=200L)
  Unit: milliseconds
      expr       min        lq    median        uq      max neval
   SDstyle 13.567070 16.224797 22.170302 24.239881 88.26719   200
    IStyle  1.675185  2.018773  2.168818  2.269292 11.31072   200
Run Code Online (Sandbox Code Playgroud)

  • 这是可读性的正确答案,但如果速度是一个问题`dt [dt [,.I [1:2],by = date] $ V1]`要快得多 (20认同)
  • + 1和`head(.I,2)`以防任何组只有1行. (16认同)
  • @Gravitas加速很大程度上取决于你的`dt中有多少个日期,你的速度差越大(500个日期我的电脑加速100倍) (2认同)

Hen*_*rik 5

使用rowid

dt[rowid(date) < 3]
#          date age   name
# 1: 2000-01-01   3 Andrew
# 2: 2000-01-01   4    Ben
# 3: 2000-01-02   6   Adam
# 4: 2000-01-02   7    Bob
Run Code Online (Sandbox Code Playgroud)

对于较大的数据,比++替代方案rowid更快by.SDby.I

DT = data.table(
                date = rep(1:1e5, each = 10),
                age = runif(1e6),
                name = sample(letters, 1e6, replace = TRUE))

system.time({r_I = DT[, .SD[1:2], by=date]})
#    user  system elapsed 
#   14.54    0.56   15.04 

system.time({r_SD = DT[DT[, .I[1:2], by = date]$V1]})
#    user  system elapsed 
#    0.15    0.00    0.15 

system.time({r_rowid = DT[rowid(date) < 3]})
#    user  system elapsed 
#    0.01    0.00    0.02 

all.equal(r_I, r_SD)
# [1] TRUE

all.equal(r_I, r_rowid)
# [1] TRUE
Run Code Online (Sandbox Code Playgroud)