Con*_*ngo 27 r data.table
我有一个data.table dt
.此data.table首先按列date
(我的分组变量)排序,然后按列排序age
:
library(data.table)
setkeyv(dt, c("date", "age")) # Sorts table first by column "date" then by "age"
> dt
date age name
1: 2000-01-01 3 Andrew
2: 2000-01-01 4 Ben
3: 2000-01-01 5 Charlie
4: 2000-01-02 6 Adam
5: 2000-01-02 7 Bob
6: 2000-01-02 8 Campbell
Run Code Online (Sandbox Code Playgroud)
我的问题是:我想知道是否可以为每个唯一日期提取前两行?或更一般地说:
如何提取每组中的前n行?
在此示例中,结果dt.f
将是:
> dt.f = ???????? # function of dt to extract the first 2 rows per unique date
> dt.f
date age name
1: 2000-01-01 3 Andrew
2: 2000-01-01 4 Ben
3: 2000-01-02 6 Adam
4: 2000-01-02 7 Bob
Run Code Online (Sandbox Code Playgroud)
ps这是创建上述data.table的代码:
install.packages("data.table")
library(data.table)
date <- c("2000-01-01","2000-01-01","2000-01-01",
"2000-01-02","2000-01-02","2000-01-02")
age <- c(3,4,5,6,7,8)
name <- c("Andrew","Ben","Charlie","Adam","Bob","Campbell")
dt <- data.table(date, age, name)
setkeyv(dt,c("date","age")) # Sorts table first by column "date" then by "age"
Run Code Online (Sandbox Code Playgroud)
Ric*_*rta 45
是的,只需.SD
根据需要使用和索引.
DT[, .SD[1:2], by=date]
date age name
1: 2000-01-01 3 Andrew
2: 2000-01-01 4 Ben
3: 2000-01-02 6 Adam
4: 2000-01-02 7 Bob
Run Code Online (Sandbox Code Playgroud)
@ eddi的建议是现场:
使用它代替速度:
DT[DT[, .I[1:2], by = date]$V1]
# using a slightly larger data set
> microbenchmark(SDstyle=DT[, .SD[1:2], by=date], IStyle=DT[DT[, .I[1:2], by = date]$V1], times=200L)
Unit: milliseconds
expr min lq median uq max neval
SDstyle 13.567070 16.224797 22.170302 24.239881 88.26719 200
IStyle 1.675185 2.018773 2.168818 2.269292 11.31072 200
Run Code Online (Sandbox Code Playgroud)
使用rowid
:
dt[rowid(date) < 3]
# date age name
# 1: 2000-01-01 3 Andrew
# 2: 2000-01-01 4 Ben
# 3: 2000-01-02 6 Adam
# 4: 2000-01-02 7 Bob
Run Code Online (Sandbox Code Playgroud)
对于较大的数据,比+和+替代方案rowid
更快by
.SD
by
.I
DT = data.table(
date = rep(1:1e5, each = 10),
age = runif(1e6),
name = sample(letters, 1e6, replace = TRUE))
system.time({r_I = DT[, .SD[1:2], by=date]})
# user system elapsed
# 14.54 0.56 15.04
system.time({r_SD = DT[DT[, .I[1:2], by = date]$V1]})
# user system elapsed
# 0.15 0.00 0.15
system.time({r_rowid = DT[rowid(date) < 3]})
# user system elapsed
# 0.01 0.00 0.02
all.equal(r_I, r_SD)
# [1] TRUE
all.equal(r_I, r_rowid)
# [1] TRUE
Run Code Online (Sandbox Code Playgroud)