Ami*_*tai 2 r frequency data.table
什么是高效优雅的data.table语法,用于查找每个id的最常见类别?我保留一个表示NA位置的布尔矢量(用于其他目的)
dt = data.table(id=rep(1:2,7), category=c("x","y",NA))
print(dt)
Run Code Online (Sandbox Code Playgroud)
在这个玩具示例中,忽略NA,x是for id==1和yfor的常见类别id==2.
如果你想忽略NA它们,你必须先用!is.na(category),分组id和category(by = .(id, category))排除它们,并用以下方法创建一个频率变量.N:
dt[!is.na(category), .N, by = .(id, category)]
Run Code Online (Sandbox Code Playgroud)
这使:
id category N
1: 1 x 3
2: 2 y 3
3: 2 x 2
4: 1 y 2
Run Code Online (Sandbox Code Playgroud)
订购此项id将为您提供更清晰的图片:
dt[!is.na(category), .N, by = .(id, category)][order(id)]
Run Code Online (Sandbox Code Playgroud)
这导致:
id category N
1: 1 x 3
2: 1 y 2
3: 2 y 3
4: 2 x 2
Run Code Online (Sandbox Code Playgroud)
如果您只想要指示最佳结果的行:
dt[!is.na(category), .N, by = .(id, category)][order(id, -N), head(.SD,1), by = id]
Run Code Online (Sandbox Code Playgroud)
要么:
dt[!is.na(category), .N, by = .(id, category)][, .SD[which.max(N)], by = id]
Run Code Online (Sandbox Code Playgroud)
两者都给:
id category N
1: 1 x 3
2: 2 y 3
Run Code Online (Sandbox Code Playgroud)