按组分组的模态值(最常见)的简明R data.table语法

Ami*_*tai 2 r frequency data.table

什么是高效优雅的data.table语法,用于查找每个id的最常见类别?我保留一个表示NA位置的布尔矢量(用于其他目的)

dt = data.table(id=rep(1:2,7), category=c("x","y",NA))
print(dt)
Run Code Online (Sandbox Code Playgroud)

在这个玩具示例中,忽略NA,x是for id==1yfor的常见类别id==2.

Jaa*_*aap 5

如果你想忽略NA它们,你必须先用!is.na(category),分组idcategory(by = .(id, category))排除它们,并用以下方法创建一个频率变量.N:

 dt[!is.na(category), .N, by = .(id, category)]
Run Code Online (Sandbox Code Playgroud)

这使:

   id category N
1:  1        x 3
2:  2        y 3
3:  2        x 2
4:  1        y 2
Run Code Online (Sandbox Code Playgroud)

订购此项id将为您提供更清晰的图片:

 dt[!is.na(category), .N, by = .(id, category)][order(id)]
Run Code Online (Sandbox Code Playgroud)

这导致:

   id category N
1:  1        x 3
2:  1        y 2
3:  2        y 3
4:  2        x 2
Run Code Online (Sandbox Code Playgroud)

如果您只想要指示最佳结果的行:

dt[!is.na(category), .N, by = .(id, category)][order(id, -N), head(.SD,1), by = id]
Run Code Online (Sandbox Code Playgroud)

要么:

dt[!is.na(category), .N, by = .(id, category)][, .SD[which.max(N)], by = id]
Run Code Online (Sandbox Code Playgroud)

两者都给:

   id category N
1:  1        x 3
2:  2        y 3
Run Code Online (Sandbox Code Playgroud)