与data.table聚合时保持零计数组合

jub*_*uba 16 r data.table

假设我有以下内容data.table:

dt <- data.table(id = c(rep(1, 5), rep(2, 4)),
                 sex = c(rep("H", 5), rep("F", 4)), 
                 fruit = c("apple", "tomato", "apple", "apple", "orange", "apple", "apple", "tomato", "tomato"),
                 key = "id")

   id sex  fruit
1:  1   H  apple
2:  1   H tomato
3:  1   H  apple
4:  1   H  apple
5:  1   H orange
6:  2   F  apple
7:  2   F  apple
8:  2   F tomato
9:  2   F tomato
Run Code Online (Sandbox Code Playgroud)

每一行代表一个人(由它确定id并且sex)吃了a 的事实fruit.我想计算每次fruit被吃掉的次数sex.我可以这样做:

dt[ , .N, by = c("fruit", "sex")]
Run Code Online (Sandbox Code Playgroud)

这使:

    fruit sex N
1:  apple   H 3
2: tomato   H 1
3: orange   H 1
4:  apple   F 2
5: tomato   F 2
Run Code Online (Sandbox Code Playgroud)

问题是,通过这样做,我正在失去orangefor 的计数sex == "F",因为这个计数是0.有没有办法在不丢失零计数组合的情况下进行这种聚合?

要非常清楚,期望的结果如下:

   fruit sex N
1:  apple   H 3
2: tomato   H 1
3: orange   H 1
4:  apple   F 2
5: tomato   F 2
6: orange   F 0
Run Code Online (Sandbox Code Playgroud)

非常感谢 !

Jos*_*ien 10

似乎最简单的方法是在data.table中明确提供所有类别组合,传递给它们i=,设置by=.EACHI迭代它们:

setkey(dt, sex, fruit)
dt[CJ(sex, fruit, unique = TRUE), .N, by = .EACHI]
#    sex  fruit N
# 1:   F  apple 2
# 2:   F orange 0
# 3:   F tomato 2
# 4:   H  apple 3
# 5:   H orange 1
# 6:   H tomato 1
Run Code Online (Sandbox Code Playgroud)


Aru*_*run 8

一种方法是改变sexid考虑因素(id这里多余?)

dt[, sex := factor(sex)]
dt[, .(sex=levels(sex), N=c(table(sex))), by=fruit]
#     fruit sex N
# 1:  apple   F 2
# 2:  apple   H 3
# 3: tomato   F 2
# 4: tomato   H 1
# 5: orange   F 0
# 6: orange   H 1
Run Code Online (Sandbox Code Playgroud)

或者您可以通过以下方式更改fruit为因子和分组sex:

dt[, fruit := factor(fruit)]
dt[, .(fruit = levels(fruit), N=c(table(fruit))),by=sex]
#    sex  fruit N
# 1:   H  apple 3
# 2:   H orange 1
# 3:   H tomato 1
# 4:   F  apple 2
# 5:   F orange 0
# 6:   F tomato 2
Run Code Online (Sandbox Code Playgroud)

编辑:

但我怀疑如果你data.table是巨大的,那么取决于table可能不是一个好主意.在这种情况下,使用CJ您之前的问题可能是要走的路.也就是说,首先进行聚合,然后进行连接.

out <- setkey(dt, sex, fruit)[, .N, 
             by="sex,fruit"][CJ(c("H","F"), 
             c("apple","tomato","orange")), 
             allow.cartesian=TRUE][is.na(N), N := 0L]
#    sex  fruit N
# 1:   F  apple 2
# 2:   F orange 0
# 3:   F tomato 2
# 4:   H  apple 3
# 5:   H orange 1
# 6:   H tomato 1
Run Code Online (Sandbox Code Playgroud)