data.table没有两列正确汇总

Question

data.table没有两列正确汇总

我有这个问题,data.table最近让我发疯.它看起来像一个bug但可能是我在这里遗漏了一些明显的东西.

我有以下数据框:

# First some data
data <- data.table(structure(list(
  month = structure(c(1356998400, 1356998400, 1356998400, 
                      1359676800, 1354320000, 1359676800, 1359676800, 1356998400, 1356998400, 
                      1354320000, 1354320000, 1354320000, 1359676800, 1359676800, 1359676800, 
                      1356998400, 1359676800, 1359676800, 1356998400, 1359676800, 1359676800, 
                      1359676800, 1359676800, 1354320000, 1354320000), class = c("POSIXct", 
                                                                                 "POSIXt"), tzone = "UTC"), 
  portal = c(TRUE, TRUE, FALSE, TRUE, 
             TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, 
             TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE
  ), 
  satisfaction = c(10L, 10L, 10L, 9L, 10L, 10L, 9L, 10L, 10L, 
                   9L, 2L, 8L, 10L, 9L, 10L, 10L, 9L, 10L, 10L, 10L, 9L, 10L, 9L, 
                   10L, 10L)), 
                  .Names = c("month", "portal", "satisfaction"), 
                  row.names = c(NA, -25L), class = "data.frame"))

Run Code Online (Sandbox Code Playgroud)

我想既要总结一下portal和month.总结好的旧tapply作品如预期 - 我得到3x2矩阵与2012年12月和2013年1月至2月的结果:

> tapply(data$satisfaction, list(data$month, data$portal), mean)
           FALSE      TRUE
2012-12-01   8.5  8.000000
2013-01-01  10.0 10.000000
2013-02-01   9.0  9.545455

Run Code Online (Sandbox Code Playgroud)

用by参数总结data.table不是:

> data[, mean(satisfaction), by = 'month,portal']
   month      portal        V1
1: 2013-01-01  FALSE 10.000000
2: 2013-02-01   TRUE  9.000000
3: 2013-01-01   TRUE 10.000000
4: 2012-12-01  FALSE  8.500000
5: 2012-12-01   TRUE  7.333333
6: 2013-02-01   TRUE  9.666667
7: 2013-02-01  FALSE  9.000000
8: 2012-12-01   TRUE 10.000000

Run Code Online (Sandbox Code Playgroud)

如您所见,它返回一个包含8个值的数据表,而不是预期的6 ; 例如,值portal == TRUE和month == 2012-02-01重复的值.

有趣的是,如果我将这仅限于2013年的数据,一切都按预期工作:

> data[month >= ymd(20130101), mean(satisfaction), by = 'month,portal']
        month portal        V1
1: 2013-01-01   TRUE 10.000000
2: 2013-01-01  FALSE 10.000000
3: 2013-02-01   TRUE  9.545455
4: 2013-02-01  FALSE  9.000000

Run Code Online (Sandbox Code Playgroud)

我很困惑超越相信:).有人可以帮帮我吗？

Answer 1

Vic*_* K. 8

这是一个已知的问题,已在data.table 1.8.7中解决(截至本文撰写时尚未在CRAN中解决).

来自data.table 新闻:

BUG FIXES

    <...>

o   setkey could sort 'double' columns (such as POSIXct) incorrectly when not the
    last column of the key, #2484. In data.table's C code :
        x[a] > x[b]-tol
    should have been :
        x[a]-x[b] > -tol  [or  x[b]-x[a] < tol ]
    The difference may have been machine/compiler dependent. Many thanks to statquant
    for the short reproducible example. Test added.

Run Code Online (Sandbox Code Playgroud)

更新到1.8.7后install.packages("data.table", repos="http://R-Forge.R-project.org"),一切都按预期工作.

Answer 2

Aru*_*run 5

问题似乎与排序有关.当我加载data并执行setkey:

setkey(data, "month", "portal")

# > data
#          month portal satisfaction
#  1: 2012-12-01   TRUE           10
#  2: 2012-12-01  FALSE            9
#  3: 2012-12-01  FALSE            8
#  4: 2012-12-01   TRUE            2
#  5: 2012-12-01   TRUE           10
#  6: 2012-12-01   TRUE           10
#  7: 2013-01-01   TRUE           10
#  8: 2013-01-01   TRUE           10
#  9: 2013-01-01   TRUE           10
# 10: 2013-01-01   TRUE           10
# 11: 2013-01-01   TRUE           10
# 12: 2013-01-01   TRUE           10
# 13: 2013-01-01  FALSE           10
# 14: 2013-02-01   TRUE            9
# 15: 2013-02-01   TRUE            9
# 16: 2013-02-01  FALSE            9
# 17: 2013-02-01   TRUE           10
# 18: 2013-02-01   TRUE           10
# 19: 2013-02-01   TRUE           10
# 20: 2013-02-01   TRUE           10
# 21: 2013-02-01   TRUE           10
# 22: 2013-02-01   TRUE            9
# 23: 2013-02-01   TRUE           10
# 24: 2013-02-01   TRUE            9
# 25: 2013-02-01   TRUE            9
#          month portal satisfaction

Run Code Online (Sandbox Code Playgroud)

您看到portal列未正确排序.当我再做setkey一次

setkey(data, "month", "portal")

# I get this warning message:
Warning message:
In setkeyv(x, cols, verbose = verbose) :
  Already keyed by this key but had invalid row order, key rebuilt. 
  If you didn't go under the hood please let datatable-help know so 
  the root cause can be fixed.

Run Code Online (Sandbox Code Playgroud)

现在,data列似乎按键列正确排序:

# > data
#          month portal satisfaction
#  1: 2012-12-01  FALSE            9
#  2: 2012-12-01  FALSE            8
#  3: 2012-12-01   TRUE           10
#  4: 2012-12-01   TRUE            2
#  5: 2012-12-01   TRUE           10
#  6: 2012-12-01   TRUE           10
#  7: 2013-01-01  FALSE           10
#  8: 2013-01-01   TRUE           10
#  9: 2013-01-01   TRUE           10
# 10: 2013-01-01   TRUE           10
# 11: 2013-01-01   TRUE           10
# 12: 2013-01-01   TRUE           10
# 13: 2013-01-01   TRUE           10
# 14: 2013-02-01  FALSE            9
# 15: 2013-02-01   TRUE            9
# 16: 2013-02-01   TRUE            9
# 17: 2013-02-01   TRUE           10
# 18: 2013-02-01   TRUE           10
# 19: 2013-02-01   TRUE           10
# 20: 2013-02-01   TRUE           10
# 21: 2013-02-01   TRUE           10
# 22: 2013-02-01   TRUE            9
# 23: 2013-02-01   TRUE           10
# 24: 2013-02-01   TRUE            9
# 25: 2013-02-01   TRUE            9
#          month portal satisfaction

Run Code Online (Sandbox Code Playgroud)

那么,对POSIXct + logical类型进行排序似乎是一个问题？

data[, mean(satisfaction), by=list(month, portal)]

#         month portal        V1
# 1: 2012-12-01  FALSE  8.500000
# 2: 2012-12-01   TRUE  8.000000
# 3: 2013-01-01  FALSE 10.000000
# 4: 2013-01-01   TRUE 10.000000
# 5: 2013-02-01  FALSE  9.000000
# 6: 2013-02-01   TRUE  9.545455

Run Code Online (Sandbox Code Playgroud)

因此,我认为有一个错误.

是的,我正要发表评论说,如果@MatthewDowle在一两天内没有考虑到这一点,那么OP应该可以在data.table邮件列表上发布这个问题. (2认同)

归档时间：	12 年，12 月前
查看次数：	295 次
最近记录：	12 年，12 月前