我有这个问题,data.table最近让我发疯.它看起来像一个bug但可能是我在这里遗漏了一些明显的东西.
我有以下数据框:
# First some data
data <- data.table(structure(list(
month = structure(c(1356998400, 1356998400, 1356998400,
1359676800, 1354320000, 1359676800, 1359676800, 1356998400, 1356998400,
1354320000, 1354320000, 1354320000, 1359676800, 1359676800, 1359676800,
1356998400, 1359676800, 1359676800, 1356998400, 1359676800, 1359676800,
1359676800, 1359676800, 1354320000, 1354320000), class = c("POSIXct",
"POSIXt"), tzone = "UTC"),
portal = c(TRUE, TRUE, FALSE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE
),
satisfaction = c(10L, 10L, 10L, 9L, 10L, 10L, 9L, 10L, 10L,
9L, 2L, 8L, 10L, 9L, 10L, 10L, 9L, 10L, 10L, 10L, 9L, 10L, 9L,
10L, 10L)),
.Names = c("month", "portal", "satisfaction"),
row.names = c(NA, -25L), class = "data.frame"))
Run Code Online (Sandbox Code Playgroud)
我想既要总结一下portal和month.总结好的旧tapply作品如预期 - 我得到3x2矩阵与2012年12月和2013年1月至2月的结果:
> tapply(data$satisfaction, list(data$month, data$portal), mean)
FALSE TRUE
2012-12-01 8.5 8.000000
2013-01-01 10.0 10.000000
2013-02-01 9.0 9.545455
Run Code Online (Sandbox Code Playgroud)
用by参数总结data.table不是:
> data[, mean(satisfaction), by = 'month,portal']
month portal V1
1: 2013-01-01 FALSE 10.000000
2: 2013-02-01 TRUE 9.000000
3: 2013-01-01 TRUE 10.000000
4: 2012-12-01 FALSE 8.500000
5: 2012-12-01 TRUE 7.333333
6: 2013-02-01 TRUE 9.666667
7: 2013-02-01 FALSE 9.000000
8: 2012-12-01 TRUE 10.000000
Run Code Online (Sandbox Code Playgroud)
如您所见,它返回一个包含8个值的数据表,而不是预期的6 ; 例如,值portal == TRUE和month == 2012-02-01重复的值.
有趣的是,如果我将这仅限于2013年的数据,一切都按预期工作:
> data[month >= ymd(20130101), mean(satisfaction), by = 'month,portal']
month portal V1
1: 2013-01-01 TRUE 10.000000
2: 2013-01-01 FALSE 10.000000
3: 2013-02-01 TRUE 9.545455
4: 2013-02-01 FALSE 9.000000
Run Code Online (Sandbox Code Playgroud)
我很困惑超越相信:).有人可以帮帮我吗?
这是一个已知的问题,已在data.table 1.8.7中解决(截至本文撰写时尚未在CRAN中解决).
来自data.table 新闻:
Run Code Online (Sandbox Code Playgroud)BUG FIXES <...> o setkey could sort 'double' columns (such as POSIXct) incorrectly when not the last column of the key, #2484. In data.table's C code : x[a] > x[b]-tol should have been : x[a]-x[b] > -tol [or x[b]-x[a] < tol ] The difference may have been machine/compiler dependent. Many thanks to statquant for the short reproducible example. Test added.
更新到1.8.7后install.packages("data.table", repos="http://R-Forge.R-project.org"),一切都按预期工作.
问题似乎与排序有关.当我加载data并执行setkey:
setkey(data, "month", "portal")
# > data
# month portal satisfaction
# 1: 2012-12-01 TRUE 10
# 2: 2012-12-01 FALSE 9
# 3: 2012-12-01 FALSE 8
# 4: 2012-12-01 TRUE 2
# 5: 2012-12-01 TRUE 10
# 6: 2012-12-01 TRUE 10
# 7: 2013-01-01 TRUE 10
# 8: 2013-01-01 TRUE 10
# 9: 2013-01-01 TRUE 10
# 10: 2013-01-01 TRUE 10
# 11: 2013-01-01 TRUE 10
# 12: 2013-01-01 TRUE 10
# 13: 2013-01-01 FALSE 10
# 14: 2013-02-01 TRUE 9
# 15: 2013-02-01 TRUE 9
# 16: 2013-02-01 FALSE 9
# 17: 2013-02-01 TRUE 10
# 18: 2013-02-01 TRUE 10
# 19: 2013-02-01 TRUE 10
# 20: 2013-02-01 TRUE 10
# 21: 2013-02-01 TRUE 10
# 22: 2013-02-01 TRUE 9
# 23: 2013-02-01 TRUE 10
# 24: 2013-02-01 TRUE 9
# 25: 2013-02-01 TRUE 9
# month portal satisfaction
Run Code Online (Sandbox Code Playgroud)
您看到portal列未正确排序.当我再做setkey一次
setkey(data, "month", "portal")
# I get this warning message:
Warning message:
In setkeyv(x, cols, verbose = verbose) :
Already keyed by this key but had invalid row order, key rebuilt.
If you didn't go under the hood please let datatable-help know so
the root cause can be fixed.
Run Code Online (Sandbox Code Playgroud)
现在,data列似乎按键列正确排序:
# > data
# month portal satisfaction
# 1: 2012-12-01 FALSE 9
# 2: 2012-12-01 FALSE 8
# 3: 2012-12-01 TRUE 10
# 4: 2012-12-01 TRUE 2
# 5: 2012-12-01 TRUE 10
# 6: 2012-12-01 TRUE 10
# 7: 2013-01-01 FALSE 10
# 8: 2013-01-01 TRUE 10
# 9: 2013-01-01 TRUE 10
# 10: 2013-01-01 TRUE 10
# 11: 2013-01-01 TRUE 10
# 12: 2013-01-01 TRUE 10
# 13: 2013-01-01 TRUE 10
# 14: 2013-02-01 FALSE 9
# 15: 2013-02-01 TRUE 9
# 16: 2013-02-01 TRUE 9
# 17: 2013-02-01 TRUE 10
# 18: 2013-02-01 TRUE 10
# 19: 2013-02-01 TRUE 10
# 20: 2013-02-01 TRUE 10
# 21: 2013-02-01 TRUE 10
# 22: 2013-02-01 TRUE 9
# 23: 2013-02-01 TRUE 10
# 24: 2013-02-01 TRUE 9
# 25: 2013-02-01 TRUE 9
# month portal satisfaction
Run Code Online (Sandbox Code Playgroud)
那么,对POSIXct + logical类型进行排序似乎是一个问题?
data[, mean(satisfaction), by=list(month, portal)]
# month portal V1
# 1: 2012-12-01 FALSE 8.500000
# 2: 2012-12-01 TRUE 8.000000
# 3: 2013-01-01 FALSE 10.000000
# 4: 2013-01-01 TRUE 10.000000
# 5: 2013-02-01 FALSE 9.000000
# 6: 2013-02-01 TRUE 9.545455
Run Code Online (Sandbox Code Playgroud)
因此,我认为有一个错误.