如何对按R中的值分组的data.table求和

Str*_*d01 4 r data.table

我有一个使用XML文件构建的data.frame,现在我想对其值进行计数和求和,比如SQL中的count和sum.

这是data.frame的外观:

   msgDataSource msgFileSource processDate msgNumRows
1        source1       Quarter  2015-01-30         30
2        source1         Month  2015-01-30         15
3        source1         Month  2015-01-30         20
4        source1          Year  2015-01-30          1
5        source2       Quarter  2015-01-30         30
6        source3       Quarter  2015-01-30         15
7        source1          Year  2015-02-01         80
8        source2          Year  2015-02-01         90
9        source1       Quarter  2015-02-01          5
10       source2       Quarter  2015-03-15          9
11       source3       Quarter  2015-03-15         14
Run Code Online (Sandbox Code Playgroud)

这就是我需要的

   processDate msgFileSource msgDataSource sumDataSources   countDataSources
 1:  2015-01-30         Month       source1             35                 2
 2:  2015-01-30       Quarter       source1             30                 1
 3:  2015-01-30       Quarter       source2             30                 1
 4:  2015-01-30       Quarter       source3             15                 1
 5:  2015-01-30          Year       source1              1                 1
 6:  2015-02-01       Quarter       source1              5                 1
 7:  2015-02-01          Year       source1             80                 1
 8:  2015-02-01          Year       source2             90                 1
 9:  2015-03-15       Quarter       source2              9                 1
10:  2015-03-15       Quarter       source3             14                 1
Run Code Online (Sandbox Code Playgroud)

这是我迄今为止能够得到的:

   processDate msgFileSource msgDataSource sumDataSources
 1:  2015-01-30         Month       source1             35
 2:  2015-01-30       Quarter       source1             30
 3:  2015-01-30       Quarter       source2             30
 4:  2015-01-30       Quarter       source3             15
 5:  2015-01-30          Year       source1              1
 6:  2015-02-01       Quarter       source1              5
 7:  2015-02-01          Year       source1             80
 8:  2015-02-01          Year       source2             90
 9:  2015-03-15       Quarter       source2              9
10:  2015-03-15       Quarter       source3             14
Run Code Online (Sandbox Code Playgroud)

这是我的代码:

dfFullData <- data.frame (
    msgDataSource = c("source1", "source1", "source1", "source1", "source2", "source3", "source1", "source2", "source1", "source2", "source3"),
    msgFileSource = c("Quarter", "Month", "Month", "Year", "Quarter", "Quarter", "Year", "Year", "Quarter", "Quarter", "Quarter"),
    processDate = c("2015-01-30", "2015-01-30", "2015-01-30", "2015-01-30", "2015-01-30", "2015-01-30", "2015-02-01", "2015-02-01", "2015-02-01", "2015-03-15", "2015-03-15"),
    msgNumRows = c(30, 15, 20, 1, 30, 15, 80, 90, 5, 9, 14),
    stringsAsFactors=FALSE
)
summaryTable <- data.table(dfFullData)
summaryTable <- summaryTable[
                        order(processDate, msgFileSource, msgDataSource),
                        sum(msgNumRows),
                        by=list(processDate, msgFileSource, msgDataSource) 
]
setnames(summaryTable, "V1", "sumDataSources")
print(summaryTable)
Run Code Online (Sandbox Code Playgroud)

有没有办法计算一次通过的计数,或者我应该单独计算它然后执行一个cbind?

我如何实现我的需求?

谢谢.

Mat*_*eck 7

用于list列出聚合中所需的摘要列data.table.使用内置符号.N查找子集中的行数:

summaryTable <- summaryTable[
                        order(processDate, msgFileSource, msgDataSource),
                        list(sumDataSources=sum(msgNumRows), 
                             countDataSources=.N),
                        by=list(processDate, msgFileSource, msgDataSource) ]
Run Code Online (Sandbox Code Playgroud)

使用list这种方式也意味着您setnames以后不需要使用,因为您已经在您的列中命名了列list.


这与实际问题无关,但正如本答案下面的评论所详述,order上述命令中的附加内容的使用可以通过使用keyby而不是代替by.最终命令如下所示:

summaryTable <- summaryTable[, list(sumDataSources=sum(msgNumRows), 
                                    countDataSources=.N),
                        keyby=list(processDate, msgFileSource, msgDataSource) ]
Run Code Online (Sandbox Code Playgroud)

keyby 还有一个额外的好处是将它的参数设置为结果表的键,其顺序是此过程的副产品.

  • 尼斯.在这里`order()`的任何理由?`length(.)`只是`.N` - 特殊的内置符号. (2认同)
  • 马特里克斯,对.该操作实际上并不依赖于订单.所以你可以使用`keyby`来代替`by`,而不是使用`order()` - `keyby`将通过在*聚合之后对列*进行分组来对数据进行排序 - 这对聚合数据的排序更有效.查看[这些新的HTML插图](https://github.com/Rdatatable/data.table/wiki/Getting-started)以获取更多信息. (2认同)