由R中的不同列值求和

use*_*199 10 r sum unique data.table

我在R中有一个非常大的数据框,并希望在其他列中为每个不同的值加上两列,例如,我们在一天内有各种商店的交易数据框的数据,如下所示

shop <- data.frame('shop_id' = c(1, 1, 1, 2, 3, 3), 
  'shop_name' = c('Shop A', 'Shop A', 'Shop A', 'Shop B', 'Shop C', 'Shop C'), 
  'city' = c('London', 'London', 'London', 'Cardiff', 'Dublin', 'Dublin'), 
  'sale' = c(12, 5, 9, 15, 10, 18), 
  'profit' = c(3, 1, 3, 6, 5, 9))
Run Code Online (Sandbox Code Playgroud)

这是:

shop_id  shop_name    city      sale profit
   1     Shop A       London    12   3
   1     Shop A       London    5    1
   1     Shop A       London    9    3
   2     Shop B       Cardiff   15   6
   3     Shop C       Dublin    10   5
   3     Shop C       Dublin    18   9
Run Code Online (Sandbox Code Playgroud)

而且我想总结每家商店的销售和利润:

shop_id  shop_name    city      sale profit
   1     Shop A       London    26   7
   2     Shop B       Cardiff   15   6
   3     Shop C       Dublin    28   14
Run Code Online (Sandbox Code Playgroud)

我目前正在使用以下代码执行此操作:

 shop_day <-ddply(shop, "shop_id", transform, sale=sum(sale), profit=sum(profit))
 shop_day <- subset(shop_day, !duplicated(shop_id))
Run Code Online (Sandbox Code Playgroud)

哪个工作绝对正常,但正如我所说的我的数据帧很大(140,000行,37列和近100,000个唯一的行,我想总结)和我的代码需要很长时间才能运行,然后最终说它已经耗尽了内存.

有谁知道最有效的方法来做到这一点.

提前致谢!

Jus*_*tin 15

**强制性数据表答案**

> library(data.table)
data.table 1.8.0  For help type: help("data.table")
> shop.dt <- data.table(shop)
> shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id']
     shop_id sale profit
[1,]       1   26      7
[2,]       2   15      6
[3,]       3   28     14
> 
Run Code Online (Sandbox Code Playgroud)

在事情变得更大之前,这听起来不错

shop <- data.frame(shop_id = letters[1:10], profit=rnorm(1e7), sale=rnorm(1e7))
shop.dt <- data.table(shop)

> system.time(ddply(shop, .(shop_id), summarise, sale=sum(sale), profit=sum(profit)))
   user  system elapsed 
  4.156   1.324   5.514 
> system.time(shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id'])
   user  system elapsed 
  0.728   0.108   0.840 
> 
Run Code Online (Sandbox Code Playgroud)

如果使用键创建data.table,则会获得额外的速度提升:

shop.dt <- data.table(shop, key='shop_id')

> system.time(shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id'])
   user  system elapsed 
  0.252   0.084   0.336 
> 
Run Code Online (Sandbox Code Playgroud)


use*_*199 6

我认为最好的方法是dplyr

library(dplyr)
shop %>% 
  group_by(shop_id, shop_name, city) %>% 
  summarise_all(sum)
Run Code Online (Sandbox Code Playgroud)