如何有效地对稀疏数据进行聚合

smc*_*mci 2 aggregate r sparse-matrix dataframe

我有一个包含1008412个观测值的大型数据集,列是customer_id(int),visit_date(日期,格式:"2010-04-04"),visit_spend(浮点数).

此日期函数用于聚合地图周数范围13-65:

weekofperiod <- function(dt) {
    as.numeric(format(as.Date(dt), "%W")) + 52 * (as.numeric(format(as.Date(dt), "%Y"))-2010)
}
Run Code Online (Sandbox Code Playgroud)

每个customer_id在53周的时间内拥有可变数量的总访问量.对于每一个customer_id,我想要得到的总和spend_per_week,通过weekofperiod().下面的代码在功能上是正确但非常慢 - 评论让它更快?此外,aggregate()产生稀疏输出,其中没有访问的周数缺失,我初始化spend_per_week为0,然后逐行手动分配来自aggregate()的非零结果,以确保结果总是有53行.当然可以做得更好吗?

示例数据集行如下所示:

   customer_id visit_date visit_spend 
72          40 2011-03-15       18.38 
73          40 2011-03-20       23.45  
74          79 2010-04-07      150.87 
75          79 2010-04-17      101.90 
76          79 2010-05-02      111.90 
Run Code Online (Sandbox Code Playgroud)

这里是空周的聚合调用和调整的代码:

for (cid in all_tt_cids) {
  print_pnq('Getting statistics for cid', cid)

  # Get row indices of the selected subset, for just this cid's records
  I <- which(tt$customer_id==cid & tt$visit_date<="2011-03-31")

  # (other code to compute other per-cid statistics)

  # spend_per_week (mode;mean;sd)
  # Aggregate spend_per_week, but beware this should be 0 for those week with no visits
  spend_per_week <- data.frame(c(list('weekofperiod'=13:65), list('spendperweek'=0)) )
  nonzero_spends_per_week <- aggregate(tt$visit_spend[I], list('weekofperiod'=weekofperiod(tt$visit_date[I])), FUN="sum")
  for (i in 1:nrow(nonzero_spends_per_week)) {
    spend_per_week[spend_per_week$weekofperiod==nonzero_spends_per_week[i,1],2] <- nonzero_spends_per_week[i,2]
  }
  colnames(spend_per_week)[2] <- 'spend_per_week'

  # (code to compute and store per-cid statistics on spend_per_week)

}
Run Code Online (Sandbox Code Playgroud)

Joh*_*lby 6

如果你更换for循环,你的最大加速将会到来.我不能从你的例子中说出来,因为你在循环中覆盖每个客户,但是如果你想保留所有主题的信息,这是一种方法.

对于测试,首先定义原始方法的函数,以及不带循环的新方法:

weekofperiod <- function(dt) {
  as.numeric(format(as.Date(dt), "%W")) + 52 * (as.numeric(format(as.Date(dt), "%Y"))-2010)
}

FastMethod <- function(tt) {  
  tt$week = weekofperiod(tt$visit_date)
  spend_per_week.tmp = as.data.frame(tapply(tt$visit_spend, tt[,c(1,4)], sum))
  spend_per_week = data.frame(matrix(0, nrow=nrow(spend_per_week.tmp), ncol=length(13:65)))
  colnames(spend_per_week) = 13:65
  rownames(spend_per_week) = rownames(spend_per_week.tmp)
  spend_per_week[, colnames(spend_per_week.tmp)] = spend_per_week.tmp
  spend_per_week
}

OrigMethod <- function(tt) {
  all_tt_cids = unique(tt$customer_id)

  for (cid in all_tt_cids) {
    # Get row indices of the selected subset, for just this cid's records
    I <- which(tt$customer_id==cid & tt$visit_date<="2011-03-31")

    # Aggregate spend_per_week, but beware this should be 0 for those week with no visits
    spend_per_week <- data.frame(c(list('weekofperiod'=13:65), list('spendperweek'=0)))
    nonzero_spends_per_week <- aggregate(tt$visit_spend[I], list('weekofperiod'=weekofperiod(tt$visit_date[I])), FUN="sum")
    for (i in 1:nrow(nonzero_spends_per_week)) {
      spend_per_week[spend_per_week$weekofperiod==nonzero_spends_per_week[i,1],2] <- nonzero_spends_per_week[i,2]
    }
    colnames(spend_per_week)[2] <- 'spend_per_week'
  }
  spend_per_week
}
Run Code Online (Sandbox Code Playgroud)

现在模拟一个更大的数据集,以便更容易比较:

n.row  = 10^4
n.cust = 10^3

customer_id = 1:n.cust
dates = seq(as.Date('2010-04-01'), as.Date('2011-03-31'), by=1)
visit_date = sample(dates, n.row, replace=T)
visit_spend = runif(n.row, 0, 200)

tt = data.frame(customer_id, visit_date, visit_spend)
Run Code Online (Sandbox Code Playgroud)

最后,比较两种方法:

> system.time(FastMethod(tt))
   user  system elapsed 
  0.082   0.001   0.083 
> system.time(OrigMethod(tt))

   user  system elapsed 
  4.505   0.007   4.514 
Run Code Online (Sandbox Code Playgroud)

这已经50倍,我敢打赌,通过更多优化,你可以做得更好.祝好运!