按日期范围和分类变量组合数据集

heo*_*heo 4 performance for-loop r dplyr

假设我有两个数据集.一个包含具有开始/结束日期的促销列表,另一个包含每个程序的月度销售数据.

promotions = data.frame(
    start.date = as.Date(c("2012-01-01", "2012-06-14", "2012-02-01", "2012-03-31", "2012-07-13")), 
    end.date = as.Date(c("2014-04-05", "2014-11-13", "2014-02-25", "2014-08-02", "2014-09-30")), 
    program = c("a", "a", "a", "b", "b"))

sales = data.frame(
    year.month.day = as.Date(c("2013-02-01", "2014-09-01", "2013-08-01", "2013-04-01", "2012-11-01")), 
    program = c("a", "b", "a", "a", "b"), 
    monthly.sales = c(200, 200, 200, 400, 200))
Run Code Online (Sandbox Code Playgroud)

注意,sales$year.month.day用于表示年/月.包括日,因此R可以更简单地将列视为日期对象的向量,但它与实际销售无关.

我需要确定每个程序每月发生的促销数量.这是一个产生我想要的输出的循环示例:

sales$count = rep(0, nrow(sales))
sub = list()
for (i in 1:nrow(sales)) {
  sub[[i]] = promotions[which(promotions$program == sales$program[i]),]
  if (nrow(sub[[i]]) > 1) {
    for (j in 1:nrow(sub[[i]])) {
      if (sales$year.month.day[i] %in% seq(from = as.Date(sub[[i]]$start.date[j]), to = as.Date(sub[[i]]$end.date[j]), by = "day")) {
        sales$count[i] = sales$count[i] + 1
      }
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

示例输出:

 sales = data.frame(
    year.month.day = as.Date(c("2013-02-01", "2014-09-01", "2013-08-01", "2013-04-01", "2012-11-01")), 
    program = c("a", "b", "a", "a", "b"), 
    monthly.sales = c(200, 200, 200, 400, 200),
    count = c(3, 1, 3, 3, 2)
)
Run Code Online (Sandbox Code Playgroud)

但是因为我的实际数据集非常大,所以当我在R中运行它时,这个循环会崩溃

有没有更有效的方法来实现相同的结果?也许是dplyr的东西?

Chi*_*oli 5

你可以用sql做到这一点.

library(sqldf)
sqldf("select s.ymd,p.program,s.monthlysales, count(*) from promotions p outer left join sales s on p.program=s.program 
where s.ymd between p.startdate and p.enddate and p.program=s.program group by s.ymd, s.program" )
Run Code Online (Sandbox Code Playgroud)

这将首先加入2数据集,其中销售中的ymd介于促销的开始和结束日期之间,并且两个数据中的程序是相同的.然后它将按ymd分组并计算实例.我已从变量名称中删除了句点.


Aru*_*run 5

使用当前开发版本的data.table中新实现的非equi连接:

require(data.table) # v1.9.7+
setDT(promotions) # convert to data.table by reference
setDT(sales)

ans = promotions[sales, .(monthly.sales, .N), by=.EACHI, allow.cartesian=TRUE, 
        on=.(program, start.date<=year.month.day, end.date>=year.month.day), nomatch=0L]

ans[, end.date := NULL]
setnames(ans, "start.date", "year.month.date")
#    program year.month.date monthly.sales N
# 1:       a      2013-02-01           200 3
# 2:       b      2014-09-01           200 1
# 3:       a      2013-08-01           200 3
# 4:       a      2013-04-01           400 3
# 5:       b      2012-11-01           200 2
Run Code Online (Sandbox Code Playgroud)

在此处查看开发版的安装说明.