合并基于年份的data.frames并填写缺失值

R.M*_*.M. 0 r dataframe

我有两个data.frames,我想合并在一起.首先是:

datess <- seq(as.Date('2005-01-01'), as.Date('2009-12-31'), 'days')
sample<- data.frame(matrix(ncol = 3, nrow = length(datess)))
colnames(sample) <- c('Date', 'y', 'Z')
sample$Date <- datess
Run Code Online (Sandbox Code Playgroud)

第二:

a <- data.frame(matrix(ncol = 3, nrow = 5))
colnames(a) <- c('a', 'y', 'Z')
a$Z <- c(1, 3, 4, 5, 2)
a$a <- c(2005, 2006, 2007, 2008, 2009)
a$y <- c('abc', 'def', 'ijk', 'xyz', 'thanks')
Run Code Online (Sandbox Code Playgroud)

我希望合并后的那一年匹配年份,然后填写当年每一天的其余值.

Date          y      Z
2005-01-01   abc     1
2005-01-02   abc     1 
2005-01-03   abc     1
{cont}
2009-12-31   thanks  2
Run Code Online (Sandbox Code Playgroud)

Uwe*_*Uwe 5

到目前为止,已经发布了三种不同的方法:

弗兰克在聊天中提出了第四种称为更新加入的方法:

library(data.table)
setDT(sample)[, yr := year(Date)][setDT(a), on = .(yr = a), `:=`(y = i.y, Z = i.Z)]
Run Code Online (Sandbox Code Playgroud)

结果证明这是四个中最快,最简洁的.

基准测试结果:

为了确定哪种方法在速度方面最有效,我使用该microbenchmark软件包设置了一个基准.

Unit: microseconds
        expr      min       lq     mean    median       uq      max neval
 create_data  248.827  291.116  316.240  302.0655  323.588  665.298   100
       match 4488.685 4545.701 4752.226 4649.5355 4810.763 6881.418   100
       dplyr 6086.609 6275.588 6513.997 6385.2760 6625.229 8535.979   100
       merge 2871.883 2942.490 3183.712 3004.6025 3168.096 5616.898   100
 update_join 1484.272 1545.063 1710.651 1659.8480 1733.476 3434.102   100
Run Code Online (Sandbox Code Playgroud)

由于sample修改它每次基准测试之前重新创建.这是由一个功能完成的,该功能也包含在基准测试中(创建数据).对于次创建数据需要从其他的时刻减去.

因此,即使对于大约1800行的小数据集,更新连接速度最快,几乎是第二次合并的两倍,其次是匹配,dplyr是最后一次,比更新连接慢4倍(时间为减去创建的数据).

基准代码

datess <- seq(as.Date('2005-01-01'), as.Date('2009-12-31'), 'days')
a <- data.frame(Z = c(1, 3, 4, 5, 2),
                a = 2005:2009,
                y = c('abc', 'def', 'ijk', 'xyz', 'thanks'),
                stringsAsFactors = FALSE)
setDT(a)
make_sample <- function() data.frame(Date = datess, y = NA_character_, Z = NA_real_)

library(data.table)
library(magrittr)
microbenchmark::microbenchmark(
  create_data = make_sample(),
  match = {
    sample <- make_sample()
    matched<-match(format(sample$Date,"%Y"),a$a)
    sample$y<-a$y[matched]
    sample$Z<-a$Z[matched]
  },
  dplyr = {
    sample <- make_sample()
    sample <- sample %>% 
      dplyr::mutate(a = format(Date, "%Y") %>% as.numeric) %>% 
      dplyr::inner_join(a %>% dplyr::select(a), by = "a") 
  },
  merge = {
    sample <- make_sample()
    sample2 <- data.frame(Date = datess)
    sample2$a <- lubridate::year(sample2$Date)
    sample <- base::merge(sample2, a, by="a")
  },
  update_join = {
    sample <- make_sample()
    setDT(sample)[, yr := year(Date)][a, on = .(yr = a), `:=`(y = i.y, Z = i.Z)]
  }
)
Run Code Online (Sandbox Code Playgroud)