考虑表单的数据框架
       idnum      start        end
1993.1    17 1993-01-01 1993-12-31
1993.2    17 1993-01-01 1993-12-31
1993.3    17 1993-01-01 1993-12-31
有start和end类型Date
 $ idnum : int  17 17 17 17 27 27
 $ start : Date, format: "1993-01-01" "1993-01-01" "1993-01-01" "1993-01-01" ...
 $ end   : Date, format: "1993-12-31" "1993-12-31" "1993-12-31" "1993-12-31" ...
我想创建一个新的数据帧,而不是每行的每月观察,中间的每个月start和end(包括边界):
期望的输出
idnum       month
   17  1993-01-01
   17  1993-02-01
   17  1993-03-01
...
   17  1993-11-01
   17  1993-12-01
我不确定month应该采用什么格式,我会在某些时候想要分组idnum,month以便对其余数据集进行回归.
到目前为止,对于每一行,seq(from=test[1,'start'], to=test[1, 'end'], by='1 month')给出了正确的序列 - 但是一旦我尝试将其应用于整个数据框,它就无法工作:
> foo <- apply(test, 1, function(x) seq(x['start'], to=x['end'], by='1 month'))
Error in to - from : non-numeric argument to binary operator
Aru*_*run 25
使用data.table:
require(data.table) ## 1.9.2+
setDT(df)[ , list(idnum = idnum, month = seq(start, end, by = "month")), by = 1:nrow(df)]
# you may use dot notation as a shorthand alias of list in j:
setDT(df)[ , .(idnum = idnum, month = seq(start, end, by = "month")), by = 1:nrow(df)]
setDT转换df为data.table.然后,对于每一行,by = 1:nrow(df)我们创建idnum并month根据需要.
jub*_*uba 16
使用dplyr:
test %>%
    group_by(idnum) %>%
    summarize(start=min(start),end=max(end)) %>%
    do(data.frame(idnum=.$idnum, month=seq(.$start,.$end,by="1 month")))
请注意,这里我就不产生之间的程序start ,并end为每一行,取而代之的则是之间的程序min(start)和max(end)每一个idnum.如果你想要前者:
test %>%
    rowwise() %>%
    do(data.frame(idnum=.$idnum, month=seq(.$start,.$end,by="1 month")))
dplyr使用and为每行创建一个序列的一个选项tidyr可以是:
df %>%\n rowwise() %>%\n transmute(idnum,\n           date = list(seq(start, end, by = "month"))) %>%\n unnest(date)\n\n  idnum date      \n   <int> <date>    \n 1    17 1993-01-01\n 2    17 1993-02-01\n 3    17 1993-03-01\n 4    17 1993-04-01\n 5    17 1993-05-01\n 6    17 1993-06-01\n 7    17 1993-07-01\n 8    17 1993-08-01\n 9    17 1993-09-01\n10    17 1993-10-01\n# \xe2\x80\xa6 with 26 more rows\n或者使用分组 ID 创建序列:
\ndf %>%\n group_by(idnum) %>%\n transmute(date = list(seq(min(start), max(end), by = "month"))) %>%\n unnest(date)\n或者当目标是为每个 ID 仅创建一个唯一序列时:
\ndf %>%\n group_by(idnum) %>%\n summarise(start = min(start),\n           end = max(end)) %>%\n transmute(date = list(seq(min(start), max(end), by = "month"))) %>%\n unnest(date)\n\n   date      \n   <date>    \n 1 1993-01-01\n 2 1993-02-01\n 3 1993-03-01\n 4 1993-04-01\n 5 1993-05-01\n 6 1993-06-01\n 7 1993-07-01\n 8 1993-08-01\n 9 1993-09-01\n10 1993-10-01\n11 1993-11-01\n12 1993-12-01\n或者使用reframe()自从dplyr 1.1.0:
df %>%\n rowwise() %>%\n reframe(idnum,\n           date = seq(start, end, by = "month"))\n对于purrr( 0.3.0) 和dplyr( 0.8.0) 的新版本,这可以通过map2 
library(dplyr)
library(purrr)
 test %>%
     # sequence of monthly dates for each corresponding start, end elements
     transmute(idnum, month = map2(start, end, seq, by = "1 month")) %>%
     # unnest the list column
     unnest %>% 
     # remove any duplicate rows
     distinct
基于@Ananda Mahto 的评论
 res1 <- melt(setNames(lapply(1:nrow(test), function(x) seq(test[x, "start"],
 test[x, "end"], by = "1 month")), test$idnum))
还,
  res2 <- setNames(do.call(`rbind`,
          with(test, 
          Map(`expand.grid`,idnum,
          Map(`seq`, start, end, by='1 month')))), c("idnum", "month"))
  head(res1)
 #  idnum      month
 #1    17 1993-01-01
 #2    17 1993-02-01
 #3    17 1993-03-01
 #4    17 1993-04-01
 #5    17 1993-05-01
 #6    17 1993-06-01
另一种tidyverse方法是使用tidyr::expand:
library(dplyr, warn = FALSE)\nlibrary(tidyr)\n\ndf |> \n  mutate(\n    row = row_number()\n  ) |> \n  group_by(row) |> \n  expand(idnum, date = seq(start, end, "month")) |> \n  ungroup() |> \n  select(-row)\n#> # A tibble: 36 \xc3\x97 2\n#>    idnum date      \n#>    <int> <date>    \n#>  1    17 1993-01-01\n#>  2    17 1993-02-01\n#>  3    17 1993-03-01\n#>  4    17 1993-04-01\n#>  5    17 1993-05-01\n#>  6    17 1993-06-01\n#>  7    17 1993-07-01\n#>  8    17 1993-08-01\n#>  9    17 1993-09-01\n#> 10    17 1993-10-01\n#> # \xe2\x80\xa6 with 26 more rows\n