考虑表单的数据框架
idnum start end
1993.1 17 1993-01-01 1993-12-31
1993.2 17 1993-01-01 1993-12-31
1993.3 17 1993-01-01 1993-12-31
Run Code Online (Sandbox Code Playgroud)
有start和end类型Date
$ idnum : int 17 17 17 17 27 27
$ start : Date, format: "1993-01-01" "1993-01-01" "1993-01-01" "1993-01-01" ...
$ end : Date, format: "1993-12-31" "1993-12-31" "1993-12-31" "1993-12-31" ...
Run Code Online (Sandbox Code Playgroud)
我想创建一个新的数据帧,而不是每行的每月观察,中间的每个月start和end(包括边界):
期望的输出
idnum month
17 1993-01-01
17 1993-02-01
17 1993-03-01
...
17 1993-11-01
17 1993-12-01
Run Code Online (Sandbox Code Playgroud)
我不确定month应该采用什么格式,我会在某些时候想要分组idnum,month以便对其余数据集进行回归.
到目前为止,对于每一行,seq(from=test[1,'start'], to=test[1, 'end'], by='1 month')给出了正确的序列 - 但是一旦我尝试将其应用于整个数据框,它就无法工作:
> foo <- apply(test, 1, function(x) seq(x['start'], to=x['end'], by='1 month'))
Error in to - from : non-numeric argument to binary operator
Run Code Online (Sandbox Code Playgroud)
Aru*_*run 25
使用data.table:
require(data.table) ## 1.9.2+
setDT(df)[ , list(idnum = idnum, month = seq(start, end, by = "month")), by = 1:nrow(df)]
# you may use dot notation as a shorthand alias of list in j:
setDT(df)[ , .(idnum = idnum, month = seq(start, end, by = "month")), by = 1:nrow(df)]
Run Code Online (Sandbox Code Playgroud)
setDT转换df为data.table.然后,对于每一行,by = 1:nrow(df)我们创建idnum并month根据需要.
jub*_*uba 16
使用dplyr:
test %>%
group_by(idnum) %>%
summarize(start=min(start),end=max(end)) %>%
do(data.frame(idnum=.$idnum, month=seq(.$start,.$end,by="1 month")))
Run Code Online (Sandbox Code Playgroud)
请注意,这里我就不产生之间的程序start ,并end为每一行,取而代之的则是之间的程序min(start)和max(end)每一个idnum.如果你想要前者:
test %>%
rowwise() %>%
do(data.frame(idnum=.$idnum, month=seq(.$start,.$end,by="1 month")))
Run Code Online (Sandbox Code Playgroud)
dplyr使用and为每行创建一个序列的一个选项tidyr可以是:
df %>%\n rowwise() %>%\n transmute(idnum,\n date = list(seq(start, end, by = "month"))) %>%\n unnest(date)\n\n idnum date \n <int> <date> \n 1 17 1993-01-01\n 2 17 1993-02-01\n 3 17 1993-03-01\n 4 17 1993-04-01\n 5 17 1993-05-01\n 6 17 1993-06-01\n 7 17 1993-07-01\n 8 17 1993-08-01\n 9 17 1993-09-01\n10 17 1993-10-01\n# \xe2\x80\xa6 with 26 more rows\nRun Code Online (Sandbox Code Playgroud)\n或者使用分组 ID 创建序列:
\ndf %>%\n group_by(idnum) %>%\n transmute(date = list(seq(min(start), max(end), by = "month"))) %>%\n unnest(date)\nRun Code Online (Sandbox Code Playgroud)\n或者当目标是为每个 ID 仅创建一个唯一序列时:
\ndf %>%\n group_by(idnum) %>%\n summarise(start = min(start),\n end = max(end)) %>%\n transmute(date = list(seq(min(start), max(end), by = "month"))) %>%\n unnest(date)\n\n date \n <date> \n 1 1993-01-01\n 2 1993-02-01\n 3 1993-03-01\n 4 1993-04-01\n 5 1993-05-01\n 6 1993-06-01\n 7 1993-07-01\n 8 1993-08-01\n 9 1993-09-01\n10 1993-10-01\n11 1993-11-01\n12 1993-12-01\nRun Code Online (Sandbox Code Playgroud)\n或者使用reframe()自从dplyr 1.1.0:
df %>%\n rowwise() %>%\n reframe(idnum,\n date = seq(start, end, by = "month"))\nRun Code Online (Sandbox Code Playgroud)\n
对于purrr( 0.3.0) 和dplyr( 0.8.0) 的新版本,这可以通过map2
library(dplyr)
library(purrr)
test %>%
# sequence of monthly dates for each corresponding start, end elements
transmute(idnum, month = map2(start, end, seq, by = "1 month")) %>%
# unnest the list column
unnest %>%
# remove any duplicate rows
distinct
Run Code Online (Sandbox Code Playgroud)
基于@Ananda Mahto 的评论
res1 <- melt(setNames(lapply(1:nrow(test), function(x) seq(test[x, "start"],
test[x, "end"], by = "1 month")), test$idnum))
Run Code Online (Sandbox Code Playgroud)
还,
res2 <- setNames(do.call(`rbind`,
with(test,
Map(`expand.grid`,idnum,
Map(`seq`, start, end, by='1 month')))), c("idnum", "month"))
head(res1)
# idnum month
#1 17 1993-01-01
#2 17 1993-02-01
#3 17 1993-03-01
#4 17 1993-04-01
#5 17 1993-05-01
#6 17 1993-06-01
Run Code Online (Sandbox Code Playgroud)
另一种tidyverse方法是使用tidyr::expand:
library(dplyr, warn = FALSE)\nlibrary(tidyr)\n\ndf |> \n mutate(\n row = row_number()\n ) |> \n group_by(row) |> \n expand(idnum, date = seq(start, end, "month")) |> \n ungroup() |> \n select(-row)\n#> # A tibble: 36 \xc3\x97 2\n#> idnum date \n#> <int> <date> \n#> 1 17 1993-01-01\n#> 2 17 1993-02-01\n#> 3 17 1993-03-01\n#> 4 17 1993-04-01\n#> 5 17 1993-05-01\n#> 6 17 1993-06-01\n#> 7 17 1993-07-01\n#> 8 17 1993-08-01\n#> 9 17 1993-09-01\n#> 10 17 1993-10-01\n#> # \xe2\x80\xa6 with 26 more rows\nRun Code Online (Sandbox Code Playgroud)\n