Qui*_*ten 12 datetime r dataframe dplyr data.table
我有以下名为 df 的示例数据框(dput如下):
group date indicator
1 A 2022-11-01 01:00:00 FALSE
2 A 2022-11-01 03:00:00 FALSE
3 A 2022-11-01 04:00:00 TRUE
4 A 2022-11-01 05:00:00 FALSE
5 A 2022-11-01 06:00:00 TRUE
6 A 2022-11-01 07:00:00 FALSE
7 A 2022-11-01 10:00:00 FALSE
8 A 2022-11-01 12:00:00 FALSE
9 B 2022-11-01 01:00:00 FALSE
10 B 2022-11-01 02:00:00 FALSE
11 B 2022-11-01 03:00:00 FALSE
12 B 2022-11-01 06:00:00 TRUE
13 B 2022-11-01 07:00:00 FALSE
14 B 2022-11-01 08:00:00 FALSE
15 B 2022-11-01 11:00:00 TRUE
16 B 2022-11-01 13:00:00 FALSE
Run Code Online (Sandbox Code Playgroud)
我想计算日期与indicator == TRUE每组最近的条件行之间的小时数差异。此外,具有 TRUE 的行应返回 0 作为输出。在这里您可以看到名为 df_desired 的所需输出:
group date indicator diff_hours
1 A 2022-11-01 01:00:00 FALSE 3
2 A 2022-11-01 03:00:00 FALSE 1
3 A 2022-11-01 04:00:00 TRUE 0
4 A 2022-11-01 05:00:00 FALSE 1
5 A 2022-11-01 06:00:00 TRUE 0
6 A 2022-11-01 07:00:00 FALSE 1
7 A 2022-11-01 10:00:00 FALSE 4
8 A 2022-11-01 12:00:00 FALSE 6
9 B 2022-11-01 01:00:00 FALSE 5
10 B 2022-11-01 02:00:00 FALSE 4
11 B 2022-11-01 03:00:00 FALSE 3
12 B 2022-11-01 06:00:00 TRUE 0
13 B 2022-11-01 07:00:00 FALSE 1
14 B 2022-11-01 08:00:00 FALSE 2
15 B 2022-11-01 11:00:00 TRUE 0
16 B 2022-11-01 13:00:00 FALSE 2
Run Code Online (Sandbox Code Playgroud)
所以我想知道是否有人知道如何计算日期之间的差异(以小时为单位)相对于每组最近的条件行?
这里dput是 df 和 df_desired:
df <- structure(list(group = c("A", "A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B", "B", "B", "B", "B"), date = structure(c(1667260800,
1667268000, 1667271600, 1667275200, 1667278800, 1667282400, 1667293200,
1667300400, 1667260800, 1667264400, 1667268000, 1667278800, 1667282400,
1667286000, 1667296800, 1667304000), class = c("POSIXct", "POSIXt"
), tzone = ""), indicator = c(FALSE, FALSE, TRUE, FALSE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE,
TRUE, FALSE)), class = "data.frame", row.names = c(NA, -16L))
df_desired <- structure(list(group = c("A", "A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B", "B", "B", "B", "B"), date = structure(c(1667260800,
1667268000, 1667271600, 1667275200, 1667278800, 1667282400, 1667293200,
1667300400, 1667260800, 1667264400, 1667268000, 1667278800, 1667282400,
1667286000, 1667296800, 1667304000), class = c("POSIXct", "POSIXt"
), tzone = ""), indicator = c(FALSE, FALSE, TRUE, FALSE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE,
TRUE, FALSE), diff_hours = c(3, 1, 0, 1, 0, 1, 4, 6, 5, 4, 3,
0, 1, 2, 0, 2)), class = "data.frame", row.names = c(NA, -16L
))
Run Code Online (Sandbox Code Playgroud)
Maë*_*aël 10
和map_dbl:
library(dplyr)\nlibrary(purrr)\ndf %>% \n group_by(group) %>% \n mutate(diff_hours = map_dbl(date, ~ min(abs(.x - date[indicator]))))\nRun Code Online (Sandbox Code Playgroud)\n输出
\n# A tibble: 16 \xc3\x97 4\n# Groups: group [2]\n group date indicator diff_hours\n <chr> <dttm> <lgl> <dbl>\n 1 A 2022-11-01 01:00:00 FALSE 3\n 2 A 2022-11-01 03:00:00 FALSE 1\n 3 A 2022-11-01 04:00:00 TRUE 0\n 4 A 2022-11-01 05:00:00 FALSE 1\n 5 A 2022-11-01 06:00:00 TRUE 0\n 6 A 2022-11-01 07:00:00 FALSE 1\n 7 A 2022-11-01 10:00:00 FALSE 4\n 8 A 2022-11-01 12:00:00 FALSE 6\n 9 B 2022-11-01 01:00:00 FALSE 5\n10 B 2022-11-01 02:00:00 FALSE 4\n11 B 2022-11-01 03:00:00 FALSE 3\n12 B 2022-11-01 06:00:00 TRUE 0\n13 B 2022-11-01 07:00:00 FALSE 1\n14 B 2022-11-01 08:00:00 FALSE 2\n15 B 2022-11-01 11:00:00 TRUE 0\n16 B 2022-11-01 13:00:00 FALSE 2\nRun Code Online (Sandbox Code Playgroud)\n如果想保留原来的差异(不是绝对的):
\ndf %>% \n group_by(group) %>% \n mutate(diff_hours = map_dbl(date, ~ (.x - date[indicator])[which.min(abs(.x - date[indicator]))]))\nRun Code Online (Sandbox Code Playgroud)\n
你可以像下面这样尝试data.table(应该有比我更有效的选项)
findInterval或roll = "nearest"setDT(df)[
,
diff_hours := abs(
difftime(date,
date[indicator][pmax(1, findInterval(date, date[indicator]))],
units = "hours"
)
),
group
][]
Run Code Online (Sandbox Code Playgroud)
或者
setDT(df)[
,
diffhours := abs(
difftime(date,
.SD[indicator][.SD,
date,
by = group,
on = "date",
roll = "nearest",
mult = "first"
][, date],
units = "hours"
)
)
][]
Run Code Online (Sandbox Code Playgroud)
这使
group date indicator diff_hours
1: A 2022-11-01 01:00:00 FALSE 3 hours
2: A 2022-11-01 03:00:00 FALSE 1 hours
3: A 2022-11-01 04:00:00 TRUE 0 hours
4: A 2022-11-01 05:00:00 FALSE 1 hours
5: A 2022-11-01 06:00:00 TRUE 0 hours
6: A 2022-11-01 07:00:00 FALSE 1 hours
7: A 2022-11-01 10:00:00 FALSE 4 hours
8: A 2022-11-01 12:00:00 FALSE 6 hours
9: B 2022-11-01 01:00:00 FALSE 5 hours
10: B 2022-11-01 02:00:00 FALSE 4 hours
11: B 2022-11-01 03:00:00 FALSE 3 hours
12: B 2022-11-01 06:00:00 TRUE 0 hours
13: B 2022-11-01 07:00:00 FALSE 1 hours
14: B 2022-11-01 08:00:00 FALSE 2 hours
15: B 2022-11-01 11:00:00 TRUE 0 hours
16: B 2022-11-01 13:00:00 FALSE 2 hours
Run Code Online (Sandbox Code Playgroud)
outer(由于使用而效率apply不高)setDT(df)[
,
diff_hours := apply(abs(outer(date, date[indicator], `-`)), 1, min) / 3600,
group
][]
Run Code Online (Sandbox Code Playgroud)
你会看到
group date indicator diff_hours
1: A 2022-11-01 01:00:00 FALSE 3
2: A 2022-11-01 03:00:00 FALSE 1
3: A 2022-11-01 04:00:00 TRUE 0
4: A 2022-11-01 05:00:00 FALSE 1
5: A 2022-11-01 06:00:00 TRUE 0
6: A 2022-11-01 07:00:00 FALSE 1
7: A 2022-11-01 10:00:00 FALSE 4
8: A 2022-11-01 12:00:00 FALSE 6
9: B 2022-11-01 01:00:00 FALSE 5
10: B 2022-11-01 02:00:00 FALSE 4
11: B 2022-11-01 03:00:00 FALSE 3
12: B 2022-11-01 06:00:00 TRUE 0
13: B 2022-11-01 07:00:00 FALSE 1
14: B 2022-11-01 08:00:00 FALSE 2
15: B 2022-11-01 11:00:00 TRUE 0
16: B 2022-11-01 13:00:00 FALSE 2
Run Code Online (Sandbox Code Playgroud)