计算R中每组最接近条件行的日期之间的差异(以小时为单位)

Qui*_*ten 12 datetime r dataframe dplyr data.table

我有以下名为 df 的示例数据框(dput如下):

   group                date indicator
1      A 2022-11-01 01:00:00     FALSE
2      A 2022-11-01 03:00:00     FALSE
3      A 2022-11-01 04:00:00      TRUE
4      A 2022-11-01 05:00:00     FALSE
5      A 2022-11-01 06:00:00      TRUE
6      A 2022-11-01 07:00:00     FALSE
7      A 2022-11-01 10:00:00     FALSE
8      A 2022-11-01 12:00:00     FALSE
9      B 2022-11-01 01:00:00     FALSE
10     B 2022-11-01 02:00:00     FALSE
11     B 2022-11-01 03:00:00     FALSE
12     B 2022-11-01 06:00:00      TRUE
13     B 2022-11-01 07:00:00     FALSE
14     B 2022-11-01 08:00:00     FALSE
15     B 2022-11-01 11:00:00      TRUE
16     B 2022-11-01 13:00:00     FALSE
Run Code Online (Sandbox Code Playgroud)

我想计算日期与indicator == TRUE每组最近的条件行之间的小时数差异。此外,具有 TRUE 的行应返回 0 作为输出。在这里您可以看到名为 df_desired 的所需输出:

   group                date indicator diff_hours
1      A 2022-11-01 01:00:00     FALSE          3
2      A 2022-11-01 03:00:00     FALSE          1
3      A 2022-11-01 04:00:00      TRUE          0
4      A 2022-11-01 05:00:00     FALSE          1
5      A 2022-11-01 06:00:00      TRUE          0
6      A 2022-11-01 07:00:00     FALSE          1
7      A 2022-11-01 10:00:00     FALSE          4
8      A 2022-11-01 12:00:00     FALSE          6
9      B 2022-11-01 01:00:00     FALSE          5
10     B 2022-11-01 02:00:00     FALSE          4
11     B 2022-11-01 03:00:00     FALSE          3
12     B 2022-11-01 06:00:00      TRUE          0
13     B 2022-11-01 07:00:00     FALSE          1
14     B 2022-11-01 08:00:00     FALSE          2
15     B 2022-11-01 11:00:00      TRUE          0
16     B 2022-11-01 13:00:00     FALSE          2
Run Code Online (Sandbox Code Playgroud)

所以我想知道是否有人知道如何计算日期之间的差异(以小时为单位)相对于每组最近的条件行?


这里dput是 df 和 df_desired:

df <- structure(list(group = c("A", "A", "A", "A", "A", "A", "A", "A", 
"B", "B", "B", "B", "B", "B", "B", "B"), date = structure(c(1667260800, 
1667268000, 1667271600, 1667275200, 1667278800, 1667282400, 1667293200, 
1667300400, 1667260800, 1667264400, 1667268000, 1667278800, 1667282400, 
1667286000, 1667296800, 1667304000), class = c("POSIXct", "POSIXt"
), tzone = ""), indicator = c(FALSE, FALSE, TRUE, FALSE, TRUE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, 
TRUE, FALSE)), class = "data.frame", row.names = c(NA, -16L))

df_desired <- structure(list(group = c("A", "A", "A", "A", "A", "A", "A", "A", 
"B", "B", "B", "B", "B", "B", "B", "B"), date = structure(c(1667260800, 
1667268000, 1667271600, 1667275200, 1667278800, 1667282400, 1667293200, 
1667300400, 1667260800, 1667264400, 1667268000, 1667278800, 1667282400, 
1667286000, 1667296800, 1667304000), class = c("POSIXct", "POSIXt"
), tzone = ""), indicator = c(FALSE, FALSE, TRUE, FALSE, TRUE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, 
TRUE, FALSE), diff_hours = c(3, 1, 0, 1, 0, 1, 4, 6, 5, 4, 3, 
0, 1, 2, 0, 2)), class = "data.frame", row.names = c(NA, -16L
))
Run Code Online (Sandbox Code Playgroud)

Maë*_*aël 10

map_dbl

\n
library(dplyr)\nlibrary(purrr)\ndf %>% \n  group_by(group) %>% \n  mutate(diff_hours = map_dbl(date, ~ min(abs(.x - date[indicator]))))\n
Run Code Online (Sandbox Code Playgroud)\n

输出

\n
# A tibble: 16 \xc3\x97 4\n# Groups:   group [2]\n   group date                indicator diff_hours\n   <chr> <dttm>              <lgl>          <dbl>\n 1 A     2022-11-01 01:00:00 FALSE              3\n 2 A     2022-11-01 03:00:00 FALSE              1\n 3 A     2022-11-01 04:00:00 TRUE               0\n 4 A     2022-11-01 05:00:00 FALSE              1\n 5 A     2022-11-01 06:00:00 TRUE               0\n 6 A     2022-11-01 07:00:00 FALSE              1\n 7 A     2022-11-01 10:00:00 FALSE              4\n 8 A     2022-11-01 12:00:00 FALSE              6\n 9 B     2022-11-01 01:00:00 FALSE              5\n10 B     2022-11-01 02:00:00 FALSE              4\n11 B     2022-11-01 03:00:00 FALSE              3\n12 B     2022-11-01 06:00:00 TRUE               0\n13 B     2022-11-01 07:00:00 FALSE              1\n14 B     2022-11-01 08:00:00 FALSE              2\n15 B     2022-11-01 11:00:00 TRUE               0\n16 B     2022-11-01 13:00:00 FALSE              2\n
Run Code Online (Sandbox Code Playgroud)\n
\n

如果想保留原来的差异(不是绝对的):

\n
df %>% \n  group_by(group) %>% \n  mutate(diff_hours = map_dbl(date, ~ (.x - date[indicator])[which.min(abs(.x - date[indicator]))]))\n
Run Code Online (Sandbox Code Playgroud)\n


Tho*_*ing 6

你可以像下面这样尝试data.table(应该有比我更有效的选项)

  • 使用findIntervalroll = "nearest"
setDT(df)[
  ,
  diff_hours := abs(
    difftime(date,
      date[indicator][pmax(1, findInterval(date, date[indicator]))],
      units = "hours"
    )
  ),
  group
][]
Run Code Online (Sandbox Code Playgroud)

或者

setDT(df)[
  ,
  diffhours := abs(
    difftime(date,
      .SD[indicator][.SD,
        date,
        by = group,
        on = "date",
        roll = "nearest",
        mult = "first"
      ][, date],
      units = "hours"
    )
  )
][]
Run Code Online (Sandbox Code Playgroud)

这使

    group                date indicator diff_hours
 1:     A 2022-11-01 01:00:00     FALSE    3 hours
 2:     A 2022-11-01 03:00:00     FALSE    1 hours
 3:     A 2022-11-01 04:00:00      TRUE    0 hours
 4:     A 2022-11-01 05:00:00     FALSE    1 hours
 5:     A 2022-11-01 06:00:00      TRUE    0 hours
 6:     A 2022-11-01 07:00:00     FALSE    1 hours
 7:     A 2022-11-01 10:00:00     FALSE    4 hours
 8:     A 2022-11-01 12:00:00     FALSE    6 hours
 9:     B 2022-11-01 01:00:00     FALSE    5 hours
10:     B 2022-11-01 02:00:00     FALSE    4 hours
11:     B 2022-11-01 03:00:00     FALSE    3 hours
12:     B 2022-11-01 06:00:00      TRUE    0 hours
13:     B 2022-11-01 07:00:00     FALSE    1 hours
14:     B 2022-11-01 08:00:00     FALSE    2 hours
15:     B 2022-11-01 11:00:00      TRUE    0 hours
16:     B 2022-11-01 13:00:00     FALSE    2 hours
Run Code Online (Sandbox Code Playgroud)
  • 使用outer(由于使用而效率apply不高)
setDT(df)[
  ,
  diff_hours := apply(abs(outer(date, date[indicator], `-`)), 1, min) / 3600,
  group
][]
Run Code Online (Sandbox Code Playgroud)

你会看到

    group                date indicator diff_hours
 1:     A 2022-11-01 01:00:00     FALSE          3
 2:     A 2022-11-01 03:00:00     FALSE          1
 3:     A 2022-11-01 04:00:00      TRUE          0
 4:     A 2022-11-01 05:00:00     FALSE          1
 5:     A 2022-11-01 06:00:00      TRUE          0
 6:     A 2022-11-01 07:00:00     FALSE          1
 7:     A 2022-11-01 10:00:00     FALSE          4
 8:     A 2022-11-01 12:00:00     FALSE          6
 9:     B 2022-11-01 01:00:00     FALSE          5
10:     B 2022-11-01 02:00:00     FALSE          4
11:     B 2022-11-01 03:00:00     FALSE          3
12:     B 2022-11-01 06:00:00      TRUE          0
13:     B 2022-11-01 07:00:00     FALSE          1
14:     B 2022-11-01 08:00:00     FALSE          2
15:     B 2022-11-01 11:00:00      TRUE          0
16:     B 2022-11-01 13:00:00     FALSE          2
Run Code Online (Sandbox Code Playgroud)