计算具有已知频率和缺失数据的数据的时间戳

blo*_*rth 7 timestamp r dplyr data.table

我的数据如下,其中“S”类型的数据包含时间戳,我需要将时间戳分配给“D”行。

   type  timestamp               count
   <chr> <dttm>                  <int>
 1 $     NA                         NA
 2 D     NA                        229
 3 M     NA                         NA
 4 D     NA                        230
 5 D     NA                        231
 6 D     NA                        232
 7 D     NA                        233
 8 D     NA                        234
 9 D     NA                        235
10 D     NA                        236
11 D     NA                        237
12 D     NA                        238
13 D     NA                        239
14 S     2024-01-24 16:11:11.000    NA
15 D     NA                        241
16 D     NA                        242
17 D     NA                        243
18 D     NA                        126
19 D     NA                        127
20 S     2024-01-24 16:13:29.000    NA
21 D     NA                        128
Run Code Online (Sandbox Code Playgroud)

“Count”是一个 1 字节迭代器,范围从 0-255 并重复。缺失计数表示缺失数据行。数据线以 16Hz 发送,因此每次计数迭代代表 1/16 秒。我试图使用 D 行的计数来分配正确的时间戳,以获得最近的 S 行时间戳,并通过当前 D 行和紧随 S 行的 D 行之间的计数差异来计算时间戳。通常,S 线是每秒一次,但我选择这个子集是为了显示数据的一些问题,主要是第 17 行 2:18 的差距。

我找到了一种可行的方法,但速度非常慢(4 毫秒/行,每天需要处理跨越多天的文件的约 100 万行数据)。真实数据位于具有多种格式(ick)的行的文件中,并且本示例中的时间和计数是从中解析出来的。这听起来像是代码出现的问题,但遗憾的是,这个系统是真实的。

如果您想查看我的缓慢解决方案或查看更完整的数据,它位于存储库中的此文件中: https: //github.com/blongworth/mlabtools/blob/main/R/time_alignment.R上面的数据是简化了,因此存储库中的方法在不进行修改的情况下无法在 reprex 数据上运行。有一些测试,但还没有一组测试来说明此 Reprex 的结果应该如何。

关于如何有效地做到这一点有什么想法吗?我最终可能不得不去 data.tables,但只要我开始使用更有效的逻辑,我想我就能到达那里。

这是上面测试 df 的 dput 输出:

structure(list(type = c("$", "D", "M", "D", "D", "D", "D", "D", 
"D", "D", "D", "D", "D", "S", "D", "D", "D", "D", "D", "S", "D"
), timestamp = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, 1706130671, NA, NA, NA, NA, NA, 1706130809, NA
), tzone = "America/New_York", class = c("POSIXct", "POSIXt")), 
    count = c(NA, 229L, NA, 230L, 231L, 232L, 233L, 234L, 235L, 
    236L, 237L, 238L, 239L, NA, 241L, 242L, 243L, 126L, 127L, 
    NA, 128L)), row.names = c(NA, -21L), class = c("tbl_df", 
"tbl", "data.frame"))
Run Code Online (Sandbox Code Playgroud)

以下是具有预期输出的示例数据:

   type  timestamp               count
   <chr> <dttm>                  <int>
 1 $     NA                         NA
 2 D     2024-01-24 16:11:10.250   229
 3 M     NA                         NA
 4 D     2024-01-24 16:11:10.312   230
 5 D     2024-01-24 16:11:10.375   231
 6 D     2024-01-24 16:11:10.437   232
 7 D     2024-01-24 16:11:10.500   233
 8 D     2024-01-24 16:11:10.562   234
 9 D     2024-01-24 16:11:10.625   235
10 D     2024-01-24 16:11:10.687   236
11 D     2024-01-24 16:11:10.750   237
12 D     2024-01-24 16:11:10.812   238
13 D     2024-01-24 16:11:10.875   239
14 S     2024-01-24 16:11:11.000    NA
15 D     2024-01-24 16:11:11.000   241
16 D     2024-01-24 16:11:11.062   242
17 D     2024-01-24 16:11:11.125   243
18 D     2024-01-24 16:13:28.875   126
19 D     2024-01-24 16:13:28.937   127
20 S     2024-01-24 16:13:29.000    NA
21 D     2024-01-24 16:13:29.000   128
Run Code Online (Sandbox Code Playgroud)

r2e*_*ans 2

这是一个经过一些时间戳体操的镜头。

\n
library(dplyr)\n# library(tidyr) # fill\ndf |>\n  mutate(count2 = count, nexttime = timestamp, prevtime = timestamp) |>\n  tidyr::fill(count2, .direction = "updown") |>\n  mutate(\n    count2 = count2 + 256*cumsum(c(FALSE, diff(count2) < 0)),\n    nextind = if_else(is.na(timestamp), count2[NA], count2),\n    prevind = nextind\n  ) |>\n  tidyr::fill(prevtime, prevind, .direction = "down") |>\n  tidyr::fill(nexttime, nextind, .direction = "up") |>\n  mutate(\n    newtimestamp = case_when(\n      !is.na(timestamp) ~ timestamp,\n      is.na(prevtime) | abs(count2 - nextind) < abs(count2 - prevind) ~\n        nexttime + (count2 - nextind)/16,\n      TRUE ~\n        prevtime + (count2 - prevind)/16\n    )\n  ) |>\n  select(names(df), newtimestamp)\n# # A tibble: 21 \xc3\x97 4\n#    type  timestamp               count newtimestamp           \n#    <chr> <dttm>                  <int> <dttm>                 \n#  1 $     NA                         NA 2024-01-24 16:11:10.250\n#  2 D     NA                        229 2024-01-24 16:11:10.250\n#  3 M     NA                         NA 2024-01-24 16:11:10.312\n#  4 D     NA                        230 2024-01-24 16:11:10.312\n#  5 D     NA                        231 2024-01-24 16:11:10.375\n#  6 D     NA                        232 2024-01-24 16:11:10.437\n#  7 D     NA                        233 2024-01-24 16:11:10.500\n#  8 D     NA                        234 2024-01-24 16:11:10.562\n#  9 D     NA                        235 2024-01-24 16:11:10.625\n# 10 D     NA                        236 2024-01-24 16:11:10.687\n# 11 D     NA                        237 2024-01-24 16:11:10.750\n# 12 D     NA                        238 2024-01-24 16:11:10.812\n# 13 D     NA                        239 2024-01-24 16:11:10.875\n# 14 S     2024-01-24 16:11:11.000    NA 2024-01-24 16:11:11.000\n# 15 D     NA                        241 2024-01-24 16:11:11.000\n# 16 D     NA                        242 2024-01-24 16:11:11.062\n# 17 D     NA                        243 2024-01-24 16:11:11.125\n# 18 D     NA                        126 2024-01-24 16:13:28.875\n# 19 D     NA                        127 2024-01-24 16:13:28.937\n# 20 S     2024-01-24 16:13:29.000    NA 2024-01-24 16:13:29.000\n# 21 D     NA                        128 2024-01-24 16:13:29.000\n
Run Code Online (Sandbox Code Playgroud)\n

笔记:

\n
    \n
  • count2只是对scount进行了完全插值NA
  • \n
  • nexttime/的用途prevtime是向前进位和向后进位,timestamp直到出现另一个非时间戳,我在;NA中选择使用哪个。case_when
  • \n
  • /用于减去,nextind以便我可以计算 1/16 秒。previndcount2
  • \n
  • case_when实际上是大多数逻辑工作的地方,确定是否timestamp应保留原始内容,或(count2-nextind)/16(或)距( )prevind的 1/16 秒。nexttimeprevtime
  • \n
\n
\n

解决data.table方案看起来非常相似。使用 R-4.2 或更高版本,我们可以使用|> _[]格式:

\n
library(data.table)\nout <- as.data.table(df) |>\n  _[, count2 := nafill(nafill(count, type = "nocb"), type = "locf") ] |>\n  _[, count2 := count2 + 256*cumsum(c(FALSE, diff(count2) < 0)) ] |>\n  _[, nextind := fifelse(is.na(timestamp), count2[NA], count2) ] |>\n  _[, prevind := nextind ] |>\n  _[, c("prevtime", "prevind") := lapply(.SD, nafill, type = "locf"), .SDcols = c("timestamp", "prevind")] |>\n  _[, c("nexttime", "nextind") := lapply(.SD, nafill, type = "nocb"), .SDcols = c("timestamp", "nextind")] |>\n  _[, newtimestamp := fcase(\n    !is.na(timestamp), timestamp,\n    is.na(prevtime) | abs(count2 - nextind) < abs(count2 - prevind), nexttime + (count2 - nextind)/16,\n    rep(TRUE, .N), prevtime + (count2 - prevind)/16) ] |>\n  _[, .SD, .SDcols = c(names(df), "newtimestamp")]\n
Run Code Online (Sandbox Code Playgroud)\n

如果在 4.2 之前的 R 上,我们可以使用data.table\'s ][-piping。

\n
DT <- as.data.table(df) # setDT(df) is canonical, avoiding that here for side-effect\nDT[, count2 := nafill(nafill(count, type = "nocb"), type = "locf")\n   ][, count2 := count2 + 256*cumsum(c(FALSE, diff(count2) < 0))\n   ][, nextind := fifelse(is.na(timestamp), count2[NA], count2)\n   ][, prevind := nextind\n   ][, c("prevtime", "prevind") := lapply(.SD, nafill, type = "locf"), .SDcols = c("timestamp", "prevind")\n   ][, c("nexttime", "nextind") := lapply(.SD, nafill, type = "nocb"), .SDcols = c("timestamp", "nextind")\n   ][, newtimestamp := fcase(\n     !is.na(timestamp), timestamp,\n     is.na(prevtime) | abs(count2 - nextind) < abs(count2 - prevind), nexttime + (count2 - nextind)/16,\n     rep(TRUE, .N), prevtime + (count2 - prevind)/16)\n   ][, .SD, .SDcols = c(names(df), "newtimestamp")]\n\n
Run Code Online (Sandbox Code Playgroud)\n

我更喜欢tidyr::fill\'s .direction="updown",它减少了调用堆栈,并且更容易在这样的管道中读取。

\n