blo*_*rth 7 timestamp r dplyr data.table
我的数据如下,其中“S”类型的数据包含时间戳,我需要将时间戳分配给“D”行。
type timestamp count
<chr> <dttm> <int>
1 $ NA NA
2 D NA 229
3 M NA NA
4 D NA 230
5 D NA 231
6 D NA 232
7 D NA 233
8 D NA 234
9 D NA 235
10 D NA 236
11 D NA 237
12 D NA 238
13 D NA 239
14 S 2024-01-24 16:11:11.000 NA
15 D NA 241
16 D NA 242
17 D NA 243
18 D NA 126
19 D NA 127
20 S 2024-01-24 16:13:29.000 NA
21 D NA 128
Run Code Online (Sandbox Code Playgroud)
“Count”是一个 1 字节迭代器,范围从 0-255 并重复。缺失计数表示缺失数据行。数据线以 16Hz 发送,因此每次计数迭代代表 1/16 秒。我试图使用 D 行的计数来分配正确的时间戳,以获得最近的 S 行时间戳,并通过当前 D 行和紧随 S 行的 D 行之间的计数差异来计算时间戳。通常,S 线是每秒一次,但我选择这个子集是为了显示数据的一些问题,主要是第 17 行 2:18 的差距。
我找到了一种可行的方法,但速度非常慢(4 毫秒/行,每天需要处理跨越多天的文件的约 100 万行数据)。真实数据位于具有多种格式(ick)的行的文件中,并且本示例中的时间和计数是从中解析出来的。这听起来像是代码出现的问题,但遗憾的是,这个系统是真实的。
如果您想查看我的缓慢解决方案或查看更完整的数据,它位于存储库中的此文件中: https: //github.com/blongworth/mlabtools/blob/main/R/time_alignment.R上面的数据是简化了,因此存储库中的方法在不进行修改的情况下无法在 reprex 数据上运行。有一些测试,但还没有一组测试来说明此 Reprex 的结果应该如何。
关于如何有效地做到这一点有什么想法吗?我最终可能不得不去 data.tables,但只要我开始使用更有效的逻辑,我想我就能到达那里。
这是上面测试 df 的 dput 输出:
structure(list(type = c("$", "D", "M", "D", "D", "D", "D", "D",
"D", "D", "D", "D", "D", "S", "D", "D", "D", "D", "D", "S", "D"
), timestamp = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, 1706130671, NA, NA, NA, NA, NA, 1706130809, NA
), tzone = "America/New_York", class = c("POSIXct", "POSIXt")),
count = c(NA, 229L, NA, 230L, 231L, 232L, 233L, 234L, 235L,
236L, 237L, 238L, 239L, NA, 241L, 242L, 243L, 126L, 127L,
NA, 128L)), row.names = c(NA, -21L), class = c("tbl_df",
"tbl", "data.frame"))
Run Code Online (Sandbox Code Playgroud)
以下是具有预期输出的示例数据:
type timestamp count
<chr> <dttm> <int>
1 $ NA NA
2 D 2024-01-24 16:11:10.250 229
3 M NA NA
4 D 2024-01-24 16:11:10.312 230
5 D 2024-01-24 16:11:10.375 231
6 D 2024-01-24 16:11:10.437 232
7 D 2024-01-24 16:11:10.500 233
8 D 2024-01-24 16:11:10.562 234
9 D 2024-01-24 16:11:10.625 235
10 D 2024-01-24 16:11:10.687 236
11 D 2024-01-24 16:11:10.750 237
12 D 2024-01-24 16:11:10.812 238
13 D 2024-01-24 16:11:10.875 239
14 S 2024-01-24 16:11:11.000 NA
15 D 2024-01-24 16:11:11.000 241
16 D 2024-01-24 16:11:11.062 242
17 D 2024-01-24 16:11:11.125 243
18 D 2024-01-24 16:13:28.875 126
19 D 2024-01-24 16:13:28.937 127
20 S 2024-01-24 16:13:29.000 NA
21 D 2024-01-24 16:13:29.000 128
Run Code Online (Sandbox Code Playgroud)
这是一个经过一些时间戳体操的镜头。
\nlibrary(dplyr)\n# library(tidyr) # fill\ndf |>\n mutate(count2 = count, nexttime = timestamp, prevtime = timestamp) |>\n tidyr::fill(count2, .direction = "updown") |>\n mutate(\n count2 = count2 + 256*cumsum(c(FALSE, diff(count2) < 0)),\n nextind = if_else(is.na(timestamp), count2[NA], count2),\n prevind = nextind\n ) |>\n tidyr::fill(prevtime, prevind, .direction = "down") |>\n tidyr::fill(nexttime, nextind, .direction = "up") |>\n mutate(\n newtimestamp = case_when(\n !is.na(timestamp) ~ timestamp,\n is.na(prevtime) | abs(count2 - nextind) < abs(count2 - prevind) ~\n nexttime + (count2 - nextind)/16,\n TRUE ~\n prevtime + (count2 - prevind)/16\n )\n ) |>\n select(names(df), newtimestamp)\n# # A tibble: 21 \xc3\x97 4\n# type timestamp count newtimestamp \n# <chr> <dttm> <int> <dttm> \n# 1 $ NA NA 2024-01-24 16:11:10.250\n# 2 D NA 229 2024-01-24 16:11:10.250\n# 3 M NA NA 2024-01-24 16:11:10.312\n# 4 D NA 230 2024-01-24 16:11:10.312\n# 5 D NA 231 2024-01-24 16:11:10.375\n# 6 D NA 232 2024-01-24 16:11:10.437\n# 7 D NA 233 2024-01-24 16:11:10.500\n# 8 D NA 234 2024-01-24 16:11:10.562\n# 9 D NA 235 2024-01-24 16:11:10.625\n# 10 D NA 236 2024-01-24 16:11:10.687\n# 11 D NA 237 2024-01-24 16:11:10.750\n# 12 D NA 238 2024-01-24 16:11:10.812\n# 13 D NA 239 2024-01-24 16:11:10.875\n# 14 S 2024-01-24 16:11:11.000 NA 2024-01-24 16:11:11.000\n# 15 D NA 241 2024-01-24 16:11:11.000\n# 16 D NA 242 2024-01-24 16:11:11.062\n# 17 D NA 243 2024-01-24 16:11:11.125\n# 18 D NA 126 2024-01-24 16:13:28.875\n# 19 D NA 127 2024-01-24 16:13:28.937\n# 20 S 2024-01-24 16:13:29.000 NA 2024-01-24 16:13:29.000\n# 21 D NA 128 2024-01-24 16:13:29.000\nRun Code Online (Sandbox Code Playgroud)\n笔记:
\ncount2只是对scount进行了完全插值NAnexttime/的用途prevtime是向前进位和向后进位,timestamp直到出现另一个非时间戳,我在;NA中选择使用哪个。case_whennextind以便我可以计算 1/16 秒。previndcount2case_when实际上是大多数逻辑工作的地方,确定是否timestamp应保留原始内容,或(count2-nextind)/16(或)距( )prevind的 1/16 秒。nexttimeprevtime解决data.table方案看起来非常相似。使用 R-4.2 或更高版本,我们可以使用|> _[]格式:
library(data.table)\nout <- as.data.table(df) |>\n _[, count2 := nafill(nafill(count, type = "nocb"), type = "locf") ] |>\n _[, count2 := count2 + 256*cumsum(c(FALSE, diff(count2) < 0)) ] |>\n _[, nextind := fifelse(is.na(timestamp), count2[NA], count2) ] |>\n _[, prevind := nextind ] |>\n _[, c("prevtime", "prevind") := lapply(.SD, nafill, type = "locf"), .SDcols = c("timestamp", "prevind")] |>\n _[, c("nexttime", "nextind") := lapply(.SD, nafill, type = "nocb"), .SDcols = c("timestamp", "nextind")] |>\n _[, newtimestamp := fcase(\n !is.na(timestamp), timestamp,\n is.na(prevtime) | abs(count2 - nextind) < abs(count2 - prevind), nexttime + (count2 - nextind)/16,\n rep(TRUE, .N), prevtime + (count2 - prevind)/16) ] |>\n _[, .SD, .SDcols = c(names(df), "newtimestamp")]\nRun Code Online (Sandbox Code Playgroud)\n如果在 4.2 之前的 R 上,我们可以使用data.table\'s ][-piping。
DT <- as.data.table(df) # setDT(df) is canonical, avoiding that here for side-effect\nDT[, count2 := nafill(nafill(count, type = "nocb"), type = "locf")\n ][, count2 := count2 + 256*cumsum(c(FALSE, diff(count2) < 0))\n ][, nextind := fifelse(is.na(timestamp), count2[NA], count2)\n ][, prevind := nextind\n ][, c("prevtime", "prevind") := lapply(.SD, nafill, type = "locf"), .SDcols = c("timestamp", "prevind")\n ][, c("nexttime", "nextind") := lapply(.SD, nafill, type = "nocb"), .SDcols = c("timestamp", "nextind")\n ][, newtimestamp := fcase(\n !is.na(timestamp), timestamp,\n is.na(prevtime) | abs(count2 - nextind) < abs(count2 - prevind), nexttime + (count2 - nextind)/16,\n rep(TRUE, .N), prevtime + (count2 - prevind)/16)\n ][, .SD, .SDcols = c(names(df), "newtimestamp")]\n\nRun Code Online (Sandbox Code Playgroud)\n我更喜欢tidyr::fill\'s .direction="updown",它减少了调用堆栈,并且更容易在这样的管道中读取。
| 归档时间: |
|
| 查看次数: |
183 次 |
| 最近记录: |