运行长度编码和分组依据

the*_*ide 1 r dplyr data.table

我还是新手使用的功能data.table.我的目标是使用rle()rleid()分组多个变量. rle()不是典型的汇总统计.

在我下面的测试数据集中,我的目标是计算连续的重复记录,其中唯一的自行车(bike_id)位于同一位置address,然后按日期分组bike_id.

一些测试数据如下:

> dat
                   time bike_id          address
 1: 2017-11-22 15:45:34       1        Waters Rd
 2: 2017-11-22 15:50:16       1        Waters Rd
 3: 2017-11-22 16:00:03       1   Washington Ave
 4: 2017-11-22 16:10:03       1   Washington Ave
 5: 2017-11-22 16:20:02       1   Washington Ave
 6: 2017-11-22 16:30:02       2       Shady Lane
 7: 2017-11-22 16:40:03       2     Comstock Ave
 8: 2017-11-22 16:50:02       2     Comstock Ave
 9: 2017-11-22 17:00:02       2     Comstock Ave
10: 2017-11-22 17:10:02       2     Comstock Ave
11: 2017-11-22 17:20:03       3   Scranton Drive
12: 2017-11-22 17:30:03       3   Scranton Drive
13: 2017-11-22 17:40:03       3   Scranton Drive
14: 2017-11-22 17:50:03       3       Shady Lane
15: 2017-11-22 18:00:04       3   Scranton Drive
16: 2017-11-23 18:10:03       1       Shady Lane
17: 2017-11-23 18:20:03       1       Shady Lane
18: 2017-11-23 18:30:02       1       Shady Lane
19: 2017-11-23 18:40:03       1       Shady Lane
20: 2017-11-23 18:50:03       1       Shady Lane
21: 2017-11-23 19:00:03       2      Lovers Lane
22: 2017-11-23 19:10:02       2 Mulholland Drive
23: 2017-11-23 19:20:03       2 Mulholland Drive
24: 2017-11-23 19:30:02       2 Mulholland Drive
25: 2017-11-23 19:40:03       2 Mulholland Drive
                   time bike_id          address
Run Code Online (Sandbox Code Playgroud)

我知道,使用rle(dat$address)会产生下所需的输出第三列,但如果用我不能确定如何组rle()data.table

> output
         date bike_id rle
1  2017-11-22       1   2
2  2017-11-22       1   3
3  2017-11-22       2   1
4  2017-11-22       2   4
5  2017-11-22       3   3
6  2017-11-22       3   1
7  2017-11-22       3   1
8  2017-11-23       1   5
9  2017-11-23       2   1
10 2017-11-23       2   4
Run Code Online (Sandbox Code Playgroud)

任何的意见都将会有帮助!

以下是示例数据:

> dput(dat)
structure(list(time = structure(c(1511383534.43394, 1511383816.49785, 
1511384403.94561, 1511385003.17654, 1511385602.47887, 1511386202.99895, 
1511386803.18361, 1511387402.98233, 1511388002.69461, 1511388602.5818, 
1511389203.52712, 1511389803.652, 1511390403.26619, 1511391003.79218, 
1511391604.30061, 1511478603.55103, 1511479203.60366, 1511479802.97132, 
1511480403.45374, 1511481003.12783, 1511481603.34055, 1511482202.62777, 
1511482803.66405, 1511483402.83378, 1511484003.46605), tzone = "", class = c("POSIXct", 
"POSIXt")), bike_id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 
3, 3, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2), address = c("Waters Rd", 
"Waters Rd", "Washington Ave", "Washington Ave", "Washington Ave", 
"Shady Lane", "Comstock Ave", "Comstock Ave", "Comstock Ave", 
"Comstock Ave", "Scranton Drive", "Scranton Drive", "Scranton Drive", 
"Shady Lane", "Scranton Drive", "Shady Lane", "Shady Lane", "Shady Lane", 
"Shady Lane", "Shady Lane", "Lovers Lane", "Mulholland Drive", 
"Mulholland Drive", "Mulholland Drive", "Mulholland Drive")), .Names = c("time", 
"bike_id", "address"), class = c("data.table", "data.frame"), row.names = c(NA, 
-25L), .internal.selfref = <pointer: 0x10300d178>)
Run Code Online (Sandbox Code Playgroud)

编辑:

一个独特的案例,其中下面的答案中的代码产生不正确的结果:

> dput(dat)
structure(list(bike_id = c(1, 1, 1, 1, 1, 1), lon = c(-76.968, 
-76.968, -76.968, -72.141, -72.141, -72.141), lat = c(38.924, 
38.924, 38.924, -39.219, -39.219, -39.219), time = structure(c(1511383534.49273, 
1511383816.52327, 1511384403.97359, 1511385003.20305, 1511385602.50507, 
1511299803.02598), tzone = "", class = c("POSIXct", "POSIXt"))), .Names = c("bike_id", 
"lon", "lat", "time"), row.names = c(NA, -6L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x10300d178>)

> dat
   bike_id     lon     lat                time
1:       1 -76.968  38.924 2017-11-22 15:45:34
2:       1 -76.968  38.924 2017-11-22 15:50:16
3:       1 -76.968  38.924 2017-11-22 16:00:03
4:       1 -72.141 -39.219 2017-11-22 16:10:03
5:       1 -72.141 -39.219 2017-11-22 16:20:02
6:       1 -72.141 -39.219 2017-11-21 16:30:03

> dat[, .(date = as.Date(time)[1], n = .N), .(bike_id, grp = rleid(lat, lon))][, grp := NULL][]
Run Code Online (Sandbox Code Playgroud)

生产:

   bike_id       date n
1:       1 2017-11-22 3
2:       1 2017-11-22 3
Run Code Online (Sandbox Code Playgroud)

预期:

   bike_id       date n
1:       1 2017-11-22 3
2:       1 2017-11-22 2
3:       1 2017-11-21 1
Run Code Online (Sandbox Code Playgroud)

akr*_*run 6

我们可以用rleiddata.table

dat[, .(date = as.Date(time)[1], n = .N), .(bike_id, grp = rleid(address))][, grp := NULL][]
Run Code Online (Sandbox Code Playgroud)

如果每个分组变量(第二个数据)有多个"日期",则前一个变量将仅选择第一个"日期"([1]).假设,我们想要得到'日期'然后使用

dat[, .(date = unique(as.Date(time)), n = .N),, .(bike_id, grp = rleid(lon, lat))]
#   bike_id grp       date n
#1:       1   1 2017-11-22 3
#2:       1   2 2017-11-22 3
#3:       1   2 2017-11-21 3
Run Code Online (Sandbox Code Playgroud)

但是,每组也有多行.如果我们每组只需要一行,则创建一个list列(保留class)

dat[, .(date = list(unique(as.Date(time))), n = .N),, .(bike_id, grp = rleid(lon, lat))]
#   bike_id grp                  date n
#1:       1   1            2017-11-22 3
#2:       1   2 2017-11-22,2017-11-21 3
Run Code Online (Sandbox Code Playgroud)

或者pasteunique元素结合在一起

更新

基于OP的预期输出(来自第二个数据集)的帖子中的更新,我们还需要使用'date'作为分组变量

dat[, .(n = .N),, .(bike_id, date = as.Date(time), grp = rleid(lon, lat))][, grp := NULL][]
#   bike_id       date n
#1:       1 2017-11-21 1
#2:       1 2017-11-22 3
#3:       1 2017-11-22 2
Run Code Online (Sandbox Code Playgroud)