从数据中删除行:重叠的时间间隔?

jra*_*ara 6 python powershell perl r

编辑:我正在寻找这个问题的解决方案现在也与其他编程语言.

根据我提出另一个问题,我有一个这样的数据集(对于R用户,下面是dput),它代表用户计算机会话:

   username          machine               start                 end
1     user1 D5599.domain.com 2011-01-03 09:44:18 2011-01-03 09:47:27
2     user1 D5599.domain.com 2011-01-03 09:46:29 2011-01-03 10:09:16
3     user1 D5599.domain.com 2011-01-03 14:07:36 2011-01-03 14:56:17
4     user1 D5599.domain.com 2011-01-05 15:03:17 2011-01-05 15:23:15
5     user1 D5599.domain.com 2011-02-14 14:33:39 2011-02-14 14:40:16
6     user1 D5599.domain.com 2011-02-23 13:54:30 2011-02-23 13:58:23
7     user1 D5599.domain.com 2011-03-21 10:10:18 2011-03-21 10:32:22
8     user1 D5645.domain.com 2011-06-09 10:12:41 2011-06-09 10:58:59
9     user1 D5682.domain.com 2011-01-03 12:03:45 2011-01-03 12:29:43
10    USER2 D5682.domain.com 2011-01-12 14:26:05 2011-01-12 14:32:53
11    USER2 D5682.domain.com 2011-01-17 15:06:19 2011-01-17 15:44:22
12    USER2 D5682.domain.com 2011-01-18 15:07:30 2011-01-18 15:42:43
13    USER2 D5682.domain.com 2011-01-25 15:20:55 2011-01-25 15:24:38
14    USER2 D5682.domain.com 2011-02-14 15:03:00 2011-02-14 15:07:43
15    USER2 D5682.domain.com 2011-02-14 14:59:23 2011-02-14 15:14:47
>
Run Code Online (Sandbox Code Playgroud)

对于来自同一计算机的相同用户名,可能存在多个并发(基于时间重叠)会话.如何删除这些行,以便只为此数据留下一个会话?原始数据集大约有.50万行.

预期的输出是(第2,15行被删除)

   username          machine               start                 end
1     user1 D5599.domain.com 2011-01-03 09:44:18 2011-01-03 09:47:27
3     user1 D5599.domain.com 2011-01-03 14:07:36 2011-01-03 14:56:17
4     user1 D5599.domain.com 2011-01-05 15:03:17 2011-01-05 15:23:15
5     user1 D5599.domain.com 2011-02-14 14:33:39 2011-02-14 14:40:16
6     user1 D5599.domain.com 2011-02-23 13:54:30 2011-02-23 13:58:23
7     user1 D5599.domain.com 2011-03-21 10:10:18 2011-03-21 10:32:22
8     user1 D5645.domain.com 2011-06-09 10:12:41 2011-06-09 10:58:59
9     user1 D5682.domain.com 2011-01-03 12:03:45 2011-01-03 12:29:43
10    USER2 D5682.domain.com 2011-01-12 14:26:05 2011-01-12 14:32:53
11    USER2 D5682.domain.com 2011-01-17 15:06:19 2011-01-17 15:44:22
12    USER2 D5682.domain.com 2011-01-18 15:07:30 2011-01-18 15:42:43
13    USER2 D5682.domain.com 2011-01-25 15:20:55 2011-01-25 15:24:38
14    USER2 D5682.domain.com 2011-02-14 15:03:00 2011-02-14 15:07:43
>
Run Code Online (Sandbox Code Playgroud)

这是数据集:

structure(list(username = c("user1", "user1", "user1",
"user1", "user1", "user1", "user1", "user1",
"user1", "USER2", "USER2", "USER2", "USER2", "USER2", "USER2"
), machine = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("D5599.domain.com", "D5645.domain.com",
"D5682.domain.com", "D5686.domain.com", "D5694.domain.com", "D5696.domain.com",
"D5772.domain.com", "D5772.domain.com", "D5847.domain.com", "D5855.domain.com",
"D5871.domain.com", "D5927.domain.com", "D5927.domain.com", "D5952.domain.com",
"D5993.domain.com", "D6012.domain.com", "D6048.domain.com", "D6077.domain.com",
"D5688.domain.com", "D5815.domain.com", "D6106.domain.com", "D6128.domain.com"
), class = "factor"), start = structure(c(1294040658, 1294040789,
1294056456, 1294232597, 1297686819, 1298462070, 1300695018, 1307603561,
1294049025, 1294835165, 1295269579, 1295356050, 1295961655, 1297688580,
1297688363), class = c("POSIXct", "POSIXt"), tzone = ""), end =
structure(c(1294040847,
1294042156, 1294059377, 1294233795, 1297687216, 1298462303, 1300696342,
1307606339, 1294050583, 1294835573, 1295271862, 1295358163, 1295961878,
1297688863, 1297689287), class = c("POSIXct", "POSIXt"), tzone = "")),
.Names = c("username",
"machine", "start", "end"), row.names = c(NA, 15L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)

G. *_*eck 4

尝试间隔包:

library(intervals)

f <- function(dd) with(dd, {
    r <- reduce(Intervals(cbind(start, end)))
    data.frame(username = username[1],
         machine = machine[1],
         start = structure(r[, 1], class = class(start)),
         end = structure(r[, 2], class = class(end)))
})

do.call("rbind", by(d, d[1:2], f))
Run Code Online (Sandbox Code Playgroud)

使用示例数据,这会将 15 行减少到以下 13 行(通过合并原始数据框中的第 1 行和第 2 行以及第 12 行和第 13 行):

   username          machine               start                 end
1     user1 D5599.domain.com 2011-01-03 02:44:18 2011-01-03 03:09:16
2     user1 D5599.domain.com 2011-01-03 07:07:36 2011-01-03 07:56:17
3     user1 D5599.domain.com 2011-01-05 08:03:17 2011-01-05 08:23:15
4     user1 D5599.domain.com 2011-02-14 07:33:39 2011-02-14 07:40:16
5     user1 D5599.domain.com 2011-02-23 06:54:30 2011-02-23 06:58:23
6     user1 D5599.domain.com 2011-03-21 04:10:18 2011-03-21 04:32:22
7     user1 D5645.domain.com 2011-06-09 03:12:41 2011-06-09 03:58:59
8     user1 D5682.domain.com 2011-01-03 05:03:45 2011-01-03 05:29:43
9     USER2 D5682.domain.com 2011-01-12 07:26:05 2011-01-12 07:32:53
10    USER2 D5682.domain.com 2011-01-17 08:06:19 2011-01-17 08:44:22
11    USER2 D5682.domain.com 2011-01-18 08:07:30 2011-01-18 08:42:43
12    USER2 D5682.domain.com 2011-01-25 08:20:55 2011-01-25 08:24:38
13    USER2 D5682.domain.com 2011-02-14 07:59:23 2011-02-14 08:14:47
Run Code Online (Sandbox Code Playgroud)