需要帮助加速dplyr聚合

bde*_*caf 6 r dplyr

tl.dr. 我有一个聚合问题,我之前没有在文档中看到过.我设法完成它,但它对于预期的应用程序来说太慢了.我通常使用的数据有大约500行(我的直觉告诉我这对于dplyr来说并不多)并且根据system.time它运行大约4秒.我的困境是我想反复进行优化运行,目前我正在考虑运行时间.

你有没有看到我可以刮胡子的东西?

如果需要,我也可以发送一些我使用的数据.

算法 我有一个数据集:

sample_dataset <- data_frame( starts = c(1000, 1008, 1017, 2000, 2020, 3000),
                          ends   = c(1009, 1015, 1020, 2015, 2030, 3010),
                          v = list(rep(1,10), rep(2,8),rep(3,4), 
                                   rep(4,16), rep(5,11), rep(6,11)) )
Run Code Online (Sandbox Code Playgroud)

所以每一行都编码一个信号和一个开始和结束索引.我想将所有closeness距离小于(例如10)的线聚合成一条线.如果重要的starts是订购.

输出应该是:

structure(list(inds = 1:3, starts = c(1000, 2000, 3000), ends = c(1020,
2030, 3010), v = list(c(1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 2, 2, 2,
2, 2, 2, 0, 3, 3, 3, 3), c(4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 0, 0, 0, 0, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), c(6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L), .Names = c("inds", "starts", "ends",
"v"))
Run Code Online (Sandbox Code Playgroud)

因此,原始数据集中的前三行是聚合的,第4行和第5行是聚合的,6是未更改的.对于重叠,应该添加数字,填充空白零.更新的开始值是第一次开始,更新的结束应该是最后的结束(假设我应该将其修复到最大值).但顺便说一下这些生成结束也应该排序.不应发生一个块完全被另一个块包围的情况.

我通过以下代码实现了这一点:

library(dplyr)

join_lines <- function(dfi) {
  if (nrow(dfi)==1) return(select(dfi,starts,ends, v))
  else 
    with(dfi,{ 
             start <- starts[[1]]
             end <- ends[[length(ends)]]
             vals <- numeric(end-start+1)
             add_val <- function(ddf)
               with(ddf,{ 
                      vals[(starts-start+1) : (ends-start+1)] <<- 
                        vals[(starts-start+1) : (ends-start+1)] + v })
             dfi %>% rowwise() %>% do(tmp=add_val(.))
             data_frame(starts=start, ends=end, v=list(vals))})
}

simplify_semisparse <- function(aframe, closeness = 10){
  aframe %>% 
    mutate( join_pre = lag(ends, default=0)+closeness >= (starts),
           inds = cumsum(!join_pre)
           ) %>%
  group_by(inds) %>% do(join_lines(.)) %>% ungroup()
}    

res <- simplify_semisparse(sample_dataset)

dput(res) # see above
Run Code Online (Sandbox Code Playgroud)

背景

我正在处理的数据来自质谱.非常特殊的是,矢量有大约500,000个条目,其中不到10%不是零,典型的光谱有大约500个这样的密集区块.我确实需要在这样的光谱中快速插值 - 我的想法是approx在"密集"区域中使用.

比较建议

我有机会比较你的建议.

@ matt-jewett解决方案产生的结果与我的预期结果不一致,所以我确实排除了它.

@jeremycgs解决方案最接近我原来的方法,但也没有产生完全相同的结果.

最重要的是我的运行时,我正在使用生产数据进行比较.我的原始解决方案需要2.165秒.@tjeremy的建议耗时0.532秒,@ uwe-block 0.012秒.

哇 - 我需要学习data.table.

jer*_*ycg 5

我就是这样做的.你在v中使用列表不是最佳实践(在我看来),所以我习惯于tidyr不再使用更长的数据帧.我还遗漏了你的0 - 你可以把它们添加回来,就像左边的连接或索引上的东西一样:

library(tidyr)
sample_dataset %>%
 mutate(grouper = cumsum(c(0, na.omit(starts - lag(starts)))>20), id = row_number()) %>% #add a 'grouping' based on your closeness (20 here) and an id for later
 unnest(v) %>% #unnest v into lines - each v now has a line
 group_by(id) %>% #group by line
 mutate(count = row_number()+starts) %>% #get a 'location' per line
 group_by(grouper, count) %>% #group by the 'location' and group
 summarise(starts = starts[1], ends = ends[n()], v = sum(v)) #sum the v
Run Code Online (Sandbox Code Playgroud)

这使:

Source: local data frame [58 x 5]
Groups: grouper [?]

   grouper count starts  ends     v
     <int> <dbl>  <dbl> <dbl> <dbl>
1        0  1001   1000  1009     1
2        0  1002   1000  1009     1
3        0  1003   1000  1009     1
4        0  1004   1000  1009     1
5        0  1005   1000  1009     1
6        0  1006   1000  1009     1
7        0  1007   1000  1009     1
8        0  1008   1000  1009     1
9        0  1009   1000  1015     3
10       0  1010   1000  1015     3
# ... with 48 more rows
Run Code Online (Sandbox Code Playgroud)

然后,如果你真的想要,可以用0填充缺失的值(out这是上面的输出):

filled = out %>% group_by(grouper) %>% do(data.frame(count = seq(from = .$starts[1], to = tail(.$ends,1))))

filled = filled %>% left_join(out, by = c('grouper', 'count'))
filled$v[is.na(filled$v)] = 0

Source: local data frame [63 x 5]
Groups: grouper [?]

   grouper count starts  ends     v
     <int> <dbl>  <dbl> <dbl> <dbl>
1        0  1000     NA    NA     0
2        0  1001   1000  1009     1
3        0  1002   1000  1009     1
4        0  1003   1000  1009     1
5        0  1004   1000  1009     1
6        0  1005   1000  1009     1
7        0  1006   1000  1009     1
8        0  1007   1000  1009     1
9        0  1008   1000  1009     1
10       0  1009   1000  1015     3
# ... with 53 more rows
Run Code Online (Sandbox Code Playgroud)

  • 另见`tidyr :: complete`以避免加入.像`%>%完成(count = full_seq(count,1),fill = list(v = 0))`应该有效.(连接仍在内部完成.) (2认同)
  • `data.frame(g = c('a','a','b','b'),v = c(1,3,4,6))%>%group_by(g)%>%完成(v = full_seq(v,1))`似乎工作正常(即范围是用每个组完成的). (2认同)

Uwe*_*Uwe 4

尽管OP已要求加快代码速度,但出于性能原因,dplyr我想建议一个解决方案。data.table此外,到目前为止发布的其他答案都没有完全满足OP的要求,即

  • 保持sample_datawith的结构starts和值ends列表v
  • 将距离接近度小于(例如 10)的所有线聚合为一条线

下面的代码尝试满足所有要求:

library(data.table)   # CRAN versio 1.10.4 used
# define threshold: closeness as defined by OP, max_gap used in code 
closeness <- 10L
max_gap <- closeness - 1L
# coerce to data.table, and key, i.e., sort by starts and ends
DT <- data.table(sample_dataset, key = c("starts", "ends"))
# compute gaps between ends and starts of next row
# identify rows which belong together: inds is advanced if gap is greater threshhold
DT[, gap := starts - shift(ends, fill = -Inf)][, inds := cumsum(gap > max_gap)][]
# close gaps but only within groups
DT0 <- DT[between(gap, 2L, max_gap), .(starts = starts - (gap - 1L), ends = starts - 1L, 
                                       v = Vectorize(rep.int)(0L, gap - 1L), gap, inds)]
# bind rowwise (union in SQL), setkey on result to maintain sort order, 
# remove column gap as no longer needed
DT2 <- setkey(rbind(DT, DT0), starts, ends)[, gap := NULL][]
# aggregate groupwise, pick min/max, combine lists
result <- DT2[, .(starts = min(starts), ends = max(ends), v = list(Reduce(c, v))), by = inds]
# alternative code: pick first/last
result <- DT2[, .(starts = first(starts), ends = last(ends), v = list(Reduce(c, v))), by = inds]
result
Run Code Online (Sandbox Code Playgroud)

产生

   inds starts ends            v
1:    1   1000 1020 1,1,1,1,1,1,
2:    2   2000 2030 4,4,4,4,4,4,
3:    3   3000 3010 6,6,6,6,6,6,
Run Code Online (Sandbox Code Playgroud)

result$v
Run Code Online (Sandbox Code Playgroud)
[[1]]
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 0 3 3 3 3

[[2]]
 [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 0 0 0 0 5 5 5 5 5 5 5 5 5 5 5

[[3]]
 [1] 6 6 6 6 6 6 6 6 6 6 6
Run Code Online (Sandbox Code Playgroud)

可以验证,v除了为组内间隙添加的额外零之外,向量中的元素数量是相同的:

# test that all v values are included
# original
sum(lengths(sample_dataset$v))
#[1] 60
# result with additional zeros removed
sum(sapply(result$v, function(x) sum(x > 0)))
#[1] 60
Run Code Online (Sandbox Code Playgroud)

我没有提供基准,因为样本数据集太小。

数据

sample_dataset <- dplyr::data_frame( starts = c(1000, 1008, 1017, 2000, 2020, 3000),
                                     ends   = c(1009, 1015, 1020, 2015, 2030, 3010),
                                     v = list(rep(1,10), rep(2,8),rep(3,4), 
                                              rep(4,16), rep(5,11), rep(6,11)) )
Run Code Online (Sandbox Code Playgroud)