使用 `rle` 函数和 `dplyr` `group_by` 命令来映射分组变量

Bal*_*pan 2 r dplyr

我有一个包含三列的数据框,其信息类似于下面给出的数据框。现在我想根据 column 中的信息提取信息搜索模式a

基于少数开发人员(@thelatemail 和@David T)的支持,我能够通过rle函数识别模式,请参见此处 -使用 rle 函数识别模式。现在,我希望继续向提取的模式添加分组信息。我尝试使用dplyr do函数 - 请参阅下面的代码。但是,这不起作用。

还提供了示例数据和所需的输出供您参考。

##mycode that produces error - needs to be fixed
test <- data%>%
  group_by(b, c)%>%
  do(.,  data.frame(from = rle(.$a)$values), to = lead(rle(.$a)$values))
Run Code Online (Sandbox Code Playgroud)
##code to create the data frame
a <- c( "a", "b", "b", "b", "a", "c", "a", "b", "d", "d", "d", "e", "f", "f", "e", "e")
b <- c(rep("experiment", times = 8), rep("control", times = 8))
c <- c(rep("A01", times = 4), rep("A02", times = 4), rep("A03", times = 4), rep("A04", times = 4))
data <- data.frame(c,b,a)

Run Code Online (Sandbox Code Playgroud)
## desired output

    c      b         from  to    fromCount toCount
                    <chr> <chr>     <int>   <int>
 1 A01 experimental  a     b             1       3
 2 A02 experimental  a     c             1       1
 3 A02 experimental  c     a             1       1
 4 A02 experimental  a     b             1       1
 5 A03 control       d     e             3       1
 6 A04 control       f     e             2       2
Run Code Online (Sandbox Code Playgroud)

相较于先前的帖子在这里,信息被压缩,因为我们采用分组的a列。

akr*_*run 5

我们可以使用rleiddata.table

library(data.table)
library(dplyr)
data %>% 
  group_by(b, c, grp = rleid(a)) %>%
  summarise(from = first(a), fromCount = n()) %>% 
  mutate(to = lead(from), toCount = lead(fromCount)) %>%
  ungroup %>%
  select(-grp) %>% 
  filter(!is.na(to)) %>%
  arrange(c)
# A tibble: 6 x 6
#  b          c     from  fromCount to    toCount
#  <chr>      <chr> <chr>     <int> <chr>   <int>
#1 experiment A01   a             1 b           3
#2 experiment A02   a             1 c           1
#3 experiment A02   c             1 a           1
#4 experiment A02   a             1 b           1
#5 control    A03   d             3 e           1
#6 control    A04   f             2 e           2
Run Code Online (Sandbox Code Playgroud)

或使用rle,通过“B”,“C”分组,之后summariserle创建一个list柱,然后提取在“值”和“长度”从塔summarise,在创建“至”,“toCount”lead的“从” , 'fromCount' 列根据 'c' 列filter列出NA元素和arrange

data %>% 
    group_by(b, c) %>%
    summarise(rl = list(rle(a)), 
              from = rl[[1]]$values, 
              fromCount = rl[[1]]$lengths) %>% 
    mutate(to = lead(from), 
           toCount = lead(fromCount)) %>%
    ungroup %>% 
    select(-rl) %>% 
    filter(!is.na(to)) %>% 
    arrange(c)
# A tibble: 6 x 6
#  b          c     from  fromCount to    toCount
#  <chr>      <chr> <chr>     <int> <chr>   <int>
#1 experiment A01   a             1 b           3
#2 experiment A02   a             1 c           1
#3 experiment A02   c             1 a           1
#4 experiment A02   a             1 b           1
#5 control    A03   d             3 e           1
#6 control    A04   f             2 e           2
Run Code Online (Sandbox Code Playgroud)

我们还可以循环通过rle list柱(“RL”)与map,提取部件,并采取leadlengthsvalues在一个tibble,使用unnest_wider创建的列和unnestlist结构,filter从NA元素和arrange

library(tidyr)
library(purrr)
data %>% 
     group_by(b, c) %>%
     summarise(rl = list(rle(a))) %>%
     ungroup %>%
     mutate(out = map(rl, 
          ~ tibble(from = .x$values,
                   fromCount = .x$lengths,
                   to = lead(from), 
                   toCount = lead(fromCount)))) %>%
     unnest_wider(c(out)) %>% 
     unnest(from:toCount) %>%
     filter(!is.na(to)) %>% 
     arrange(c) %>% 
     select(-rl)
Run Code Online (Sandbox Code Playgroud)