我有一个包含三列的数据框,其信息类似于下面给出的数据框。现在我想根据 column 中的信息提取信息搜索模式a。
基于少数开发人员(@thelatemail 和@David T)的支持,我能够通过rle函数识别模式,请参见此处 -使用 rle 函数识别模式。现在,我希望继续向提取的模式添加分组信息。我尝试使用dplyr do函数 - 请参阅下面的代码。但是,这不起作用。
还提供了示例数据和所需的输出供您参考。
##mycode that produces error - needs to be fixed
test <- data%>%
group_by(b, c)%>%
do(., data.frame(from = rle(.$a)$values), to = lead(rle(.$a)$values))
Run Code Online (Sandbox Code Playgroud)
##code to create the data frame
a <- c( "a", "b", "b", "b", "a", "c", "a", "b", "d", "d", "d", "e", "f", "f", "e", "e")
b <- c(rep("experiment", times = 8), rep("control", times = 8))
c <- c(rep("A01", times = 4), rep("A02", times = 4), rep("A03", times = 4), rep("A04", times = 4))
data <- data.frame(c,b,a)
Run Code Online (Sandbox Code Playgroud)
## desired output
c b from to fromCount toCount
<chr> <chr> <int> <int>
1 A01 experimental a b 1 3
2 A02 experimental a c 1 1
3 A02 experimental c a 1 1
4 A02 experimental a b 1 1
5 A03 control d e 3 1
6 A04 control f e 2 2
Run Code Online (Sandbox Code Playgroud)
相较于先前的帖子在这里,信息被压缩,因为我们采用分组的a列。
我们可以使用rleid从data.table
library(data.table)
library(dplyr)
data %>%
group_by(b, c, grp = rleid(a)) %>%
summarise(from = first(a), fromCount = n()) %>%
mutate(to = lead(from), toCount = lead(fromCount)) %>%
ungroup %>%
select(-grp) %>%
filter(!is.na(to)) %>%
arrange(c)
# A tibble: 6 x 6
# b c from fromCount to toCount
# <chr> <chr> <chr> <int> <chr> <int>
#1 experiment A01 a 1 b 3
#2 experiment A02 a 1 c 1
#3 experiment A02 c 1 a 1
#4 experiment A02 a 1 b 1
#5 control A03 d 3 e 1
#6 control A04 f 2 e 2
Run Code Online (Sandbox Code Playgroud)
或使用rle,通过“B”,“C”分组,之后summarise与rle创建一个list柱,然后提取在“值”和“长度”从塔summarise,在创建“至”,“toCount”lead的“从” , 'fromCount' 列根据 'c' 列filter列出NA元素和arrange行
data %>%
group_by(b, c) %>%
summarise(rl = list(rle(a)),
from = rl[[1]]$values,
fromCount = rl[[1]]$lengths) %>%
mutate(to = lead(from),
toCount = lead(fromCount)) %>%
ungroup %>%
select(-rl) %>%
filter(!is.na(to)) %>%
arrange(c)
# A tibble: 6 x 6
# b c from fromCount to toCount
# <chr> <chr> <chr> <int> <chr> <int>
#1 experiment A01 a 1 b 3
#2 experiment A02 a 1 c 1
#3 experiment A02 c 1 a 1
#4 experiment A02 a 1 b 1
#5 control A03 d 3 e 1
#6 control A04 f 2 e 2
Run Code Online (Sandbox Code Playgroud)
我们还可以循环通过rle list柱(“RL”)与map,提取部件,并采取lead的lengths,values在一个tibble,使用unnest_wider创建的列和unnest该list结构,filter从NA元素和arrange
library(tidyr)
library(purrr)
data %>%
group_by(b, c) %>%
summarise(rl = list(rle(a))) %>%
ungroup %>%
mutate(out = map(rl,
~ tibble(from = .x$values,
fromCount = .x$lengths,
to = lead(from),
toCount = lead(fromCount)))) %>%
unnest_wider(c(out)) %>%
unnest(from:toCount) %>%
filter(!is.na(to)) %>%
arrange(c) %>%
select(-rl)
Run Code Online (Sandbox Code Playgroud)