pdu*_*ois 4 r dplyr broom tidyverse
我有两个数据框:
src_tbl <- structure(list(Sample_name = c("S1", "S2", "S1", "S2", "S1",
"S2"), crt = c(0.079, 0.082, 0.079, 0.082, 0.079, 0.082), sr = c(0.592,
0.549, 0.592, 0.549, 0.592, 0.549), condition = c("x1", "x1",
"x2", "x2", "x3", "x3"), score = c("0.077", "0.075", "0.483",
"0.268", "0.555", "0.120")), row.names = c(NA, -6L), .Names = c("Sample_name",
"crt", "sr", "condition", "score"), class = c("tbl_df",
"tbl", "data.frame"))
src_tbl
#> Sample_name crt sr condition score
#> 1 S1 0.079 0.592 x1 0.077
#> 2 S2 0.082 0.549 x1 0.075
#> 3 S1 0.079 0.592 x2 0.483
#> 4 S2 0.082 0.549 x2 0.268
#> 5 S1 0.079 0.592 x3 0.555
#> 6 S2 0.082 0.549 x3 0.120
ref_tbl <- structure(list(Sample_name = c("P1", "P2", "P3", "P1", "P2",
"P3", "P1", "P2", "P3"), crt = c(1, 1, 1, 1, 1, 1, 1, 1, 1),
sr = c(2, 2, 2, 2, 2, 2, 2, 2, 2), condition = c("r1", "r1",
"r1", "r2", "r2", "r2", "r3", "r3", "r3"), score = c("0.200",
"0.201", "0.199", "0.200", "0.202", "0.200", "0.200", "0.204",
"0.197")), row.names = c(NA, -9L), .Names = c("Sample_name",
"crt", "sr", "condition", "score"), class = c("tbl_df",
"tbl", "data.frame"))
ref_tbl
#> Sample_name crt sr condition score
#> 1 P1 1 2 r1 0.200
#> 2 P2 1 2 r1 0.201
#> 3 P3 1 2 r1 0.199
#> 4 P1 1 2 r2 0.200
#> 5 P2 1 2 r2 0.202
#> 6 P3 1 2 r2 0.200
#> 7 P1 1 2 r3 0.200
#> 8 P2 1 2 r3 0.204
#> 9 P3 1 2 r3 0.197
Run Code Online (Sandbox Code Playgroud)
我想要做的是对两个数据框中分组的列执行operation(ks.test()).例如,S1和P1的KS测试的p值是:scoreSample_name
# in src_tbl
s1 <- c(0.077,0.483,0.555)
#in ref_tbl
p1 <- c(0.200,0.200,0.200)
testout <- ks.test(s1,p1)
#> Warning in ks.test(s1, p1): cannot compute exact p-value with ties
broom::tidy(testout)
#> statistic p.value method alternative
#> 1 0.6666667 0.5175508 Two-sample Kolmogorov-Smirnov test two-sided
Run Code Online (Sandbox Code Playgroud)
我想要做的是对所有操作执行所有操作,以便最终得到这样的表
src ref p.value
S1 P1 0.5175508
S1 P2 0.6
S1 P3 0.6
S2 P1 0.5175508
S2 P2 0.6
S2 P3 0.6
Run Code Online (Sandbox Code Playgroud)
我怎样才能做到这一点?由于样本数量ref_table可能很大(P1,P2 ...... P10k),因此最好是快速的.
这是一个解决方案tidyverse.我首先在每个源数据集中嵌套得分:
ref_tbl <- ref_tbl %>%
mutate(ref = as.factor(Sample_name),
score_ref = as.numeric(score)) %>%
select(ref, score_ref) %>%
tidyr::nest(score_ref)
ref_tbl
# A tibble: 3 x 2
ref data
<fctr> <list>
1 P1 <tibble [3 x 1]>
2 P2 <tibble [3 x 1]>
3 P3 <tibble [3 x 1]>
src_tbl <- src_tbl %>%
mutate(src = as.factor(Sample_name),
score_src = as.numeric(score)) %>%
select(src, score_src) %>%
tidyr::nest(score_src)
src_tbl
# A tibble: 2 x 2
src data
<fctr> <list>
1 S1 <tibble [3 x 1]>
2 S2 <tibble [3 x 1]>
Run Code Online (Sandbox Code Playgroud)
然后我创建一个包含所有样本名称组合的网格:
all_comb <- as_data_frame(expand.grid(src = src_tbl$src, ref = ref_tbl$ref))
all_comb
# A tibble: 6 x 2
src ref
<fctr> <fctr>
1 S1 P1
2 S2 P1
3 S1 P2
4 S2 P2
5 S1 P3
6 S2 P3
Run Code Online (Sandbox Code Playgroud)
现在,我们可以加入嵌套数据,然后绑定列,因此每个组合必须有一个包含分数的列表列.
all_comb <- all_comb %>%
left_join(ref_tbl, by = "ref") %>%
left_join(src_tbl, by = "src") %>%
mutate(data = purrr::map2(data.x, data.y, bind_cols)) %>%
select(-data.x, -data.y)
all_comb
# A tibble: 6 x 3
src ref data
<fctr> <fctr> <list>
1 S1 P1 <tibble [3 x 2]>
2 S2 P1 <tibble [3 x 2]>
3 S1 P2 <tibble [3 x 2]>
4 S2 P2 <tibble [3 x 2]>
5 S1 P3 <tibble [3 x 2]>
6 S2 P3 <tibble [3 x 2]>
Run Code Online (Sandbox Code Playgroud)
最后,我映射ks.test每个数据集,使用扫帚获得请求的p.value.
final <- all_comb %>%
mutate(ks = purrr::map(data, ~ks.test(.$score_ref, .$score_src)),
tidied = purrr::map(ks, broom::tidy)) %>%
tidyr::unnest(tidied) %>%
select(src, ref, p.value)
Warning message: cannot compute exact p-value with ties
Warning message: cannot compute exact p-value with ties
final
# A tibble: 6 x 3
src ref p.value
<fctr> <fctr> <dbl>
1 S1 P1 0.5175508
2 S2 P1 0.5175508
3 S1 P2 0.6000000
4 S2 P2 0.6000000
5 S1 P3 0.6000000
6 S2 P3 0.6000000
Run Code Online (Sandbox Code Playgroud)