如何计算 dplyr::group_by 成员之间的重叠

Question

如何计算 dplyr::group_by 成员之间的重叠

我有以下内容：

library(tidyverse)
df <- tibble::tribble(
  ~gene, ~celltype,
  "a",   "cel1_1",  
  "b",   "cel1_1",  
  "c",   "cel1_1",  
  "a",   "cell_2",  
  "b",   "cell_2",  
  "c",   "cell_3",  
  "d",   "cell_3"
)

df %>% group_by(celltype)
#> Source: local data frame [7 x 2]
#> Groups: celltype [3]
#> 
#> # A tibble: 7 x 2
#>    gene celltype
#>   <chr>    <chr>
#> 1     a   cel1_1
#> 2     b   cel1_1
#> 3     c   cel1_1
#> 4     a   cell_2
#> 5     b   cell_2
#> 6     c   cell_3
#> 7     d   cell_3

Run Code Online (Sandbox Code Playgroud)

重叠中的基因可以按以下方式分组

 cell1   a,b,c
 cell2   a,b
 cell3   c,d

Run Code Online (Sandbox Code Playgroud)

我想要做的是计算所有细胞的基因重叠，结果是这个表：

          cell1    cell2     cell3
 cell1    3          2          1 
 cell2    2          2          0
 cell3    1          0          2

Run Code Online (Sandbox Code Playgroud)

我怎样才能做到这一点？

更新

最后计算百分比（除以对中最大分母）

          #cell1                cell2           cell3
 cell1    1.00(3/3)          0.67 (2/3)         0.33 (1/3)
 cell2    0.67 (2/3)         1.00               0
 cell3    0.33 (1/3)         0                  1.00

Run Code Online (Sandbox Code Playgroud)

我尝试了这个，但没有得到我想要的：

> tmp <- crossprod(table(df))
> tmp/max(tmp)
        celltype
celltype    cel1_1    cell_2    cell_3
  cel1_1 1.0000000 0.6666667 0.3333333
  cell_2 0.6666667 0.6666667 0.0000000
  cell_3 0.3333333 0.0000000 0.6666667

Run Code Online (Sandbox Code Playgroud)

因此对角线的值始终为 1.00。

Answer 1

akr*_*run 5

我们可以table使用crossprod

crossprod(table(df))
#       celltype
#celltype cell_1 cell_2 cell_3
#  cell_1      3      2      1
#  cell_2      2      2      0
#  cell_3      1      0      2

Run Code Online (Sandbox Code Playgroud)

或者另一个选择是tidyverse

library(tidyverse)
count(df, gene, celltype) %>% 
       spread(celltype, n, fill = 0) %>%
       select(-gene) %>% 
       as.matrix %>% 
       crossprod
#        cel1_1 cell_2 cell_3
#cel1_1      3      2      1
#cell_2      2      2      0
#cell_3      1      0      2

Run Code Online (Sandbox Code Playgroud)

或者与data.table

library(data.table)
crossprod(as.matrix(dcast(setDT(df), gene~celltype, length)[,-1]))

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，5 月前
查看次数：	983 次
最近记录：	8 年，5 月前