Perl(或R或SQL):计算字符串在列之间出现的频率

Ste*_*ner 9 mysql string perl r

我有一个看起来像这样的文本文件:

gene1   gene2   gene3
a       d       c
b       e       d
c       f       g
d       g       
        h
        i
Run Code Online (Sandbox Code Playgroud)

(每列都是一个人类基因,每个都包含可变数量的蛋白质(字符串,这里显示为字母),可以与这些基因结合).

我想要做的是计算每个字符串表示的列数,输出该数字和所有列标题,如下所示:

a   1   gene1
b   1   gene1
c   2   gene1 gene3
d   3   gene1 gene2 gene3
e   1   gene2
f   1   gene2
g   2   gene2 gene3
h   1   gene2
i   1   gene2
Run Code Online (Sandbox Code Playgroud)

我一直试图弄清楚如何在Perl和R中做到这一点,但到目前为止还没有成功.谢谢你的帮助.

Cha*_*ase 8

这个解决方案看起来有点像黑客,但它提供了所需的输出.它依赖于使用两者plyrreshape包,但我相信你可以找到基本的R替代品.诀窍在于,函数melt可以让我们将数据展平为长格式,从而可以实现从那一点开始的简单(ish)操作.

library(reshape)
library(plyr)

#Recreate your data
dat <- data.frame(gene1 = c(letters[1:4], NA, NA),
                  gene2 = letters[4:9],
                  gene3 = c("c", "d", "g", NA, NA, NA)
                  )

#Melt the data. You'll need to update this if you have more columns
dat.m <- melt(dat, measure.vars = 1:3)

#Tabulate counts
counts <- as.data.frame(table(dat.m$value))

#I'm not sure what to call this column since it's a smooshing of column names
otherColumn <- ddply(dat.m, "value", function(x) paste(x$variable, collapse = " "))

#Merge the two together. You could fix the column names above, or just deal with it here
merge(counts, otherColumn, by.x = "Var1", by.y = "value")
Run Code Online (Sandbox Code Playgroud)

得到:

> merge(counts, otherColumn, by.x = "Var1", by.y = "value")
  Var1 Freq                V1
1    a    1             gene1
2    b    1             gene1
3    c    2       gene1 gene3
4    d    3 gene1 gene2 gene3
....
Run Code Online (Sandbox Code Playgroud)

  • 您可以使用`ddply(dat.m,.(value),summarize,Freq = length(variable),V1 = paste(variable,collapse =""))简化为单个`ddply`调用. (2认同)

yst*_*sth 6

在perl中,假设每列中的蛋白质没有需要去除的重复.(如果他们这样做,则应该使用散列哈希值.)

use strict;
use warnings;

my $header = <>;
my %column_genes;
while ($header =~ /(\S+)/g) {
    $column_genes{$-[1]} = "$1";
}

my %proteins;
while (my $line = <>) {
    while ($line =~ /(\S+)/g) {
        if (exists $column_genes{$-[1]}) {
            push @{ $proteins{$1} }, $column_genes{$-[1]};
        }
        else {
            warn "line $. column $-[1] unexpected protein $1 ignored\n";
        }
    }
}

for my $protein (sort keys %proteins) {
    print join("\t",
        $protein,
        scalar @{ $proteins{$protein} },
        join(' ', sort @{ $proteins{$protein} } )
    ), "\n";
}
Run Code Online (Sandbox Code Playgroud)

从stdin读取,写入stdout.

  • `@ -`是一个特殊的数组,报告正则表达式捕获开始的位置(`$ - [1]`表示`$ 1`开始,'$ _ [2]`表示$ 2`等) (2认同)

Ram*_*ath 5

一个衬垫(或更确切地说是3个衬垫)

ddply(na.omit(melt(dat, m = 1:3)), .(value), summarize, 
     len = length(variable), 
     var = paste(variable, collapse = " "))
Run Code Online (Sandbox Code Playgroud)