jia*_*mao 7 python perl awk r data.table
数据文件中有两个数字列.我需要按第一列的间隔(例如100)计算第二列的平均值.
我可以用R编程这个任务,但我的R代码对于一个相对较大的数据文件来说真的很慢(数百万行,第一列的值在1到33132539之间变化).
在这里,我展示了我的R代码.我怎么能把它调到更快?其他解决方案是perl,python,awk或shell.
提前致谢.
(1)我的数据文件(制表符分隔,数百万行)
5380 30.07383\n
5390 30.87\n
5393 0.07383\n
5404 6\n
5428 30.07383\n
5437 1\n
5440 9\n
5443 30.07383\n
5459 6\n
5463 30.07383\n
5480 7\n
5521 30.07383\n
5538 0\n
5584 20\n
5673 30.07383\n
5720 30.07383\n
5841 3\n
5880 30.07383\n
5913 4\n
5958 30.07383\n
Run Code Online (Sandbox Code Playgroud)
(2)我想得到的,这里间隔= 100
intervals_of_first_columns, average_of_2nd column_by_the_interval
100, 0\n
200, 0\n
300, 20.34074\n
400, 14.90325\n
.....
Run Code Online (Sandbox Code Playgroud)
(3)R代码
chr1 <- 33132539 # set the limit for the interval
window <- 100 # set the size of interval
spe <- read.table("my_data_file", header=F) # read my data in
names(spe) <- c("pos", "rho") # name my data
interval.chr1 <- data.frame(pos=seq(0, chr1, window)) # setup intervals
meanrho.chr1 <- NULL # object for the mean I want to get
# real calculation, really slow on my own data.
for(i in 1:nrow(interval.chr1)){
count.sub<-subset(spe, chrom==1 & pos>=interval.chr1$pos[i] & pos<=interval.chr1$pos[i+1])
meanrho.chr1[i]<-mean(count.sub$rho)
}
Run Code Online (Sandbox Code Playgroud)
您根本不需要设置输出data.frame,但如果需要,可以.这是我将如何编码它,我保证它会很快.
> dat$incrmt <- dat$V1 %/% 100
> dat
V1 V2 incrmt
1 5380 30.07383 53
2 5390 30.87000 53
3 5393 0.07383 53
4 5404 6.00000 54
5 5428 30.07383 54
6 5437 1.00000 54
7 5440 9.00000 54
8 5443 30.07383 54
9 5459 6.00000 54
10 5463 30.07383 54
11 5480 7.00000 54
12 5521 30.07383 55
13 5538 0.00000 55
14 5584 20.00000 55
15 5673 30.07383 56
16 5720 30.07383 57
17 5841 3.00000 58
18 5880 30.07383 58
19 5913 4.00000 59
20 5958 30.07383 59
> with(dat, tapply(V2, incrmt, mean, na.rm=TRUE))
53 54 55 56 57 58 59
20.33922 14.90269 16.69128 30.07383 30.07383 16.53692 17.03692
Run Code Online (Sandbox Code Playgroud)
您可以完成更少的设置(使用以下代码跳过incrmt变量:
> with(dat, tapply(V2, V1 %/% 100, mean, na.rm=TRUE))
53 54 55 56 57 58 59
20.33922 14.90269 16.69128 30.07383 30.07383 16.53692 17.03692
Run Code Online (Sandbox Code Playgroud)
如果您希望结果可用于某些内容:
by100MeanV2 <- with(dat, tapply(V2, V1 %/% 100, mean, na.rm=TRUE))
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
606 次 |
最近记录: |