Pie*_*rre 6 string erlang bioinformatics sequence mean
我正在尝试使用Erlang获取fasta序列的平均长度.fasta文件看起来像这样
>title1
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCTCGTACGC
>title2
ATCGATCGCATCGATGCTACGATCTCGTACGC
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
>title3
ATCGATCGCATCGAT(...)
Run Code Online (Sandbox Code Playgroud)
我尝试使用以下Erlang代码回答这个问题:
-module(golf).
-export([test/0]).
line([],{Sequences,Total}) -> {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.
scanLines(S,Sequences,Total)->
case io:get_line(S,'') of
eof -> {Sequences,Total};
{error,_} ->{Sequences,Total};
Line -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
end .
test()->
{Sequences,Total}=scanLines(standard_io,0,0),
io:format("~p\n",[Total/(1.0*Sequences)]),
halt().
Run Code Online (Sandbox Code Playgroud)
编译/执行:
erlc golf.erl
erl -noshell -s golf test < sequence.fasta
563.16
Run Code Online (Sandbox Code Playgroud)
这个代码似乎适用于一个小的fasta文件,但解析一个较大的(> 100Mo)需要几个小时.为什么?我是Erlang的新手,请你改进这段代码吗?
如果你需要非常快的IO,那么你必须比平常做更多的技巧.
-module(g).
-export([s/0]).
s()->
P = open_port({fd, 0, 1}, [in, binary, {line, 256}]),
r(P, 0, 0),
halt().
r(P, C, L) ->
receive
{P, {data, {eol, <<$>:8, _/binary>>}}} ->
r(P, C+1, L);
{P, {data, {eol, Line}}} ->
r(P, C, L + size(Line));
{'EXIT', P, normal} ->
io:format("~p~n",[L/C])
end.
Run Code Online (Sandbox Code Playgroud)
据我所知,这是最快的IO -noshell -noinput.编译就像erlc +native +"{hipe, [o3]}" g.erl但是-smp disable
erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files g.erl
Run Code Online (Sandbox Code Playgroud)
并运行:
time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s < uniprot_sprot.fasta
352.6697028442464
real 0m3.241s
user 0m3.060s
sys 0m0.124s
Run Code Online (Sandbox Code Playgroud)
有-smp enable本地需要:
$ erlc +native +"{hipe, [o3]}" g.erl
$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464
real 0m5.103s
user 0m4.944s
sys 0m0.112s
Run Code Online (Sandbox Code Playgroud)
字节代码,但-smp disable(几乎与本机相同,因为大多数工作是在端口完成的!):
$ erlc g.erl
$ time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464
real 0m3.565s
user 0m3.436s
sys 0m0.104s
Run Code Online (Sandbox Code Playgroud)
只是为了完整性字节代码与smp:
$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464
real 0m5.433s
user 0m5.236s
sys 0m0.128s
Run Code Online (Sandbox Code Playgroud)
为了比较sarnold 版本给了我错误的答案,并在同一硬件上采取更多:
$ erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files golf.erl
./golf.erl:5: Warning: variable 'Rest' is unused
$ time erl -smp disable -noshell -mode minimal -s golf test
359.04679841439776
real 0m17.569s
user 0m16.749s
sys 0m0.664s
Run Code Online (Sandbox Code Playgroud)
编辑:我看过的特点,uniprot_sprot.fasta我有点惊讶.它是3824397行和232MB.这意味着该-smp disabled版本每秒可处理118万个文本行(面向行的IO为71MB/s).
| 归档时间: |
|
| 查看次数: |
1076 次 |
| 最近记录: |