vis*_*226 6 python csv gzip bioinformatics
对于初学者我不熟悉生物信息学,特别是编程,但我已经构建了一个脚本,它将通过一个所谓的VCF文件(只包括个体,一个clumn =一个人),并使用搜索字符串来查找对于每个变体(系),个体是纯合的还是杂合的.
这个脚本起作用,至少在小子集上,但我知道它将所有东西都存储在内存中.我想在非常大的压缩文件(甚至整个基因组)上做这个,但我不知道如何将这个脚本转换成一个逐行完成所有操作的脚本(因为我想要计算整列,我只是不要看看如何解决这个问题.
因此,每个人的输出是5件事(总变体,数字纯合子,数字杂合子,以及同源和杂合子的比例).请参阅以下代码:
#!usr/bin/env python
import re
import gzip
subset_cols = 'subset_cols_chr18.vcf.gz'
#nuc_div = 'nuc_div_chr18.txt'
gz_infile = gzip.GzipFile(subset_cols, "r")
#gz_outfile = gzip.GzipFile(nuc_div, "w")
# make a dictionary of the header line for easy retrieval of elements later on
headers = gz_infile.readline().rstrip().split('\t')
print headers
column_dict = {}
for header in headers:
column_dict[header] = []
for line in gz_infile:
columns = line.rstrip().split('\t')
for i in range(len(columns)):
c_header=headers[i]
column_dict[c_header].append(columns[i])
#print column_dict
for key in column_dict:
number_homozygotes = 0
number_heterozygotes = 0
for values in column_dict[key]:
SearchStr = '(\d)/(\d):\d+,\d+:\d+:\d+:\d+,\d+,\d+'
#this search string contains the regexp (this regexp was tested)
Result = re.search(SearchStr,values)
if Result:
#here, it will skip the missing genoytypes ./.
variant_one = int(Result.group(1))
variant_two = int(Result.group(2))
if variant_one == 0 and variant_two == 0:
continue
elif variant_one == variant_two:
#count +1 in case variant one and two are equal (so 0/0, 1/1, etc.)
number_homozygotes += 1
elif variant_one != variant_two:
#count +1 in case variant one is not equal to variant two (so 1/0, 0/1, etc.)
number_heterozygotes += 1
print "%s homozygotes %s" % (number_homozygotes, key)
print "%s heterozygotes %s" % (number_heterozygotes,key)
variants = number_homozygotes + number_heterozygotes
print "%s variants" % variants
prop_homozygotes = (1.0*number_homozygotes/variants)*100
prop_heterozygotes = (1.0*number_heterozygotes/variants)*100
print "%s %% homozygous %s" % (prop_homozygotes, key)
print "%s %% heterozygous %s" % (prop_heterozygotes, key)
Run Code Online (Sandbox Code Playgroud)
任何帮助将不胜感激,所以我可以继续调查大数据集,谢谢:)
顺便说一下,VCF文件看起来像这样:INDIVIDUAL_1 INDIVIDUAL_2 INDIVIDUAL_3 0/0:9,0:9:24:0,24,221 1/0:5,4:9:25:25,0,26 1/1: 0,13:13:33:347,33,0
这是带有各个ID名称的标题行(我总共有33个人,ID标签更复杂,我在这里简化了)然后我有很多这些具有相同特定模式的信息行.我只对斜线的第一部分感兴趣,因此定期表达.
披露:我在Hail项目上全职工作.
嗨,您好!欢迎来到编程和生物信息学!
amirouche正确地确定您需要某种"流式"或"逐行"算法来处理太大而无法放入机器RAM中的数据.不幸的是,如果你只限于没有库的python,你必须手动chunk文件并处理VCF的解析.
该冰雹项目是科学家与遗传数据太大,不适合在RAM中一路上涨的自由,开放源码工具(几十压缩VCF TB级的数据,IE),以太大,不适合一台机器上.Hail可以利用一台机器上的所有核心或云计算机上的所有核心.Hail运行在Mac OS X和大多数GNU/Linux版本上.冰雹暴露了统计遗传学领域特定的语言,使您的问题更短.
你的python代码最忠实的翻译为Hail是这样的:
/path/to/hail importvcf -f YOUR_FILE.vcf.gz \
annotatesamples expr -c \
'sa.nCalled = gs.filter(g => g.isCalled).count(),
sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
sa.nHet = gs.filter(g => g.isHet).count()'
annotatesamples expr -c \
'sa.pHom = sa.nHom / sa.nCalled,
sa.pHet = sa.nHet / sa.nCalled' \
exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
Run Code Online (Sandbox Code Playgroud)
我在2.0GB文件上运行了我的双核笔记本电脑上面的命令:
# ls -alh profile225.vcf.bgz
-rw-r--r-- 1 dking 1594166068 2.0G Aug 25 15:43 profile225.vcf.bgz
# ../hail/build/install/hail/bin/hail importvcf -f profile225.vcf.bgz \
annotatesamples expr -c \
'sa.nCalled = gs.filter(g => g.isCalled).count(),
sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
sa.nHet = gs.filter(g => g.isHet).count()' \
annotatesamples expr -c \
'sa.pHom = sa.nHom / sa.nCalled,
sa.pHet = sa.nHet / sa.nCalled' \
exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
hail: info: running: importvcf -f profile225.vcf.bgz
[Stage 0:=======================================================> (63 + 2) / 65]hail: info: Coerced sorted dataset
hail: info: running: annotatesamples expr -c 'sa.nCalled = gs.filter(g => g.isCalled).count(),
sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
sa.nHet = gs.filter(g => g.isHet).count()'
[Stage 1:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.pHom = sa.nHom / sa.nCalled,
sa.pHet = sa.nHet / sa.nCalled'
hail: info: running: exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
hail: info: while importing:
file:/Users/dking/projects/hail-data/profile225.vcf.bgz import clean
hail: info: timing:
importvcf: 34.211s
annotatesamples expr: 6m52.4s
annotatesamples expr: 21.399ms
exportsamples: 121.786ms
total: 7m26.8s
# head sampleInfo.tsv
sample pHomRef pHet nHom nHet nCalled
HG00096 9.49219e-01 5.07815e-02 212325 11359 223684
HG00097 9.28419e-01 7.15807e-02 214035 16502 230537
HG00099 9.27182e-01 7.28184e-02 211619 16620 228239
HG00100 9.19605e-01 8.03948e-02 214554 18757 233311
HG00101 9.28714e-01 7.12865e-02 214283 16448 230731
HG00102 9.24274e-01 7.57260e-02 212095 17377 229472
HG00103 9.36543e-01 6.34566e-02 209944 14225 224169
HG00105 9.29944e-01 7.00564e-02 214153 16133 230286
HG00106 9.25831e-01 7.41687e-02 213805 17128 230933
Run Code Online (Sandbox Code Playgroud)
哇!2GB的7分钟,这很慢!不幸的是,这是因为VCF不是一种很好的数据分析格式!
让我们转换为Hail的优化存储格式,VDS,然后重新运行命令:
# ../hail/build/install/hail/bin/hail importvcf -f profile225.vcf.bgz write -o profile225.vds
hail: info: running: importvcf -f profile225.vcf.bgz
[Stage 0:========================================================>(64 + 1) / 65]hail: info: Coerced sorted dataset
hail: info: running: write -o profile225.vds
[Stage 1:> (0 + 4) / 65]
[Stage 1:========================================================>(64 + 1) / 65]
# ../hail/build/install/hail/bin/hail read -i profile225.vds \
annotatesamples expr -c \
'sa.nCalled = gs.filter(g => g.isCalled).count(),
sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
sa.nHet = gs.filter(g => g.isHet).count()' \
annotatesamples expr -c \
'sa.pHom = sa.nHom / sa.nCalled,
sa.pHet = sa.nHet / sa.nCalled' \
exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
hail: info: running: read -i profile225.vds
[Stage 1:> (0 + 0) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[Stage 1:============================================> (3 + 1) / 4]hail: info: running: annotatesamples expr -c 'sa.nCalled = gs.filter(g => g.isCalled).count(),
sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
sa.nHet = gs.filter(g => g.isHet).count()'
[Stage 2:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.pHom = sa.nHom / sa.nCalled,
sa.pHet = sa.nHet / sa.nCalled'
hail: info: running: exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
hail: info: timing:
read: 2.969s
annotatesamples expr: 1m20.4s
annotatesamples expr: 21.868ms
exportsamples: 151.829ms
total: 1m23.5s
Run Code Online (Sandbox Code Playgroud)
大约快五倍!关于更大规模,在代表完整VCF的VDS上在Google云上运行相同的命令,1000 Genomes Project(2535全基因组,约315GB压缩)使用328个工作核心花费3m42s.
Hail还有一个sampleqc命令可以计算你想要的大部分内容(以及更多!):
../hail/build/install/hail/bin/hail read -i profile225.vds \
sampleqc \
annotatesamples expr -c \
'sa.myqc.pHomRef = (sa.qc.nHomRef + sa.qc.nHomVar) / sa.qc.nCalled,
sa.myqc.pHet= sa.qc.nHet / sa.qc.nCalled' \
exportsamples -c 'sample = s, sa.myqc.*, nHom = sa.qc.nHomRef + sa.qc.nHomVar, nHet = sa.qc.nHet, nCalled = sa.qc.nCalled' -o sampleInfo.tsv
hail: info: running: read -i profile225.vds
[Stage 0:> (0 + 0) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[Stage 1:============================================> (3 + 1) / 4]hail: info: running: sampleqc
[Stage 2:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.myqc.pHomRef = (sa.qc.nHomRef + sa.qc.nHomVar) / sa.qc.nCalled,
sa.myqc.pHet= sa.qc.nHet / sa.qc.nCalled'
hail: info: running: exportsamples -c 'sample = s, sa.myqc.*, nHom = sa.qc.nHomRef + sa.qc.nHomVar, nHet = sa.qc.nHet, nCalled = sa.qc.nCalled' -o sampleInfo.tsv
hail: info: timing:
read: 2.928s
sampleqc: 1m27.0s
annotatesamples expr: 229.653ms
exportsamples: 353.942ms
total: 1m30.5s
Run Code Online (Sandbox Code Playgroud)
安装Hail非常简单,我们有文档可以帮助您.需要更多帮助?您可以在Hail用户聊天室获得实时支持,或者如果您更喜欢论坛,则可以获得Hail话语(两者都链接到主页,遗憾的是我没有足够的声誉来创建真正的链接).
在不久的将来(距离今天不到一个月),Hail团队将完成一个Python API,允许您将第一个片段表达为:
result = importvcf("YOUR_FILE.vcf.gz")
.annotatesamples('sa.nCalled = gs.filter(g => g.isCalled).count(),
sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
sa.nHet = gs.filter(g => g.isHet).count()')
.annotatesamples('sa.pHom = sa.nHom / sa.nCalled,
sa.pHet = sa.nHet / sa.nCalled')
for (x in result.sampleannotations):
print("Sample " + x +
" nCalled " + x.nCalled +
" nHom " + x.nHom +
" nHet " + x.nHet +
" percent Hom " + x.pHom * 100 +
" percent Het " + x.pHet * 100)
result.sampleannotations.write("sampleInfo.tsv")
Run Code Online (Sandbox Code Playgroud)
编辑:添加head了tsv文件的输出.
编辑2:最新的冰雹不需要biallelic sampleqc
编辑3:关于使用数百个内核扩展到云的注意事项
为了能够处理比 RAM 更大的数据集,您需要重新设计算法来逐行处理数据,现在您正在处理每一列。
但在此之前,您需要一种方法来流式传输 gzip 文件中的行。
以下 Python 3 代码执行此操作:
"""/sf/answers/2838399721/"""
#!/usr/bin/env python3
import zlib
from mmap import PAGESIZE
CHUNKSIZE = PAGESIZE
# This is a generator that yields *decompressed* chunks from
# a gzip file. This is also called a stream or lazy list.
# It's done like so to avoid to have the whole file into memory
# Read more about Python generators to understand how it works.
# cf. `yield` keyword.
def gzip_to_chunks(filename):
decompressor = zlib.decompressobj(zlib.MAX_WBITS + 16)
with open(filename, 'rb') as f:
chunk = f.read(CHUNKSIZE)
while chunk:
out = decompressor.decompress(chunk)
yield out
chunk = f.read(CHUNKSIZE)
out = decompressor.flush()
yield out
# Again the following is a generator (see the `yield` keyword).
# What id does is iterate over an *iterable* of strings and yields
# rows from the file
# (hint: `gzip_to_chunks(filename)` returns a generator of strings)
# (hint: a generator is also an iterable)
# You can verify that by calling `chunks_to_rows` with a list of
# strings, where every strings is a chunk of the VCF file.
# (hint: a list is also an iterable)
# inline doc follows
def chunks_to_rows(chunks):
row = b'' # we will add the chars making a single row to this variable
for chunk in chunks: # iterate over the strings/chuncks yielded by gzip_to_chunks
for char in chunk: # iterate over all chars from the string
if char == b'\n'[0]: # hey! this is the end of the row!
yield row.decode('utf8').split('\t') # the row is complete, yield!
row = b'' # start a new row
else:
row += int.to_bytes(char, 1, byteorder='big') # Otherwise we are in the middle of the row
# at this point the program has read all the chunk
# at this point the program has read all the file without loading it fully in memory at once
# That said, there's maybe still something in row
if row:
yield row.decode('utf-8').split('\t') # yield the very last row if any
for e in chunks_to_rows(gzip_to_chunks('conceptnet-assertions-5.6.0.csv.gz')):
uid, relation, start, end, metadata = e
print(start, relation, end)
Run Code Online (Sandbox Code Playgroud)
编辑:修改答案并使其适用于gzip 压缩的concetpnet 的 tsv 文件
| 归档时间: |
|
| 查看次数: |
347 次 |
| 最近记录: |