我挖掘了许多线索,但他们都没有解决这个问题.
我感兴趣的是addind string chr到每一行的列的开头.文件是制表符分隔的,看起来像:
re1 1 AGT
re2 1 AGT
re3 2 ACGTCA
re12 3 ACGTACT
Run Code Online (Sandbox Code Playgroud)
我需要的是:
re1 chr1 AGT
re2 chr1 AGT
re3 chr2 ACGTCA
re12 chr3 ACGTACT
Run Code Online (Sandbox Code Playgroud)
可以在bash oneliner上
非常感谢任何帮助,欢呼,Irek
我有一个代码如下:
import HTSeq
reference = open('genome.fa','r')
sequences = dict( (s.name, s) for s in HTSeq.FastaReader(reference))
out = open('homopolymers_in_ref','w')
def find_all(a_str,sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1: return
yield start
start += len(sub)
homa = 'AAAAAAAAAA'
homc = 'CCCCCCCCCC'
homg = 'GGGGGGGGGG'
homt = 'TTTTTTTTTT'
for key,line in sequences.items():
seq = str(line)
a= list(find_all(seq,homa))
c = list(find_all(seq,homc))
g = list(find_all(seq,homg))
t = list(find_all(seq,homt))
for i in a:
## print i,key,'A'
out.write(str(i)+'\t'+str(key)+'\t'+'A'+'\n')
for i in …Run Code Online (Sandbox Code Playgroud)