Nod*_*nin 12 python biopython python-2.7 python-3.x
我正在使用Python和正则表达式来查找ORF(开放阅读框架).
查找一个子字符串,该字符串仅由字母组成ATGC(不包含空格或新行):
开头ATG,结尾TAG或TAA或TGA并应考虑从第一个字符的序列,那么第二和第三然后:
Seq= "CCTCAGCGAGGACAGCAAGGGACTAGCCAGGAGGGAGAACAGAAACTCCAGAACATCTTGGAAATAGCTCCCAGAAAAGC
AAGCAGCCAACCAGGCAGGTTCTGTCCCTTTCACTCACTGGCCCAAGGCGCCACATCTCCCTCCAGAAAAGACACCATGA
GCACAGAAAGCATGATCCGCGACGTGGAACTGGCAGAAGAGGCACTCCCCCAAAAGATGGGGGGCTTCCAGAACTCCAGG
CGGTGCCTATGTCTCAGCCTCTTCTCATTCCTGCTTGTGGCAGGGGCCACCACGCTCTTCTGTCTACTGAACTTCGGGGT
GATCGGTCCCCAAAGGGATGAGAAGTTCCCAAATGGCCTCCCTCTCATCAGTTCTATGGCCCAGACCCTCACACTCAGAT
CATCTTCTCAAAATTCGAGTGACAAGCCTGTAGCCCACGTCGTAGCAAACCACCAAGTGGAGGAGCAGCTGGAGTGGCTG
AGCCAGCGCGCCAACGCCCTCCTGGCCAACGGCATGGATCTCAAAGACAACCAACTAGTGGTGCCAGCCGATGGGTTGTA
CCTTGTCTACTCCCAGGTTCTCTTCAAGGGACAAGGCTGCCCCGACTACGTGCTCCTCACCCACACCGTCAGCCGATTTG
CTATCTCATACCAGGAGAAAGTCAACCTCCTCTCTGCCGTCAAGAGCCCCTGCCCCAAGGACACCCCTGAGGGGGCTGAG
CTCAAACCCTGGTATGAGCCCATATACCTGGGAGGAGTCTTCCAGCTGGAGAAGGGGGACCAACTCAGCGCTGAGGTCAA
TCTGCCCAAGTACTTAGACTTTGCGGAGTCCGGGCAGGTCTACTTTGGAGTCATTGCTCTGTGAAGGGAATGGGTGTTCA
TCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTA
TCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAA
GATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGG
AGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGAT
CTCAGGCCTTCCTACCTTCAGACCTTTCCAGATTCTTCCCTGAGGTGCAATGCACAGCCTTCCTCACAGAGCCAGCCCCC
CTCTATTTATATTTGCACTTATTATTTATTATTTATTTATTATTTATTTATTTGCTTATGAATGTATTTATTTGGAAGGC
CGGGGTGTCCTGGAGGACCCAGTGTGGGAAGCTGTCTTCAGACAGACATGTTTTCTGTGAAAACGGAGCTGAGCTGTCCC
CACCTGGCCTCTCTACCTTGTTGCCTCCTCTTTTGCTTATGTTTAAAACAAAATATTTATCTAACCCAATTGTCTTAATA
ACGCTGATTTGGTGACCAGGCTGTCGCTACATCACTGAACCTCTGCTCCCCACGGGAGCCGTGACTGTAATCGCCCTACG
GGTCATTGAGAGAAATAA"
Run Code Online (Sandbox Code Playgroud)
我尝试过的:
# finding the stop codon here
def stop_codon(seq_0):
for i in range(0,len(seq_0),3):
if (seq_0[i:i+3]== "TAA" and i%3==0) or (seq_0[i:i+3]== "TAG" and i%3==0) or (seq_0[i:i+3]== "TGA" and i%3==0) :
a =i+3
break
else:
a = None
# finding the start codon here
startcodon_find =[m.start() for m in re.finditer('ATG', seq_0)]
Run Code Online (Sandbox Code Playgroud)
我怎样才能找到检查起始密码子然后找到第一个终止密码子的方法.随后找到下一个起始密码子和下一个终止密码子.
我希望将它运行三帧.如前所述,三个帧将把序列的第一,第二和第三个字符视为开始.
此序列需要分为3个小部分.因为它应该是这样的事情:
ATG TTT AAA ACA AAA TAT TTA TCT AAC CCA ATT GTC TTA ATA ACG CTG ATT TGA
Run Code Online (Sandbox Code Playgroud)
任何帮助将不胜感激.
我的最终答案是:
def orf_find(st0):
seq_0=""
for i in range(0,len(st0),3):
if len(st0[i:i+3])==3:
seq_0 = seq_0 + st0[i:i+3]+ " "
ms_1 =[m.start() for m in re.finditer('ATG', seq_0)]
ms_2 =[m.start() for m in re.finditer('(TAA)|(TAG)|(TGA)', seq_0)]
def get_next(arr,value):
for a in arr:
if a > value:
return a
return -1
codons = []
start_codon=ms_1[0]
while (True):
stop_codon = get_next(ms_2,start_codon)
if stop_codon == -1:
break
codons.append((start_codon,stop_codon))
start_codon = get_next(ms_1,stop_codon)
if start_codon==-1:
break
max_val = 0
selected_tupple = ()
for i in codons:
k=i[1]-i[0]
if k > max_val:
max_val = k
selected_tupple = i
print "selected tupple is ", selected_tupple
final_seq=seq_0[selected_tupple[0]:selected_tupple[1]+3]
print final_seq
print "The longest orf length is " + str(max_val)
output_file = open('Longorf.txt','w')
output_file.write(str(orf_find(st0)))
output_file.close()
Run Code Online (Sandbox Code Playgroud)
上面的写入功能无助于我将内容写入文本文件.我进去的只有NONE ..为什么这个错误..有人可以帮忙吗?
如果你想手工编码:
import re
from string import maketrans
pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))')
def revcomp(dna_seq):
return dna_seq[::-1].translate(maketrans("ATGC","TACG"))
def orfs(dna):
return set(pattern.findall(dna) + pattern.findall(revcomp(dna)))
print orfs(Seq)
Run Code Online (Sandbox Code Playgroud)
正如你已经标记了它Biopython我想你知道Biopython.你有没有检查过这些文件?http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc231可能有所帮助.
我调整了上面链接中的代码以处理您的序列:
from Bio.Seq import Seq
seq = Seq("CCTCAGCGAGGACAGCAAGGGACTAGCCAGGAGGGAGAACAGAAACTCCAGAACATCTTGGAAATAGCTCCCAGAAAAGCAAGCAGCCAACCAGGCAGGTTCTGTCCCTTTCACTCACTGGCCCAAGGCGCCACATCTCCCTCCAGAAAAGACACCATGAGCACAGAAAGCATGATCCGCGACGTGGAACTGGCAGAAGAGGCACTCCCCCAAAAGATGGGGGGCTTCCAGAACTCCAGGCGGTGCCTATGTCTCAGCCTCTTCTCATTCCTGCTTGTGGCAGGGGCCACCACGCTCTTCTGTCTACTGAACTTCGGGGTGATCGGTCCCCAAAGGGATGAGAAGTTCCCAAATGGCCTCCCTCTCATCAGTTCTATGGCCCAGACCCTCACACTCAGATCATCTTCTCAAAATTCGAGTGACAAGCCTGTAGCCCACGTCGTAGCAAACCACCAAGTGGAGGAGCAGCTGGAGTGGCTGAGCCAGCGCGCCAACGCCCTCCTGGCCAACGGCATGGATCTCAAAGACAACCAACTAGTGGTGCCAGCCGATGGGTTGTACCTTGTCTACTCCCAGGTTCTCTTCAAGGGACAAGGCTGCCCCGACTACGTGCTCCTCACCCACACCGTCAGCCGATTTGCTATCTCATACCAGGAGAAAGTCAACCTCCTCTCTGCCGTCAAGAGCCCCTGCCCCAAGGACACCCCTGAGGGGGCTGAGCTCAAACCCTGGTATGAGCCCATATACCTGGGAGGAGTCTTCCAGCTGGAGAAGGGGGACCAACTCAGCGCTGAGGTCAATCTGCCCAAGTACTTAGACTTTGCGGAGTCCGGGCAGGTCTACTTTGGAGTCATTGCTCTGTGAAGGGAATGGGTGTTCATCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTATCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAAGATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGGAGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGATCTCAGGCCTTCCTACCTTCAGACCTTTCCAGATTCTTCCCTGAGGTGCAATGCACAGCCTTCCTCACAGAGCCAGCCCCCCTCTATTTATATTTGCACTTATTATTTATTATTTATTTATTATTTATTTATTTGCTTATGAATGTATTTATTTGGAAGGCCGGGGTGTCCTGGAGGACCCAGTGTGGGAAGCTGTCTTCAGACAGACATGTTTTCTGTGAAAACGGAGCTGAGCTGTCCCCACCTGGCCTCTCTACCTTGTTGCCTCCTCTTTTGCTTATGTTTAAAACAAAATATTTATCTAACCCAATTGTCTTAATAACGCTGATTTGGTGACCAGGCTGTCGCTACATCACTGAACCTCTGCTCCCCACGGGAGCCGTGACTGTAATCGCCCTACGGGTCATTGAGAGAAATAA")
table = 1
min_pro_len = 100
for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
for frame in range(3):
for pro in nuc[frame:].translate(table).split("*"):
if len(pro) >= min_pro_len:
print "%s...%s - length %i, strand %i, frame %i" % (pro[:30], pro[-3:], len(pro), strand, frame)
Run Code Online (Sandbox Code Playgroud)
ORF也被翻译.您可以选择其他转换表.查看http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:translation
编辑:代码说明:
在顶部我从你的字符串中创建一个序列对象.请注意seq = Seq("ACGT").两个for循环创建了6个不同的帧.内部for循环根据所选择的转换表翻译每个框架并返回氨基酸链,其中每个终止密码子被编码为*.该split函数拆分此字符串,删除这些占位符,从而得到可能的蛋白质序列列表.通过设置min_pro_len,您可以定义要检测的蛋白质的最小氨基酸链长度.1是标准表.退房http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG1这里可以看到,在起始密码子AUG(等于ATG)和结束密码子(*核苷酸序列)是TAA,TAG,和TGA,就像你想要的那样.您还可以使用其他转换表.
当你添加
print nuc[frame:].translate(table)
Run Code Online (Sandbox Code Playgroud)
在第二个for-loop里面你得到类似的东西:
PQRGQQGTSQEGEQKLQNILEIAPRKASSQPGRFCPFHSLAQGATSPSRKDTMSTESMIRDVELAEEALPQKMGGFQNSRRCLCLSLFSFLLVAGATTLFCLLNFGVIGPQRDEKFPNGLPLISSMAQTLTLRSSSQNSSDKPVAHVVANHQVEEQLEWLSQRANALLANGMDLKDNQLVVPADGLYLVYSQVLFKGQGCPDYVLLTHTVSRFAISYQEKVNLLSAVKSPCPKDTPEGAELKPWYEPIYLGGVFQLEKGDQLSAEVNLPKYLDFAESGQVYFGVIAL*REWVFIHSLPSPHSDPFTLTPLLSTPQSPQSVSF*LRKGIMAQGPTLCSELSTTTQKHKMLGQ*PGLWASHAPPSRTQMGFPNSLEPRMSIPEFCKGRVVRLPLSQNEAG*DLRPSYLQTFPDSSLRCNAQPSSQSQPPSIYICTYYLLFIYYLFICL*MYLFGRPGCPGGPSVGSCLQTDMFSVKTELSCPHLASLPCCLLFCLCLKQNIYLTQLS**R*FGDQAVATSLNLCSPREP*L*SPYGSLREI
Run Code Online (Sandbox Code Playgroud)
(注意星号位于终止密码子位置)
编辑:回答你的第二个问题:
您必须将要写入的字符串返回到文件中.创建一个输出字符串并在函数末尾返回它:
output = "selected tupple is " + str(selected_tupple) + "\n"
output += final_seq + "\n"
output += "The longest orf length is " + str(max_val) + "\n"
return output
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
18655 次 |
| 最近记录: |