您好
我目前正在参与制作一个旨在将所有乳头状瘤病毒信息整合在一个地方的网站.作为努力的一部分,我们正在策划公共服务器上的所有已知文件(例如genbank)我遇到的一个问题是所有解决的结构中的许多(约50%)没有根据蛋白质编号.即子结构域结晶(氨基酸310-450),然而结晶学家将其沉积为残留物1-140.我想知道是否有人知道重新编号整个pdb文件的方法.我已经找到了重新编号序列的方法(由seqres标识),但是这不会更新螺旋和工作表信息.如果您有任何建议我会很感激...
谢谢
我是pdb-tools的维护者- 这可能是一个可以帮助你的工具.
我最近residue-renumber
在我的应用程序中修改了脚本以提供更大的灵活性.它现在可以是renumber
hetatms和特定的链,或者强制残留数量是连续的,或者只是为所有残基添加用户指定的偏移量.
如果这有助于你,请告诉我.
小智 1
我也经常遇到这个问题。在放弃了旧的 Perl 脚本之后,我一直在尝试使用一些 Python。此解决方案假设您已安装 Biopython、ProDy ( http://www.csb.pitt.edu/ProDy/#prody ) 和 EMBOSS ( http://emboss.sourceforge.net/ )。
我在这里使用了乳头瘤病毒 PDB 条目之一。
from Bio import AlignIO,SeqIO,ExPASy,SwissProt
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
from Bio.Emboss.Applications import NeedleCommandline
from prody.proteins.pdbfile import parsePDB, writePDB
import os
oneletter = {
'ASP':'D','GLU':'E','ASN':'N','GLN':'Q',
'ARG':'R','LYS':'K','PRO':'P','GLY':'G',
'CYS':'C','THR':'T','SER':'S','MET':'M',
'TRP':'W','PHE':'F','TYR':'Y','HIS':'H',
'ALA':'A','VAL':'V','LEU':'L','ILE':'I',
}
# Retrieve pdb to extract sequence
# Can probably be done with Bio.PDB but being able to use the vmd-like selection algebra is nice
pdbname="2kpl"
selection="chain A"
structure=parsePDB(pdbname)
pdbseq_str=''.join([oneletter[i] for i in structure.select("protein and name CA and %s"%selection).getResnames()])
alnPDBseq=SeqRecord(Seq(pdbseq_str,IUPAC.protein),id=pdbname)
SeqIO.write(alnPDBseq,"%s.fasta"%pdbname,"fasta")
# Retrieve reference sequence
accession="Q96QZ7"
handle = ExPASy.get_sprot_raw(accession)
swissseq = SwissProt.read(handle)
refseq=SeqRecord(Seq(swissseq.sequence,IUPAC.protein),id=accession)
SeqIO.write(refseq, "%s.fasta"%accession,"fasta")
# Do global alignment with needle from EMBOSS, stores entire sequences which makes numbering easier
needle_cli = NeedleCommandline(asequence="%s.fasta"%pdbname,bsequence="%s.fasta"%accession,gapopen=10,gapextend=0.5,outfile="needle.out")
needle_cli()
aln = AlignIO.read("needle.out", "emboss")
os.remove("needle.out")
os.remove("%s.fasta"%pdbname)
os.remove("%s.fasta"%accession)
alnPDBseq = aln[0]
alnREFseq = aln[1]
# Initialize per-letter annotation for pdb sequence record
alnPDBseq.letter_annotations["resnum"]=[None]*len(alnPDBseq)
# Initialize annotation for reference sequence, assume first residue is #1
alnREFseq.letter_annotations["resnum"]=range(1,len(alnREFseq)+1)
# Set new residue numbers in alnPDBseq based on alignment
reslist = [[i,alnREFseq.letter_annotations["resnum"][i]] for i in range(len(alnREFseq)) if alnPDBseq[i] != '-']
for [i,r] in reslist:
alnPDBseq.letter_annotations["resnum"][i]=r
# Set new residue numbers in the structure
newresnums=[i for i in alnPDBseq.letter_annotations["resnum"][:] if i != None]
resindices=structure.select("protein and name CA and %s"%selection).getResindices()
resmatrix = [[newresnums[i],resindices[i]] for i in range(len(newresnums)) ]
for [newresnum,resindex] in resmatrix:
structure.select("resindex %d"%resindex).setResnums(newresnum)
writePDB("%s.renumbered.pdb"%pdbname,structure)
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
4192 次 |
最近记录: |