我试图用BioPython,Phylo模块构建一棵树.
到目前为止我所做的是这张图片:
每个名称都有一个四位数字后跟 - 和一个数字:这个数字表示该序列的表示次数.这意味着1578 - 22,该节点应代表22个序列.
所以现在我知道如何更改节点的每个大小.每个节点都有不同的大小,这很容易做一个不同值的数组:
fh = open(MEDIA_ROOT + "groupsnp.txt")
list_size = {}
for line in fh:
if '>' in line:
labels = line.split('>')
label = labels[-1]
label = label.split()
num = line.split('-')
size = num[-1]
size = size.split()
for lab in label:
for number in size:
list_size[lab] = int(number)
a = array(list_size.values())
Run Code Online (Sandbox Code Playgroud)
但是数组是任意的,我想将正确的节点大小放入正确的节点,我试过这个:
for elem in list_size.keys():
if labels == elem:
Phylo.draw_graphviz(tree_xml, prog="neato", node_size=a)
Run Code Online (Sandbox Code Playgroud)
但是当我使用if语句时没有出现.
无论如何这样做?
我真的很感激!
谢谢大家
我尝试对已经对齐的序列进行评分.让我们说吧
seq1 = 'PAVKDLGAEG-ASDKGT--SHVVY----------TI-QLASTFE'
seq2 = 'PAVEDLGATG-ANDKGT--LYNIYARNTEGHPRSTV-QLGSTFE'
Run Code Online (Sandbox Code Playgroud)
给定参数
substitution matrix : blosum62
gap open penalty : -5
gap extension penalty : -1
Run Code Online (Sandbox Code Playgroud)
我确实浏览了biopython cookbook,但我能得到的是替换矩阵blogsum62,但我觉得必须有人已经实现了这种类型的库.
那么有人可以建议任何可以解决我的问题的库或最短的代码吗?
Thx提前
我正在尝试安装biopython以在Windows7计算机上运行Python 3.3.
我已经下载了biopython可执行文件biopython-1.61.win32-py3.3-beta.exe.但是,当我尝试运行可执行文件时,我收到消息"需要Python版本3.3,这在注册表中找不到." Python版本3.3存在于我的计算机上.我一直在运行程序一两个月.它是从文件python-3.3.0.amd64.msi安装的,位于Program Files(x86)目录中.我尝试重新安装Python 3.3但得到相同的错误消息.
有谁知道如何解决这个问题?
我目前有以下代码查询pubmed:
from Bio import Entrez
Entrez.email = "kuharrw@hiram.edu" # Always tell NCBI who you are
handle = Entrez.esearch(db="pubmed", term="bacteria")
record = Entrez.read(handle)
list = record["IdList"]
print len(list)
for index in range(0, len(list)):
listId = list[index]
handle = Entrez.esummary(db="pubmed", id=listId)
record = Entrez.read(handle)
print index
print record[0]["Title"]
print record[0]["HasAbstract"]
Run Code Online (Sandbox Code Playgroud)
这段代码能够告诉我文章是否有摘要但我找不到任何关于如何实际返回摘要的文档.是否有可能使用biopython?如果不是有另一种方式?
我正在使用Biopython在Python中实现一个算法.我有几个存储在FASTA文件中的对齐(等长序列集).每个对齐包含500到30000个seq,每个序列长约17000个元素.每个序列都存储为Bio.SeqRecord.SeqRecord对象(查看SeqRecord对象的API文档以获取更多信息),该对象不仅包含序列,还包含有关它的一些信息.我使用Bio.AlignIO.read()从磁盘读取它(查看AlignIO模块的API文档以获取更多信息),它返回一个MultipleSeqAlignment对象:
seqs = AlignIO.read(seqs_filename, 'fasta')
len_seqs = seqs.get_alignment_length()
stats = {'-': [0.0] * len_seqs, 'A': [0.0] * len_seqs,
'G': [0.0] * len_seqs, 'C': [0.0] * len_seqs,
'T': [0.0] * len_seqs}
Run Code Online (Sandbox Code Playgroud)
为清晰起见,我将此草图包括在内:

因为我想对对齐的分析进行并行化,所以我使用线程模块为每个可用的cpu分配了一个片段(有关我之后做出此决定的原因的详细信息):
num_cpus = cpu_count()
num_columns = ceil(len_seqs / float(num_cpus))
start_column = 0
threads = []
for cpu in range(0, num_cpus):
section = (start_column, start_column + num_columns)
threads.append(CI_Thread(seqs_type, seqs, section, …Run Code Online (Sandbox Code Playgroud) 我正在编写一个函数,该函数应该通过DNA序列的.fasta文件,并为文件中的每个序列创建核苷酸(nt)和二核苷酸(dnt)频率的字典.然后我将每个字典存储在一个名为"频率"的列表中.这是一段奇怪的代码:
for fasta in seq_file:
freq = {}
dna = str(fasta.seq)
for base1 in ['A', 'T', 'G', 'C']:
onefreq = float(dna.count(base1)) / len(dna)
freq[base1] = onefreq
for base2 in ['A', 'T', 'G', 'C']:
dinucleotide = base1 + base2
twofreq = float(dna.count(dinucleotide)) / (len(dna) - 1)
freq[dinucleotide] = twofreq
frequency.append(freq)
Run Code Online (Sandbox Code Playgroud)
(顺便说一下,我正在使用biopython,所以我不必将整个fasta文件提交到内存.这也适用于ssDNA,所以我不需要考虑反义dnt)
为单个nt记录的频率增加到1.0,但是dnt的频率不会增加到1.0.因为计算两种频率的方法在我眼中是相同的,所以这是od.
我将诊断打印语句和"检查"变量留在:
for fasta in seq_file:
freq = {}
dna = str(fasta.seq)
check = 0.0
check2 = 0.0
for base1 in ['A', 'T', 'G', 'C']:
onefreq = float(dna.count(base1)) / len(dna) …Run Code Online (Sandbox Code Playgroud) 这个问题与生物信息学有关。我在相应论坛没有收到任何建议,所以写在这里。
我需要删除 fasta 文件中的非 ACTG 核苷酸,并使用 biopython 中的 seqio 将输出写入新文件。
我的代码是
import re
import sys
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
seq_list=[]
for seq_record in SeqIO.parse("test.fasta", "fasta",IUPAC.ambiguous_dna):
sequence=seq_record.seq
sequence=sequence.tomutable()
seq_record.seq = re.sub('[^GATC]',"",str(sequence).upper())
seq_list.append(seq_record)
SeqIO.write(seq_list,"test_out","fasta")
Run Code Online (Sandbox Code Playgroud)
运行此代码会出现错误:
Traceback (most recent call last):
File "remove.py", line 18, in <module>
SeqIO.write(list,"test_out","fasta")
File "/home/ghovhannisyan/Software/anaconda2/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 481, in write
count = writer_class(fp).write_file(sequences)
File "/home/ghovhannisyan/Software/anaconda2/lib/python2.7/site-packages /Bio/SeqIO/Interfaces.py", line 209, in write_file
count = self.write_records(records)
File "/home/ghovhannisyan/Software/anaconda2/lib/python2.7/site-packages/Bio/SeqIO/Interfaces.py", line 194, …Run Code Online (Sandbox Code Playgroud) 我正在尝试更改以前的脚本,该脚本利用 biopython 获取有关物种门的信息。编写此脚本是为了一次检索一个物种的信息。我想修改脚本,以便我可以一次对 100 个生物执行此操作。这是初始代码
import sys
from Bio import Entrez
def get_tax_id(species):
"""to get data from ncbi taxomomy, we need to have the taxid. we can
get that by passing the species name to esearch, which will return
the tax id"""
species = species.replace(" ", "+").strip()
search = Entrez.esearch(term = species, db = "taxonomy", retmode = "xml")
record = Entrez.read(search)
return record['IdList'][0]
def get_tax_data(taxid):
"""once we have the taxid, we can fetch the record"""
search = Entrez.efetch(id = taxid, …Run Code Online (Sandbox Code Playgroud) 尝试在 Fedora 21、Python 2.7 上安装 Biopython。我做了以下
[mike@localhost Downloads](17:32)$ sudo pip2.7 install biopython
You are using pip version 6.1.1, however version 7.1.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting biopython
/usr/lib/python2.7/site-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:79: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
Downloading biopython-1.65.tar.gz (12.6MB)
100% |????????????????????????????????| 12.6MB 33kB/s
Installing collected packages: biopython
Running setup.py install …Run Code Online (Sandbox Code Playgroud) 我是一名新近自学成才的(减去 1 节非常基础的课程)程序员,在生物实验室工作。我有一个脚本,它遍历来自两种不同细胞类型的 RNAseq 数据,并在另一个数据集中运行 ttest。它适用于这个应用程序,但代码感觉非常粗鲁,我知道我会写很多类似的脚本。
如何更好地编写以下代码以使其更高效?
计划目标:
:
import pandas as pd
from scipy.stats import ttest_ind
rnatest = {'Gene symbol':["GeneA","GeneB"],"rnaseq1A":[1,1.5],"rnaseq1B":[1.3,1.2],"rnaseq2A":[2.3,2.7],"rnaseq2B":[2,2.6]}
df = pd.DataFrame(rnatest)
GOIlist = ["GeneA","GeneB"]
GOI = []
mu = []
pval = []
for index, row in df.iterrows():
if row['Gene symbol'] in GOIlist:
t, p = ttest_ind([row["rnaseq1A"],row["rnaseq1B"]],[row["rnaseq2A"],row["rnaseq2B"]])
GOI.append(row['Gene symbol'])
mu.append(t)
pval.append(p)
df2 = {'Gene symbol':GOI,"tVAL":mu, "pVAL":pval}
df2 = pd.DataFrame(df2)
print(df2)
Run Code Online (Sandbox Code Playgroud)