scikit-bio是否有可能从基因组fasta文件中提取存储在gff3格式文件中的基因组特征?
例:
genome.fasta
>sequence1
ATGGAGAGAGAGAGAGAGAGGGGGCAGCATACGCATCGACATACGACATACATCAGATACGACATACTACTACTATGA
Run Code Online (Sandbox Code Playgroud)
annotation.gff3
#gff-version 3
sequence1 source gene 1 78 . + . ID=gene1
sequence1 source mRNA 1 78 . + . ID=transcript1;parent=gene1
sequence1 source CDS 1 6 . + 0 ID=CDS1;parent=transcript1
sequence1 source CDS 73 78 . + 0 ID=CDS2;parent=transcript1
Run Code Online (Sandbox Code Playgroud)
mRNA特征(转录物1)的所需序列将是两个子CDS特征的连接.所以在这种情况下,这将是'ATGGAGCTATGA'
.
我刚刚使用pip3安装了numpy和scikit-bio.如果我在交互式会话中导入DNASequence,我会收到一条错误消息:
>>> from skbio.sequence import DNASequence
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.4/site-packages/skbio/__init__.py", line 64, in <module>
from skbio.stats.distance import DistanceMatrix
File "/usr/local/lib/python3.4/site-packages/skbio/stats/distance/__init__.py", line 293, in <module>
from ._base import (DissimilarityMatrixError, DistanceMatrixError,
File "/usr/local/lib/python3.4/site-packages/skbio/stats/distance/_base.py", line 11, in <module>
from future.utils.six import StringIO, string_types
ImportError: No module named 'future.utils.six'
Run Code Online (Sandbox Code Playgroud)
运行'pip3 list'向我显示安装了六个1.8.0.更奇怪的是,如果我重复import语句,DNASequence会正确加载.知道是什么导致了这种行为吗?
我正在运行Mac OS X 10.9.5(Mavericks),Python 3.4.1(通过自制软件安装).
当尝试使用Python 2.78和Visual C++ 2008 Express Edition在Windows XP上通过pip安装scikit-bio工具包时,该过程被VC发出以下消息中断:
cl : Command line error D8021 : invalid numeric argument '/Wno-error=declaration
-after-statement'
Run Code Online (Sandbox Code Playgroud)
关于此错误,Microsoft Developer Network网站只是说:
invalid numeric argument 'number'
A number greater than 65,534 was specified as a numeric argument.
Run Code Online (Sandbox Code Playgroud)
我还没有尝试在Linux下安装scikit-bio(Ubuntu 12.04 Precise),但我的印象是它能正常工作(就像Linux一样).
有没有人成功在Windows下安装scikit-bio(XP,7,8)?任何提示?
提前致谢!
我正在试图弄清楚如何Principal Coordinate Analysis
使用各种距离指标来实现.我在这两个偶然skbio
和sklearn
与实现. 我不明白为什么sklearn
每次实现都是不同skbio
的同时呢?是否有一定程度的随机性Multidimensional Scaling
,特别是Principal Coordinate Analysis
?我看到所有的集群都非常相似,但为什么它们不同?我是否正确实施了这个?
Principal Coordinate Analysis
使用Scikit-bio
(ie Skbio
)运行总是给出相同的结果:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns; sns.set_style("whitegrid", {'axes.grid' : False})
import skbio
from scipy.spatial import distance
%matplotlib inline
np.random.seed(0)
# Iris dataset
DF_data = pd.DataFrame(load_iris().data,
index = ["iris_%d" …
Run Code Online (Sandbox Code Playgroud) machine-learning linear-algebra multi-dimensional-scaling scikit-learn skbio
我看attributes
的skbio's
PCoA
方法(见下表).我是新来这个API
,我希望能够得到eigenvectors
投射到新中轴线和原始点相似.fit_transform
的sklearn.decomposition.PCA
,所以我可以创造一些PC_1 vs PC_2
式的情节.我想出了如何获得eigvals
,proportion_explained
但features
回来了None
.
这是因为它处于测试阶段吗?
如果有任何教程使用它,那将非常感激.我是一个狂热的粉丝,scikit-learn
并希望开始使用更多的scikit's
产品.
| Attributes
| ----------
| short_method_name : str
| Abbreviated ordination method name.
| long_method_name : str
| Ordination method name.
| eigvals : pd.Series
| The resulting eigenvalues. The index corresponds to the ordination
| axis labels
| samples : pd.DataFrame
| The position of the samples …
Run Code Online (Sandbox Code Playgroud) machine-learning linear-algebra dimensionality-reduction scikits skbio
我正在尝试使用scikit-bio读取fastq格式的文本文件.
鉴于它是一个相当大的文件,执行操作非常慢.
最终,我试图将fastq文件解压缩到字典中:
f = 'Undetermined_S0_L001_I1_001.fastq'
seqs = skbio.io.read(f, format='fastq')
seq_dic = {}
for seq in seqs:
seq = str(seq)
if seq in seq_dic.keys():
seq_dic[seq] +=1
else:
seq_dic[seq] = 1
Run Code Online (Sandbox Code Playgroud)
这里的大部分时间都是在阅读文件时使用的:
%%time
f = 'Undetermined_S0_L001_I1_001.fastq'
seqs = skbio.io.read(f, format='fastq')
for seq in itertools.islice(seqs, 100000):
seq
CPU times: user 46.2 s, sys: 334 ms, total: 46.5 s
Wall time: 47.8 s
Run Code Online (Sandbox Code Playgroud)
我的理解是,不验证序列会改善运行时间,但似乎并非如此:
%%time
f = 'Undetermined_S0_L001_I1_001.fastq'
seqs = skbio.io.read(f, format='fastq', verify=False, variant='illumina1.8')
for seq in itertools.islice(seqs, 100000):
seq
CPU …
Run Code Online (Sandbox Code Playgroud)