我想为一组DNA序列生成一个热编码.例如,序列ACGTCCA可以以转置方式如下表示.但是下面的代码将以水平方式生成一个热编码,我希望它以垂直形式.谁能帮我?
ACGTCCA
1000001 - A
0100110 - C
0010000 - G
0001000 - T
Run Code Online (Sandbox Code Playgroud)
示例代码:
from sklearn.preprocessing import OneHotEncoder
import itertools
# two example sequences
seqs = ["ACGTCCA","CGGATTG"]
# split sequences to tokens
tokens_seqs = [seq.split("\\") for seq in seqs]
# convert list of of token-lists to one flat list of tokens
# and then create a dictionary that maps word to id of word,
# like {A: 1, B: 2} here
all_tokens = itertools.chain.from_iterable(tokens_seqs)
word_to_id = {token: idx for …
Run Code Online (Sandbox Code Playgroud) python arrays python-itertools scikit-learn one-hot-encoding
我想将下面的值标准化为水平而不是垂直。代码读取代码后提供的csv文件,并输出具有标准化值的新csv文件。如何使其水平标准化?给出如下代码:
码
#norm_code.py
#normalization = x-min/max-min
import numpy as np
from sklearn import preprocessing
all_data=np.loadtxt(open("c:/Python27/test.csv","r"),
delimiter=",",
skiprows=0,
dtype=np.float64)
x=all_data[:]
print('total number of samples (rows):', x.shape[0])
print('total number of features (columns):', x.shape[1])
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1)).fit(x)
X_minmax=minmax_scale.transform(x)
with open('test_norm.csv',"w") as f:
f.write("\n".join(",".join(map(str, x)) for x in (X_minmax)))
Run Code Online (Sandbox Code Playgroud)
test.csv
1 2 0 4 3
3 2 1 1 0
2 1 1 0 1
Run Code Online (Sandbox Code Playgroud) 我想在两个列表中划分列表项.
a = [[1, 0, 2], [0, 0, 0], [1], [1]]
b = [[5, 6, 4], [6, 6, 6], [3], [3]]
Run Code Online (Sandbox Code Playgroud)
如何将a除以b来获得此输出:
c = [[0.2, 0, 0.5], [0, 0, 0], [0.333], [0.333]]
Run Code Online (Sandbox Code Playgroud)
谁能帮我?
我有一些字符串,我想删除每个字符串的最后一个字符.当我尝试下面的代码时,它会删除我的第二行字符串,而不是删除最后一个元素.以下是我的代码:
码
with open('test.txt') as file:
seqs=file.read().splitlines()
seqs=seqs[:-1]
Run Code Online (Sandbox Code Playgroud)
的test.txt
ABCABC
XYZXYZ
Run Code Online (Sandbox Code Playgroud)
产量
ABCABC
Run Code Online (Sandbox Code Playgroud)
期望的输出
ABCAB
XYZXY
Run Code Online (Sandbox Code Playgroud) 如何计算具有少量数组的列表中某个整数的出现?例如,我想查找值2的出现次数.
import numpy as np
a = [np.array([2, 2, 1, 2]), np.array([1, 3])]
Run Code Online (Sandbox Code Playgroud)
预期产量:
[3, 0]
Run Code Online (Sandbox Code Playgroud)
谁能帮我?
我有以下序列,它们是具有序列标题及其核苷酸的fasta格式.如何随机提取序列.例如,我想从总序列中随机选择2个序列.提供的工具是根据百分比而不是序列数提取.谁能帮我?
A.fasta
>chr1:1310706-1310726
GACGGTTTCCGGTTAGTGGAA
>chr1:901959-901979
GAGGGCTTTCTGGAGAAGGAG
>chr1:983001-983021
GTCCGCTTGCGGGACCTGGGG
>chr1:984333-984353
CTGGAATTCCGGGCGCTGGAG
>chr1:1154147-1154167
GAGATCGTCCGGGACCTGGGT
Run Code Online (Sandbox Code Playgroud)
预期产出
>chr1:1154147-1154167
GAGATCGTCCGGGACCTGGGT
>chr1:901959-901979
GAGGGCTTTCTGGAGAAGGAG
Run Code Online (Sandbox Code Playgroud) 我有这段代码生成以下输出:
result = []
with open('fileA.txt') as f:
for line in f:
if line.startswith('chr'):
label = line.strip()
elif line[0] == ' ':
# short sequence
length = len(line.strip())
# find the index of the beginning of the short sequence
for i, c in enumerate(line):
if c.isalpha():
short_index = i
break
elif line[0].isdigit():
# long sequence
n = line.split(' ')[0]
# find the index of the beginning of the long sequence
for i, c in enumerate(line):
if c.isalpha():
long_index = i …
Run Code Online (Sandbox Code Playgroud) 我想知道我们可以在一个数组中逐行读取.例如:
array([[ 0.28, 0.22, 0.23, 0.27],
[ 0.12, 0.29, 0.34, 0.21],
[ 0.44, 0.56, 0.51, 0.65]])
Run Code Online (Sandbox Code Playgroud)
以数组形式读取第一行以执行某些操作,然后继续第二行数组:
array([0.28,0.22,0.23,0.27])
Run Code Online (Sandbox Code Playgroud)
产生上述数组的原因是这两行代码:
from numpy import genfromtxt
single=genfromtxt('single.csv',delimiter=',')
Run Code Online (Sandbox Code Playgroud)
single.csv
0.28, 0.22, 0.23, 0.27
0.12, 0.29, 0.34, 0.21
0.44, 0.56, 0.51, 0.65
Run Code Online (Sandbox Code Playgroud)
使用readlines()
看起来像生成列表而不是数组.就我而言,我正在使用csv文件.我试图逐行使用值行而不是一起使用它们以避免内存错误.谁能帮我?
with open('single.csv') as single:
single=single.readlines()
Run Code Online (Sandbox Code Playgroud) python ×8
arrays ×3
numpy ×3
scikit-learn ×2
division ×1
extract ×1
extraction ×1
fasta ×1
list ×1
python-2.7 ×1
string ×1
trailing ×1