我在Python中编写了一个混淆矩阵计算代码:
def conf_mat(prob_arr, input_arr):
# confusion matrix
conf_arr = [[0, 0], [0, 0]]
for i in range(len(prob_arr)):
if int(input_arr[i]) == 1:
if float(prob_arr[i]) < 0.5:
conf_arr[0][1] = conf_arr[0][1] + 1
else:
conf_arr[0][0] = conf_arr[0][0] + 1
elif int(input_arr[i]) == 2:
if float(prob_arr[i]) >= 0.5:
conf_arr[1][0] = conf_arr[1][0] +1
else:
conf_arr[1][1] = conf_arr[1][1] +1
accuracy = float(conf_arr[0][0] + conf_arr[1][1])/(len(input_arr))
Run Code Online (Sandbox Code Playgroud)
prob_arr是我的分类代码返回的数组,示例数组是这样的:
[1.0, 1.0, 1.0, 0.41592955657342651, 1.0, 0.0053405015805891975, 4.5321494433440449e-299, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.70943426182688163, 1.0, 1.0, 1.0, 1.0]
Run Code Online (Sandbox Code Playgroud)
input_arr是数据集的原始类标签,它是这样的:
[2, 1, …Run Code Online (Sandbox Code Playgroud) 我有许多CSV文件,其中包含性别,年龄,诊断等列.
目前,它们的编码如下:
ID, gender, age, diagnosis
1, male, 42, asthma
1, male, 42, anxiety
2, male, 19, asthma
3, female, 23, diabetes
4, female, 61, diabetes
4, female, 61, copd
Run Code Online (Sandbox Code Playgroud)
目标是将此数据转换为此目标格式:
旁注:如果可能的话,还可以将原始列名称添加到新列名称中,例如"age_42"或"gender_female".
ID, male, female, 42, 19, 23, 61, asthma, anxiety, diabetes, copd
1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0
2, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0
3, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0
4, 0, …Run Code Online (Sandbox Code Playgroud) 我正在运行 SARIMAX 模型,但在指定外生变量时遇到了问题。在第一个代码块(下面)中,我指定了一个外生变量 lesdata['LESpost'] 并且模型运行没有问题。但是,当我添加另一个外生变量时,我最终会收到一条错误消息(请参阅堆栈跟踪)。
ar = (1,0,1) # AR(1 3)
ma = (0) # No MA terms
mod1 = sm.tsa.statespace.SARIMAX(lesdata['emadm'], exog= (lesdata['LESpost'],lesdata['QOF']), trend='c', order=(ar,0,ma), mle_regression=True)
Traceback (most recent call last):
File "<ipython-input-129-d1300aeaeffc>", line 4, in <module>
mle_regression=True)
File "C:\Users\danie\Anaconda2\lib\site-packages\statsmodels\tsa\statespace\sarimax.py", line 510, in __init__
endog, exog=exog, k_states=k_states, k_posdef=k_posdef, **kwargs
File "C:\Users\danie\Anaconda2\lib\site-packages\statsmodels\tsa\statespace\mlemodel.py", line 84, in __init__
missing='none')
File "C:\Users\danie\Anaconda2\lib\site-packages\statsmodels\tsa\base\tsa_model.py", line 43, in __init__
super(TimeSeriesModel, self).__init__(endog, exog, missing=missing)
File "C:\Users\danie\Anaconda2\lib\site-packages\statsmodels\base\model.py", line 212, in __init__
super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
File "C:\Users\danie\Anaconda2\lib\site-packages\statsmodels\base\model.py", line 63, …Run Code Online (Sandbox Code Playgroud) 我有一个脚本,它获取查询ID的列表文件,并从uniprot中提取有机体和序列,代码运行良好,但它非常慢.我希望通过它处理大约400万个序列,但解析100个序列大约需要5分钟:
real 5m32.452s
user 0m0.651s
sys 0m0.135s
Run Code Online (Sandbox Code Playgroud)
代码使用python的检索模块.我在网上看到我可以使用.session()属性,但是当我尝试这个时,我收到以下错误:
Traceback (most recent call last):
File "retrieve.py", line 14, in <module>
result = session.get(baseURL, payload)
TypeError: get() takes exactly 2 arguments (3 given)
Run Code Online (Sandbox Code Playgroud)
代码列在这里:
import requests
baseURL = 'http://www.uniprot.org/uniprot/'
sample = open('sample.txt','r')
out = open('out','w')
for line in sample:
query = line.strip()
payload = {
'query': query,
'format':'tab',
'columns': 'id, entry_name, organism, sequence'
}
result = requests.get(baseURL, payload)
if result.ok:
out.write(query + '\t' + result.text[41:] + '\n')
Run Code Online (Sandbox Code Playgroud)
输入格式示例:
EDP09046
ONI31767
ENSFALT00000002630
EAS32469 …Run Code Online (Sandbox Code Playgroud)