我有一个包含约300个点和32个不同标签的数据集,我想通过使用网格搜索和LabelKFold验证绘制其学习曲线来评估LinearSVR模型.
我的代码看起来像这样:
import numpy as np
from sklearn import preprocessing
from sklearn.svm import LinearSVR
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import LabelKFold
from sklearn.grid_search import GridSearchCV
from sklearn.learning_curve import learning_curve
...
#get data (x, y, labels)
...
C_space = np.logspace(-3, 3, 10)
epsilon_space = np.logspace(-3, 3, 10)
svr_estimator = Pipeline([
("scale", preprocessing.StandardScaler()),
("svr", LinearSVR),
])
search_params = dict(
svr__C = C_space,
svr__epsilon = epsilon_space
)
kfold = LabelKFold(labels, 5)
svr_search = GridSearchCV(svr_estimator, param_grid = search_params, cv = ???)
train_space = …Run Code Online (Sandbox Code Playgroud) 我有每个样本不同权重的数据.在我的应用中,重要的是在估计模型和比较替代模型时考虑这些权重.
我sklearn用来估计模型并比较替代的超参数选择.但是这个单元测试显示GridSearchCV不适sample_weights用于估计分数.
有没有办法有sklearn使用sample_weight得分模式?
单元测试:
from __future__ import division
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss
from sklearn.model_selection import GridSearchCV, RepeatedKFold
def grid_cv(X_in, y_in, w_in, cv, max_features_grid, use_weighting):
out_results = dict()
for k in max_features_grid:
clf = RandomForestClassifier(n_estimators=256,
criterion="entropy",
warm_start=False,
n_jobs=-1,
random_state=RANDOM_STATE,
max_features=k)
for train_ndx, test_ndx in cv.split(X=X_in, y=y_in):
X_train = X_in[train_ndx, :]
y_train = y_in[train_ndx]
w_train = w_in[train_ndx]
y_test = y[test_ndx] …Run Code Online (Sandbox Code Playgroud) 我对自己非常熟悉data.table,但在setkeyv功能上遇到了一个我无法解决的奇怪错误.
错误非常简单:
keycols<-c("A", "B")
DT <- data.table(A=1:10, B=91:90)
setkeyv(DT, keycols)
# Error in setkeyv(x, cols, verbose = verbose, physical = physical) :
# 4 arguments passed to .Internal(nchar) which requires 3
Run Code Online (Sandbox Code Playgroud)
作为参考,这是我的sessionInfo():
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached …Run Code Online (Sandbox Code Playgroud) 我在 kernlab 包中发现了一些令人费解的行为:估计数学上相同的 SVM 在软件中会产生不同的结果。
为简单起见,此代码片段仅采用虹膜数据并使其成为二元分类问题。如您所见,我在两个 SVM 中都使用了线性内核。
library(kernlab)
library(e1071)
data(iris)
x <- as.matrix(iris[, 1:4])
y <- as.factor(ifelse(iris[, 5] == 'versicolor', 1, -1))
C <- 5.278031643091578
svm1 <- ksvm(x = x, y = y, scaled = FALSE, kernel = 'vanilladot', C = C)
K <- kernelMatrix(vanilladot(), x)
svm2 <- ksvm(x = K, y = y, C = C, kernel = 'matrix')
svm3 <- svm(x = x, y = y, scale = FALSE, kernel = 'linear', cost = C)
Run Code Online (Sandbox Code Playgroud)
但是,svm1 和 …
在我的table1中我有varchar字段,其中我存储其他table2的id-list(id-INT UNSIGNED AUTOINCREMENT),用逗号分隔.
例如:1,3,5,12,90
也不应该重复ID.
我需要检查字符串(来自外部)是否符合此规则.例如,我需要检查$ _POST ['id_list']
数据一致性现在并不重要(例如,插入此值而不检查table2中是否确实存在此ID)
任何建议都会有所帮助.