我想使用 cross_val_score 来验证我的 OneClassSVM 训练集。这样做会导致以下错误消息。
难道是因为OneClassSVM是无监督算法,没有“y”向量传递给cross_val_score,所以算法失败了?
Run Code Online (Sandbox Code Playgroud)clf = svm.OneClassSVM(nu=_nu, kernel=_kernel, gamma=_gamma, random_state=_random_state, cache_size=_cache_size) scores = cross_val_score(estimator=clf, X=X_scaled, scoring='accuracy', cv=5)
PS:我意识到“y”向量在cross_val_score中是可选的。但这个错误仍然让我假设“y”向量导致了错误。
File "/usr/local/lib/python2.7/site-packages/sklearn/model_selection/_validation.py", line 140, in cross_val_score
for train, test in cv_iter)
File "/usr/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 758, in __call__
while self.dispatch_one_batch(iterator):
File "/usr/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 608, in dispatch_one_batch
self._dispatch(tasks)
File "/usr/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 571, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/usr/local/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 109, in apply_async
result = ImmediateResult(func)
File "/usr/local/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 326, in __init__
self.results = batch()
File "/usr/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 131, in __call__ …Run Code Online (Sandbox Code Playgroud) 我尝试进行特征选择,RFECV但每次都会给出不同的结果,交叉验证是否将样本 X 划分为随机块或顺序确定性块?
grid_scores_另外,为什么和的分数不同score(X,y)?为什么分数有时是负数?
如何使用交叉验证模型获得系数?当我进行交叉验证时,我会得到 CV 模型的分数,我怎样才能得到系数?
#Split into training and testing
x_train, x_test, y_train, y_test = train_test_split(samples, scores, test_size = 0.30, train_size = 0.70)
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, x_train, y_train, cv=5)
scores
Run Code Online (Sandbox Code Playgroud)
我想打印与每个特征相关的系数
#Print co-efficients of features
for i in range(0, nFeatures):
print samples.columns[i],":", coef[0][i]
Run Code Online (Sandbox Code Playgroud)
这个没有交叉验证,它提供系数
#Create SVM model using a linear kernel
model = svm.SVC(kernel='linear', C=C).fit(x_train, y_train)
coef = model.coef_
Run Code Online (Sandbox Code Playgroud) 我无法手动匹配 LGBM 的简历分数。
这是一个 MCVE:
from sklearn.datasets import load_breast_cancer
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import numpy as np
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
folds = KFold(5, random_state=42)
params = {'random_state': 42}
results = lgb.cv(params, lgb.Dataset(X_train, y_train), folds=folds, num_boost_round=1000, early_stopping_rounds=100, metrics=['auc'])
print('LGBM\'s cv score: ', results['auc-mean'][-1])
clf = lgb.LGBMClassifier(**params, n_estimators=len(results['auc-mean']))
val_scores = []
for train_idx, val_idx …Run Code Online (Sandbox Code Playgroud) python machine-learning scikit-learn cross-validation lightgbm
我最近从https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/23_Time-Series-Prediction.ipynb学习了用于时间序列预测的 LSTM
在他的教程中,他说:我们将使用以下函数创建一批从训练数据中随机选取的较短子序列,而不是在近 30 万个观察的完整序列上训练循环神经网络。
def batch_generator(batch_size, sequence_length):
"""
Generator function for creating random batches of training-data.
"""
# Infinite loop.
while True:
# Allocate a new array for the batch of input-signals.
x_shape = (batch_size, sequence_length, num_x_signals)
x_batch = np.zeros(shape=x_shape, dtype=np.float16)
# Allocate a new array for the batch of output-signals.
y_shape = (batch_size, sequence_length, num_y_signals)
y_batch = np.zeros(shape=y_shape, dtype=np.float16)
# Fill the batch with random sequences of data.
for i in range(batch_size):
# Get a random …Run Code Online (Sandbox Code Playgroud) 我是在 Python 中实现机器学习的新手,目前正在按照 YouTube 教程尝试 KNN 分类。这是代码。
import numpy as np
#from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
import pandas as pd
df=pd.read_csv('breast-cancer-wisconsin.data.txt')
df.replace('?', -99999, inplace=True)
df.drop(['id'],1,inplace=True)
X=np.array(df.drop(['class'],1))
y=np.array(df['class'])
X_train, X_test, y_train, y_test=cross_validate.train_test_split(X,y,test_size=0.2)
Run Code Online (Sandbox Code Playgroud)
我收到以下错误:-
X_train, X_test, y_train, y_test=cross_validate.train_test_split(X,y,test_size=0.2)
AttributeError: 'function' object has no attribute 'train_test_split'
Run Code Online (Sandbox Code Playgroud)
我尝试将 train_test_split 导入为
from sklearn.model_selection import train_test_split
Run Code Online (Sandbox Code Playgroud)
但后来我得到了同样的错误。任何帮助表示赞赏。谢谢!
我想交叉验证我的时间序列数据并按时间戳年份拆分。
这是熊猫数据框中的以下数据:
mock_data
timestamp counts
'2015-01-01 03:45:14' 4
.
.
.
'2016-01-01 13:02:14' 12
.
.
.
'2017-01-01 09:56:54' 6
.
.
.
'2018-01-01 13:02:14' 8
.
.
.
'2019-01-01 11:39:40' 24
.
.
.
'2020-01-01 04:02:03' 30
mock_data.dtypes
timestamp object
counts int64
Run Code Online (Sandbox Code Playgroud)
查看TimeSeriesSplit()scikit-learn的功能,好像不能n_split按年份指定部分。是否有另一种方法可以创建连续的训练集,从而导致以下训练-测试拆分?
tscv = newTimeSeriesSplit(n_splits=5, by='year')
>>> print(tscv)
newTimeSeriesSplit(max_train_size=None, n_splits=5, by='year')
>>> for train_index, test_index in tscv.split(mock_data):
... print("TRAIN:", train_index, "TEST:", test_index)
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index] …Run Code Online (Sandbox Code Playgroud) 我将特征分离X,y然后在使用 k 折交叉验证将其拆分后预处理我的火车测试数据。之后,我将训练数据拟合到我的随机森林回归模型并计算置信度分数。拆分后为什么要预处理?因为人们告诉我这样做更正确,并且为了我的模型性能,我一直保持这个原则。
这是我第一次使用 KFold 交叉验证,因为我的模型分数过高,我想我可以通过交叉验证来修复它。我仍然对如何使用它感到困惑,我已经阅读了文档和一些文章,但我并没有真正理解我如何真正将它暗示给我的模型,但我还是尝试了,我的模型仍然过度拟合。使用训练测试拆分或交叉验证导致我的模型分数仍然是 0.999,我不知道我的错误是什么,因为我是使用这种方法的新手,但我想也许我做错了,所以它不能修复过度拟合。请告诉我我的代码有什么问题以及如何解决这个问题
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
import scipy.stats as ss
avo_sales = pd.read_csv('avocados.csv')
avo_sales.rename(columns = {'4046':'small PLU sold',
'4225':'large PLU sold',
'4770':'xlarge PLU sold'},
inplace= True)
avo_sales.columns = avo_sales.columns.str.replace(' ','')
x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1))
y = np.array(avo_sales.TotalBags)
# X_train, X_test, …Run Code Online (Sandbox Code Playgroud) python machine-learning python-3.x scikit-learn cross-validation
我有一个分类问题,我想在 sklearn 中roc_auc使用该值cross_validate。我的代码如下。
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(random_state = 0, class_weight="balanced")
from sklearn.model_selection import cross_validate
cross_validate(clf, X, y, cv=10, scoring = ('accuracy', 'roc_auc'))
Run Code Online (Sandbox Code Playgroud)
但是,我收到以下错误。
ValueError: multiclass format is not supported
Run Code Online (Sandbox Code Playgroud)
请注意,我roc_auc特别选择的是它同时支持binary和multiclass分类,如:https : //scikit-learn.org/stable/modules/model_evaluation.html
我也有二进制分类数据集。请让我知道如何解决此错误。
如果需要,我很乐意提供更多详细信息。
python classification machine-learning scikit-learn cross-validation
我正在使用以下代码:-
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = regressor, X=X,y=y, cv =10)
accuracies.mean()
Run Code Online (Sandbox Code Playgroud)
这个平均值是 RMSE 还是 MSE ?
编辑:-我正在使用随机森林回归。在 Scikit 学习文档中,他们将其描述为准确性。我如何将它与 RMSE 或 MSE 相关联
cross-validation ×10
scikit-learn ×8
python ×7
python-3.x ×3
svm ×2
time-series ×2
lightgbm ×1
lstm ×1
pandas ×1
regression ×1
tensorflow ×1