kan*_*aba 5 python knn pca scikit-learn kaggle
通过将PCA添加到算法中,我正在努力提高kaggle数字识别教程的%96.5 SKlearn kNN预测分数,但基于PCA输出的新kNN预测非常可怕,如23%.
下面是完整的代码,如果你指出我错在哪里,我感激不尽.
import pandas as pd
import numpy as np
import pylab as pl
import os as os
from sklearn import metrics
%pylab inline
os.chdir("/users/******/desktop/python")
traindata=pd.read_csv("train.csv")
traindata=np.array(traindata)
traindata=traindata.astype(float)
X,y=traindata[:,1:],traindata[:,0]
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=0.25, random_state=33)
#scale & PCA train data
from sklearn import preprocessing
from sklearn.decomposition import PCA
X_train_scaled = preprocessing.scale(X_train)
estimator = PCA(n_components=350)
X_train_pca = estimator.fit_transform(X_train_scaled)
# sum(estimator.explained_variance_ratio_) = 0.96
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=6)
neigh.fit(X_train_pca,y_train)
# scale & PCA test data
X_test_scaled=preprocessing.scale(X_test)
X_test_pca=estimator.fit_transform(X_test_scaled)
y_test_pred=neigh.predict(X_test_pca)
# print metrics.accuracy_score(y_test, y_test_pred) = 0.23
# print metrics.classification_report(y_test, y_test_pred)
Run Code Online (Sandbox Code Playgroud)
YS-*_*S-L 19
在处理测试数据时,您使用fit_transform(X_test)
了实际重新计算测试数据的另一个PCA转换.您应该使用transform(X_test)
,以便测试数据经历与训练数据相同的转换.
代码部分看起来像(感谢ogrisel的whiten
提示):
estimator = PCA(n_components=350, whiten=True)
X_train_pca = estimator.fit_transform(X_train)
X_test_pca = estimator.transform(X_test)
Run Code Online (Sandbox Code Playgroud)
试试看它是否有帮助?
归档时间: |
|
查看次数: |
2626 次 |
最近记录: |