我的代码运行正常较小的测试样品,如10000行的数据X_train,y_train.当我为数百万行调用它时,我得到了结果错误.是包中的错误,还是我可以做不同的事情?我正在使用Anaconda 2.0.1中的Python 2.7.7,我将来自Anaconda的多处理软件包的pool.py和来自scikit-learn的外部软件包的parallel.py放在我的Dropbox上.
测试脚本是:
import numpy as np
import sklearn
from sklearn.linear_model import SGDClassifier
from sklearn import grid_search
import multiprocessing as mp
def main():
print("Started.")
print("numpy:", np.__version__)
print("sklearn:", sklearn.__version__)
n_samples = 1000000
n_features = 1000
X_train = np.random.randn(n_samples, n_features)
y_train = np.random.randint(0, 2, size=n_samples)
print("input data size: %.3fMB" % (X_train.nbytes / 1e6))
model = SGDClassifier(penalty='elasticnet', n_iter=10, shuffle=True)
param_grid = [{
'alpha' : 10.0 ** -np.arange(1,7),
'l1_ratio': [.05, .15, .5, .7, .9, .95, …Run Code Online (Sandbox Code Playgroud)