NaNs突然出现在sklearn KFolds身上

Question

NaNs突然出现在sklearn KFolds身上

Whi*_*hia 1 python machine-learning scikit-learn cross-validation

我正在尝试对我的数据集运行交叉验证.数据似乎很干净,但是当我尝试运行它时,我的一些数据被NaN取代.我不知道为什么.有没有人见过这个？

y, X = np.ravel(df_test['labels']), df_test[['variation', 'length', 'tempo']]
X_train, X_test, y_train, y_test = cv.train_test_split(X,y,test_size=.30, random_state=4444)

Run Code Online (Sandbox Code Playgroud)

这是我在KFolds之前看到的X数据: variation length tempo 0 0.005144 1183.148118 135.999178 1 0.002595 720.165442 117.453835 2 0.008146 397.500952 112.347147 3 0.005367 1109.819501 172.265625 4 0.001631 509.931973 135.999178 5 0.001620 560.365714 151.999081 6 0.002513 763.377778 107.666016 7 0.009262 502.083628 99.384014 8 0.000610 500.017052 143.554688 9 0.000733 269.001723 117.453835

我的Y数据看起来像这样: array([ True, False, False, True, True, True, True, False, True, False], dtype=bool)

现在当我尝试做十字架时:

kf = KFold(X_train.shape[0], n_folds=4, shuffle=True)

for train_index, val_index in kf:
    cv_train_x = X_train.ix[train_index]
    cv_val_x = X_train.ix[val_index]
    cv_train_y = y_train[train_index]
    cv_val_y = y_train[val_index]
    print cv_train_x

    logreg = LogisticRegression(C = .01)
    logreg.fit(cv_train_x, cv_train_y)
    pred = logreg.predict(cv_val_x)
    print accuracy_score(cv_val_y, pred)

Run Code Online (Sandbox Code Playgroud)

当我尝试运行它时,我出错了以下错误,所以我添加了print语句.
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

在我的打印声明中,这是它打印的内容,一些数据变成了NaN. variation length tempo 0 NaN NaN NaN 1 NaN NaN NaN 2 0.008146 397.500952 112.347147 3 0.005367 1109.819501 172.265625 4 0.001631 509.931973 135.999178

我确定我做错了什么,有什么想法吗？一如既往,非常感谢你!

Answer 1

lej*_*lot 15

解决使用.iloc而不是.ix索引你的pandas数据帧

for train_index, val_index in kf:
    cv_train_x = X_train.iloc[train_index]
    cv_val_x = X_train.iloc[val_index]
    cv_train_y = y_train[train_index]
    cv_val_y = y_train[val_index]
    print cv_train_x

    logreg = LogisticRegression(C = .01)
    logreg.fit(cv_train_x, cv_train_y)
    pred = logreg.predict(cv_val_x)
    print accuracy_score(cv_val_y, pred)

Run Code Online (Sandbox Code Playgroud)

索引与ix通常等同于使用.loc它是基于标签的索引,而不是基于索引的.虽然.loc对作品X拥有基于索引/标记一个很好的整数,之后CV分裂这个规则就不再出现了,你喜欢的东西:

        length       tempo  variation
4   509.931973  135.999178   0.001631
2   397.500952  112.347147   0.008146
7   502.083628   99.384014   0.009262
6   763.377778  107.666016   0.002513
5   560.365714  151.999081   0.001620
3  1109.819501  172.265625   0.005367
9   269.001723  117.453835   0.000733

Run Code Online (Sandbox Code Playgroud)

现在你不再有标签0或1,所以如果你这样做

X_train.loc[1]

Run Code Online (Sandbox Code Playgroud)

你会得到一个例外

KeyError: 'the label [1] is not in the [index]'

Run Code Online (Sandbox Code Playgroud)

但是,如果您请求多个标签(其中至少存在一个标签),则pandas会出现静默错误.因此,如果你这样做

 X_train.loc[[1,4]]

Run Code Online (Sandbox Code Playgroud)

你会得到

       length       tempo  variation
1         NaN         NaN        NaN
4  509.931973  135.999178   0.001631

Run Code Online (Sandbox Code Playgroud)

正如预期的那样 - 1返回NaN(因为未找到),4表示实际行 - 因为它在X_train中.为了解决它 - 只需切换到.iloc或手动重建X_train的索引.

归档时间：	9 年，2 月前
查看次数：	1740 次
最近记录：	9 年，2 月前