交叉验证时,键中的键错误不在索引中

Question

交叉验证时,键中的键错误不在索引中

sar*_*iii 6 python scikit-learn cross-validation

我在我的数据集上应用了svm.我的数据集是多标签意味着每个观察都有多个标签.

虽然KFold cross-validation它引起了错误not in index.

它显示从601到6007的索引not in index(我有1 ... 6008个数据样本).

这是我的代码:

   df = pd.read_csv("finalupdatedothers.csv")
categories = ['ADR','WD','EF','INF','SSI','DI','others']
X= df[['sentences']]
y = df[['ADR','WD','EF','INF','SSI','DI','others']]
kf = KFold(n_splits=10)
kf.get_n_splits(X)
for train_index, test_index in kf.split(X,y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

SVC_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words)),
                ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
            ])

for category in categories:
    print('... Processing {} '.format(category))
    # train the model using X_dtm & y
    SVC_pipeline.fit(X_train['sentences'], y_train[category])

    prediction = SVC_pipeline.predict(X_test['sentences'])
    print('SVM Linear Test accuracy is {} '.format(accuracy_score(X_test[category], prediction)))
    print 'SVM Linear f1 measurement is {} '.format(f1_score(X_test[category], prediction, average='weighted'))
    print([{X_test[i]: categories[prediction[i]]} for i in range(len(list(prediction)))])

Run Code Online (Sandbox Code Playgroud)

实际上,我不知道如何应用KFold交叉验证,我可以分别获得每个标签的F1分数和准确度.看了这个,这对我没有帮助,我怎样才能成功申请我的案子.

为了重现,这是数据框 的一小部分,最后七个特征是我的标签,包括ADR,WD,......

,sentences,ADR,WD,EF,INF,SSI,DI,others
0,"extreme weight gain, short-term memory loss, hair loss.",1,0,0,0,0,0,0
1,I am detoxing from Lexapro now.,0,0,0,0,0,0,1
2,I slowly cut my dosage over several months and took vitamin supplements to help.,0,0,0,0,0,0,1
3,I am now 10 days completely off and OMG is it rough.,0,0,0,0,0,0,1
4,"I have flu-like symptoms, dizziness, major mood swings, lots of anxiety, tiredness.",0,1,0,0,0,0,0
5,I have no idea when this will end.,0,0,0,0,0,0,1

Run Code Online (Sandbox Code Playgroud)

更新

当我做了什么Vivek Kumar说它引起了错误

ValueError: Found input variables with inconsistent numbers of samples: [1, 5408]

Run Code Online (Sandbox Code Playgroud)

在分类器部分.你知道怎么解决吗？

stackoverflow中有一些链接可以解决这个错误,它说我需要重塑训练数据.我也做了但没有成功链接谢谢:)

Answer 1

Viv*_*mar 20

train_index,test_index是基于行数的整数索引.但是大熊猫索引并不像那样.较新版本的pandas对切片或从中选择数据的方式更为严格.

您需要使用.iloc来访问数据.更多信息可在此处获得

这就是你需要的:

for train_index, test_index in kf.split(X,y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    ...
    ...

    # TfidfVectorizer dont work with DataFrame, 
    # because iterating a DataFrame gives the column names, not the actual data
    # So specify explicitly the column name, to get the sentences

    SVC_pipeline.fit(X_train['sentences'], y_train[category])

    prediction = SVC_pipeline.predict(X_test['sentences'])

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，5 月前
查看次数：	4317 次
最近记录：	6 年，8 月前