如何解决"IndexError:数组索引太多"

Suj*_* De 6 python arrays machine-learning indices data-science

我下面的代码给出了以下错误"IndexError:数组索引太多".我对机器学习很新,所以我对如何解决这个问题一无所知.任何形式的帮助将不胜感激.

train = pandas.read_csv("D:/...input/train.csv")


xTrain = train.iloc[:,0:54]
yTrain = train.iloc[:,54:]


from sklearn.cross_validation import cross_val_score
clf = LogisticRegression(multi_class='multinomial')
scores = cross_val_score(clf, xTrain, yTrain, cv=10, scoring='accuracy')
print('****Results****')
print(scores.mean())
Run Code Online (Sandbox Code Playgroud)

Vet*_* PS 6

使用Pandas Dataframe逐步解释ML代码:

  1. 将预测变量和目标列分别分为X和y.

  2. 拆分训练数据(X_train,y_train)和测试数据(X_test,y_test).

  3. 计算交叉验证的AUC(曲线下面积).得到一个错误" IndexError:由于y_train而导致数组的索引太多 ",因为它期待一维数组,但是提取的二维数组是一个不匹配.后更换代码"y_train"y_train ["Y"]代码工作就像一个魅力.


   # Importing Packages :

   import pandas as pd

   from sklearn.model_selection import cross_val_score

   from sklearn.model_selection import StratifiedShuffleSplit

   # Seperating Predictor and Target Columns into X and y Respectively :
   # df -> Dataframe extracted from CSV File

   data_X = df.drop(['y'], axis=1) 
   data_y = pd.DataFrame(df['y'])

   # Making a Stratified Shuffle Split of Train and Test Data (test_size=0.3 Denotes 30 % Test Data and Remaining 70% Train Data) :

   rs = StratifiedShuffleSplit(n_splits=2, test_size=0.3,random_state=2)       
   rs.get_n_splits(data_X,data_y)

   for train_index, test_index in rs.split(data_X,data_y):

       # Splitting Training and Testing Data based on Index Values :

       X_train,X_test = data_X.iloc[train_index], data_X.iloc[test_index]
       y_train,y_test = data_y.iloc[train_index], data_y.iloc[test_index]

       # Calculating 5-Fold Cross-Validated AUC (cv=5) - Error occurs due to Dimension of **y_train** in this Line :

       classify_cross_val_score = cross_val_score(classify, X_train, y_train, cv=5, scoring='roc_auc').mean()

       print("Classify_Cross_Val_Score ",classify_cross_val_score) # Error at Previous Line.

       # Worked after Replacing 'y_train' with y_train['y'] in above Line 
       # where y is the ONLY Column (or) Series Present in the Pandas Data frame 
       # (i.e) Target variable for Prediction :

       classify_cross_val_score = cross_val_score(classify, X_train, y_train['y'], cv=5, scoring='roc_auc').mean()

       print("Classify_Cross_Val_Score ",classify_cross_val_score)

       print(y_train.shape)

       print(y_train['y'].shape)
Run Code Online (Sandbox Code Playgroud)

输出:

    Classify_Cross_Val_Score  0.7021433588790991
    (31647, 1) # 2-D
    (31647,)   # 1-D
Run Code Online (Sandbox Code Playgroud)

注意:从sklearn.model_selection导入cross_val_score.cross_val_score已从sklearn.model_selection导入,而不是从sklearn.cross_validation导入,这是不推荐使用的.


小智 4

您收到的错误代码基本上是说您已经为数组声明了不适合的内容。我看不到你的数组的声明,但我假设它是一维的,并且程序反对你将它视为二维数组。

只需检查您的声明是否正确,并在设置它们后通过打印值来测试代码,以仔细检查它们是否符合您的预期。

关于这个主题已经存在一些问题,所以我将在这里链接一个可能有用的问题: IndexError:索引太多。1 行 2 列的 Numpy 数组