在自定义类中使用 train_test_split 时打字错误(单例数组...)

hun*_*der 6 python-3.x scikit-learn

TypeError: Singleton array array(<__main__.AZHU_EmailClassifier_2 object at 0x000001D6E7A680D0>, dtype=object) 不能被视为有效集合。

当我尝试在我的自定义 AZHU_EmailClassifier_2 类中运行 train_test_split 函数时出现此错误。

我的课:

class AZHU_EmailClassifier_2:
    import os
    import numpy as np
    import pandas as pd
    from sklearn.model_selection import train_test_split

    def __init__(self):
        pass
    
    def retrain_model(self, csv_file):
        
        MIN_ROW_NUMBER = 500
        TEST_SIZE = 0.25
        RANDOM_STATE = 42
        
        self.os.chdir(r"c:\LORI\PROJECTS\ALLIANZ\INCOMING_CHANNELS") # <---- a retraining file mappaja
        
        df=self.pd.read_excel(csv_file,error_bad_lines=False, header=None)
        
        df.dropna(axis=0,how='any', inplace=True)
             
        rows_no=df.shape[0]
        if rows_no<MIN_ROW_NUMBER:
            print("Insufficient number of rows (<35.000)! RETRAINING ABORTED")
            return None

        X=df[0]
        y=df[1]
        
        X_train, X_test, y_train, y_test=self.train_test_split(X,y)
        #X_train, X_test, y_train, y_test=self.train_test_split(X,y,test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=y)
        
        return X_train
                
Run Code Online (Sandbox Code Playgroud)

当我运行 train_test_split 函数时触发错误。

整个错误信息:

-------------------------------------------------- ------------------------- TypeError Traceback(最近一次调用最后一次)在 1 个实例中 = AZHU_EmailClassifier_2() 2 ----> 3 instance.retrain_model (“retraining_dummy.xlsx”)

在 retrain_model(self, csv_file) 28 y=df[1] 29 ---> 30 X_train, X_test, y_train, y_test=self.train_test_split(X,y) 31 #X_train, X_test, y_train, y_test=self.train_test_split( X,y,test_size=TEST_SIZE, random_state=RANDOM_STATE, 分层=y) 32

~\Anaconda3\lib\site-packages\sklearn\model_selection_split.py in train_test_split(*arrays, **options) 2125 raise TypeError("Invalid parameters传递: %s" % str(options)) 2126 -> 2127 arrays = indexable (*arrays) 2128 2129 n_samples = _num_samples(arrays[0])

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in indexable(*iterables) 291 """ 292 result = [_make_indexable(X) for X in iterables] --> 293 check_consistent_length(*result) 294返回结果 295

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays) 251 """ 252 --> 253 lengths = [_num_samples(X) for X in array if X is not None] 254 uniques = np.unique(lengths) 255 如果 len(uniques) > 1:

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in (.0) 251 """ 252 --> 253 lengths = [_num_samples(X) for X in array if X is not None] 254 uniques = np.unique(lengths) 255 如果 len(uniques) > 1:

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in _num_samples(x) 194 如果 hasattr(x, 'shape') 和 x.shape 不是 None: 195 if len(x.shape) == 0: --> 196 raise TypeError("Singleton array %r cannot be seen" 197 " a valid collection." % x) 198 # 检查形状是否返回整数或默认为 len

TypeError: Singleton array array(< main .AZHU_EmailClassifier_2 object at 0x000001D6E7A68F10>, dtype=object) 不能被视为有效集合。

我不知道为什么会抛出这个错误。你能指出我正确的方向吗?任何帮助表示赞赏!

meT*_*sky 6

您收到此错误是因为您train_test_split在类内部导入,因此train_test_split成为绑定方法而不是函数,并且每当调用该方法时,实例将作为第一个参数传递。这是一个可以重建情况的最小示例

class test():
    
    from sklearn.model_selection import train_test_split
    
    def retrain_model(self):
        print(self.train_test_split)
        print(self.train_test_split())
    
test_instance = test()
test_instance.retrain_model()
Run Code Online (Sandbox Code Playgroud)

运行此脚本后,您将获得 TypeError

TypeError: Singleton array array(<__main__.test object at 0x7ffa473ae438>, dtype=object) cannot be considered a valid collection.
Run Code Online (Sandbox Code Playgroud)

self.train_test_split在内存中的位置也是0x7ffa473ae438.

根据PEP8

导入总是放在文件的顶部,就在任何模块注释和文档字符串之后,以及模块全局变量和常量之前。

所以,最简单的办法就是把类外的东西都导入,train_test_split直接调用

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

class AZHU_EmailClassifier_2():

    def __init__(self):
        pass
    
    def retrain_model(self,):
        
        MIN_ROW_NUMBER = 20
        TEST_SIZE = 0.25
        RANDOM_STATE = 42
                
        df = pd.DataFrame({0:np.linspace(1,100,100),1:np.random.rand(100)})
        X=df[0];y=df[1]
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=TEST_SIZE,random_state=RANDOM_STATE)
        
        return X_train

test = AZHU_EmailClassifier_2()
test.retrain_model()
Run Code Online (Sandbox Code Playgroud)