TypeError:预期序列或类似数组,得到估计量

Dee*_*aya 6 python-2.7 pandas scikit-learn

我正在开发一个对产品进行用户评论的项目.我使用TfidfVectorizer从我的数据集中提取功能,除了我手动提取的一些其他功能.

df = pd.read_csv('reviews.csv', header=0)

FEATURES = ['feature1', 'feature2']
reviews = df['review']
reviews = reviews.values.flatten()

vectorizer = TfidfVectorizer(min_df=1, decode_error='ignore', ngram_range=(1, 3), stop_words='english', max_features=45)

X = vectorizer.fit_transform(reviews)
idf = vectorizer.idf_
features = vectorizer.get_feature_names()
FEATURES += features
inverse =  vectorizer.inverse_transform(X)

for i, row in df.iterrows():
    for f in features:
        df.set_value(i, f, False)
    for inv in inverse[i]:
        df.set_value(i, inv, True)

train_df, test_df = train_test_split(df, test_size = 0.2, random_state=700)
Run Code Online (Sandbox Code Playgroud)

上面的代码工作正常.但当我将max_features45从更改为更高时,我会tran_test_split在线上出错.

错误是:

Traceback (most recent call last): File "analysis.py", line 120, in <module> train_df, test_df = train_test_split(df, test_size = 0.2, random_state=700) File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1906, in train_test_split arrays = indexable(*arrays) File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 201, in indexable check_consistent_length(*result) File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 173, in check_consistent_length uniques = np.unique([_num_samples(X) for X in arrays if X is not None]) File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 112, in _num_samples 'estimator %s' % x) TypeError: Expected sequence or array-like, got estimator

当我改变增加max_features尺寸时,我不确定究竟会发生什么变化.

如果您需要更多数据或者我错过了什么,请告诉我

elz*_*elz 8

我知道这是旧的,但我有同样的问题,虽然@shahins的答案有效,但我想要保留数据帧对象的东西,以便我可以在train/test splits中进行索引.

解:

将数据框列重命名为适合的东西(其他):

df = df.rename(columns = {'fit': 'fit_feature'})
Run Code Online (Sandbox Code Playgroud)

为什么会这样:

实际上并不是问题的特征数量,特别是导致问题的一个特征.我猜你正在将"fit"这个词作为你的一个文本特征(而且它没有显示出较低的max_features阈值).

查看sklearn源代码,它会检查以确保您没有通过测试查看任何对象是否具有"适合"属性来传递sklearn估算器.代码正在检查fitsklearn估计器的方法,但是当您有一fit列数据帧时也会引发异常(请记住df.fitdf['fit']选择"fit"列).


sha*_*ins 3

我遇到了这个问题,我尝试了类似的方法,它对我有用:

train_test_split(df.as_matrix(), test_size = 0.2, random_state=700)
Run Code Online (Sandbox Code Playgroud)