使用scikit-learn的多个功能

Jam*_*ily 7 python machine-learning pandas scikit-learn

我正在使用scikit-learn进行文本分类.使用单一功能可以很好地工作,但引入多个功能会给我带来错误.我认为问题在于我没有像分类器所期望的那样格式化数据.

例如,这工作正常:

data = np.array(df['feature1'])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)

classifier = Pipeline(...)

classifier.fit(X_train, Y_train)
Run Code Online (Sandbox Code Playgroud)

但是这个:

data = np.array(df[['feature1', 'feature2']])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)

classifier = Pipeline(...)

classifier.fit(X_train, Y_train)
Run Code Online (Sandbox Code Playgroud)

死了

Traceback (most recent call last):
  File "/Users/jed/Dropbox/LegalMetric/LegalMetricML/motion_classifier.py", line 157, in <module>
    classifier.fit(X_train, Y_train)
  File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 130, in fit
    Xt, fit_params = self._pre_transform(X, y, **fit_params)
  File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 120, in _pre_transform
    Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 780, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 715, in _count_vocab
    for feature in analyze(doc):
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 229, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 195, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
Run Code Online (Sandbox Code Playgroud)

在调用classifier.fit()之后的预处理阶段.我认为问题是我正在格式化数据,但我无法弄清楚如何正确.

feature1和feature2都是英文文本字符串,目标也是.我正在使用LabelEncoder()来编码目标,这似乎工作正常.

这是一个print data返回的示例,让您了解它现在的格式.

[['some short english text'
  'a paragraph of english text']
 ['some more short english text'
  'a second paragraph of english text']
 ['some more short english text'
  'a third paragraph of english text']]
Run Code Online (Sandbox Code Playgroud)

ely*_*ely 3

特定的错误消息使您的代码看起来像是某个地方期望某个东西是 a str(以便.lower可以调用),但它正在接收整个数组(可能是整个strs 数组)。

您能否编辑问题以更好地描述数据并发布完整的回溯,而不仅仅是带有指定错误的一小部分?

同时,你也可以尝试一下

data = df[['feature1', 'feature2']].values
Run Code Online (Sandbox Code Playgroud)

df['target'].values
Run Code Online (Sandbox Code Playgroud)

而不是明确地投射给np.ndarray自己。

在我看来,就像正在制作一个 1x1 的数组,并且“数组”中的单例元素本身就是一个ndarray.