CountVectorizer:AttributeError:'numpy.ndarray'对象没有属性'lower'

ash*_*shu 11 python numpy scikit-learn text-classification

我有一个一维数组,每个元素都有大字符串.我试图使用a CountVectorizer将文本数据转换为数字向量.但是,我收到一个错误说:

AttributeError: 'numpy.ndarray' object has no attribute 'lower'
Run Code Online (Sandbox Code Playgroud)

mealarray每个元素中包含大字符串.有5000个这样的样本.我正在尝试对此进行矢量化,如下所示:

vectorizer = CountVectorizer(
    stop_words='english',
    ngram_range=(1, 1),  #ngram_range=(1, 1) is the default
    dtype='double',
)
data = vectorizer.fit_transform(mealarray)
Run Code Online (Sandbox Code Playgroud)

完整的堆栈跟踪:

File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform
    self.fixed_vocabulary_)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 748, in _count_vocab
    for feature in analyze(doc):
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 234, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 200, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
Run Code Online (Sandbox Code Playgroud)

War*_*ser 14

检查形状mealarray.如果参数to fit_transform是一个字符串数组,则它必须是一维数组.(也就是说,mealarray.shape必须是表单(n,).)例如,如果mealarray有一个像这样的形状,你将得到"无属性"错误(n, 1).

你可以试试像

data = vectorizer.fit_transform(mealarray.ravel())
Run Code Online (Sandbox Code Playgroud)


ash*_*shu 7

得到了我的问题的答案.基本上,CountVectorizer将列表(带有字符串内容)作为参数而不是数组.这解决了我的问题.