相关疑难解决方法(0)

AttributeError:未找到lower; 在scikit-learn中使用带有CountVectorizer的Pipeline

我有一个语料库:

X_train = [ ['this is an dummy example'] 
      ['in reality this line is very long']
      ...
      ['here is a last text in the training set']
    ]

Run Code Online (Sandbox Code Playgroud)

和一些标签:

y_train = [1, 5, ... , 3]

Run Code Online (Sandbox Code Playgroud)

我想使用Pipeline和GridSearch如下:

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('reg', SGDRegressor())
])


parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'tfidf__use_idf': (True, False),
    'reg__alpha': (0.00001, 0.000001),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1)

grid_search.fit(X_train, y_train)

Run Code Online (Sandbox Code Playgroud)

当我运行这个时,我收到一个错误说AttributeError: lower not found.

我在这里搜索并发现了一个关于这个错误的问题,这让我相信我的文本没有被标记化存在问题(这听起来就像它击中了头部,因为我使用列表列表作为输入数据,其中每个列表包含一个单个不间断的字符串).

我制作了一个快速而肮脏的标记器来测试这个理论: