在 Scikit-Learn 特征提取中合并 CountVectorizer

Question

在 Scikit-Learn 特征提取中合并 CountVectorizer

Arc*_*kla 5 python feature-extraction scikit-learn

我是 scikit-learn 的新手，需要一些帮助来完成我一直在做的事情。

我正在尝试使用多项式朴素贝叶斯分类对两种类型的文档（例如 A 型和 B 型）进行分类。为了获取这些文档的术语计数，我在 sklearn.feature_extraction.text 中使用 CountVectorizer 类。

问题在于，两种类型的文档需要不同的正则表达式来提取标记（CountVectorization 的 token_pattern 参数）。我似乎找不到一种方法来首先加载类型 A 的训练文档，然后加载类型 B 的训练文档。是否可以执行以下操作：

vecA = CountVectorizer(token_pattern="[a-zA-Z]+", ...)
vecA.fit(list_of_type_A_document_content)
...
vecB = CountVectorizer(token_pattern="[a-zA-Z0-9]+", ...)
vecB.fit(list_of_type_B_document_content)
...
# Somehow merge the two vectorizers results and get the final sparse matrix

Run Code Online (Sandbox Code Playgroud)

Answer 1

小智 4

你可以试试：

vecA = CountVectorizer(token_pattern="[a-zA-Z]+", ...)
vecA.fit_transform(list_of_type_A_document_content)
vecB = CountVectorizer(token_pattern="[a-zA-Z0-9]+", ...)
vecB.fit_transform(list_of_type_B_document_content)
combined_features = FeatureUnion([('CountVectorizer', vectA),('CountVect', vectB)])
combined_features.transform(test_data)

Run Code Online (Sandbox Code Playgroud)

您可以从 http://scikit-learn.org/stable/modules/ generated/sklearn.pipeline.FeatureUnion.html 阅读有关 FeatureUnion 的更多信息

从 0.13.1 版本开始可用

如果“vectA”和“vectB”都包含单词“happy”，则特征将为“vectA_name_happy”和“vectB_name_happy”。可能不是预期的那样。需要一个“快乐”的功能。 (3认同)

归档时间：	9 年，9 月前
查看次数：	3736 次
最近记录：	4 年，6 月前