您可以在 scikit-learn 中添加到 CountVectorizer 吗？

Question

您可以在 scikit-learn 中添加到 CountVectorizer 吗？

我想在基于文本语料库的scikit-learn 中创建一个 CountVectorizer，然后稍后将更多文本添加到 CountVectorizer（添加到原始字典）。

如果我使用transform()，它会保留原始词汇，但不会添加新词。如果我使用fit_transform()，它只会从头开始重新生成词汇表。见下文：

In [2]: count_vect = CountVectorizer()

In [3]: count_vect.fit_transform(["This is a test"])
Out[3]: 
<1x3 sparse matrix of type '<type 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>

In [4]: count_vect.vocabulary_  
Out[4]: {u'is': 0, u'test': 1, u'this': 2}

In [5]: count_vect.transform(["This not is a test"])
Out[5]: 
<1x3 sparse matrix of type '<type 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>

In [6]: count_vect.vocabulary_
Out[6]: {u'is': 0, u'test': 1, u'this': 2}

In [7]: count_vect.fit_transform(["This not is a test"])
Out[7]: 
<1x4 sparse matrix of type '<type 'numpy.int64'>'
    with 4 stored elements in Compressed Sparse Row format>

In [8]: count_vect.vocabulary_
Out[8]: {u'is': 0, u'not': 1, u'test': 2, u'this': 3}

Run Code Online (Sandbox Code Playgroud)

我想要一个update()函数的等价物。我希望它像这样工作：

In [2]: count_vect = CountVectorizer()

In [3]: count_vect.fit_transform(["This is a test"])
Out[3]: 
<1x3 sparse matrix of type '<type 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>

In [4]: count_vect.vocabulary_  
Out[4]: {u'is': 0, u'test': 1, u'this': 2}

In [5]: count_vect.update(["This not is a test"])
Out[5]: 
<1x3 sparse matrix of type '<type 'numpy.int64'>'
    with 4 stored elements in Compressed Sparse Row format>

In [6]: count_vect.vocabulary_
Out[6]: {u'is': 0, u'not': 1, u'test': 2, u'this': 3}

Run Code Online (Sandbox Code Playgroud)

有没有办法做到这一点？

Answer 1

pim*_*314 5

中实现的算法scikit-learn旨在同时适应所有数据，这对于大多数 ML 算法来说是必需的（尽管有趣的不是您描述的应用程序），因此没有任何update功能。

有一种方法可以通过稍微不同的想法来获得您想要的东西，请参阅以下代码

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
count_vect = CountVectorizer()
count_vect.fit_transform(["This is a test"])
print count_vect.vocabulary_
count_vect.fit_transform(["This is a test", "This is not a test"])
print count_vect.vocabulary_

Run Code Online (Sandbox Code Playgroud)

哪些输出

{u'this': 2, u'test': 1, u'is': 0}
{u'this': 3, u'test': 2, u'is': 0, u'not': 1}

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年前
查看次数：	1731 次
最近记录：	10 年前