理解sklearn中CountVectorizer中的`ngram_range`参数

Question

理解sklearn中CountVectorizer中的`ngram_range`参数

Mat*_*ien 27 python n-gram feature-selection scikit-learn

我对如何在Python中的scikit-learn库中使用ngrams感到有点困惑,具体来说,这个ngram_range参数在CountVectorizer中是如何工作的.

运行此代码:

from sklearn.feature_extraction.text import CountVectorizer
vocabulary = ['hi ', 'bye', 'run away']
cv = CountVectorizer(vocabulary=vocabulary, ngram_range=(1, 2))
print cv.vocabulary_

Run Code Online (Sandbox Code Playgroud)

给我:

{'hi ': 0, 'bye': 1, 'run away': 2}

Run Code Online (Sandbox Code Playgroud)

在我明显错误的印象中,我会得到unigrams和bigrams,就像这样:

{'hi ': 0, 'bye': 1, 'run away': 2, 'run': 3, 'away': 4}

Run Code Online (Sandbox Code Playgroud)

我正在使用这里的文档:http: //scikit-learn.org/stable/modules/feature_extraction.html

显然,我对如何使用ngrams的理解存在严重错误.也许这个论点没有效果,或者我对一个真正的二元组有一些概念上的问题!我很难过.如果有人提出建议,我会感激不尽.

更新:
我意识到了我的方式的愚蠢.我的印象是ngram_range会影响词汇,而不是语料库.

Answer 1

Fre*_*Foo 31

vocabulary明确设置意味着不从数据中学习词汇.如果你没有设置它,你得到:

>>> v = CountVectorizer(ngram_range=(1, 2))
>>> pprint(v.fit(["an apple a day keeps the doctor away"]).vocabulary_)
{u'an': 0,
 u'an apple': 1,
 u'apple': 2,
 u'apple day': 3,
 u'away': 4,
 u'day': 5,
 u'day keeps': 6,
 u'doctor': 7,
 u'doctor away': 8,
 u'keeps': 9,
 u'keeps the': 10,
 u'the': 11,
 u'the doctor': 12}

Run Code Online (Sandbox Code Playgroud)

显式词汇限制将从文本中提取的术语; 词汇没有改变:

>>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"})
>>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray()
array([[1, 1]])  # unigram and bigram found

Run Code Online (Sandbox Code Playgroud)

(注意,在n-gram提取之前应用了禁用词过滤"apple day".)

归档时间：	11 年，8 月前
查看次数：	29359 次
最近记录：	11 年，7 月前