在sklearn中的TfidfVectorizer中向stop_words列表添加单词

Question

在sklearn中的TfidfVectorizer中向stop_words列表添加单词

ac1*_*c11 13 python classification stop-words scikit-learn text-classification

我想在TfidfVectorizer中为stop_words添加一些单词.我按照添加单词的解决方案来scikit-learn的CountVectorizer的停止列表.我的停用词列表现在包含"英语"停用词和我指定的停用词.但是仍然TfidfVectorizer不接受我的停用词列表,我仍然可以在我的功能列表中看到这些词.以下是我的代码

from sklearn.feature_extraction import text
my_stop_words = text.ENGLISH_STOP_WORDS.union(my_words)

vectorizer = TfidfVectorizer(analyzer=u'word',max_df=0.95,lowercase=True,stop_words=set(my_stop_words),max_features=15000)
X= vectorizer.fit_transform(text)

Run Code Online (Sandbox Code Playgroud)

我还尝试在TfidfVectorizer中将stop_words设置为stop_words = my_stop_words.但它仍然无效.请帮忙.

Answer 1

Ped*_*ram 6

这是一个例子：

from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer

my_stop_words = text.ENGLISH_STOP_WORDS.union(["book"])

vectorizer = TfidfVectorizer(ngram_range=(1,1), stop_words=my_stop_words)

X = vectorizer.fit_transform(["this is an apple.","this is a book."])

idf_values = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))

# printing the tfidf vectors
print(X)

# printing the vocabulary
print(vectorizer.vocabulary_)

Run Code Online (Sandbox Code Playgroud)

在此示例中，我为两个示例文档创建了tfidf向量：

"This is a green apple."
"This is a machine learning book."

Run Code Online (Sandbox Code Playgroud)

默认情况下，this，is，a，并且an都在ENGLISH_STOP_WORDS列表中。而且，我还添加book了停用词列表。这是输出：

(0, 1)  0.707106781187
(0, 0)  0.707106781187
(1, 3)  0.707106781187
(1, 2)  0.707106781187
{'green': 1, 'machine': 3, 'learning': 2, 'apple': 0}

Run Code Online (Sandbox Code Playgroud)

如我们所见，该单词book也从功能列表中删除，因为我们将其列为停用词。结果，tfidfvectorizer确实接受了手动添加的词作为停用词，并在创建向量时忽略了该词。

@StamatisTiniakos 应该有。ENGLISH_STOP_WORDS 的类型为：`<class 'frozenset'>`，因此作为示例，您可以使用此集合创建一个新列表，并从列表中添加或删除单词，然后将其传递给矢量化器。 (2认同)

Answer 2

yan*_*han 5

此处回答：https ://stackoverflow.com/a/24386751/732396

即使sklearn.feature_extraction.text.ENGLISH_STOP_WORDS是一个 freezeset，您也可以复制它并添加您自己的单词，然后将该变量作为stop_words列表传递到参数中。

归档时间：	11 年前
查看次数：	20099 次
最近记录：	8 年，4 月前