TfidfVectorizer:ValueError:不是内置的停止列表:俄语

Edw*_*ard 1 python tf-idf

我试着用俄​​语停止词来应用TfidfVectorizer

Tfidf = sklearn.feature_extraction.text.TfidfVectorizer(stop_words='russian' )
Z = Tfidf.fit_transform(X)
Run Code Online (Sandbox Code Playgroud)

我明白了

ValueError: not a built-in stop list: russian
Run Code Online (Sandbox Code Playgroud)

当我使用英语停止词是正确的

Tfidf = sklearn.feature_extraction.text.TfidfVectorizer(stop_words='english' )
Z = Tfidf.fit_transform(X)
Run Code Online (Sandbox Code Playgroud)

怎么改进呢?完全追溯

<ipython-input-118-e787bf15d612> in <module>()
      1 Tfidf = sklearn.feature_extraction.text.TfidfVectorizer(stop_words='russian' )
----> 2 Z = Tfidf.fit_transform(X)

C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
   1303             Tf-idf-weighted document-term matrix.
   1304         """
-> 1305         X = super(TfidfVectorizer, self).fit_transform(raw_documents)
   1306         self._tfidf.fit(X)
   1307         # X is already a transformed view of raw_documents so

C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
    815 
    816         vocabulary, X = self._count_vocab(raw_documents,
--> 817                                           self.fixed_vocabulary_)
    818 
    819         if self.binary:

C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
    745             vocabulary.default_factory = vocabulary.__len__
    746 
--> 747         analyze = self.build_analyzer()
    748         j_indices = _make_int_array()
    749         indptr = _make_int_array()

C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in build_analyzer(self)
    232 
    233         elif self.analyzer == 'word':
--> 234             stop_words = self.get_stop_words()
    235             tokenize = self.build_tokenizer()
    236 

C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in get_stop_words(self)
    215     def get_stop_words(self):
    216         """Build or fetch the effective stop words list"""
--> 217         return _check_stop_list(self.stop_words)
    218 
    219     def build_analyzer(self):

C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _check_stop_list(stop)
     88         return ENGLISH_STOP_WORDS
     89     elif isinstance(stop, six.string_types):
---> 90         raise ValueError("not a built-in stop list: %s" % stop)
     91     elif stop is None:
     92         return None

ValueError: not a built-in stop list: russian
Run Code Online (Sandbox Code Playgroud)

小智 7

你们可以在发布之前先阅读文档吗?

stop_words:string {'english'},list或None(默认)

如果是字符串,则将其传递给_check_stop_list并返回相应的停止列表.'english'是目前唯一支持的字符串值.

如果列表,该列表被假定包含停用词,则所有这些将从生成的令牌中删除.仅适用于analyzer =='word'.

如果为None,则不使用停用词.max_df可以设置为[0.7,1.0]范围内的值,以根据术语的语料库文档频率自动检测和过滤停用词.