我该如何解决错误：TfidfVectorizer 的 'stop_words' 参数必须是 {'english'} 中的 str、'list' 的实例或 None？

Question

我该如何解决错误：TfidfVectorizer 的 'stop_words' 参数必须是 {'english'} 中的 str、'list' 的实例或 None？

Seb*_*anS 2 nlp topic-modeling tfidfvectorizer

在 Albrecht、Jens、Sidharth Ramachandran 和 Christian Winkler 的解释后，我尝试进行主题建模（使用德语停用词和德语文本）。使用 Python 进行文本分析的蓝图：针对常见现实世界 (NLP) 应用程序的基于机器学习的解决方案。第一版。加利福尼亚州塞巴斯托波尔：O\xe2\x80\x99Reilly Media, Inc，2020。，第 209 页 ff。

\n

# Load Data\nimport pandas as pd\n# csv Datei \xc3\xbcber read_csv laden\nxlsx = pd.ExcelFile("Priorisierung_der_Anforderungen.xlsx")\ndf = pd.read_excel(xlsx)\n\n# Anforderungsbeschreibung in String umwandlen\ndf=df.astype({'Anforderungsbeschreibung':'string'})\ndf.info()\n\n# "Ignore spaces after the stop..."\nimport re\ndf["paragraphs"] = df["Anforderungsbeschreibung"].map(lambda text:re.split('\\.\\s*\\n', text))\ndf["number_of_paragraphs"] = df["paragraphs"].map(len)\n\n%matplotlib inline\ndf.groupby('Title').agg({'number_of_paragraphs': 'mean'}).plot.bar(figsize=(24,12))\n\n\n# Preparations\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom spacy.lang.de.stop_words import STOP_WORDS as stopwords\n\ntfidf_text_vectorizer = TfidfVectorizer(stop_words=stopwords, min_df=5, max_df=0.7)\ntfidf_text_vectors = tfidf_text_vectorizer.fit_transform(df['Anforderungsbeschreibung'])\ntfidf_text_vectors.shape\n

Run Code Online (Sandbox Code Playgroud)\n

我收到此错误消息：

\n

 InvalidParameterError: The 'stop_words' parameter of TfidfVectorizer must be a str among {'english'}, an instance of 'list' or None.   \n

Run Code Online (Sandbox Code Playgroud)\n

\n

InvalidParameterError                     Traceback (most recent call last)\nCell In[8], line 4\n  1 #tfidf_text_vectorizer = = TfidfVectorizer(stop_words=stopwords.words('german'),)\n  3 tfidf_text_vectorizer = TfidfVectorizer(stop_words=stopwords, min_df=5, max_df=0.7)\n----> 4 tfidf_text_vectors = tfidf_text_vectorizer.fit_transform(df['Anforderungsbeschreibung'])\n  5 tfidf_text_vectors.shape\n\nInvalidParameterError: The 'stop_words' parameter of TfidfVectorizer must be a str among {'english'}, an instance of 'list' or None.\n

Run Code Online (Sandbox Code Playgroud)\n

希望您提供任何提示。\n塞巴斯蒂安

\n

Answer 1

alv*_*vas 7

您从 Spacy 导入的停用词不是列表。

from spacy.lang.de.stop_words import STOP_WORDS

type(STOP_WORDS)

Run Code Online (Sandbox Code Playgroud)

[出去]：

set

Run Code Online (Sandbox Code Playgroud)

将停用词放入列表中，它应该按预期工作。

from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.de.stop_words import STOP_WORDS


tfidf_text_vectorizer = TfidfVectorizer(stop_words=list(STOP_WORDS))

Run Code Online (Sandbox Code Playgroud)

这是 sklearn 最近实施的更改吗？根据我自己的经验，过去通过一组就很好了。stackoverflow 中的较旧答案还表明它曾经接受冻结集，例如 /sf/ask/1707054261/ (2认同)

归档时间：	2 年，10 月前
查看次数：	5267 次
最近记录：	2 年，10 月前