我该如何解决错误:TfidfVectorizer 的 'stop_words' 参数必须是 {'english'} 中的 str、'list' 的实例或 None?

Seb*_*anS 2 nlp topic-modeling tfidfvectorizer

在 Albrecht、Jens、Sidharth Ramachandran 和 Christian Winkler 的解释后,我尝试进行主题建模(使用德语停用词和德语文本)。使用 Python 进行文本分析的蓝图:针对常见现实世界 (NLP) 应用程序的基于机器学习的解决方案。第一版。加利福尼亚州塞巴斯托波尔:O\xe2\x80\x99Reilly Media, Inc,2020。,第 209 页 ff。

\n
# Load Data\nimport pandas as pd\n# csv Datei \xc3\xbcber read_csv laden\nxlsx = pd.ExcelFile("Priorisierung_der_Anforderungen.xlsx")\ndf = pd.read_excel(xlsx)\n\n# Anforderungsbeschreibung in String umwandlen\ndf=df.astype({'Anforderungsbeschreibung':'string'})\ndf.info()\n\n# "Ignore spaces after the stop..."\nimport re\ndf["paragraphs"] = df["Anforderungsbeschreibung"].map(lambda text:re.split('\\.\\s*\\n', text))\ndf["number_of_paragraphs"] = df["paragraphs"].map(len)\n\n%matplotlib inline\ndf.groupby('Title').agg({'number_of_paragraphs': 'mean'}).plot.bar(figsize=(24,12))\n\n\n# Preparations\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom spacy.lang.de.stop_words import STOP_WORDS as stopwords\n\ntfidf_text_vectorizer = TfidfVectorizer(stop_words=stopwords, min_df=5, max_df=0.7)\ntfidf_text_vectors = tfidf_text_vectorizer.fit_transform(df['Anforderungsbeschreibung'])\ntfidf_text_vectors.shape\n
Run Code Online (Sandbox Code Playgroud)\n

我收到此错误消息:

\n
 InvalidParameterError: The 'stop_words' parameter of TfidfVectorizer must be a str among {'english'}, an instance of 'list' or None.   \n
Run Code Online (Sandbox Code Playgroud)\n
\n
InvalidParameterError                     Traceback (most recent call last)\nCell In[8], line 4\n  1 #tfidf_text_vectorizer = = TfidfVectorizer(stop_words=stopwords.words('german'),)\n  3 tfidf_text_vectorizer = TfidfVectorizer(stop_words=stopwords, min_df=5, max_df=0.7)\n----> 4 tfidf_text_vectors = tfidf_text_vectorizer.fit_transform(df['Anforderungsbeschreibung'])\n  5 tfidf_text_vectors.shape\n\nInvalidParameterError: The 'stop_words' parameter of TfidfVectorizer must be a str among {'english'}, an instance of 'list' or None.\n
Run Code Online (Sandbox Code Playgroud)\n

希望您提供任何提示。\n塞巴斯蒂安

\n

alv*_*vas 7

您从 Spacy 导入的停用词不是列表。

from spacy.lang.de.stop_words import STOP_WORDS

type(STOP_WORDS)
Run Code Online (Sandbox Code Playgroud)

[出去]:

set
Run Code Online (Sandbox Code Playgroud)

将停用词放入列表中,它应该按预期工作。

from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.de.stop_words import STOP_WORDS


tfidf_text_vectorizer = TfidfVectorizer(stop_words=list(STOP_WORDS))
Run Code Online (Sandbox Code Playgroud)

  • 这是 sklearn 最近实施的更改吗?根据我自己的经验,过去通过一组就很好了。stackoverflow 中的较旧答案还表明它曾经接受冻结集,例如 /sf/ask/1707054261/ (2认同)