Seb*_*anS 2 nlp topic-modeling tfidfvectorizer
在 Albrecht、Jens、Sidharth Ramachandran 和 Christian Winkler 的解释后,我尝试进行主题建模(使用德语停用词和德语文本)。使用 Python 进行文本分析的蓝图:针对常见现实世界 (NLP) 应用程序的基于机器学习的解决方案。第一版。加利福尼亚州塞巴斯托波尔:O\xe2\x80\x99Reilly Media, Inc,2020。,第 209 页 ff。
\n# Load Data\nimport pandas as pd\n# csv Datei \xc3\xbcber read_csv laden\nxlsx = pd.ExcelFile("Priorisierung_der_Anforderungen.xlsx")\ndf = pd.read_excel(xlsx)\n\n# Anforderungsbeschreibung in String umwandlen\ndf=df.astype({'Anforderungsbeschreibung':'string'})\ndf.info()\n\n# "Ignore spaces after the stop..."\nimport re\ndf["paragraphs"] = df["Anforderungsbeschreibung"].map(lambda text:re.split('\\.\\s*\\n', text))\ndf["number_of_paragraphs"] = df["paragraphs"].map(len)\n\n%matplotlib inline\ndf.groupby('Title').agg({'number_of_paragraphs': 'mean'}).plot.bar(figsize=(24,12))\n\n\n# Preparations\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom spacy.lang.de.stop_words import STOP_WORDS as stopwords\n\ntfidf_text_vectorizer = TfidfVectorizer(stop_words=stopwords, min_df=5, max_df=0.7)\ntfidf_text_vectors = tfidf_text_vectorizer.fit_transform(df['Anforderungsbeschreibung'])\ntfidf_text_vectors.shape\nRun Code Online (Sandbox Code Playgroud)\n我收到此错误消息:
\n InvalidParameterError: The 'stop_words' parameter of TfidfVectorizer must be a str among {'english'}, an instance of 'list' or None. \nRun Code Online (Sandbox Code Playgroud)\nInvalidParameterError Traceback (most recent call last)\nCell In[8], line 4\n 1 #tfidf_text_vectorizer = = TfidfVectorizer(stop_words=stopwords.words('german'),)\n 3 tfidf_text_vectorizer = TfidfVectorizer(stop_words=stopwords, min_df=5, max_df=0.7)\n----> 4 tfidf_text_vectors = tfidf_text_vectorizer.fit_transform(df['Anforderungsbeschreibung'])\n 5 tfidf_text_vectors.shape\n\nInvalidParameterError: The 'stop_words' parameter of TfidfVectorizer must be a str among {'english'}, an instance of 'list' or None.\nRun Code Online (Sandbox Code Playgroud)\n希望您提供任何提示。\n塞巴斯蒂安
\n您从 Spacy 导入的停用词不是列表。
from spacy.lang.de.stop_words import STOP_WORDS
type(STOP_WORDS)
Run Code Online (Sandbox Code Playgroud)
[出去]:
set
Run Code Online (Sandbox Code Playgroud)
将停用词放入列表中,它应该按预期工作。
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.de.stop_words import STOP_WORDS
tfidf_text_vectorizer = TfidfVectorizer(stop_words=list(STOP_WORDS))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
5267 次 |
| 最近记录: |