我正在关注本文档集群教程。作为输入,我提供了一个txt文件,可以在此处下载。它是3个其他txt文件的组合文件,并使用\ n进行了分隔。创建tf-idf矩阵后,我收到此警告:
,, UserWarning:您的stop_words可能与您的预处理不一致。标记停用词会生成标记['abov','afterafter','alon','alreadi','always','ani','anoth','anyon','anyth','anywher','becam' ,'becaus','becom','befor','besid','cri','describ','dure','els','elsewher','empti','everi','everyon',' Everyth”,“ everywher”,“ fifti”,“ forti”,“ henc”,“ hereaft”,“ herebi”,“ howev”,“ hundr”,“ inde”,“ mani”,“ meanwhil”,“ moreov” ,“ nobodi”,“ noon”,“ noth”,“ nowher”,“ onc”,“ onli”,“ otherwis”,“ ourselv”,“ perhap”,“ pleas”,“ sever”,“ sinc”,“ sincer”,“ sixti”,“ someon”,“ someth”,“ sometim”,“ somewher”,“ themselv” ,“ thenc”,“ thereaft”,“ therebi”,“ therefor”,“ togeth”,“ twelv”,“ twenti”,“ veri”,“ whatev”,“ whenc”,“ whenev”,“ wherea”,“ whereaft”,“ wherebi”,“ wherev”,“ whi”,“ yourselv”]不在stop_words中。“ stop_words”。%sorted(不一致))”。'thereaft','therebi','therefor','togeth','twelv','twenti','veri','whatev','whenc','whenev','wherea','whereaft','wherebi ','wherev','whi','yourselv']不在stop_words中。“ stop_words”。%sorted(不一致))”。'thereaft','therebi','therefor','togeth','twelv','twenti','veri','whatev','whenc','whenev','wherea','whereaft','wherebi ','wherev','whi','yourselv']不在stop_words中。“ stop_words”。%sorted(不一致))”。
我想这与复词和停用词的顺序有关,但是由于这是我在txt处理中的第一个项目,我有点迷路,而且我不知道该如何解决...
import pandas as pd
import nltk
from nltk.corpus import stopwords
import re
import os …Run Code Online (Sandbox Code Playgroud)