小编Faw*_*waz的帖子

在R中删除过于常见的单词(出现在80%以上的文档中)

我正在使用'tm'包来创建语料库.我已经完成了大部分预处理步骤.剩下的就是删除过于常见的单词(超过80%的文档中出现的术语).任何人都可以帮我吗?

dsc <- Corpus(dd)
dsc <- tm_map(dsc, stripWhitespace)
dsc <- tm_map(dsc, removePunctuation)
dsc <- tm_map(dsc, removeNumbers)
dsc <- tm_map(dsc, removeWords, otherWords1)
dsc <- tm_map(dsc, removeWords, otherWords2)
dsc <- tm_map(dsc, removeWords, otherWords3)
dsc <- tm_map(dsc, removeWords, javaKeywords)
dsc <- tm_map(dsc, removeWords, stopwords("english"))
dsc = tm_map(dsc, stemDocument)
dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, 
                         stopwords = FALSE))

dtm = removeSparseTerms(dtm, 0.99) 
# ^-  Removes overly rare words (occur in less than 2% of the documents)
Run Code Online (Sandbox Code Playgroud)

r text-mining tm

5
推荐指数
2
解决办法
9218
查看次数

标签 统计

r ×1

text-mining ×1

tm ×1