5 r text-analysis text-mining lda topic-modeling
我正在使用R中的包tm和lda主题模型新闻文章的语料库.但是,我得到一个"非角色"问题,因为""这会弄乱我的主题.这是我的工作流程:
text <- Corpus(VectorSource(d$text))
newtext <- lapply(text, tolower)
sw <- c(stopwords("english"), "ahram", "online", "egypt", "egypts", "egyptian")
newtext <- lapply(newtext, function(x) removePunctuation(x))
newtext <- lapply(newtext, function(x) removeWords(x, sw))
newtext <- lapply(newtext, function(x) removeNumbers(x))
newtext <- lapply(newtext, function(x) stripWhitespace(x))
d$processed <- unlist(newtext)
corpus <- lexicalize(d$processed)
k <- 40
result <-lda.collapsed.gibbs.sampler(corpus$documents, k, corpus$vocab, 500, .02, .05,
compute.log.likelihood = TRUE, trace = 2L)
Run Code Online (Sandbox Code Playgroud)
不幸的是,当我训练lda模型时,一切看起来都很棒,除了最常出现的单词是"".我尝试通过从下面给出的词汇中删除它并如上所述重新估计模型来解决这个问题:
newtext <- lapply(newtext, function(x) removeWords(x, ""))
Run Code Online (Sandbox Code Playgroud)
但是,它仍然存在,如下所示:
str_split(newtext[[1]], " ")
[[1]]
[1] "" "body" "mohamed" "hassan"
[5] "cook" "found" "turkish" "search"
[9] "rescue" "teams" "rescued" "hospital"
[13] "rescue" "teams" "continued" "search"
[17] "missing" "body" "cook" "crew"
[21] "wereegyptians" "sudanese" "syrians" "hassan"
[25] "cook" "cargo" "ship" "sea"
[29] "bright" "crashed" "thursday" "port"
[33] "antalya" "southern" "turkey" "vessel"
[37] "collided" "rocks" "port" "thursday"
[41] "night" "result" "heavy" "winds"
[45] "waves" "crew" ""
Run Code Online (Sandbox Code Playgroud)
关于如何去除这个的任何建议?添加""到我的停用词列表也没有帮助.
我经常处理文本,但不是这样,所以这是摆脱"你有"的2种方法.可能额外的""字符是由于句子之间的双倍空格键.您可以在将文本转换为单词之前或之后处理此情况.您可以在strsplit之前将所有""x2替换为""x1,或者之后可以将其取代(您必须在strsplit之后取消列出).
x <- "I like to ride my bicycle. Do you like to ride too?"
#TREAT BEFORE(OPTION):
a <- gsub(" +", " ", x)
strsplit(a, " ")
#TREAT AFTER OPTION:
y <- unlist(strsplit(x, " "))
y[!y%in%""]
Run Code Online (Sandbox Code Playgroud)
您也可以尝试:
newtext <- lapply(newtext, function(x) gsub(" +", " ", x))
Run Code Online (Sandbox Code Playgroud)
再次,我不使用tm,所以这可能没有帮助,但这篇文章没有看到任何行动所以我想我会分享可能性.