R中使用LDA和tm的文本分析

Question

R中使用LDA和tm的文本分析

嘿伙计们我在传导LDA方面有点麻烦,因为出于某种原因,一旦我准备好进行分析,我就会出错.我会尽我所能去完成我正在做的事情,遗憾的是我无法提供数据,因为我使用的数据是专有数据.

dataset <- read.csv("proprietarydata.csv")

首先,我做了一些清理数据$ text和post是类字符

dataset$text <- as.character(dataset$text) 
post <- gsub("[^[:print:]]"," ",data$Post.Content)
post <- gsub("[^[:alnum:]]", " ",post)

Run Code Online (Sandbox Code Playgroud)

帖子最终看起来像这样:`

`[1] "here is a string"
 [2] "here is another string"
 etc....`

Run Code Online (Sandbox Code Playgroud)

然后我创建了以下功能,它可以进行更多清洁:

createdtm <- function(x){
myCorpus <- Corpus(VectorSource(x))
myCorpus <- tm_map(myCorpus,PlainTextDocument)
docs <- tm_map(myCorpus,tolower)
docs <- tm_map(docs, removeWords, stopwords(kind="SMART"))
docs <- tm_map(docs, removeWords, c("the"," the","will","can","regards","need","thanks","please","http"))
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
return(docs)}

predtm <- createdtm(post)

Run Code Online (Sandbox Code Playgroud)

这最终会返回一个语料库,为每个文档提供这样的内容:

[[1]]
<<PlainTextDocument (metadata: 7)>>
Here text string


[[2]]
<<PlainTextDocument (metadata: 7)>>
Here another string

Run Code Online (Sandbox Code Playgroud)

然后我通过创建DocumentTermMatrix来为LDA做好准备

dtm <- DocumentTermMatrix(predtm)
inspect(dtm)


<<DocumentTermMatrix (documents: 14640, terms: 39972)>>
Non-/sparse entries: 381476/584808604
Sparsity           : 100%
Maximal term length: 86
Weighting          : term frequency (tf)

Docs           truclientrre truddy trudi trudy true truebegin truecontrol
              Terms
Docs           truecrypt truecryptas trueimage truely truethis trulibraryref
              Terms
Docs           trumored truncate truncated truncatememory truncates
              Terms
Docs           truncatetableinautonomoustrx truncating trunk trunkhyper
              Terms
Docs           trunking trunkread trunks trunkswitch truss trust trustashtml
              Terms
Docs           trusted trustedbat trustedclient trustedclients
              Terms
Docs           trustedclientsjks trustedclientspwd trustedpublisher
              Terms
Docs           trustedreviews trustedsignon trusting trustiv trustlearn
              Terms
Docs           trustmanager trustpoint trusts truststorefile truststorepass
              Terms
Docs           trusty truth truthfully truths tryd tryed tryig tryin tryng

Run Code Online (Sandbox Code Playgroud)

这看起来很奇怪,但这就是我一直这样做的方式.所以我最终继续前进,并做了以下事情

run.lda <- LDA(dtm,4)

Run Code Online (Sandbox Code Playgroud)

这会返回我的第一个错误

  Error in LDA(dtm, 4) : 
  Each row of the input matrix needs to contain at least one non-zero entry

Run Code Online (Sandbox Code Playgroud)

在研究了这个错误后,我看看这篇文章从R topicmodels中的DocumentTermMatrix删除空文档？我假设我已经掌控了所有内容并且感到兴奋,所以我按照链接中的步骤进行操作

这样运行

rowTotals <- apply(dtm , 1, sum)

Run Code Online (Sandbox Code Playgroud)

这不是

dtm.new   <- dtm[rowTotals> 0]

Run Code Online (Sandbox Code Playgroud)

它返回:

  Error in `[.simple_triplet_matrix`(dtm, rowTotals > 0) : 
  Logical vector subscripting disabled for this object.

Run Code Online (Sandbox Code Playgroud)

我知道我可能会发热,因为有些人可能会说这不是可重复的例子.请随时询问有关此问题的任何信息.这是我能做的最好的事情.

Answer 1

MrF*_*ick 5

创建一个可重复性最小的示例真的不应该那么难.例如

library(tm)
library(topicmodels)
raw <- c("hello","","goodbye")
tm <- Corpus(VectorSource(raw))

dtm <- DocumentTermMatrix(tm)

LDA(dtm,4)

# Error in LDA(dtm, 4) : 
#   Each row of the input matrix needs to contain at least one non-zero entry

Run Code Online (Sandbox Code Playgroud)

也不应该难以记住如何正确地对矩阵进行子集(通过[row,col]不仅仅指定[index].

rowTotals <- apply(dtm , 1, sum)
dtm <- dtm[rowTotals>0,]
LDA(dtm, 4)

#A LDA_VEM topic model with 4 topics.

Run Code Online (Sandbox Code Playgroud)

请花时间创建可重复的示例.通常这样做会发现自己的错误,并且可以轻松修复它.至少,它将帮助其他人更清楚地看到问题并消除不必要的信息.

归档时间：	11 年，4 月前
查看次数：	1641 次
最近记录：	11 年，4 月前