数学的tm :: findAssocs这个函数是如何工作的?

use*_*507 3 r text-mining

我一直在使用findAssoc()textmining(tm包),但意识到我的数据集似乎不对.

我的数据集是保存在csv文件的一列中的1500个开放式答案.所以我像这样调用数据集,并使用典型的数据集tm_map使其成为语料库.

library(tm)
Q29 <- read.csv("favoritegame2.csv")
corpus <- Corpus(VectorSource(Q29$Q29))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus<- tm_map(corpus, removeWords, stopwords("english"))
dtm<- DocumentTermMatrix(corpus)

findAssocs(dtm, "like", .2)
> cousin  fill  ....
  0.28    0.20      
Run Code Online (Sandbox Code Playgroud)

Q1.当我找到与之关联的术语时like,我没有看到输出like = 1作为输出的一部分.然而,

dtm.df <-as.data.frame(inspect(dtm))
Run Code Online (Sandbox Code Playgroud)

此数据框由1500个obs组成.1689变量..(或者是因为数据保存在一行csv文件中?)

Q2.尽管cousinfill出现了一次当目标项like出现了一次,比分是这样的不同.他们不应该一样吗?

我想找到数学findAssoc()但却没有成功.任何建议都非常感谢!

Bru*_*.Ca 9

我认为没有人回答您的最后一个问题。

我正在尝试查找findAssoc()的数学方法,但尚未成功。任何建议都非常感谢!

findAssoc()的数学基于R的stats包中的标准函数cor()。给定两个数值向量,cor()计算它们的协方差除以两个标准差。

因此,给定一个DocumentTermMatrix dtm,其中包含术语“ word1”和“ word2”,使得findAssocs(dtm,“ word1”,0)返回值为x的“ word2”,即“ word1”和“ word2”的术语向量的相关性是x。

举个大例子

> data <-  c("", "word1", "word1 word2","word1 word2 word3","word1 word2 word3 word4","word1 word2 word3 word4 word5") 
> dtm <- DocumentTermMatrix(VCorpus(VectorSource(data)))
> as.matrix(dtm)
    Terms
Docs word1 word2 word3 word4 word5
   1     0     0     0     0     0
   2     1     0     0     0     0
   3     1     1     0     0     0
   4     1     1     1     0     0
   5     1     1     1     1     0
   6     1     1     1     1     1
> findAssocs(dtm, "word1", 0) 
$word1
word2 word3 word4 word5 
 0.63  0.45  0.32  0.20 

> cor(as.matrix(dtm)[,"word1"], as.matrix(dtm)[,"word2"])
[1] 0.6324555
> cor(as.matrix(dtm)[,"word1"], as.matrix(dtm)[,"word3"])
[1] 0.4472136
Run Code Online (Sandbox Code Playgroud)

以此类推,第4和第5个单词。

另请参阅http://r.789695.n4.nabble.com/findAssocs-tt3845751.html#a4637248


42-*_*42- 7

 findAssocs
#function (x, term, corlimit) 
#UseMethod("findAssocs", x)
#<environment: namespace:tm>

methods(findAssocs )
#[1] findAssocs.DocumentTermMatrix* findAssocs.matrix*   findAssocs.TermDocumentMatrix*

 getAnywhere(findAssocs.DocumentTermMatrix)
#-------------
A single object matching ‘findAssocs.DocumentTermMatrix’ was found
It was found in the following places
  registered S3 method for findAssocs from namespace tm
  namespace:tm
with value

function (x, term, corlimit) 
{
    ind <- term == Terms(x)
    suppressWarnings(x.cor <- cor(as.matrix(x[, ind]), as.matrix(x[, 
        !ind])))
Run Code Online (Sandbox Code Playgroud)

那是自我引用被删除的地方.

    findAssocs(x.cor, term, corlimit)
}
<environment: namespace:tm>
#-------------
 getAnywhere(findAssocs.matrix)
#-------------
A single object matching ‘findAssocs.matrix’ was found
It was found in the following places
  registered S3 method for findAssocs from namespace tm
  namespace:tm
with value

function (x, term, corlimit) 
sort(round(x[term, which(x[term, ] > corlimit)], 2), decreasing = TRUE)
<environment: namespace:tm>
Run Code Online (Sandbox Code Playgroud)