Gro*_*ote 2 r text-mining term-document-matrix
在RI中,用于[tm package][1]从文档集构建术语 - 文档矩阵.
我的目标是从术语文档矩阵中的所有双字母组合中提取单词关联,并为每个前三个或一些返回.因此,我正在寻找一个包含矩阵中所有row.names的变量,以便该函数findAssocs()可以完成他的工作.
到目前为止这是我的代码:
library(tm)
library(RWeka)
txtData <- read.csv("file.csv", header = T, sep = ",")
txtCorpus <- Corpus(VectorSource(txtData$text))
...further preprocessing
#Tokenizer for n-grams and passed on to the term-document matrix constructor
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(txtCorpus, control = list(tokenize = BigramTokenizer))
#term argument holds two words since the BigramTokenizer extracted all pairs from txtCorpus
findAssocs(txtTdmBi, "cat shop", 0.5)
cat cabi cat scratch ...
0.96 0.91
Run Code Online (Sandbox Code Playgroud)
我试图定义一个包含所有row.names的变量txtTdmBi并将其提供给findAssocs()函数.但是,结果如下:
allRows <- c(row.names(txtTdmBi))
findAssocs(txtTdmBi, allRows, 0.5)
Error in which(x[term, ] > corlimit) : subscript out of bounds
In addition: Warning message:
In term == Terms(x) :
longer object length is not a multiple of shorter object length
Run Code Online (Sandbox Code Playgroud)
因为这里已经解释了为多个术语 - 文档矩阵花费的术语的提取关联,我想有可能在单个术语 - 文档矩阵中找到多个术语的关联.除了怎么样?
我希望有人能告诉我如何解决这个问题.在此先感谢任何支持.
如果我理解正确,lapply解决方案可能是回答您问题的方法.这与您链接的答案相同,但这是一个可能更接近您的用例的自包含示例:
加载库和可重现的数据(请在此处将这些内容包含在此处)
library(tm)
library(RWeka)
data(crude)
Run Code Online (Sandbox Code Playgroud)
你的二元游戏令牌器......
#Tokenizer for n-grams and passed on to the term-document matrix constructor
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
Run Code Online (Sandbox Code Playgroud)
通过检查随机样本检查它是否有效......
inspect(txtTdmBi[1000:1005, 10:15])
A term-document matrix (6 terms, 6 documents)
Non-/sparse entries: 1/35
Sparsity : 97%
Maximal term length: 18
Weighting : term frequency (tf)
Docs
Terms 248 273 349 352 353 368
for their 0 0 0 0 0 0
for west 0 0 0 0 0 0
forced it 0 0 0 0 0 0
forced to 0 0 0 0 0 0
forces trying 1 0 0 0 0 0
foreign investment 0 0 0 0 0 0
Run Code Online (Sandbox Code Playgroud)
以下是您的问题的答案:
现在使用lapply函数计算术语 - 文档矩阵中术语向量中每个项目的关联词.最简单的访问术语向量txtTdmBi$dimnames$Terms.例如txtTdmBi$dimnames$Terms[[1005]]"外国投资".
在这里我使用llply了plyr包,所以我们可以有一个进度条(安慰大工作),但它基本上与基本lapply功能相同.
library(plyr)
dat <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocs(txtTdmBi, i, 0.5), .progress = "text" )
Run Code Online (Sandbox Code Playgroud)
输出是一个列表,其中列表中的每个项目都是命名数字的向量,其中名称是术语,数字是相关值.例如,要查看与"外国投资"相关的条款,我们可以像这样访问列表:
dat[[1005]]
Run Code Online (Sandbox Code Playgroud)
以下是与该术语相关的术语(我刚刚粘贴在前几位)
168 million 1986 was 1987 early 300 mln 31 pct
1.00 1.00 1.00 1.00 1.00
a bit a crossroads a leading a political a population
1.00 1.00 1.00 1.00 1.00
a reduced a series a slightly about zero activity continues
1.00 1.00 1.00 1.00 1.00
advisers are agricultural sector agriculture the all such also reviews
1.00 1.00 1.00 1.00 1.00
and advisers and attract and imports and liberalised and steel
1.00 1.00 1.00 1.00 1.00
and trade and virtual announced since appears to are equally
1.00 1.00 1.00 1.00 1.00
are recommending areas for areas of as it as steps
1.00 1.00 1.00 1.00 1.00
asia with asian member assesses indonesia attract new balance of
1.00 1.00 1.00 1.00 1.00
Run Code Online (Sandbox Code Playgroud)
那是你想做的吗?
顺便提一下,如果您的术语 - 文档矩阵非常大,您可能想要尝试以下版本findAssocs:
# u is a term document matrix
# term is your term
# corlimit is a value -1 to 1
findAssocsBig <- function(u, term, corlimit){
suppressWarnings(x.cor <- gamlr::corr(t(u[ !u$dimnames$Terms == term, ]),
as.matrix(t(u[ u$dimnames$Terms == term, ])) ))
x <- sort(round(x.cor[(x.cor[, term] > corlimit), ], 2), decreasing = TRUE)
return(x)
}
Run Code Online (Sandbox Code Playgroud)
这可以这样使用:
dat1 <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocsBig(txtTdmBi, i, 0.5), .progress = "text" )
Run Code Online (Sandbox Code Playgroud)
这样做的好处是它使用了一种将TDM转换为矩阵的不同方法tm:findAssocs.这种不同的方法更有效地使用内存,因此可以防止这种消息Error: cannot allocate vector of size 1.9 Gb发生.
快速基准测试显示两个findAssocs函数的速度大致相同,因此主要区别在于内存的使用:
library(microbenchmark)
microbenchmark(
dat1 <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocsBig(txtTdmBi, i, 0.5)),
dat <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocs(txtTdmBi, i, 0.5)),
times = 10)
Unit: seconds
expr min lq median
dat1 <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocsBig(txtTdmBi, i, 0.5)) 10.82369 11.03968 11.25492
dat <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocs(txtTdmBi, i, 0.5)) 10.70980 10.85640 11.14156
uq max neval
11.39326 11.89754 10
11.18877 11.97978 10
Run Code Online (Sandbox Code Playgroud)