小编Gro*_*ote的帖子

在R中查找多个术语的findAssocs

在RI中,用于[tm package][1]从文档集构建术语 - 文档矩阵.

我的目标是从术语文档矩阵中的所有双字母组合中提取单词关联,并为每个前三个或一些返回.因此,我正在寻找一个包含矩阵中所有ro​​w.names的变量,以便该函数findAssocs()可以完成他的工作.

到目前为止这是我的代码:

library(tm)
library(RWeka)
txtData <- read.csv("file.csv", header = T, sep = ",")
txtCorpus <- Corpus(VectorSource(txtData$text))

...further preprocessing

#Tokenizer for n-grams and passed on to the term-document matrix constructor
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(txtCorpus, control = list(tokenize = BigramTokenizer))

#term argument holds two words since the BigramTokenizer extracted all pairs from txtCorpus
findAssocs(txtTdmBi, "cat shop", 0.5)
cat cabi  cat scratch  ...
    0.96 …
Run Code Online (Sandbox Code Playgroud)

r text-mining term-document-matrix

2
推荐指数
1
解决办法
7879
查看次数

标签 统计

r ×1

term-document-matrix ×1

text-mining ×1