在RI中,用于[tm package][1]从文档集构建术语 - 文档矩阵.
我的目标是从术语文档矩阵中的所有双字母组合中提取单词关联,并为每个前三个或一些返回.因此,我正在寻找一个包含矩阵中所有row.names的变量,以便该函数findAssocs()可以完成他的工作.
到目前为止这是我的代码:
library(tm)
library(RWeka)
txtData <- read.csv("file.csv", header = T, sep = ",")
txtCorpus <- Corpus(VectorSource(txtData$text))
...further preprocessing
#Tokenizer for n-grams and passed on to the term-document matrix constructor
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(txtCorpus, control = list(tokenize = BigramTokenizer))
#term argument holds two words since the BigramTokenizer extracted all pairs from txtCorpus
findAssocs(txtTdmBi, "cat shop", 0.5)
cat cabi cat scratch ...
0.96 …Run Code Online (Sandbox Code Playgroud)