使用 findAssocs 构建 R 中所有单词组合的相关矩阵

DrP*_*lla 1 text r correlation tm

我正在尝试编写代码来构建一个表格,该表格显示语料库中所有单词之间的所有相关性。

我知道我可以findAssocstm包中使用来查找单个单词的所有单词相关性,即findAssocs(dtm, "quick", 0.5)- 会给我所有与 0.5 以上的单词“quick”相关的单词,但我不想手动为每个单词执行此操作我所拥有的文字中的词。

#Loading a .csv file into R
file_loc <- "C:/temp/TESTER.csv"
x <- read.csv(file_loc, header=FALSE)
require (tm)
corp <- Corpus(DataframeSource(x))
dtm <- DocumentTermMatrix(corp)

#Clean up the text
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, content_transformer(stripWhitespace))
dtm <- DocumentTermMatrix(corp)
Run Code Online (Sandbox Code Playgroud)

从这里我可以找到单个单词的单词相关性:

findAssocs(dtm, "quick", 0.4)
Run Code Online (Sandbox Code Playgroud)

但我想找到所有这样的相关性:

       quick  easy   the   and 
quick   1.00  0.54  0.72  0.92     
 easy   0.54  1.00  0.98  0.54   
  the   0.72  0.98  1.00  0.05  
  and   0.92  0.54  0.05  1.00
Run Code Online (Sandbox Code Playgroud)

有什么建议?

“TESTER.csv”数据文件示例(从单元格 A1 开始)

[1] I got my question answered very quickly
[2] It was quick and easy to find the information I needed
[3] My question was answered quickly by the people at stack overflow
[4] Because they're good at what they do
[5] They got it dealt with quickly and didn't mess around
[6] The information I needed was there all along
[7] They resolved it quite quickly
Run Code Online (Sandbox Code Playgroud)

luk*_*keA 5

您可能可以使用as.matrixcorfindAssocs下限为 0:

(cor_1 <- findAssocs(dtm, colnames(dtm)[1:2], 0))
#               all along
#  there       1.00  1.00
#  information 0.65  0.65
#  needed      0.65  0.65
#  the         0.47  0.47
#  was         0.47  0.47
Run Code Online (Sandbox Code Playgroud)

cor 为您提供所有皮尔逊相关性,其价值:

cor_2 <- cor(as.matrix(dtm))
cor_2[c("there", "information", "needed", "the", "was"), c("all", "along")]
#                   all     along
# there       1.0000000 1.0000000
# information 0.6454972 0.6454972
# needed      0.6454972 0.6454972
# the         0.4714045 0.4714045
# was         0.4714045 0.4714045
Run Code Online (Sandbox Code Playgroud)

前面的代码:

x <- readLines(n = 7)
[1] I got my question answered very quickly
[2] It was quick and easy to find the information I needed
[3] My question was answered quickly by the people at stack overflow
[4] Because they're good at what they do
[5] They got it dealt with quickly and didn't mess around
[6] The information I needed was there all along
[7] They resolved it quite quickly
library(tm)
corp <- Corpus(VectorSource(x))
dtm <- DocumentTermMatrix(corp)
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, content_transformer(stripWhitespace))
dtm <- DocumentTermMatrix(corp)
Run Code Online (Sandbox Code Playgroud)