leo*_*oce 5 r text-mining network-analysis
我有一大块句子,我想建立单词共现的无向边列表,并查看每个边的频率.我看了看tm包但没找到类似的功能.我可以使用一些包/脚本吗?非常感谢!
注意:单词不会与自身共存.出现两次或更多次的单词与同一句子中的其他单词共同出现一次.
DF:
sentence_id text
1 a b c d e
2 a b b e
3 b c d
4 a e
5 a
6 a a a
Run Code Online (Sandbox Code Playgroud)
OUTPUT
word1 word2 freq
a b 2
a c 1
a d 1
a e 3
b c 2
b d 2
b e 2
c d 2
c e 1
d e 1
Run Code Online (Sandbox Code Playgroud)
它很复杂,所以必须有更好的方法:
dat <- read.csv(text="sentence_id, text
1, a b c d e
2, a b b e
3, b c d
4, a e", header=TRUE)
library(qdapTools); library(tidyr)
x <- t(mtabulate(with(dat, by(text, sentence_id, bag_o_words))) > 0)
out <- x %*% t(x)
out[upper.tri(out, diag=TRUE)] <- NA
out2 <- matrix2df(out, "word1") %>%
gather(word2, freq, -word1) %>%
na.omit()
rownames(out2) <- NULL
out2
## word1 word2 freq
## 1 b a 2
## 2 c a 1
## 3 d a 1
## 4 e a 3
## 5 c b 2
## 6 d b 2
## 7 e b 2
## 8 d c 2
## 9 e c 1
## 10 e d 1
Run Code Online (Sandbox Code Playgroud)
仅基础解决方案
out <- lapply(with(dat, split(text, sentence_id)), function(x) {
strsplit(gsub("^\\s+|\\s+$", "", as.character(x)), "\\s+")[[1]]
})
nms <- sort(unique(unlist(out)))
out2 <- lapply(out, function(x) {
as.data.frame(table(x), stringsAsFactors = FALSE)
})
dat2 <- data.frame(x = nms)
for(i in seq_along(out2)) {
m <- merge(dat2, out2[[i]], all.x = TRUE)
names(m)[i + 1] <- dat[["sentence_id"]][i]
dat2 <- m
}
dat2[is.na(dat2)] <- 0
x <- as.matrix(dat2[, -1]) > 0
out3 <- x %*% t(x)
out3[upper.tri(out3, diag=TRUE)] <- NA
dimnames(out3) <- list(dat2[[1]], dat2[[1]])
out4 <- na.omit(data.frame(
word1 = rep(rownames(out3), ncol(out3)),
word2 = rep(colnames(out3), each = nrow(out3)),
freq = c(unlist(out3)),
stringsAsFactors = FALSE)
)
row.names(out4) <- NULL
out4
Run Code Online (Sandbox Code Playgroud)