app*_*ree 24 r data-mining text-mining
我试图找到一个实际上可以找到R文本挖掘包中最常用的两个和三个单词短语的代码(也许还有另一个我不知道的包).我一直在尝试使用标记器,但似乎没有运气.
如果您过去曾处理过类似情况,您是否可以发布经过测试且实际有效的代码?非常感谢!
Tim*_*rka 11
你可以通过在自定义标记化功能tm的DocumentTermMatrix作用,所以如果你有包tau安装是相当简单的.
library(tm); library(tau);
tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n)))))
texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
corpus <- Corpus(VectorSource(texts))
matrix <- DocumentTermMatrix(corpus,control=list(tokenize=tokenize_ngrams))
Run Code Online (Sandbox Code Playgroud)
凡n在tokenize_ngrams功能短语每个单词的数量.此功能也在包中实现RTextTools,这进一步简化了操作.
library(RTextTools)
texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
matrix <- create_matrix(texts,ngramLength=3)
Run Code Online (Sandbox Code Playgroud)
这将返回一个DocumentTermMatrix与包一起使用的类tm.
5.我可以在术语 - 文档矩阵中使用bigrams而不是单个令牌吗?
是.RWeka为任意n-gram提供了一个标记器,可以直接传递给term-document matrix构造函数.例如:
library("RWeka")
library("tm")
data("crude")
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
inspect(tdm[340:345,1:10])
Run Code Online (Sandbox Code Playgroud)
这是我自己为不同目的而创作的作品,但我认为也可能适用于您的需求:
#User Defined Functions
Trim <- function (x) gsub("^\\s+|\\s+$", "", x)
breaker <- function(x) unlist(strsplit(x, "[[:space:]]|(?=[.!?*-])", perl=TRUE))
strip <- function(x, digit.remove = TRUE, apostrophe.remove = FALSE){
strp <- function(x, digit.remove, apostrophe.remove){
x2 <- Trim(tolower(gsub(".*?($|'|[^[:punct:]]).*?", "\\1", as.character(x))))
x2 <- if(apostrophe.remove) gsub("'", "", x2) else x2
ifelse(digit.remove==TRUE, gsub("[[:digit:]]", "", x2), x2)
}
unlist(lapply(x, function(x) Trim(strp(x =x, digit.remove = digit.remove,
apostrophe.remove = apostrophe.remove)) ))
}
unblanker <- function(x)subset(x, nchar(x)>0)
#Fake Text Data
x <- "I like green eggs and ham. They are delicious. They taste so yummy. I'm talking about ham and eggs of course"
#The code using Base R to Do what you want
breaker(x)
strip(x)
words <- unblanker(breaker(strip(x)))
textDF <- as.data.frame(table(words))
textDF$characters <- sapply(as.character(textDF$words), nchar)
textDF2 <- textDF[order(-textDF$characters, textDF$Freq), ]
rownames(textDF2) <- 1:nrow(textDF2)
textDF2
subset(textDF2, characters%in%2:3)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
27017 次 |
| 最近记录: |