use*_*388 13 r corpus term-document-matrix qdap data.table
我的文件有超过4M的行,我需要一种更有效的方法将我的数据转换为语料库和文档术语矩阵,以便我可以将它传递给贝叶斯分类器.
请考虑以下代码:
library(tm)
GetCorpus <-function(textVector)
{
doc.corpus <- Corpus(VectorSource(textVector))
doc.corpus <- tm_map(doc.corpus, tolower)
doc.corpus <- tm_map(doc.corpus, removeNumbers)
doc.corpus <- tm_map(doc.corpus, removePunctuation)
doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))
doc.corpus <- tm_map(doc.corpus, stemDocument, "english")
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
doc.corpus <- tm_map(doc.corpus, PlainTextDocument)
return(doc.corpus)
}
data <- data.frame(
c("Let the big dogs hunt","No holds barred","My child is an honor student"), stringsAsFactors = F)
corp <- GetCorpus(data[,1])
inspect(corp)
dtm <- DocumentTermMatrix(corp)
inspect(dtm)
Run Code Online (Sandbox Code Playgroud)
输出:
> inspect(corp)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
let big dogs hunt
[[2]]
<<PlainTextDocument (metadata: 7)>>
holds bar
[[3]]
<<PlainTextDocument (metadata: 7)>>
child honor stud
> inspect(dtm)
<<DocumentTermMatrix (documents: 3, terms: 9)>>
Non-/sparse entries: 9/18
Sparsity : 67%
Maximal term length: 5
Weighting : term frequency (tf)
Terms
Docs bar big child dogs holds honor hunt let stud
character(0) 0 1 0 1 0 0 1 1 0
character(0) 1 0 0 0 1 0 0 0 0
character(0) 0 0 1 0 0 1 0 0 1
Run Code Online (Sandbox Code Playgroud)
我的问题是,我可以用什么来更快地创建语料库和DTM?如果我使用超过300k行,它似乎非常慢.
我听说我可以使用,data.table
但我不知道如何.
我也查看了qdap
包,但是在尝试加载包时它给了我一个错误,而且我甚至不知道它是否会起作用.
Ken*_*oit 16
data.table
是肯定要走的正确方法.正则表达式操作很慢,虽然stringi
它们的速度要快得多(除了要好得多).任何与
在quanteda::dfm()
为我的quanteda包创建时,我经历了许多迭代解决问题(请参阅此处的GitHub repo).到目前为止,最快的解决方案是使用data.table
和Matrix
包来索引文档和标记化的特征,计算文档中的特征,并将结果直接插入稀疏矩阵.
在下面的代码中,我已经找到了使用quanteda软件包找到的示例文本,您可以(并且应该!)从CRAN或开发版本安装
devtools::install_github("kbenoit/quanteda")
Run Code Online (Sandbox Code Playgroud)
我很想知道它对你的4m文件是如何工作的.根据我使用该大小语料库的经验,它可以很好地工作(如果你有足够的内存).
请注意,在我的所有分析中,由于它们用C++编写的方式,我无法通过任何类型的并行化来提高data.table操作的速度.
dfm()
功能的核心这是data.table
基于源代码的基础,以防任何人想要改进它.它输入一个表示标记化文本的字符向量列表.在quanteda包中,全功能dfm()
直接在文档或语料库对象的字符向量上工作,默认情况下实现小写,删除数字和删除间距(但如果愿意,这些都可以修改).
require(data.table)
require(Matrix)
dfm_quanteda <- function(x) {
docIndex <- 1:length(x)
if (is.null(names(x)))
names(docIndex) <- factor(paste("text", 1:length(x), sep="")) else
names(docIndex) <- names(x)
alltokens <- data.table(docIndex = rep(docIndex, sapply(x, length)),
features = unlist(x, use.names = FALSE))
alltokens <- alltokens[features != ""] # if there are any "blank" features
alltokens[, "n":=1L]
alltokens <- alltokens[, by=list(docIndex,features), sum(n)]
uniqueFeatures <- unique(alltokens$features)
uniqueFeatures <- sort(uniqueFeatures)
featureTable <- data.table(featureIndex = 1:length(uniqueFeatures),
features = uniqueFeatures)
setkey(alltokens, features)
setkey(featureTable, features)
alltokens <- alltokens[featureTable, allow.cartesian = TRUE]
alltokens[is.na(docIndex), c("docIndex", "V1") := list(1, 0)]
sparseMatrix(i = alltokens$docIndex,
j = alltokens$featureIndex,
x = alltokens$V1,
dimnames=list(docs=names(docIndex), features=uniqueFeatures))
}
require(quanteda)
str(inaugTexts)
## Named chr [1:57] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ ...
## - attr(*, "names")= chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
tokenizedTexts <- tokenize(toLower(inaugTexts), removePunct = TRUE, removeNumbers = TRUE)
system.time(dfm_quanteda(tokenizedTexts))
## user system elapsed
## 0.060 0.005 0.064
Run Code Online (Sandbox Code Playgroud)
这只是一个片段,但完整的源代码很容易在GitHub repo(dfm-main.R
)上找到.
为简单起见,这是怎么回事?
require(quanteda)
mytext <- c("Let the big dogs hunt",
"No holds barred",
"My child is an honor student")
dfm(mytext, ignoredFeatures = stopwords("english"), stem = TRUE)
# Creating a dfm from a character vector ...
# ... lowercasing
# ... tokenizing
# ... indexing 3 documents
# ... shaping tokens into data.table, found 14 total tokens
# ... stemming the tokens (english)
# ... ignoring 174 feature types, discarding 5 total features (35.7%)
# ... summing tokens by document
# ... indexing 9 feature types
# ... building sparse matrix
# ... created a 3 x 9 sparse dfm
# ... complete. Elapsed time: 0.023 seconds.
# Document-feature matrix of: 3 documents, 9 features.
# 3 x 9 sparse Matrix of class "dfmSparse"
# features
# docs bar big child dog hold honor hunt let student
# text1 0 1 0 1 0 0 1 1 0
# text2 1 0 0 0 1 0 0 0 0
# text3 0 0 1 0 0 1 0 0 1
Run Code Online (Sandbox Code Playgroud)
Tyl*_*ker 12
我想你可能想要考虑一个更加正则表达式的解决方案.这些是我作为开发人员正在努力解决的一些问题/想法.我目前正在stringi
大力寻找开发包,因为它有一些一致命名的函数,这些函数对字符串操作很快.
在这个响应中,我试图使用我所知道的任何工具比更方便的方法tm
可能给我们更快(当然要快得多qdap
).在这里,我甚至没有探索并行处理或data.table/dplyr,而是专注于字符串操作,stringi
并将数据保存在矩阵中,并使用旨在处理该格式的特定包进行操作.我举了你的例子并将其乘以100000x.即使使用词干,我的机器也需要17秒.
data <- data.frame(
text=c("Let the big dogs hunt",
"No holds barred",
"My child is an honor student"
), stringsAsFactors = F)
## eliminate this step to work as a MWE
data <- data[rep(1:nrow(data), 100000), , drop=FALSE]
library(stringi)
library(SnowballC)
out <- stri_extract_all_words(stri_trans_tolower(SnowballC::wordStem(data[[1]], "english"))) #in old package versions it was named 'stri_extract_words'
names(out) <- paste0("doc", 1:length(out))
lev <- sort(unique(unlist(out)))
dat <- do.call(cbind, lapply(out, function(x, lev) {
tabulate(factor(x, levels = lev, ordered = TRUE), nbins = length(lev))
}, lev = lev))
rownames(dat) <- sort(lev)
library(tm)
dat <- dat[!rownames(dat) %in% tm::stopwords("english"), ]
library(slam)
dat2 <- slam::as.simple_triplet_matrix(dat)
tdm <- tm::as.TermDocumentMatrix(dat2, weighting=weightTf)
tdm
## or...
dtm <- tm::as.DocumentTermMatrix(dat2, weighting=weightTf)
dtm
Run Code Online (Sandbox Code Playgroud)