创建具有4M行的语料库和DTM的更有效方法

use*_*388 13 r corpus term-document-matrix qdap data.table

我的文件有超过4M的行,我需要一种更有效的方法将我的数据转换为语料库和文档术语矩阵,以便我可以将它传递给贝叶斯分类器.

请考虑以下代码:

library(tm)

GetCorpus <-function(textVector)
{
  doc.corpus <- Corpus(VectorSource(textVector))
  doc.corpus <- tm_map(doc.corpus, tolower)
  doc.corpus <- tm_map(doc.corpus, removeNumbers)
  doc.corpus <- tm_map(doc.corpus, removePunctuation)
  doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))
  doc.corpus <- tm_map(doc.corpus, stemDocument, "english")
  doc.corpus <- tm_map(doc.corpus, stripWhitespace)
  doc.corpus <- tm_map(doc.corpus, PlainTextDocument)
  return(doc.corpus)
}

data <- data.frame(
  c("Let the big dogs hunt","No holds barred","My child is an honor student"), stringsAsFactors = F)

corp <- GetCorpus(data[,1])

inspect(corp)

dtm <- DocumentTermMatrix(corp)

inspect(dtm)
Run Code Online (Sandbox Code Playgroud)

输出:

> inspect(corp)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
let big dogs hunt

[[2]]
<<PlainTextDocument (metadata: 7)>>
 holds bar

[[3]]
<<PlainTextDocument (metadata: 7)>>
 child honor stud
> inspect(dtm)
<<DocumentTermMatrix (documents: 3, terms: 9)>>
Non-/sparse entries: 9/18
Sparsity           : 67%
Maximal term length: 5
Weighting          : term frequency (tf)

              Terms
Docs           bar big child dogs holds honor hunt let stud
  character(0)   0   1     0    1     0     0    1   1    0
  character(0)   1   0     0    0     1     0    0   0    0
  character(0)   0   0     1    0     0     1    0   0    1
Run Code Online (Sandbox Code Playgroud)

我的问题是,我可以用什么来更快地创建语料库和DTM?如果我使用超过300k行,它似乎非常慢.

我听说我可以使用,data.table但我不知道如何.

我也查看了qdap包,但是在尝试加载包时它给了我一个错误,而且我甚至不知道它是否会起作用.

参考.http://cran.r-project.org/web/packages/qdap/qdap.pdf

Ken*_*oit 16

哪种方法?

data.table肯定要走的正确方法.正则表达式操作很慢,虽然stringi它们的速度要快得多(除了要好得多).任何与

quanteda::dfm()为我的quanteda包创建时,我经历了许多迭代解决问题(请参阅此处GitHub repo).到目前为止,最快的解决方案是使用data.tableMatrix包来索引文档和标记化的特征,计算文档中的特征,并将结果直接插入稀疏矩阵.

在下面的代码中,我已经找到了使用quanteda软件包找到的示例文本,您可以(并且应该!)从CRAN或开发版本安装

devtools::install_github("kbenoit/quanteda")
Run Code Online (Sandbox Code Playgroud)

我很想知道它对你的4m文件是如何工作的.根据我使用该大小语料库的经验,它可以很好地工作(如果你有足够的内存).

请注意,在我的所有分析中,由于它们用C++编写的方式,我无法通过任何类型的并行化来提高data.table操作的速度.

quanteda dfm()功能的核心

这是data.table基于源代码的基础,以防任何人想要改进它.它输入一个表示标记化文本的字符向量列表.在quanteda包中,全功能dfm()直接在文档或语料库对象的字符向量上工作,默认情况下实现小写,删除数字和删除间距(但如果愿意,这些都可以修改).

require(data.table)
require(Matrix)

dfm_quanteda <- function(x) {
    docIndex <- 1:length(x)
    if (is.null(names(x))) 
        names(docIndex) <- factor(paste("text", 1:length(x), sep="")) else
            names(docIndex) <- names(x)

    alltokens <- data.table(docIndex = rep(docIndex, sapply(x, length)),
                            features = unlist(x, use.names = FALSE))
    alltokens <- alltokens[features != ""]  # if there are any "blank" features
    alltokens[, "n":=1L]
    alltokens <- alltokens[, by=list(docIndex,features), sum(n)]

    uniqueFeatures <- unique(alltokens$features)
    uniqueFeatures <- sort(uniqueFeatures)

    featureTable <- data.table(featureIndex = 1:length(uniqueFeatures),
                               features = uniqueFeatures)
    setkey(alltokens, features)
    setkey(featureTable, features)

    alltokens <- alltokens[featureTable, allow.cartesian = TRUE]
    alltokens[is.na(docIndex), c("docIndex", "V1") := list(1, 0)]

    sparseMatrix(i = alltokens$docIndex, 
                 j = alltokens$featureIndex, 
                 x = alltokens$V1, 
                 dimnames=list(docs=names(docIndex), features=uniqueFeatures))
}

require(quanteda)
str(inaugTexts)
## Named chr [1:57] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ ...
## - attr(*, "names")= chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
tokenizedTexts <- tokenize(toLower(inaugTexts), removePunct = TRUE, removeNumbers = TRUE)
system.time(dfm_quanteda(tokenizedTexts))
##  user  system elapsed 
## 0.060   0.005   0.064 
Run Code Online (Sandbox Code Playgroud)

这只是一个片段,但完整的源代码很容易在GitHub repo(dfm-main.R)上找到.

quanteda在你的例子

为简单起见,这是怎么回事?

require(quanteda)
mytext <- c("Let the big dogs hunt",
            "No holds barred",
            "My child is an honor student")
dfm(mytext, ignoredFeatures = stopwords("english"), stem = TRUE)
# Creating a dfm from a character vector ...
# ... lowercasing
# ... tokenizing
# ... indexing 3 documents
# ... shaping tokens into data.table, found 14 total tokens
# ... stemming the tokens (english)
# ... ignoring 174 feature types, discarding 5 total features (35.7%)
# ... summing tokens by document
# ... indexing 9 feature types
# ... building sparse matrix
# ... created a 3 x 9 sparse dfm
# ... complete. Elapsed time: 0.023 seconds.

# Document-feature matrix of: 3 documents, 9 features.
# 3 x 9 sparse Matrix of class "dfmSparse"
# features
# docs    bar big child dog hold honor hunt let student
# text1   0   1     0   1    0     0    1   1       0
# text2   1   0     0   0    1     0    0   0       0
# text3   0   0     1   0    0     1    0   0       1
Run Code Online (Sandbox Code Playgroud)

  • 这是一个很好的加速.如果其他条件相同,我鼓励OP将检查移至此解决方案. (2认同)

Tyl*_*ker 12

我想你可能想要考虑一个更加正则表达式的解决方案.这些是我作为开发人员正在努力解决的一些问题/想法.我目前正在stringi大力寻找开发包,因为它有一些一致命名的函数,这些函数对字符串操作很快.

在这个响应中,我试图使用我所知道的任何工具比更方便的方法tm可能给我们更快(当然要快得多qdap).在这里,我甚至没有探索并行处理或data.table/dplyr,而是专注于字符串操作,stringi并将数据保存在矩阵中,并使用旨在处理该格式的特定包进行操作.我举了你的例子并将其乘以100000x.即使使用词干,我的机器也需要17秒.

data <- data.frame(
    text=c("Let the big dogs hunt",
        "No holds barred",
        "My child is an honor student"
    ), stringsAsFactors = F)

## eliminate this step to work as a MWE
data <- data[rep(1:nrow(data), 100000), , drop=FALSE]

library(stringi)
library(SnowballC)
out <- stri_extract_all_words(stri_trans_tolower(SnowballC::wordStem(data[[1]], "english"))) #in old package versions it was named 'stri_extract_words'
names(out) <- paste0("doc", 1:length(out))

lev <- sort(unique(unlist(out)))
dat <- do.call(cbind, lapply(out, function(x, lev) {
    tabulate(factor(x, levels = lev, ordered = TRUE), nbins = length(lev))
}, lev = lev))
rownames(dat) <- sort(lev)

library(tm)
dat <- dat[!rownames(dat) %in% tm::stopwords("english"), ] 

library(slam)
dat2 <- slam::as.simple_triplet_matrix(dat)

tdm <- tm::as.TermDocumentMatrix(dat2, weighting=weightTf)
tdm

## or...
dtm <- tm::as.DocumentTermMatrix(dat2, weighting=weightTf)
dtm
Run Code Online (Sandbox Code Playgroud)