ant*_*ell 6 r document-classification sentiment-analysis tm
我的下面的代码工作正常,除非我使用创建一个超过3000个术语的DocumentTermMatrix.这一行:
movie_dict <- findFreqTerms(movie_dtm_train, 8)
movie_dtm_hiFq_train <- DocumentTermMatrix(movie_corpus_train, list(dictionary = movie_dict))
Run Code Online (Sandbox Code Playgroud)
失败:
Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
'i, j, v' different lengths
In addition: Warning messages:
1: In mclapply(unname(content(x)), termFreq, control) :
all scheduled cores encountered errors in user code
2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
NAs introduced by coercion
Run Code Online (Sandbox Code Playgroud)
有什么方法可以解决这个问题吗?对于DocumentTermMatrix,3000*60000矩阵是否太大了?这对于文档分类来说似乎相当小..
完整代码段:
n1 <- 60000
n2 <- 70000
#******* loading the data ******************************************
#kaggle sentiment_analysis dataset
movie_all <- read.delim('train.tsv', stringsAsFactors=FALSE)
movie_raw <- movie_all[1:(n2),]
#******* cleaning the corpus ***************************************
movie_corpus <- Corpus(VectorSource(movie_raw$Phrase))
movie_corpus_clean <- tm_map(movie_corpus, content_transformer(tolower))
movie_corpus_clean <- tm_map(movie_corpus_clean, removeNumbers)
movie_corpus_clean <- tm_map(movie_corpus_clean, removeWords, stopwords())
movie_corpus_clean <- tm_map(movie_corpus_clean, removePunctuation)
movie_corpus_clean <- tm_map(movie_corpus_clean, stripWhitespace)
movie_dtm <- DocumentTermMatrix(movie_corpus_clean)
#*********** break out data into train/test sets *******************
movie_train <- movie_raw[1:(n1),]
movie_corpus_train <- movie_corpus_clean[1:(n1)]
movie_dtm_train <- movie_dtm[1:(n1),]
#*********** remove rare words from document term matrix ***********
movie_dict <- findFreqTerms(movie_dtm_train, 8)
movie_dtm_hiFq_train <- DocumentTermMatrix(movie_corpus_train, list(dictionary = movie_dict))
Run Code Online (Sandbox Code Playgroud)
编辑 失败:
movie_dtm_hiFq_train <- DocumentTermMatrix(movie_corpus_train[1:60000], list(dictionary = movie_dict))
Run Code Online (Sandbox Code Playgroud)
但这有效:
d1 <- DocumentTermMatrix(movie_corpus_train[1:30000], list(dictionary = movie_dict))
d2 <- DocumentTermMatrix(movie_corpus_train[30000:60000], list(dictionary = movie_dict))
movie_dtm_hiFq_train <- c(d1, d2)
Run Code Online (Sandbox Code Playgroud)
这让我相信这一定是一个尺寸问题..
| 归档时间: |
|
| 查看次数: |
1985 次 |
| 最近记录: |