我有以下代码:
# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings.
corpus_clean <- tm_map(news_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords('english'))
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
corpus_clean <- tm_map(corpus_clean, trim)
news_dtm <- DocumentTermMatrix(corpus_clean) # errors here
Run Code Online (Sandbox Code Playgroud)
当我运行该DocumentTermMatrix()方法时,它给了我这个错误:
错误:inherits(doc,"TextDocument")不为TRUE
为什么我会收到此错误?我的行不是文本文件吗?
这是检查时的输出corpus_clean:
[[153]]
[1] obama holds technical school model us
[[154]]
[1] oil boom produces jobs bonanza archaeologists
[[155]] …Run Code Online (Sandbox Code Playgroud) 我正在尝试运行此代码(Ubuntu 12.04,R 3.1.1)
# Load requisite packages
library(tm)
library(ggplot2)
library(lsa)
# Place Enron email snippets into a single vector.
text <- c(
"To Mr. Ken Lay, I’m writing to urge you to donate the millions of dollars you made from selling Enron stock before the company declared bankruptcy.",
"while you netted well over a $100 million, many of Enron's employees were financially devastated when the company declared bankruptcy and their retirement plans were wiped out",
"you sold $101 million worth …Run Code Online (Sandbox Code Playgroud)