use*_*137 32 r text-mining tm
我正在尝试运行此代码(Ubuntu 12.04,R 3.1.1)
# Load requisite packages
library(tm)
library(ggplot2)
library(lsa)
# Place Enron email snippets into a single vector.
text <- c(
"To Mr. Ken Lay, I’m writing to urge you to donate the millions of dollars you made from selling Enron stock before the company declared bankruptcy.",
"while you netted well over a $100 million, many of Enron's employees were financially devastated when the company declared bankruptcy and their retirement plans were wiped out",
"you sold $101 million worth of Enron stock while aggressively urging the company’s employees to keep buying it",
"This is a reminder of Enron’s Email retention policy. The Email retention policy provides as follows . . .",
"Furthermore, it is against policy to store Email outside of your Outlook Mailbox and/or your Public Folders. Please do not copy Email onto floppy disks, zip disks, CDs or the network.",
"Based on our receipt of various subpoenas, we will be preserving your past and future email. Please be prudent in the circulation of email relating to your work and activities.",
"We have recognized over $550 million of fair value gains on stocks via our swaps with Raptor.",
"The Raptor accounting treatment looks questionable. a. Enron booked a $500 million gain from equity derivatives from a related party.",
"In the third quarter we have a $250 million problem with Raptor 3 if we don’t “enhance” the capital structure of Raptor 3 to commit more ENE shares.")
view <- factor(rep(c("view 1", "view 2", "view 3"), each = 3))
df <- data.frame(text, view, stringsAsFactors = FALSE)
# Prepare mini-Enron corpus
corpus <- Corpus(VectorSource(df$text))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("english")))
corpus <- tm_map(corpus, stemDocument, language = "english")
corpus # check corpus
# Mini-Enron corpus with 9 text documents
# Compute a term-document matrix that contains occurrance of terms in each email
# Compute distance between pairs of documents and scale the multidimentional semantic space (MDS) onto two dimensions
td.mat <- as.matrix(TermDocumentMatrix(corpus))
dist.mat <- dist(t(as.matrix(td.mat)))
dist.mat # check distance matrix
# Compute distance between pairs of documents and scale the multidimentional semantic space onto two dimensions
fit <- cmdscale(dist.mat, eig = TRUE, k = 2)
points <- data.frame(x = fit$points[, 1], y = fit$points[, 2])
ggplot(points, aes(x = x, y = y)) + geom_point(data = points, aes(x = x, y = y, color = df$view)) + geom_text(data = points, aes(x = x, y = y - 0.2, label = row.names(df)))
Run Code Online (Sandbox Code Playgroud)
但是,当我运行它时,我收到此错误(td.mat <-
as.matrix(TermDocumentMatrix(corpus))在行中):
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "character"
In addition: Warning message:
In mclapply(unname(content(x)), termFreq, control) :
all scheduled cores encountered errors in user code
Run Code Online (Sandbox Code Playgroud)
我不知道该看什么 - 所有模块都加载了.
MrF*_*ick 90
最新版本的tm(0.60)使得你无法再使用tm_map简单字符值的函数.所以问题就在于你的tolower步骤,因为这不是一个"规范"的转变(参见参考资料getTransformations()).只需更换它
corpus <- tm_map(corpus, content_transformer(tolower))
Run Code Online (Sandbox Code Playgroud)
该content_transformer函数包装将一切都转换为躯体内正确的数据类型.您可以使用content_transformer任何旨在操纵字符向量的函数,以便它可以在tm_map管道中工作.
Pau*_*der 29
这有点旧,但仅仅是为了以后谷歌搜索的目的:有一个替代解决方案.在corpus <- tm_map(corpus, tolower)你可以使用corpus <- tm_map(corpus, PlainTextDocument)它之后,它正好回到正确的数据类型.