试图从twitter数据创建wordcloud,但得到以下错误:
Error in FUN(X[[72L]], ...) :
invalid input '????????????????????????? "@xxx:bla, bla, bla... http://t.co/56Fb78aTSC"' in 'utf8towcs'
Run Code Online (Sandbox Code Playgroud)
运行"mytwittersearch_corpus < - tm_map(mytwittersearch_corpus,tolower)"代码后出现此错误
mytwittersearch_list <-sapply(mytwittersearch, function(x) x$getText())
mytwittersearch_corpus <-Corpus(VectorSource(mytwittersearch_corpus_list))
mytwittersearch_corpus<-tm_map(mytwittersearch_corpus, tolower)
mytwittersearch_corpus<-tm_map( mytwittersearch_corpus, removePunctuation)
mytwittersearch_corpus <-tm_map(mytwittersearch_corpus, function(x) removeWords(x, stopwords()))
Run Code Online (Sandbox Code Playgroud)
我在其他页面上看到这可能是由于R难以处理非英语语言中的符号,表情符号和字母,但这似乎不是R有问题的"错误推文"的问题.我确实运行了代码:
mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, function(x) iconv(enc2utf8(x), sub = "byte"))
mytwittersearch_corpus<- tm_map(mytwittersearch_corpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "bytes")))
Run Code Online (Sandbox Code Playgroud)
这些没有帮助.content_transformer即使tm-package已经检查并运行,我也发现它无法找到功能.
我在OS X 10.6.8上运行它并使用最新的RStudio.
RUs*_*ser 10
我使用此代码来摆脱问题字符:
tweets$text <- sapply(tweets$text,function(row) iconv(row, "latin1", "ASCII", sub=""))
Run Code Online (Sandbox Code Playgroud)
小智 2
这里有一个从 Twitter 数据创建词云的好例子。使用示例和下面的代码,并在创建 TermDocumentMatrix 时传递 tolower 参数,我可以创建 Twitter 词云。
library(twitteR)
library(tm)
library(wordcloud)
library(RColorBrewer)
library(ggplot2)
#Collect tweets containing 'new year'
tweets = searchTwitter("new year", n=50, lang="en")
#Extract text content of all the tweets
tweetTxt = sapply(tweets, function(x) x$getText())
#In tm package, the documents are managed by a structure called Corpus
myCorpus = Corpus(VectorSource(tweetTxt))
#Create a term-document matrix from a corpus
tdm = TermDocumentMatrix(myCorpus,control = list(removePunctuation = TRUE,stopwords = c("new", "year", stopwords("english")), removeNumbers = TRUE, tolower = TRUE))
#Convert as matrix
m = as.matrix(tdm)
#Get word counts in decreasing order
word_freqs = sort(rowSums(m), decreasing=TRUE)
#Create data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
#Plot wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))
Run Code Online (Sandbox Code Playgroud)
