我在脚本中使用此函数使用r文本挖掘包(tm)来消除推文中的URL.令我惊讶的是,在清理之后,有一些剩余的"http"单词以及来自URL本身的片段(例如t.co).看起来有些URL被彻底消灭了,而其他一些只是分解成组件.可能是什么原因?注意:我拿了.在t.co网址中.StackOverflow不允许将URL提交到t.co地址.
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "/")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "@")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "\\|")
removeURL <- function(x) gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, removeURL)
Run Code Online (Sandbox Code Playgroud)
清洁前的文字
VOTE TODAY! Go to https://tco/KPQ5EY9VwQ to find your polling location. We are going to Make America Great Again!… https://tco/KPQ5EY9VwQ
清洁后的文字
vote today go https tco mxraxyntjy find polling location going make america great https tco kpqeyvwq
您正在删除removeURL函数正在查找的符号.此外,您需要确保创建适当的变压器功能content_transformer().这是一个工作示例,其中包含用于删除URL的不同正则表达式(它在空格处停止)
library(tm)
test<-"VOTE TODAY! Go to https://t.com/KPQ5EY9VwQ to find your polling location. We are going to Make America Great Again!… https://t.com/KPQ5EY9VwQ"
trumpcorpus1020to1109 <- VCorpus(VectorSource(test))
removeURL <- content_transformer(function(x) gsub("(f|ht)tp(s?)://\\S+", "", x, perl=T))
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, removeURL)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "/")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "@")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "\\|")
content(trumpcorpus1020to1109[[1]])
# [1] "VOTE TODAY! Go to to find your polling location. We are going to Make America Great Again!… "
Run Code Online (Sandbox Code Playgroud)