TM包中的gsub函数删除URLS不会删除整个字符串

ido*_*eus 2 r tm

我在脚本中使用此函数使用r文本挖掘包(tm)来消除推文中的URL.令我惊讶的是,在清理之后,有一些剩余的"http"单词以及来自URL本身的片段(例如t.co).看起来有些URL被彻底消灭了,而其他一些只是分解成组件.可能是什么原因?注意:我拿了.在t.co网址中.StackOverflow不允许将URL提交到t.co地址.

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "/")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "@")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "\\|")
removeURL <- function(x) gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, removeURL)
Run Code Online (Sandbox Code Playgroud)

清洁前的文字

VOTE TODAY! Go to https://tco/KPQ5EY9VwQ to find your polling location. We are going to Make America Great Again!… https://tco/KPQ5EY9VwQ

清洁后的文字

vote today go https tco mxraxyntjy find polling location going make america great https tco kpqeyvwq

MrF*_*ick 7

您正在删除removeURL函数正在查找的符号.此外,您需要确保创建适当的变压器功能content_transformer().这是一个工作示例,其中包含用于删除URL的不同正则表达式(它在空格处停止)

library(tm)
test<-"VOTE TODAY! Go to https://t.com/KPQ5EY9VwQ to find your polling location. We are going to Make America Great Again!… https://t.com/KPQ5EY9VwQ"

trumpcorpus1020to1109 <- VCorpus(VectorSource(test))
removeURL <- content_transformer(function(x) gsub("(f|ht)tp(s?)://\\S+", "", x, perl=T))
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, removeURL)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "/")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "@")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "\\|")
content(trumpcorpus1020to1109[[1]])
# [1] "VOTE TODAY! Go to  to find your polling location. We are going to Make America Great Again!… "
Run Code Online (Sandbox Code Playgroud)