使用regex将URL解压缩到新的数据框列中

lmc*_*ane 4 regex r grepl

我想使用正则表达式从数据框中的文本中提取所有URL到新列.我有一些旧的代码,我用来提取关键字,所以我想调整代码为正则表达式.我想将正则表达式保存为字符串变量并在此处应用:

data$ContentURL <- apply(sapply(regex, grepl, data$Content, fixed=FALSE), 1, function(x) paste(selection[x], collapse=','))
Run Code Online (Sandbox Code Playgroud)

似乎fixed=FALSE应该告诉grepl它是一个正则表达式,但R不喜欢我试图将正则表达式保存为:

regex <- "http.*?1-\\d+,\\d+"
Run Code Online (Sandbox Code Playgroud)

我的数据组织在这样的数据框中:

data <- read.table(text='"Content"     "date"   
 1     "a house a home https://www.foo.com"     "12/31/2013"
 2     "cabin ideas https://www.example.com in the woods"     "5/4/2013"
 3     "motel is a hotel"   "1/4/2013"', header=TRUE)
Run Code Online (Sandbox Code Playgroud)

希望看起来像:

                                           Content       date              ContentURL
1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
3                                 motel is a hotel   1/4/2013                        
Run Code Online (Sandbox Code Playgroud)

hrb*_*str 19

Hadleyverse解决方案(stringr包)具有不错的URL模式:

library(stringr)

url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

data$ContentURL <- str_extract(data$Content, url_pattern)

data

##                                            Content       date              ContentURL
## 1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
## 2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
## 3                                 motel is a hotel   1/4/2013                    <NA>
Run Code Online (Sandbox Code Playgroud)

你可以使用,str_extract_all如果有倍数Content,但这将涉及到你之后的一些额外处理.