我的数据集中的一个变量包含Google搜索结果页的网址.我想从这些网址中提取搜索关键字.
示例数据集:
keyw <- structure(list(user = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("p1", "p2"), class = "factor"),
url = structure(c(3L, 5L, 4L, 1L, 2L, 6L), .Label = c("https://www.google.nl/search?q=five+fingers&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=kERoVbmMO6fp7AaGioCYAw", "https://www.google.nl/search?q=five+fingers&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=kERoVbmMO6fp7AaGioCYAw#safe=off&q=five+short+fingers+", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg#safe=off&q=high+five+with+a+chair", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg#safe=off&q=high+five+with+handshake", "https://www.youtube.com/watch?v=6HOallAdtDI"), class = "factor")),
.Names = c("user", "url"), class = "data.frame", row.names = c(NA, -6L))
Run Code Online (Sandbox Code Playgroud)
到目前为止,我能够从URL中提取搜索关键字部分:
keyw$words <- sapply(str_extract_all(keyw$url, 'q=([^&#]*)'),paste, collapse=",")
Run Code Online (Sandbox Code Playgroud)
但是,这仍然没有给我想要的结果.上面的代码给出了以下结果:
> keyw$words
[1] "q=high+five"
[2] "q=high+five,q=high+five+with+handshake"
[3] "q=high+five,q=high+five+with+a+chair"
[4] "q=five+fingers"
[5] "q=five+fingers,q=five+short+fingers+"
[6] ""
Run Code Online (Sandbox Code Playgroud)
此输出有三个问题:
q=high+five
,我需要high,five
.NA
.期望的结果应该是:
> keyw$words
[1] "high,five"
[2] "high,five,with,handshake"
[3] "high,five,with,a,chair"
[4] "five,fingers"
[5] "five,short,fingers"
[6] NA
Run Code Online (Sandbox Code Playgroud)
我该如何解决这个问题?
Ten*_*bai 11
评论后的另一个更新(看起来太复杂,但这是我在这一点上可以实现的最好:) :):
keyw$words <- sapply(str_extract_all(str_extract(keyw$url,"https?:[/]{2}[^/]*google.*[/].*"),'(?<=q=|[+])([^$+#&]+)(?!.*q=)'),function(x) if(!length(x)) NA else paste(x,collapse=","))
> keyw$words
[1] "high,five" "high,five,with,handshake" "high,five,with,a,chair" "five,fingers"
[5] "five,short,fingers" NA
Run Code Online (Sandbox Code Playgroud)
更改是输入到str_extract_all的过滤器,通过"过滤"更改为完整向量以匹配正则表达式,任何正则表达式都可以去那里或多或少精确匹配您所希望的.
这里的正则表达式是:
http
litteraly https?
0或1秒[/]{2}
正好两个斜杠(使用字符类避免需要丑陋的\\/
构造并使事情更具可读性[^/]*
任意数量的非斜线字符google.*[/]
匹配litteraly google后跟任何东西到最后/.*
在最后一次斜线之后最终匹配或不匹配在任何地方替换*,以确保有一个参数(+
将要求前面的字符至少出现一次)
受@BrodieG启发的更新,如果没有匹配将返回NA,但如果q=
参数中有任何网站仍会匹配.
还是用同样的方法:
> keyw$words <- sapply(str_extract_all(keyw$url,'(?:(?<=q=|\\+)([^$+#&]+)(?!.*q=))'),function(x) if(!length(x)) NA else paste(x,collapse=","))
> keyw$words
[1] "high,five" "high,five,with,handshake" "high,five,with,a,chair"
[4] "five,fingers" "five,short,fingers" NA
Run Code Online (Sandbox Code Playgroud)
(lookbehind (?<=)
确保在单词之前的某处有q =或+,负向前瞻(?!)
确保我们找不到q =直到行尾.
字符类不允许+符号在每个单词处停止.
或许这个
gsub("\\+", ",", gsub(".*q=([^&#]*[^+&]).*", "\\1", keyw$url))
# [1] "high,five" "high,five,with,handshake" "high,five,with,a,chair"
# [4] "five,fingers" "five,short,fingers"
Run Code Online (Sandbox Code Playgroud)
更新(从David借用部分正则表达式):
dat <- as.character(keyw$url)
pat <- "^https://www\\.google\\.nl/.*\\bq=([^&]*[^&+]).*"
sapply(
regmatches(dat, regexec(pat, dat)),
function(x) if(!length(x)) NA else gsub("\\+", ",", x[[2]])
)
Run Code Online (Sandbox Code Playgroud)
生产:
[1] "high,five" "high,five,with,handshake" "high,five,with,a,chair"
[4] "five,fingers" "five,short,fingers" NA
Run Code Online (Sandbox Code Playgroud)
使用:
pat <- "^https://www\\.google.(?:com?.)?[a-z]{2,3}/.*\\b?q=([^&]*[^&+]).*"
Run Code Online (Sandbox Code Playgroud)
考虑所有国家/地区特定的google-domains(来源)
要么:
gsub("\\+", ",", sub("^.*\\bq=([^&]*).*", "\\1", keyw$url))
Run Code Online (Sandbox Code Playgroud)
生产:
[1] "high,five" "high,five,with,handshake" "high,five,with,a,chair"
[4] "five,fingers" "five,short,fingers,"
Run Code Online (Sandbox Code Playgroud)
在这里,我们使用贪婪来确保我们跳过最后q=...
一部分的所有内容,然后使用标准sub
/ \\1
技巧来捕获我们想要的内容.最后,替换+
为,
.