我已经构建了一个功能,它可以获取网址并在抓取网页后返回所需的结果.功能如下:
library(httr)
library(curl)
library(rvest)
library(dplyr)
sd_cat <- function(url){
cat <- curl(url, handle = new_handle("useragent" = "myua")) %>%
read_html() %>%
html_nodes("#breadCrumbWrapper") %>%
html_text()
x <- cat[1]
#y <- gsub(pattern = "\n", x=x, replacement = " ")
y <- gsub(pattern = "\t", x=x, replacement = " ")
y <- gsub("\\d|,|\t", x=y, replacement = "")
y <- gsub("^ *|(?<= ) | *$", "", y, perl=T)
z <- gsub("\n*{2,}","",y)
z <- gsub(" {2,}",">",z)
final <- substring(z,2)
final <- substring(final,1,nchar(final)-1)
final
#sample discontinued url: "http://www.snapdeal.com//product/givenchy-xeryus-rouge-g-edt/1978028261" …Run Code Online (Sandbox Code Playgroud)