小编shi*_*uel的帖子

使用rvest(R)进行网络抓取时停止使用网址

我已经构建了一个功能,它可以获取网址并在抓取网页后返回所需的结果.功能如下:

library(httr) 
library(curl) 
library(rvest) 
library(dplyr)

sd_cat <- function(url){
  cat <- curl(url, handle = new_handle("useragent" = "myua")) %>%
  read_html() %>%
  html_nodes("#breadCrumbWrapper") %>%
  html_text()

x <- cat[1]

#y <- gsub(pattern = "\n", x=x, replacement = " ")

y <- gsub(pattern = "\t", x=x, replacement = " ")

y <- gsub("\\d|,|\t", x=y, replacement = "")

y <- gsub("^ *|(?<= ) | *$", "", y, perl=T)

z <- gsub("\n*{2,}","",y)

z <- gsub(" {2,}",">",z)

final <- substring(z,2)

final <- substring(final,1,nchar(final)-1)

final

#sample discontinued url: "http://www.snapdeal.com//product/givenchy-xeryus-rouge-g-edt/1978028261" …
Run Code Online (Sandbox Code Playgroud)

r web-scraping rvest

2
推荐指数
1
解决办法
2319
查看次数

标签 统计

r ×1

rvest ×1

web-scraping ×1