标签: rvest

使用rvest刮取数据

我正在尝试使用以下代码从此页面中搜索每个搜索结果的名称:

url2 <- "http://www.truckandtrailer.ca/search.cfm?intIndustryID=2&searchtype=advanced&pageaction=showresults&bitNew=0&intCategoryID=30&intMakeID=0&intSelectProvinceID=&x=26&y=6"

results <- url2 %>%
  html() %>%
  html_nodes(".desc_title") %>%
  html_text()
results

Run Code Online (Sandbox Code Playgroud)

然而它只是返回:

character(0)

Run Code Online (Sandbox Code Playgroud)

有关如何解决此问题的任何想法？感谢帮助!

r web-scraping rvest

作者

2015 06-14

0
推荐指数

1
解决办法

1427
查看次数

网页搜索与rvest

我尝试使用rvest来获取该网站上的所有471个案例,但每次只能获得25个案例(无论列表是否扩展).任何帮助,将不胜感激.

library("rvest")
url <- "http://investmentpolicyhub.unctad.org/ISDS?status=100"
cases <- url %>%
read_html() %>%
html_nodes(xpath='//*[@id="cases-list"]') %>%
html_table()
View(cases)

Run Code Online (Sandbox Code Playgroud)

谢谢.

r web-scraping rvest

作者

lucky-day

0
推荐指数

1
解决办法

425
查看次数

Web刮几天的表

我一直在使用webscraping XML::readHTMLTable,现在我正在努力学习如何在更细微的层面上进行搜索.我的动机来自于尝试在多个日子里在网站上刮一张桌子来改变位置(例如,昨天它是页面上的第4个表格,今天它是页面上的第2个表格,等等).我将使用一个以各种体育赛事发布维加斯赔率的网站为例,我将特别试图提取NBA数据.

URL1 = "http://www.scoresandodds.com/grid_20161123.html"
URL2 = "http://www.scoresandodds.com/grid_20161125.html"

Run Code Online (Sandbox Code Playgroud)

你会注意到NBA桌子是第一张桌子URL1,它是第二张桌子URL2.认识到NBA是第一个表格,以下是我如何将其作为第一个网址:

library(XML)

URL1 = "http://www.scoresandodds.com/grid_20161123.html"
exTable = readHTMLTable(URL1)[[1]] %>%
# Find first blank, since NBA is the first table #
  head(which(exTable[,1] == "")[1] - 1)

Run Code Online (Sandbox Code Playgroud)

然后我会从那里清理它.我知道这不是最好的方法,甚至考虑到我想要循环多天,因为需要进行所有的清洁.学习如何抓取网页表中的特定对象会更好.

我已经玩了rvest一些,我知道我可以为Vegas线获得看起来像"td.line"的节点,但是我试图选择特定表格的节点(css = "#nba > div.sport"或其他东西？).我不一定想要这个具体例子的答案,但学习如何做这个例子将允许我将技能应用于许多其他情况.在此先感谢您的帮助.

r css-selectors html-parsing web-scraping rvest

Coo*_*Day

2017 11-15

0
推荐指数

1
解决办法

105
查看次数

如何提取下载链接并在 R 中下载文件？

我想提取链接并自动下载 Type='AA' 的第一条记录的文件。

我设法提取表格，但如何提取最后一列中“AA”类型的链接？


library(rvest)
library(stringr)

url <- "https://beta.companieshouse.gov.uk/company/02280000/filing-history"
wahis.session <- html_session(url)                                
r <-    wahis.session %>%
  html_nodes(xpath = '//*[@id="fhTable"]') %>%
  html_table(fill = T)

Run Code Online (Sandbox Code Playgroud)

r rvest

Jan*_*ane

lucky-day

0
推荐指数

1
解决办法

1857
查看次数

如何使用 R 从 php 网站抓取大表

我正在尝试从“https://www.metabolomicsworkbench.org/data/mb_struct_ajax.php”中抓取表格。

我在网上找到的代码（rvest）不起作用

library(rvest)
url <- "https://www.metabolomicsworkbench.org/data/mb_structure_ajax.php"
A <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="containerx"]/div[1]/table') %>%
  html_table()

Run Code Online (Sandbox Code Playgroud)

A 是“0 的列表”

我应该如何修复此代码或者有更好的方法吗？

提前致谢。

r web-scraping scrape rvest

cod*_*rer

lucky-day

0
推荐指数

1
解决办法

534
查看次数

使用 RSelenium 最大化浏览器窗口

有没有办法使用 RSelenium 最大化浏览器窗口？

我当前的代码是：

scrape_url <- "https://[...]"

eCaps <- list(firefoxOptions = list(
    args = list('--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"')
))
    
rD <- RSelenium::rsDriver(browser="firefox", port=4546L, verbose=F, chromever="87.0.4280.20",
                              extraCapabilities = eCaps)
    
remDr <- rD[["client"]]
remDr$navigate(scrape_url)

Run Code Online (Sandbox Code Playgroud)

selenium r web-scraping rselenium rvest

anp*_*ami

lucky-day

0
推荐指数

1
解决办法

417
查看次数

如何使用 R 拆分没有分隔符的合并/粘合单词

我使用 R 中的 rvest 使用以下代码从本文页面中抓取文本关键字：

#install.packages("xml2") # required for rvest
library("rvest") # for web scraping
library("dplyr") # for data management

#' start with get the link for the web to be scraped
page <- read_html("https://www.sciencedirect.com/science/article/pii/S1877042810004568")
keyW <- page %>% html_nodes("div.Keywords.u-font-serif") %>% html_text() %>% paste(collapse = ",")

Run Code Online (Sandbox Code Playgroud)

它给了我：

> keyW    
[1] "KeywordsPhysics curriculumTurkish education systemfinnish education systemPISAphysics achievement"

Run Code Online (Sandbox Code Playgroud)

使用以下代码行从字符串中删除单词“Keywords”及其之前的所有内容后：

keyW <- gsub(".*Keywords","", keyW)

Run Code Online (Sandbox Code Playgroud)

新的密钥W是：

[1] "Physics curriculumTurkish education systemfinnish education systemPISAphysics achievement"

Run Code Online (Sandbox Code Playgroud)

但是，我想要的输出是这个列表：

[1] "Physics curriculum" "Turkish education system" "finnish education …

Run Code Online (Sandbox Code Playgroud)

r text-mining gsub strsplit rvest

Zaw*_*min

2021 01-29

0
推荐指数

1
解决办法

93
查看次数

在R中打印/显示JPG文件

使用rvest软件包时，我试图在R中打印/显示lego_movie海报。我没有这样做。这是我的尝试：

library(rvest)
poster <- lego_movie %>%
  html_nodes("#img_primary img") %>%
  html_attr("src")

## 1st attempt
library(jpeg)
jpeg(poster)
dev.off()

## 2nd attempt
readJPEG(poster)
dev.off()

Run Code Online (Sandbox Code Playgroud)

我认为EBImage具有display功能。无法将该软件包安装在中R-3.1.2。它显示警告消息：package ‘EBImage’ is not available (for R version 3.1.2)。

我的问题的底线是：如何在不使用EBImage软件包的情况下将R中的jpeg文件显示为显示器？

几个相关的问题：

使用R中的基本图形绘制JPG图像

如何将绘图作为图像保存在磁盘上？

jpeg r rvest

S D*_*Das

2017 05-23

-1
推荐指数

1
解决办法

3234
查看次数

rvest html()无法识别URL

因此,我正在编写一个带有R的网络刮刀来搜索zillow,了解西澳州每个县的房屋中值.我正在使用rvest包,这里是有问题的代码:

URL <- "https://en.wikipedia.org/wiki/List_of_counties_in_Washington"
wiki <- html(URL)

#Getting the list of counties in WA
counties <- wiki %>%
  html_nodes(".wikitable td:nth-child(1) a") %>%
  html_text()

#Putting together a list to pull my search terms from
searchTerms <- list()

for(i in 1:length(counties)) {
  searchTerms[[i]] <- paste0(counties[i], ", WA", sep="")
}
searchTerms <- gsub(",", "", searchTerms)
searchTerms <- gsub(" ", "-", searchTerms)

homeValues <- list()

#Getting the HTML for each county using the search terms in the URL,
#eventually it will pull the homeValues …

Run Code Online (Sandbox Code Playgroud)

r web-scraping rvest

ToT*_*est

2015 09-05

-1
推荐指数

1
解决办法

550
查看次数