Dou*_*hey 5 html r web-scraping
我一直在尝试使用 R 抓取我当地政府的 Power BI 仪表板,但似乎不可能。我从 Microsoft 网站上了解到,无法对 Power BI 仪表板进行 scrable,但我正在通过几个论坛表明这是可能的,但是我正在经历一个循环
我正在尝试Zip Code从此仪表板中抓取选项卡数据:
我从下面给定的代码中尝试了几种“技术”
scc_webpage <- xml2::read_html("https://app.powerbigov.us/view?r=eyJrIjoiZDFmN2ViMGEtNzQzMC00ZDU3LTkwZjUtOWU1N2RiZmJlOTYyIiwidCI6IjNiMTg1MTYzLTZjYTMtNDA2NS04NDAwLWNhNzJiM2Y3OWU2ZCJ9&pageName=ReportSectionb438b98829599a9276e2&pageName=ReportSectionb438b98829599a9276e2")
# Attempt using xpath
scc_webpage %>%
rvest::html_nodes(xpath = '//*[@id="pvExplorationHost"]/div/div/exploration/div/explore-canvas-modern/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container-group/transform/div/div[2]/visual-container-modern[1]/transform/div/div[3]/div/visual-modern/div/div/div[2]/div[1]/div[4]/div/div/div[1]/div[1]') %>%
rvest::html_text()
# Attempt using div.<class>
scc_webpage %>%
rvest::html_nodes("div.pivotTableCellWrap cell-interactive tablixAlignRight ") %>%
rvest::html_text()
# Attempt using xpathSapply
query = '//*[@id="pvExplorationHost"]/div/div/exploration/div/explore-canvas-modern/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container-group/transform/div/div[2]/visual-container-modern[1]/transform/div/div[3]/div/visual-modern/div/div/div[2]/div[1]/div[4]/div/div/div[1]/div[1]'
XML::xpathSApply(xml, query, xmlValue)
scc_webpage %>%
html_nodes("ui-view")
Run Code Online (Sandbox Code Playgroud)
但是我总是character(0)在使用 xpath 并获取div类和 id时得到一个输出,或者甚至{xml_nodeset (0)}在尝试通过html_nodes. 奇怪的是,当我这样做时,它不会显示画面数据的整个 html:
scc_webpage %>%
html_nodes("div")
Run Code Online (Sandbox Code Playgroud)
这将是输出,将我需要的块留空:
{xml_nodeset (2)}
[1] <div id="pbi-loading"><svg version="1.1" class="pulsing-svg-item" xmlns="http://www.w3.org/2000/svg" xmlns:xlink ...
[2] <div id="pbiAppPlaceHolder">\r\n <ui-view></ui-view><root></root>\n</div>
Run Code Online (Sandbox Code Playgroud)
我想问题可能是因为数字在一系列嵌套div属性中?
我试图获得的主要数据是表中显示Zip code, confirmed cases, % total cases, deaths, 的数字% total deaths。
如果这可以在 R 中或可能在 Python 中使用 Selenium 完成,将不胜感激!
问题是您要分析的站点依赖 JavaScript 来运行并为您获取内容。在这种情况下,httr::GET对您没有帮助。
然而,由于手工工作也不是一种选择,我们有 Selenium。
以下是您正在寻找的内容:
library(dplyr)
library(purrr)
library(readr)
library(wdman)
library(RSelenium)
library(xml2)
library(selectr)
# using wdman to start a selenium server
selServ <- selenium(
port = 4444L,
version = 'latest',
chromever = '84.0.4147.30', # set this to a chrome version that's available on your machine
)
# using RSelenium to start chrome on the selenium server
remDr <- remoteDriver(
remoteServerAddr = 'localhost',
port = 4444L,
browserName = 'chrome'
)
# open a new Tab on Chrome
remDr$open()
# navigate to the site you wish to analyze
report_url <- "https://app.powerbigov.us/view?r=eyJrIjoiZDFmN2ViMGEtNzQzMC00ZDU3LTkwZjUtOWU1N2RiZmJlOTYyIiwidCI6IjNiMTg1MTYzLTZjYTMtNDA2NS04NDAwLWNhNzJiM2Y3OWU2ZCJ9&pageName=ReportSectionb438b98829599a9276e2&pageName=ReportSectionb438b98829599a9276e2"
remDr$navigate(report_url)
# find and click the button leading to the Zip Code data
zipCodeBtn <- remDr$findElement('.//button[descendant::span[text()="Zip Code"]]', using="xpath")
zipCodeBtn$clickElement()
# fetch the site source in XML
zipcode_data_table <- read_html(remDr$getPageSource()[[1]]) %>%
querySelector("div.pivotTable")
Run Code Online (Sandbox Code Playgroud)
现在我们已将页面源代码读入 R,这可能是您开始抓取任务时的想法。
从这里开始它是一帆风顺的,仅仅是将该 xml 转换为可用的表:
col_headers <- zipcode_data_table %>%
querySelectorAll("div.columnHeaders div.pivotTableCellWrap") %>%
map_chr(xml_text)
rownames <- zipcode_data_table %>%
querySelectorAll("div.rowHeaders div.pivotTableCellWrap") %>%
map_chr(xml_text)
zipcode_data <- zipcode_data_table %>%
querySelectorAll("div.bodyCells div.pivotTableCellWrap") %>%
map(xml_parent) %>%
unique() %>%
map(~ .x %>% querySelectorAll("div.pivotTableCellWrap") %>% map_chr(xml_text)) %>%
setNames(col_headers) %>%
bind_cols()
# tadaa
df_final <- tibble(zipcode = rownames, zipcode_data) %>%
type_convert(trim_ws = T, na = c(""))
Run Code Online (Sandbox Code Playgroud)
生成的 df 如下所示:
> df_final
# A tibble: 15 x 5
zipcode `Confirmed Cases ` `% of Total Cases ` `Deaths ` `% of Total Deaths `
<chr> <dbl> <chr> <dbl> <chr>
1 63301 1549 17.53% 40 28.99%
2 63366 1364 15.44% 38 27.54%
3 63303 1160 13.13% 21 15.22%
4 63385 1091 12.35% 12 8.70%
5 63304 1046 11.84% 3 2.17%
6 63368 896 10.14% 12 8.70%
7 63367 882 9.98% 9 6.52%
8 534 6.04% 1 0.72%
9 63348 105 1.19% 0 0.00%
10 63341 84 0.95% 1 0.72%
11 63332 64 0.72% 0 0.00%
12 63373 25 0.28% 1 0.72%
13 63386 17 0.19% 0 0.00%
14 63357 13 0.15% 0 0.00%
15 63376 5 0.06% 0 0.00%
Run Code Online (Sandbox Code Playgroud)