从R中的网站提取html表

Jd *_*aba 1 r html-table rvest

嗨,我想从premierleague网站上提取表格.

我使用的rvest包是 包,我在初始阶段使用的代码如下:

library(rvest)
library(magrittr)
premierleague <- read_html("https://fantasy.premierleague.com/a/entry/767830/history")
premierleague %>% html_nodes("ism-table")
Run Code Online (Sandbox Code Playgroud)

我找不到一个可以解压缩html_nodesfor rvest包的html标签.

我使用类似的方法从" http://admissions.calpoly.edu/prospective/profile.html "中提取数据,我能够提取数据.我用于calpoly的代码如下:

library(rvest)
library(magrittr)
CPadmissions <- read_html("http://admissions.calpoly.edu/prospective/profile.html")

CPadmissions %>% html_nodes("table") %>%
  .[[1]] %>%
  html_table()
Run Code Online (Sandbox Code Playgroud)

通过以下链接从youtube获取上述代码:https://www.youtube.com/watch?v = gSbuwYdNYLM&ab_channel = EvanO%27Brien

任何有关从fantasy.premierleague.com获取数据的帮助都非常感谢.我需要使用某种API吗?

ali*_*ire 7

由于数据是用JavaScript加载的,因此使用rvest获取HTML将无法满足您的需求,但如果您使用PhantomJS作为RSelenium中的无头浏览器,那么它并不是那么复杂(通过RSelenium标准):

library(RSelenium)
library(rvest)

# initialize browser and driver with RSelenium
ptm <- phantom()
rd <- remoteDriver(browserName = 'phantomjs')
rd$open()

# grab source for page
rd$navigate('https://fantasy.premierleague.com/a/entry/767830/history')
html <- rd$getPageSource()[[1]]

# clean up
rd$close()
ptm$stop()

# parse with rvest
df <- html %>% read_html() %>% 
    html_node('#ismr-event-history table.ism-table') %>% 
    html_table() %>% 
    setNames(gsub('\\S+\\s+(\\S+)', '\\1', names(.))) %>%    # clean column names
    setNames(gsub('\\s', '_', names(.)))

str(df)
## 'data.frame':    20 obs. of  10 variables:
##  $ Gameweek                : chr  "GW1" "GW2" "GW3" "GW4" ...
##  $ Gameweek_Points         : int  34 47 53 51 66 66 65 63 48 90 ...
##  $ Points_Bench            : int  1 6 9 7 14 2 9 3 8 2 ...
##  $ Gameweek_Rank           : chr  "2,406,373" "2,659,789" "541,258" "905,524" ...
##  $ Transfers_Made          : int  0 0 2 0 3 2 2 0 2 0 ...
##  $ Transfers_Cost          : int  0 0 0 0 4 4 4 0 0 0 ...
##  $ Overall_Points          : chr  "34" "81" "134" "185" ...
##  $ Overall_Rank            : chr  "2,406,373" "2,448,674" "1,914,025" "1,461,665" ...
##  $ Value                   : chr  "£100.0" "£100.0" "£99.9" "£100.0" ...
##  $ Change_Previous_Gameweek: logi  NA NA NA NA NA NA ...
Run Code Online (Sandbox Code Playgroud)

和往常一样,需要更多的清洁,但总的来说,没有太多的工作,它的状态非常好.(如果你正在使用tidyverse,那df %>% mutate_if(is.character, parse_number)就会很好.)箭头是图像,这就是为什么最后一列是全部NA,但你仍然可以计算它们.