Dav*_*ung 1 r html-parsing web-scraping scrape rvest
我正在尝试使用rvest 从http://www.basketball-reference.com/teams/CHI/2015.html中删除.我使用了selectorgadget,发现标签对于我想要的表是#advanced.但是,我注意到它没有捡到它.查看页面源代码,我注意到这些表位于html注释标记内<!--
从评论标签中获取表格的最佳方法是什么?谢谢!
编辑:我正试图拉出"高级"表:http://www.basketball-reference.com/teams/CHI/2015.html#advanced :::none
您可以使用XPath comment()函数选择注释节点,然后将其内容重新解析为HTML:
library(rvest)
# scrape page
h <- read_html('http://www.basketball-reference.com/teams/CHI/2015.html')
df <- h %>% html_nodes(xpath = '//comment()') %>% # select comment nodes
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to a single string
read_html() %>% # reparse to HTML
html_node('table#advanced') %>% # select the desired table
html_table() %>% # parse table
.[colSums(is.na(.)) < nrow(.)] # get rid of spacer columns
df[, 1:15]
## Rk Player Age G MP PER TS% 3PAr FTr ORB% DRB% TRB% AST% STL% BLK%
## 1 1 Pau Gasol 34 78 2681 22.7 0.550 0.023 0.317 9.2 27.6 18.6 14.4 0.5 4.0
## 2 2 Jimmy Butler 25 65 2513 21.3 0.583 0.212 0.508 5.1 11.2 8.2 14.4 2.3 1.0
## 3 3 Joakim Noah 29 67 2049 15.3 0.482 0.005 0.407 11.9 22.1 17.1 23.0 1.2 2.6
## 4 4 Aaron Brooks 30 82 1885 14.4 0.534 0.383 0.213 1.9 7.5 4.8 24.2 1.5 0.6
## 5 5 Mike Dunleavy 34 63 1838 11.6 0.573 0.547 0.181 1.7 12.7 7.3 9.7 1.1 0.8
## 6 6 Taj Gibson 29 62 1692 16.1 0.545 0.000 0.364 10.7 14.6 12.7 6.9 1.1 3.2
## 7 7 Nikola Mirotic 23 82 1654 17.9 0.556 0.502 0.455 4.3 21.8 13.3 9.7 1.7 2.4
## 8 8 Kirk Hinrich 34 66 1610 6.8 0.468 0.441 0.131 1.4 6.6 4.1 13.8 1.5 0.6
## 9 9 Derrick Rose 26 51 1530 15.9 0.493 0.325 0.224 2.6 8.7 5.7 30.7 1.2 0.8
## 10 10 Tony Snell 23 72 1412 10.2 0.550 0.531 0.148 2.5 10.9 6.8 6.8 1.2 0.6
## 11 11 E'Twaun Moore 25 56 504 10.3 0.504 0.273 0.144 2.7 7.1 5.0 10.4 2.1 0.9
## 12 12 Doug McDermott 23 36 321 6.1 0.480 0.383 0.140 2.1 12.2 7.3 3.0 0.6 0.2
## 13 13 Nazr Mohammed 37 23 128 8.7 0.431 0.000 0.100 9.6 22.3 16.1 3.6 1.6 2.8
## 14 14 Cameron Bairstow 24 18 64 2.1 0.309 0.000 0.357 10.5 3.3 6.8 2.2 1.6 1.1
Run Code Online (Sandbox Code Playgroud)
好的,我知道了。
\n\nlibrary(stringi)\nlibrary(knitr)\nlibrary(rvest)\n\n\n any_version_html <- function(x){\n XML::htmlParse(x)\n }\na <- \'http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none\'\nb <- readLines(a)\nc <- paste0(b, collapse = "")\nd <- as.character(unlist(stri_extract_all_regex(c, \'<table(.*?)/table>\', omit_no_match = T, simplify = T)))\n\ne <- html_table(any_version_html(d))\n\n\n> kable(summary(e),\'rst\')\n====== ========== ====\nLength Class Mode\n====== ========== ====\n9 data.frame list\n2 data.frame list\n24 data.frame list\n21 data.frame list\n28 data.frame list\n28 data.frame list\n27 data.frame list\n30 data.frame list\n27 data.frame list\n27 data.frame list\n28 data.frame list\n28 data.frame list\n27 data.frame list\n30 data.frame list\n27 data.frame list\n27 data.frame list\n3 data.frame list\n====== ========== ====\n\n\nkable(e[[1]],\'rst\')\n\n\n=== ================ === ==== === ================== === === =================================\nNo. Player Pos Ht Wt Birth Date \xc3\x82 Exp College \n=== ================ === ==== === ================== === === =================================\n 41 Cameron Bairstow PF 6-9 250 December 7, 1990 au R University of New Mexico \n 0 Aaron Brooks PG 6-0 161 January 14, 1985 us 6 University of Oregon \n 21 Jimmy Butler SG 6-7 220 September 14, 1989 us 3 Marquette University \n 34 Mike Dunleavy SF 6-9 230 September 15, 1980 us 12 Duke University \n 16 Pau Gasol PF 7-0 250 July 6, 1980 es 13 \n 22 Taj Gibson PF 6-9 225 June 24, 1985 us 5 University of Southern California\n 12 Kirk Hinrich SG 6-4 190 January 2, 1981 us 11 University of Kansas \n 3 Doug McDermott SF 6-8 225 January 3, 1992 us R Creighton University \n\n\n## Realized we should index with some names...but this is somewhat cheating as we know the start and end indexes for table titles..I prefer to parse-in-the-dark.\n\n# Names are in h2-tags\ne_names <- as.character(unlist(stri_extract_all_regex(c, \'<h2(.*?)/h2>\', simplify = T)))\ne_names <- gsub("<(.*?)>","",e_names[grep(\'Roster\',e_names):grep(\'Salaries\',e_names)])\nnames(e) <- e_names\nkable(head(e$Salaries), \'rst\')\n\n=== ============== ===========\n Rk Player Salary \n=== ============== ===========\n 1 Derrick Rose $18,862,875\n 2 Carlos Boozer $13,550,000\n 3 Joakim Noah $12,200,000\n 4 Taj Gibson $8,000,000 \n 5 Pau Gasol $7,128,000 \n 6 Nikola Mirotic $5,305,000 \n=== ============== ===========\nRun Code Online (Sandbox Code Playgroud)\n