如何使用R在HTML中的注释标记内抓取表格?

Dav*_*ung 1 r html-parsing web-scraping scrape rvest

我正在尝试使用rvest 从http://www.basketball-reference.com/teams/CHI/2015.html中删除.我使用了selectorgadget,发现标签对于我想要的表是#advanced.但是,我注意到它没有捡到它.查看页面源代码,我注意到这些表位于html注释标记内<!--

从评论标签中获取表格的最佳方法是什么?谢谢!

编辑:我正试图拉出"高级"表:http://www.basketball-reference.com/teams/CHI/2015.html#advanced :::none

ali*_*ire 6

您可以使用XPath comment()函数选择注释节点,然后将其内容重新解析为HTML:

library(rvest)

# scrape page
h <- read_html('http://www.basketball-reference.com/teams/CHI/2015.html')

df <- h %>% html_nodes(xpath = '//comment()') %>%    # select comment nodes
    html_text() %>%    # extract comment text
    paste(collapse = '') %>%    # collapse to a single string
    read_html() %>%    # reparse to HTML
    html_node('table#advanced') %>%    # select the desired table
    html_table() %>%    # parse table
    .[colSums(is.na(.)) < nrow(.)]    # get rid of spacer columns

df[, 1:15]
##    Rk           Player Age  G   MP  PER   TS%  3PAr   FTr ORB% DRB% TRB% AST% STL% BLK%
## 1   1        Pau Gasol  34 78 2681 22.7 0.550 0.023 0.317  9.2 27.6 18.6 14.4  0.5  4.0
## 2   2     Jimmy Butler  25 65 2513 21.3 0.583 0.212 0.508  5.1 11.2  8.2 14.4  2.3  1.0
## 3   3      Joakim Noah  29 67 2049 15.3 0.482 0.005 0.407 11.9 22.1 17.1 23.0  1.2  2.6
## 4   4     Aaron Brooks  30 82 1885 14.4 0.534 0.383 0.213  1.9  7.5  4.8 24.2  1.5  0.6
## 5   5    Mike Dunleavy  34 63 1838 11.6 0.573 0.547 0.181  1.7 12.7  7.3  9.7  1.1  0.8
## 6   6       Taj Gibson  29 62 1692 16.1 0.545 0.000 0.364 10.7 14.6 12.7  6.9  1.1  3.2
## 7   7   Nikola Mirotic  23 82 1654 17.9 0.556 0.502 0.455  4.3 21.8 13.3  9.7  1.7  2.4
## 8   8     Kirk Hinrich  34 66 1610  6.8 0.468 0.441 0.131  1.4  6.6  4.1 13.8  1.5  0.6
## 9   9     Derrick Rose  26 51 1530 15.9 0.493 0.325 0.224  2.6  8.7  5.7 30.7  1.2  0.8
## 10 10       Tony Snell  23 72 1412 10.2 0.550 0.531 0.148  2.5 10.9  6.8  6.8  1.2  0.6
## 11 11    E'Twaun Moore  25 56  504 10.3 0.504 0.273 0.144  2.7  7.1  5.0 10.4  2.1  0.9
## 12 12   Doug McDermott  23 36  321  6.1 0.480 0.383 0.140  2.1 12.2  7.3  3.0  0.6  0.2
## 13 13    Nazr Mohammed  37 23  128  8.7 0.431 0.000 0.100  9.6 22.3 16.1  3.6  1.6  2.8
## 14 14 Cameron Bairstow  24 18   64  2.1 0.309 0.000 0.357 10.5  3.3  6.8  2.2  1.6  1.1
Run Code Online (Sandbox Code Playgroud)


Car*_*eri 3

好的,我知道了。

\n\n
library(stringi)\nlibrary(knitr)\nlibrary(rvest)\n\n\n any_version_html <- function(x){\n       XML::htmlParse(x)\n    }\na <- \'http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none\'\nb <- readLines(a)\nc <- paste0(b, collapse = "")\nd <- as.character(unlist(stri_extract_all_regex(c, \'<table(.*?)/table>\', omit_no_match = T, simplify = T)))\n\ne <- html_table(any_version_html(d))\n\n\n> kable(summary(e),\'rst\')\n======  ==========  ====\nLength  Class       Mode\n======  ==========  ====\n9       data.frame  list\n2       data.frame  list\n24      data.frame  list\n21      data.frame  list\n28      data.frame  list\n28      data.frame  list\n27      data.frame  list\n30      data.frame  list\n27      data.frame  list\n27      data.frame  list\n28      data.frame  list\n28      data.frame  list\n27      data.frame  list\n30      data.frame  list\n27      data.frame  list\n27      data.frame  list\n3       data.frame  list\n======  ==========  ====\n\n\nkable(e[[1]],\'rst\')\n\n\n===  ================  ===  ====  ===  ==================  ===  ===  =================================\nNo.  Player            Pos  Ht     Wt  Birth Date          \xc3\x82    Exp  College                          \n===  ================  ===  ====  ===  ==================  ===  ===  =================================\n 41  Cameron Bairstow  PF   6-9   250  December 7, 1990    au   R    University of New Mexico         \n  0  Aaron Brooks      PG   6-0   161  January 14, 1985    us   6    University of Oregon             \n 21  Jimmy Butler      SG   6-7   220  September 14, 1989  us   3    Marquette University             \n 34  Mike Dunleavy     SF   6-9   230  September 15, 1980  us   12   Duke University                  \n 16  Pau Gasol         PF   7-0   250  July 6, 1980        es   13                                    \n 22  Taj Gibson        PF   6-9   225  June 24, 1985       us   5    University of Southern California\n 12  Kirk Hinrich      SG   6-4   190  January 2, 1981     us   11   University of Kansas             \n  3  Doug McDermott    SF   6-8   225  January 3, 1992     us   R    Creighton University    \n\n\n## Realized we should index with some names...but this is somewhat cheating as we know the start and end indexes for table titles..I prefer to parse-in-the-dark.\n\n# Names are in h2-tags\ne_names <- as.character(unlist(stri_extract_all_regex(c, \'<h2(.*?)/h2>\', simplify = T)))\ne_names <- gsub("<(.*?)>","",e_names[grep(\'Roster\',e_names):grep(\'Salaries\',e_names)])\nnames(e) <- e_names\nkable(head(e$Salaries), \'rst\')\n\n===  ==============  ===========\n Rk  Player          Salary     \n===  ==============  ===========\n  1  Derrick Rose    $18,862,875\n  2  Carlos Boozer   $13,550,000\n  3  Joakim Noah     $12,200,000\n  4  Taj Gibson      $8,000,000 \n  5  Pau Gasol       $7,128,000 \n  6  Nikola Mirotic  $5,305,000 \n===  ==============  ===========\n
Run Code Online (Sandbox Code Playgroud)\n