使用rvest刮擦跨度的html表

use*_*905 5 r html-table web-scraping rvest

我正在使用rvest来提取下一页中的表格:

https://en.wikipedia.org/wiki/List_of_United_States_presidential_elections_by_popular_vote_margin

以下代码有效:

URL <- 'https://en.wikipedia.org/wiki/List_of_United_States_presidential_elections_by_popular_vote_margin'
table <- URL %>%  
  read_html %>% 
  html_nodes("table")  %>% 
  .[[2]] %>% 
  html_table(trim=TRUE)
Run Code Online (Sandbox Code Playgroud)

但是边缘和总统名称的列有一些奇怪的价值.原因是源代码具有以下内容:

<td><span style="display:none">00.001</span>?10.44%</td>
Run Code Online (Sandbox Code Playgroud)

所以不是得到-10.44%而是得到00.001'10.44%

我怎么能解决这个问题?

Jot*_*ota 3

一种选择是单独定位并替换有问题的列。

\n\n

边距列可以定位为xpath

\n\n
# get the html\nhtml <- URL %>%  \n  read_html()\n\n# Example using the first margin column (column # 6)\nhtml %>%\n  html_nodes(xpath = \'//table[2]\') %>%       # get table 2\n  html_nodes(xpath = \'//td[6]/text()\') %>%   # get column 6 using text()\n  iconv("UTF-8", "UTF-8")                    # to convert "\xc3\xa2\xcb\x86\xe2\x80\x99" to "-"\n# [1] "\xe2\x88\x9210.44%" "\xe2\x88\x923.00%"  "\xe2\x88\x920.83%"  "\xe2\x88\x920.51%"  "0.09%"   "0.17%"   "0.57%"  \n# [8] "0.70%"   "1.45%"   "2.06%"   "2.46%"   "3.01%"   "3.12%"   "3.86%"  \n#[15] "4.31%"   "4.48%"   "4.79%"   "5.32%"   "5.56%"   "6.05%"   "6.12%"  \n#[22] "6.95%"   "7.27%"   "7.50%"   "7.72%"   "8.51%"   "8.53%"   "9.74%"  \n#[29] "9.96%"   "10.08%"  "10.13%"  "10.85%"  "11.80%"  "12.20%"  "12.25%" \n#[36] "14.20%"  "14.44%"  "15.40%"  "17.41%"  "17.76%"  "17.81%"  "18.21%" \n#[43] "18.83%"  "22.58%"  "23.15%"  "24.26%"  "25.22%"  "26.17%"\n
Run Code Online (Sandbox Code Playgroud)\n\n

对另一个边距列执行相同的操作。我曾经iconv将 转换\xc3\xa2\xcb\x86\xe2\x80\x99-,因为它是一个编码问题,但您可以使用基于替换的解决方案(例如使用sub)。

\n\n

要以总统姓名为目标列,您可以再次使用 xpath:

\n\n
html %>%\n  html_nodes(xpath = \'//table[2]\') %>% \n  html_nodes(xpath = \'//td[3]/a/text()\') %>%\n  html_text()\n# [1] "John Quincy Adams"      "Rutherford Hayes"       "Benjamin Harrison"     \n# [4] "George W. Bush"         "James Garfield"         "John Kennedy"          \n# [7] "Grover Cleveland"       "Richard Nixon"          "James Polk"            \n#[10] "Jimmy Carter"           "George W. Bush"         "Grover Cleveland"      \n#[13] "Woodrow Wilson"         "Barack Obama"           "William McKinley"      \n#[16] "Harry Truman"           "Zachary Taylor"         "Ulysses Grant"         \n#[19] "Bill Clinton"           "William Henry Harrison" "William McKinley"      \n#[22] "Franklin Pierce"        "Barack Obama"           "Franklin Roosevelt"    \n#[25] "George H. W. Bush"      "Bill Clinton"           "William Taft"          \n#[28] "Ronald Reagan"          "Franklin Roosevelt"     "Abraham Lincoln"       \n#[31] "Abraham Lincoln"        "Dwight Eisenhower"      "Ulysses Grant"         \n#[34] "James Buchanan"         "Andrew Jackson"         "Martin Van Buren"      \n#[37] "Woodrow Wilson"         "Dwight Eisenhower"      "Herbert Hoover"        \n#[40] "Franklin Roosevelt"     "Andrew Jackson"         "Ronald Reagan"         \n#[43] "Theodore Roosevelt"     "Lyndon Johnson"         "Richard Nixon"         \n#[46] "Franklin Roosevelt"     "Calvin Coolidge"        "Warren Harding" \n
Run Code Online (Sandbox Code Playgroud)\n