use*_*905 5 r html-table web-scraping rvest
我正在使用rvest来提取下一页中的表格:
https://en.wikipedia.org/wiki/List_of_United_States_presidential_elections_by_popular_vote_margin
以下代码有效:
URL <- 'https://en.wikipedia.org/wiki/List_of_United_States_presidential_elections_by_popular_vote_margin'
table <- URL %>%
read_html %>%
html_nodes("table") %>%
.[[2]] %>%
html_table(trim=TRUE)
Run Code Online (Sandbox Code Playgroud)
但是边缘和总统名称的列有一些奇怪的价值.原因是源代码具有以下内容:
<td><span style="display:none">00.001</span>?10.44%</td>
Run Code Online (Sandbox Code Playgroud)
所以不是得到-10.44%而是得到00.001'10.44%
我怎么能解决这个问题?
一种选择是单独定位并替换有问题的列。
\n\n边距列可以定位为xpath
# get the html\nhtml <- URL %>% \n read_html()\n\n# Example using the first margin column (column # 6)\nhtml %>%\n html_nodes(xpath = \'//table[2]\') %>% # get table 2\n html_nodes(xpath = \'//td[6]/text()\') %>% # get column 6 using text()\n iconv("UTF-8", "UTF-8") # to convert "\xc3\xa2\xcb\x86\xe2\x80\x99" to "-"\n# [1] "\xe2\x88\x9210.44%" "\xe2\x88\x923.00%" "\xe2\x88\x920.83%" "\xe2\x88\x920.51%" "0.09%" "0.17%" "0.57%" \n# [8] "0.70%" "1.45%" "2.06%" "2.46%" "3.01%" "3.12%" "3.86%" \n#[15] "4.31%" "4.48%" "4.79%" "5.32%" "5.56%" "6.05%" "6.12%" \n#[22] "6.95%" "7.27%" "7.50%" "7.72%" "8.51%" "8.53%" "9.74%" \n#[29] "9.96%" "10.08%" "10.13%" "10.85%" "11.80%" "12.20%" "12.25%" \n#[36] "14.20%" "14.44%" "15.40%" "17.41%" "17.76%" "17.81%" "18.21%" \n#[43] "18.83%" "22.58%" "23.15%" "24.26%" "25.22%" "26.17%"\nRun Code Online (Sandbox Code Playgroud)\n\n对另一个边距列执行相同的操作。我曾经iconv将 转换\xc3\xa2\xcb\x86\xe2\x80\x99为-,因为它是一个编码问题,但您可以使用基于替换的解决方案(例如使用sub)。
要以总统姓名为目标列,您可以再次使用 xpath:
\n\nhtml %>%\n html_nodes(xpath = \'//table[2]\') %>% \n html_nodes(xpath = \'//td[3]/a/text()\') %>%\n html_text()\n# [1] "John Quincy Adams" "Rutherford Hayes" "Benjamin Harrison" \n# [4] "George W. Bush" "James Garfield" "John Kennedy" \n# [7] "Grover Cleveland" "Richard Nixon" "James Polk" \n#[10] "Jimmy Carter" "George W. Bush" "Grover Cleveland" \n#[13] "Woodrow Wilson" "Barack Obama" "William McKinley" \n#[16] "Harry Truman" "Zachary Taylor" "Ulysses Grant" \n#[19] "Bill Clinton" "William Henry Harrison" "William McKinley" \n#[22] "Franklin Pierce" "Barack Obama" "Franklin Roosevelt" \n#[25] "George H. W. Bush" "Bill Clinton" "William Taft" \n#[28] "Ronald Reagan" "Franklin Roosevelt" "Abraham Lincoln" \n#[31] "Abraham Lincoln" "Dwight Eisenhower" "Ulysses Grant" \n#[34] "James Buchanan" "Andrew Jackson" "Martin Van Buren" \n#[37] "Woodrow Wilson" "Dwight Eisenhower" "Herbert Hoover" \n#[40] "Franklin Roosevelt" "Andrew Jackson" "Ronald Reagan" \n#[43] "Theodore Roosevelt" "Lyndon Johnson" "Richard Nixon" \n#[46] "Franklin Roosevelt" "Calvin Coolidge" "Warren Harding" \nRun Code Online (Sandbox Code Playgroud)\n