在R中导入维基百科表

kar*_*los 15 r dataframe

我经常从维基百科中提取表格.Excel的Web导入对维基百科无效,因为它将整个页面视为表格.在谷歌电子表格中,我可以输入:

=ImportHtml("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan","table",3)
Run Code Online (Sandbox Code Playgroud)

此功能将从该页面下载第3张表,其中列出了密歇根州UP的所有县.

R中有类似的东西吗?或者可以通过用户定义的函数创建?

And*_*rie 13

readHTMLTable包中的功能XML非常适合这种情况.

请尝试以下方法:

library(XML)
doc <- readHTMLTable(
         doc="http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan")

doc[[6]]

            V1         V2                 V3                              V4
1       County Population Land Area (sq mi) Population Density (per sq mi)
2        Alger      9,862                918                            10.7
3       Baraga      8,735                904                             9.7
4     Chippewa     38,413               1561                            24.7
5        Delta     38,520               1170                            32.9
6    Dickinson     27,427                766                            35.8
7      Gogebic     17,370               1102                            15.8
8     Houghton     36,016               1012                            35.6
9         Iron     13,138               1166                            11.3
10    Keweenaw      2,301                541                             4.3
11        Luce      7,024                903                             7.8
12    Mackinac     11,943               1022                            11.7
13   Marquette     64,634               1821                            35.5
14   Menominee     25,109               1043                            24.3
15   Ontonagon      7,818               1312                             6.0
16 Schoolcraft      8,903               1178                             7.6
17       TOTAL    317,258             16,420                            19.3
Run Code Online (Sandbox Code Playgroud)

readHTMLTable返回data.frameHTML页面的每个元素的s 列表.您可以使用names获取有关每个元素的信息:

> names(doc)
 [1] "NULL"                                                                               
 [2] "toc"                                                                                
 [3] "Election results of the 2008 Presidential Election by County in the Upper Peninsula"
 [4] "NULL"                                                                               
 [5] "Cities and Villages of the Upper Peninsula"                                         
 [6] "Upper Peninsula Land Area and Population Density by County"                         
 [7] "19th Century Population by Census Year of the Upper Peninsula by County"            
 [8] "20th & 21st Centuries Population by Census Year of the Upper Peninsula by County"   
 [9] "NULL"                                                                               
[10] "NULL"                                                                               
[11] "NULL"                                                                               
[12] "NULL"                                                                               
[13] "NULL"                                                                               
[14] "NULL"                                                                               
[15] "NULL"                                                                               
[16] "NULL" 
Run Code Online (Sandbox Code Playgroud)

  • 在维基百科转向安全连接后,此解决方案不再有效.任何线索如何让它工作? (5认同)
  • 我尝试了代码`readHTMLTable(doc =“ https://en.wikipedia.org/wiki/Gross_domestic_product”)`,但得到的XML内容似乎不是XML:我猜测`https`可以是问题,如何解决呢? (2认同)

Sha*_*bho 7

这是一个与安全(https)链接一起使用的解决方案:

install.packages("htmltab")
library(htmltab)
htmltab("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan",3)
Run Code Online (Sandbox Code Playgroud)


sch*_*nee 7

以Andrie的答案为基础,并解决SSL。如果可以采用一种附加的库依赖性:

library(httr)
library(XML)

url <- "https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan"

r <- GET(url)

doc <- readHTMLTable(
  doc=content(r, "text"))

doc[6]
Run Code Online (Sandbox Code Playgroud)