为"周期表"和所有链接搜索维基页面

Tal*_*ili 8 xml r web-scraping

我想抓一下以下的wiki文章:http://en.wikipedia.org/wiki/Periodic_table

这样我的R代码的输出将是一个包含以下列的表:

  • 化学元素简称
  • 化学元素全称
  • 化学元素维基页面的URL

(显然,每个化学元素都有一行)

我正在尝试使用XML包来获取页面内的值,但似乎一直停留在开头,所以我很欣赏如何做到这一点的示例(和/或相关示例的链接)

library(XML)
base_url<-"http://en.wikipedia.org/wiki/Periodic_table"
base_html<-getURLContent(base_url)[[1]]
parsed_html <- htmlTreeParse(base_html, useInternalNodes = TRUE)
xmlChildren(parsed_html)
getNodeSet(parsed_html, "//html", c(x = base_url))
[[1]]
attr(,"class")
[1] "XMLNodeSet"
Run Code Online (Sandbox Code Playgroud)

G. *_*eck 13

试试这个:

library(XML)

URL <- "http://en.wikipedia.org/wiki/Periodic_table"
root <- htmlTreeParse(URL, useInternalNodes = TRUE)

# extract attributes and value of all 'a' tags within 3rd table
f <- function(x) c(xmlAttrs(x), xmlValue(x))
m1 <- xpathApply(root, "//table[3]//a", f)
m2 <- suppressWarnings(do.call(rbind, m1))

# extract rows that correspond to chemical symbols
ix <- grep("^[[:upper:]][[:lower:]]{0,2}", m2[, "class"])

m3 <- m2[ix, 1:3]
colnames(m3) <- c("URL", "Name", "Symbol")
m3[,1] <- sub("^", "http://en.wikipedia.org", m3[,1])
m3[,2] <- sub(" .*", "", m3[,2])
Run Code Online (Sandbox Code Playgroud)

一点输出:

> dim(m3)
[1] 118   3
> head(m3)
     URL                                      Name        Symbol
[1,] "http://en.wikipedia.org/wiki/Hydrogen"  "Hydrogen"  "H"   
[2,] "http://en.wikipedia.org/wiki/Helium"    "Helium"    "He"  
[3,] "http://en.wikipedia.org/wiki/Lithium"   "Lithium"   "Li"  
[4,] "http://en.wikipedia.org/wiki/Beryllium" "Beryllium" "Be"  
[5,] "http://en.wikipedia.org/wiki/Boron"     "Boron"     "B"   
[6,] "http://en.wikipedia.org/wiki/Carbon"    "Carbon"    "C"   
Run Code Online (Sandbox Code Playgroud)

我们可以通过从Jeffrey的xpath表达式开始进一步增强xpath表达式来使这更紧凑(因为它几乎将元素放在顶部)并为其添加一个确切的限定条件.在这种情况下xpathSApply可以用于消除do.call对plyr包的需要.我们确定赔率和结束的最后一点与之前相同.这产生了矩阵而不是数据帧,这似乎是优选的,因为内容完全是字符.

library(XML)

URL <- "http://en.wikipedia.org/wiki/Periodic_table"
root <- htmlTreeParse(URL, useInternalNodes = TRUE)

# extract attributes and value of all a tags within 3rd table
f <- function(x) c(xmlAttrs(x), xmlValue(x))
M <- t(xpathSApply(root, "//table[3]/tr/td/a[.!='']", f))[1:118,]

# nicer column names, fix up URLs, fix up Mercury.
colnames(M) <- c("URL", "Name", "Symbol")
M[,1] <- sub("^", "http://en.wikipedia.org", M[,1])
M[,2] <- sub(" .*", "", M[,2])

View(M)
Run Code Online (Sandbox Code Playgroud)

  • 添加了第二个解决方案,该解决方案基于Jeffrey的xpath表达式和我之前的代码的增强. (3认同)

Jef*_*een 4

Tal——我以为这会很容易。我打算向您指出 readHTMLTable(),这是 XML 包中我最喜欢的函数。哎呀,它的帮助页面甚至显示了抓取维基百科页面的示例!

\n\n

但可惜,这不是你想要的:

\n\n
library(XML)\nurl = \'http://en.wikipedia.org/wiki/Periodic_table\'\ntables = readHTMLTable(html)\n\n# ... look through the list to find the one you want...\n\ntable = tables[3]\ntable\n$`NULL`\n         Group\xc2\xa0#    1    2    3     4     5     6     7     8     9    10    11    12     13     14     15     16     17     18\n1         Period      <NA> <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>\n2              1   1H       2He  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>\n3              2  3Li  4Be         5B    6C    7N    8O    9F  10Ne  <NA>  <NA>  <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>\n4              3 11Na 12Mg       13Al  14Si   15P   16S  17Cl  18Ar  <NA>  <NA>  <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>\n5              4  19K 20Ca 21Sc  22Ti   23V  24Cr  25Mn  26Fe  27Co  28Ni  29Cu  30Zn   31Ga   32Ge   33As   34Se   35Br   36Kr\n6              5 37Rb 38Sr  39Y  40Zr  41Nb  42Mo  43Tc  44Ru  45Rh  46Pd  47Ag  48Cd   49In   50Sn   51Sb   52Te    53I   54Xe\n7              6 55Cs 56Ba    *  72Hf  73Ta   74W  75Re  76Os  77Ir  78Pt  79Au  80Hg   81Tl   82Pb   83Bi   84Po   85At   86Rn\n8              7 87Fr 88Ra   ** 104Rf 105Db 106Sg 107Bh 108Hs 109Mt 110Ds 111Rg 112Cn 113Uut 114Uuq 115Uup 116Uuh 117Uus 118Uuo\n9                <NA> <NA> <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>\n10 * Lanthanoids 57La 58Ce 59Pr  60Nd  61Pm  62Sm  63Eu  64Gd  65Tb  66Dy  67Ho  68Er   69Tm   70Yb   71Lu          <NA>   <NA>\n11  ** Actinoids 89Ac 90Th 91Pa   92U  93Np  94Pu  95Am  96Cm  97Bk  98Cf  99Es 100Fm  101Md  102No  103Lr          <NA>   <NA>\n
Run Code Online (Sandbox Code Playgroud)\n\n

名称消失了,原子序数融入了符号中。

\n\n

那么回到绘图板...

\n\n

我的 DOM 行走能力不是很强,所以这不太漂亮。它获取表格单元格中的每个链接,仅保留具有“标题”属性的链接(即符号所在的位置),并将您想要的内容粘贴到 data.frame 中。它也获取页面上的所有其他此类链接,但我们很幸运,元素是前 118 个此类链接:

\n\n
library(XML)\nlibrary(plyr) \n\nurl = \'http://en.wikipedia.org/wiki/Periodic_table\'\n\n# don\'t forget to parse the HTML, doh!\n\ndoc = htmlParse(url)\n\n# get every link in a table cell:\n\nlinks = getNodeSet(doc, \'//table/tr/td/a\')\n\n# make a data.frame for each node with non-blank text, link, and \'title\' attribute:\n\ndf = ldply(links, function(x) {\n            text = xmlValue(x)\n            if (text==\'\') text=NULL\n\n            symbol = xmlGetAttr(x, \'title\')\n            link = xmlGetAttr(x, \'href\')\n            if (!is.null(text) & !is.null(symbol) & !is.null(link))\n                data.frame(symbol, text, link)\n        } )\n\n# only keep the actual elements -- we\'re lucky they\'re first!\n\ndf = head(df, 118)\n\nhead(df)\n     symbol text            link\n1  Hydrogen    H  /wiki/Hydrogen\n2    Helium   He    /wiki/Helium\n3   Lithium   Li   /wiki/Lithium\n4 Beryllium   Be /wiki/Beryllium\n5     Boron    B     /wiki/Boron\n6    Carbon    C    /wiki/Carbon\n
Run Code Online (Sandbox Code Playgroud)\n