Rvest读取表,其中包含跨越多行的单元格

cor*_*ory 8 r web-scraping rvest

我正在尝试使用rvest从Wikipedia 抓取不规则表格。该表具有跨越多行的单元格。该文档html_table明确规定,这是一个限制。我只是想知道是否有解决方法。

如下所示: 在此处输入图片说明

我的代码:

library(rvest)
url <- "https://en.wikipedia.org/wiki/Arizona_League"
parks <- url %>%
  read_html() %>%
  html_nodes(xpath='/html/body/div[3]/div[3]/div[4]/div/table[2]') %>%
  html_table(fill=TRUE) %>%  # fill=FALSE yields the same results
  .[[1]]
Run Code Online (Sandbox Code Playgroud)

返回此:

在此处输入图片说明

例如,在存在多个错误的地方:“城市”下的第4行应为“ Mesa”,而不是“芝加哥小熊队”。我对空白单元格感到满意,因为我可以根据需要“填充”,但是错误的数据是一个问题。非常感谢您的帮助。

den*_*nis 8

我有办法对其进行编码。它并不完美,有点长,但它确实有效:

library(rvest)
url <- "https://en.wikipedia.org/wiki/Arizona_League"

# get the lines of the table
lines <- url %>%
  read_html() %>%
  html_nodes(xpath="//table[starts-with(@class, 'wikitable')]") %>%
  html_nodes(xpath = 'tbody/tr')

#define the empty table
ncol <-  lines %>%
  .[[1]] %>%
  html_children()%>%
  length()
nrow <- length(lines)
table <- as.data.frame(matrix(nrow = nrow,ncol = ncol))
   
# fill the table
for(i in 1:nrow){
  # get content of the line
  linecontent <- lines[[i]]%>%
    html_children()%>%
    html_text()%>%
    gsub("\n","",.)
  
  # attribute the content to free columns
  colselect <- is.na(table[i,])
  table[i,colselect] <- linecontent
    
# get the line repetition of each columns
  repetition <- lines[[i]]%>%
    html_children()%>%
    html_attr("rowspan")%>%
    ifelse(is.na(.),1,.) %>% # if no rowspan, then it is a normal row, not a multiple one
    as.numeric
  
 # repeat the cells of the multiple rows down
  for(j in 1:length(repetition)){
    span <- repetition[j]
    if(span > 1){
      table[(i+1):(i+span-1),colselect][,j] <- rep(linecontent[j],span-1)
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

这个想法是lines通过获取/tr节点在变量中包含表的 html 行。然后我创建一个空表:列数是第一行的子项的长度(因为它包含标题),行数是lines. 我在 for 循环中手动填充它(这里没有更好的方法)。

困难在于,当当前行上已有多行列时,行中给出的列文本量会发生变化。例如 :

  lines[[3]]%>%
    html_children()%>%
    html_text()%>%
    gsub("\n","",.)
Run Code Online (Sandbox Code Playgroud)

仅给出 5 个值:

[1] "Arizona League Athletics Gold" "Oakland Athletics"             "Mesa"                          "Fitch Park"                   
[5] "10,000"  
Run Code Online (Sandbox Code Playgroud)

而不是 6 列,因为第一列是East8 行。此East值仅出现在它跨越的第一行上。

诀窍是当单元格具有rowspan属性时在表格中向下重复单元格(意味着它们跨越多行)。它允许仅在下一行选择 NA 列,以便 html 行给出的文本量与我们填充的表格中的空闲列量相匹配。

这是通过colselect变量完成的,它是一个布尔值,在重复给定行的单元格之前给出空闲行。

结果 :

         V1                             V2                   V3         V4                                 V5       V6
1  Division                           Team      MLB Affiliation       City                            Stadium Capacity
2      East          Arizona League Angels   Los Angeles Angels      Tempe               Tempe Diablo Stadium    9,785
3      East  Arizona League Athletics Gold    Oakland Athletics       Mesa                         Fitch Park   10,000
4      East Arizona League Athletics Green    Oakland Athletics       Mesa                         Fitch Park   10,000
5      East          Arizona League Cubs 1         Chicago Cubs       Mesa                         Sloan Park   15,000
6      East          Arizona League Cubs 2         Chicago Cubs       Mesa                         Sloan Park   15,000
7      East    Arizona League Diamondbacks Arizona Diamondbacks Scottsdale Salt River Fields at Talking Stick   11,000
8      East    Arizona League Giants Black San Francisco Giants Scottsdale                 Scottsdale Stadium   12,000
9      East   Arizona League Giants Orange San Francisco Giants Scottsdale                 Scottsdale Stadium   12,000
10  Central    Arizona League Brewers Gold    Milwaukee Brewers    Phoenix  American Family Fields of Phoenix    8,000
11  Central Arizona League Dodgers Lasorda  Los Angeles Dodgers    Phoenix                    Camelback Ranch   12,000
12  Central    Arizona League Indians Blue    Cleveland Indians   Goodyear                  Goodyear Ballpark   10,000
13  Central        Arizona League Padres 2     San Diego Padres     Peoria              Peoria Sports Complex   12,882
14  Central            Arizona League Reds      Cincinnati Reds   Goodyear                  Goodyear Ballpark   10,000
15  Central       Arizona League White Sox    Chicago White Sox    Phoenix                    Camelback Ranch   12,000
16     West    Arizona League Brewers Blue    Milwaukee Brewers    Phoenix  American Family Fields of Phoenix    8,000
17     West    Arizona League Dodgers Mota  Los Angeles Dodgers    Phoenix                    Camelback Ranch   12,000
18     West     Arizona League Indians Red    Cleveland Indians   Goodyear                  Goodyear Ballpark   10,000
19     West        Arizona League Mariners     Seattle Mariners     Peoria              Peoria Sports Complex   12,882
20     West        Arizona League Padres 1     San Diego Padres     Peoria              Peoria Sports Complex   12,882
21     West         Arizona League Rangers        Texas Rangers   Surprise                   Surprise Stadium   10,500
22     West          Arizona League Royals   Kansas City Royals   Surprise                   Surprise Stadium   10,500
Run Code Online (Sandbox Code Playgroud)

编辑

我做了一个较短版本的函数,这里有更多解释