R：2019 年更改后的网页抓取 yahoo.finance

Question

R：2019 年更改后的网页抓取 yahoo.finance

L.C*_*.C. 5 r web-scraping yahoo-finance rvest

很长一段时间以来，我一直很高兴地使用从其他 stackoverflow 答案借用的代码来抓取 yahoo.finance 页面，并且效果很好，但是在过去的几周里，雅虎将其表格更改为可折叠/可展开的表格。这破坏了代码，尽管我尽了几天最大的努力，但仍无法修复该错误。

这是其他人使用多年的代码示例（然后由不同的人以不同的方式解析和处理）。

library(rvest)
library(tidyverse)

# Create a URL string
myURL <- "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL"

# Create a dataframe called df to hold this income statement called df
df <- myURL %>% 
  read_html() %>% 
  html_table(header = TRUE) %>% 
  map_df(bind_cols) %>% 
  as_tibble()

Run Code Online (Sandbox Code Playgroud)

有人可以帮忙吗？

编辑以获得更多清晰度：

如果你运行上面的命令，然后查看 df 你得到

# A tibble: 0 x 0

Run Code Online (Sandbox Code Playgroud)

对于预期结果的示例，我们可以尝试 yahoo 未更改的另一个页面，如下所示：

 # Create a URL string
myURL2 <-  "https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL"

df2 <- myURL2 %>% 
  read_html() %>% 
  html_table(header = FALSE) %>% 
  map_df(bind_cols) %>% 
  as_tibble()

Run Code Online (Sandbox Code Playgroud)

如果您查看 df2，您会得到两个变量的 59 个观察结果，作为该页面上的主表，从

市值（盘中）5 [此处的值] 企业价值 3 [此处的值] 依此类推...

Answer 1

QHa*_*arr 6

这可能看起来有点绕房子，但我想避免页面上的大部分我怀疑是动态的内容（例如许多类名）并提供可能具有稍长保质期的内容。

您的代码失败的部分原因是没有table元素容纳该数据。相反，您可以使用看起来更稳定的类属性来收集所需输出表的“行” fi-row。title在每一行中，您可以通过匹配具有任一属性的元素或 data-test='fin-col'基于父行节点来收集列。

我使用正则表达式来匹配日期（因为这些日期会随着时间的推移而变化），并将它们与静态两个标头结合起来，以提供最终的数据帧标头用于输出。我将正则表达式限制为单个节点的文本，我知道该文本应包含仅那些所需日期的模式匹配。

回复：

library(rvest)
library(stringr)
library(magrittr)

page <- read_html('https://finance.yahoo.com/quote/AAPL/financials?p=AAPL')
nodes <- page %>%html_nodes(".fi-row")
df = NULL

for(i in nodes){
  r <- list(i %>%html_nodes("[title],[data-test='fin-col']")%>%html_text())
  df <- rbind(df,as.data.frame(matrix(r[[1]], ncol = length(r[[1]]), byrow = TRUE), stringsAsFactors = FALSE))
}

matches <- str_match_all(page%>%html_node('#Col1-3-Financials-Proxy')%>%html_text(),'\\d{1,2}/\\d{1,2}/\\d{4}')  
headers <- c('Breakdown','TTM', matches[[1]][,1]) 
names(df) <- headers
View(df)

Run Code Online (Sandbox Code Playgroud)

样本：

py:

import requests, re
import pandas as pd
from bs4 import BeautifulSoup as bs

r = requests.get('https://finance.yahoo.com/quote/AAPL/financials?p=AAPL')
soup = bs(r.content, 'lxml')
results = []

for row in soup.select('.fi-row'):
    results.append([i.text for i in row.select('[title],[data-test="fin-col"]')])

p = re.compile(r'\d{1,2}/\d{1,2}/\d{4}')
headers = ['Breakdown','TTM']
headers.extend(p.findall(soup.select_one('#Col1-3-Financials-Proxy').text))
df = pd.DataFrame(results, columns = headers)
print(df)

Run Code Online (Sandbox Code Playgroud)

Answer 2

L.C*_*.C. 1

正如上面的评论中提到的，这是一种尝试处理已发布的不同表大小的替代方案。我曾致力于此，并得到了朋友的帮助。

library(rvest)
library(tidyverse)

url <- https://finance.yahoo.com/quote/AAPL/financials?p=AAPL

# Download the data
raw_table <- read_html(url) %>% html_nodes("div.D\\(tbr\\)")

number_of_columns <- raw_table[1] %>% html_nodes("span") %>% length()

if(number_of_columns > 1){
  # Create empty data frame with the required dimentions
  df <- data.frame(matrix(ncol = number_of_columns, nrow = length(raw_table)),
                      stringsAsFactors = F)

  # Fill the table looping through rows
  for (i in 1:length(raw_table)) {
    # Find the row name and set it.
    df[i, 1] <- raw_table[i] %>% html_nodes("div.Ta\\(start\\)") %>% html_text()
    # Now grab the values
    row_values <- raw_table[i] %>% html_nodes("div.Ta\\(end\\)")
    for (j in 1:(number_of_columns - 1)) {
      df[i, j+1] <- row_values[j] %>% html_text()
    }
  }
view(df)

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，5 月前
查看次数：	2866 次
最近记录：	6 年，3 月前