Fra*_*268 2 finance r web-scraping
我有兴趣使用R分析来自Yahoo Finance的多个代码的余额,收入和现金流量表.
我已经看到有R套件从雅虎财经中提取信息,但我看到的所有例子都涉及历史股价信息.有没有办法可以使用R从这些语句中提取历史信息?
例如,对于Apple(AAPL),可检索链接如下:
实质上,目标是创建三个数据框(AAPL_cashflow,AAPL_income&AAPL_balance),它们与网站上的模式相同.每行由财务类型标识,列为日期.
有没有人有解析和刮表的经验?我认为这rvest有助于此,对吗?
提前致谢!
有了一些来自的包tidyverse,这应该让你开始:
library(tidyverse)
library(rvest)
"https://finance.yahoo.com/quote/AAPL/financials?p=AAPL" %>%
read_html() %>%
html_table() %>%
map_df(bind_cols) %>%
as_tibble()
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)# A tibble: 28 x 5 X1 X2 X3 X4 X5 <chr> <chr> <chr> <chr> <chr> 1 Revenue 9/30/2017 9/24/2016 9/26/2015 9/27/20… 2 Total Revenue 229,234,000 215,639,000 233,715,000 182,795… 3 Cost of Revenue 141,048,000 131,376,000 140,089,000 112,258… 4 Gross Profit 88,186,000 84,263,000 93,626,000 70,537,… 5 Operating Expenses Operating Expenses Operating Expenses Operating Expenses Operati… 6 Research Development 11,581,000 10,045,000 8,067,000 6,041,0… 7 Selling General and Administrative 15,261,000 14,194,000 14,329,000 11,993,… 8 Non Recurring - - - - 9 Others - - - - 10 Total Operating Expenses 167,890,000 155,615,000 162,485,000 130,292… # ... with 18 more rows
请注意,如果要获取第一行并将其视为列名,请添加header = TRUE到该html_table调用.例如,这将为finances数据框中的日期提供列名称.
此外,此数据框内有多个表,因此您需要对其进行整形以便使用数据.例如,当var 应为数字类型时,var X2through X5当前是字符.
一个例子可能是:
finances <- "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL" %>%
read_html() %>%
html_table(header = TRUE) %>%
map_df(bind_cols) %>%
as_tibble()
finances %>%
mutate_all(funs(str_replace_all(., ",", ""))) %>%
mutate_all(funs(str_replace(., "-", NA_character_))) %>%
mutate_at(vars(-Revenue), funs(str_remove_all(., "[a-zA-Z]"))) %>%
mutate_at(vars(-Revenue), funs(as.numeric)) %>%
drop_na()
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)# A tibble: 14 x 5 Revenue `9/30/2017` `9/24/2016` `9/26/2015` `9/27/2014` <chr> <dbl> <dbl> <dbl> <dbl> 1 Total Revenue 229234000. 215639000. 233715000. 182795000. 2 Cost of Revenue 141048000. 131376000. 140089000. 112258000. 3 Gross Profit 88186000. 84263000. 93626000. 70537000. 4 Research Development 11581000. 10045000. 8067000. 6041000. 5 Selling General and Administrative 15261000. 14194000. 14329000. 11993000. 6 Total Operating Expenses 167890000. 155615000. 162485000. 130292000. 7 Operating Income or Loss 61344000. 60024000. 71230000. 52503000. 8 Total Other Income/Expenses Net 2745000. 1348000. 1285000. 980000. 9 Earnings Before Interest and Taxes 61344000. 60024000. 71230000. 52503000. 10 Income Before Tax 64089000. 61372000. 72515000. 53483000. 11 Income Tax Expense 15738000. 15685000. 19121000. 13973000. 12 Net Income From Continuing Ops 48351000. 45687000. 53394000. 39510000. 13 Net Income 48351000. 45687000. 53394000. 39510000. 14 Net Income Applicable To Common Shares 48351000. 45687000. 53394000. 39510000.
我们可以更进一步,使用以下内容使数据框架更加"整洁" gather:
finances %>%
mutate_all(funs(str_replace_all(., ",", ""))) %>%
mutate_all(funs(str_replace(., "-", NA_character_))) %>%
mutate_at(vars(-Revenue), funs(str_remove_all(., "[a-zA-Z]"))) %>%
mutate_at(vars(-Revenue), funs(as.numeric)) %>%
drop_na() %>%
gather(key = "date", value, -Revenue) %>%
mutate(date = lubridate::mdy(date)) %>%
rename("var" = Revenue) %>%
as_tibble()
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)# A tibble: 56 x 3 var date value <chr> <date> <dbl> 1 Total Revenue 2017-09-30 229234000. 2 Cost of Revenue 2017-09-30 141048000. 3 Gross Profit 2017-09-30 88186000. 4 Research Development 2017-09-30 11581000. 5 Selling General and Administrative 2017-09-30 15261000. 6 Total Operating Expenses 2017-09-30 167890000. 7 Operating Income or Loss 2017-09-30 61344000. 8 Total Other Income/Expenses Net 2017-09-30 2745000. 9 Earnings Before Interest and Taxes 2017-09-30 61344000. 10 Income Before Tax 2017-09-30 64089000. # ... with 46 more rows