用 Python 抓取雅虎财务损益表

Question

用 Python 抓取雅虎财务损益表

Joh*_*alt 5 html python beautifulsoup yahoo-finance

我正在尝试使用 Python从雅虎财经的损益表中抓取数据。具体来说，假设我想要最新的Apple Net Income 数据。

数据由一堆嵌套的 HTML 表格构成。我正在使用该requests模块来访问它并检索 HTML。

我正在使用BeautifulSoup 4来筛选 HTML 结构，但我不知道如何获得这个数字。

这是使用 Firefox 进行分析的屏幕截图。

到目前为止我的代码：

from bs4 import BeautifulSoup import requests myurl = "https://finance.yahoo.com/q/is?s=AAPL&annual" html = requests.get(myurl).content soup = BeautifulSoup(html)
Run Code Online (Sandbox Code Playgroud)
我尝试使用

all_strong = soup.find_all("strong")
Run Code Online (Sandbox Code Playgroud)
然后得到第 17 个元素，它恰好是包含我想要的图形的元素，但这似乎很不优雅。像这样的东西：

all_strong[16].parent.next_sibling ...
Run Code Online (Sandbox Code Playgroud)
当然，目标是用于BeautifulSoup搜索我需要的数字的名称（在本例中为“净收入”），然后在 HTML 表的同一行中获取数字本身。

我非常感谢有关如何解决此问题的任何想法，请记住，我想应用该解决方案从其他雅虎财经页面检索一堆其他数据。

解决方案/扩展：

低于@wilbur该解决方案的工作，我在扩大它能够得到的数值的任何可用的身影任何的财务页面（即损益表，资产负债表和现金流量表）对任何上市公司。我的功能如下：

def periodic_figure_values(soup, yahoo_figure): values = [] pattern = re.compile(yahoo_figure) title = soup.find("strong", text=pattern) # works for the figures printed in bold if title: row = title.parent.parent else: title = soup.find("td", text=pattern) # works for any other available figure if title: row = title.parent else: sys.exit("Invalid figure '" + yahoo_figure + "' passed.") cells = row.find_all("td")[1:] # exclude the <td> with figure name for cell in cells: if cell.text.strip() != yahoo_figure: # needed because some figures are indented str_value = cell.text.strip().replace(",", "").replace("(", "-").replace(")", "") if str_value == "-": str_value = 0 value = int(str_value) * 1000 values.append(value) return values
Run Code Online (Sandbox Code Playgroud)
该yahoo_figure变量是一个字符串。显然，这必须与雅虎财经上使用的数字名称完全相同。要传递soup变量，我首先使用以下函数：

def financials_soup(ticker_symbol, statement="is", quarterly=False): if statement == "is" or statement == "bs" or statement == "cf": url = "https://finance.yahoo.com/q/" + statement + "?s=" + ticker_symbol if not quarterly: url += "&annual" return BeautifulSoup(requests.get(url).text, "html.parser") return sys.exit("Invalid financial statement code '" + statement + "' passed.")
Run Code Online (Sandbox Code Playgroud)
示例用法——我想从最后可用的损益表中获取 Apple Inc. 的所得税费用：

print(periodic_figure_values(financials_soup("AAPL", "is"), "Income Tax Expense"))
Run Code Online (Sandbox Code Playgroud)
输出： [19121000000, 13973000000, 13118000000]

您还可以得到的日期期末从soup并创建一个字典，其中日期键和数字的值，但这样会使这篇文章太长。到目前为止，这似乎对我有用，但我总是感谢建设性的批评。

Answer 1

wpe*_*rcy 4

这变得有点困难，因为“净收入”包含在标签中<strong>，所以请耐心等待，但我认为这有效：

import re, requests
from bs4 import BeautifulSoup

url = 'https://finance.yahoo.com/q/is?s=AAPL&annual'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
pattern = re.compile('Net Income')

title = soup.find('strong', text=pattern)
row = title.parent.parent # yes, yes, I know it's not the prettiest
cells = row.find_all('td')[1:] #exclude the <td> with 'Net Income'

values = [ c.text.strip() for c in cells ]

Run Code Online (Sandbox Code Playgroud)

values在这种情况下，将包含“净收入”行中的三个表格单元格（并且，我可能会补充说，可以轻松转换为整数 - 我只是喜欢他们将“，”保留为字符串）

In [10]: values
Out[10]: [u'53,394,000', u'39,510,000', u'37,037,000']

Run Code Online (Sandbox Code Playgroud)

当我在 Alphabet (GOOG) 上测试它时，它不起作用，因为我相信它们没有显示损益表 ( https://finance.yahoo.com/q/is?s=GOOG&annual )，但是当我检查 Facebook 时(FB)，值已正确返回( https://finance.yahoo.com/q/is?s=FB&annual )。

如果您想创建一个更加动态的脚本，您可以使用字符串格式来使用您想要的任何股票代码来格式化 url，如下所示：

ticker_symbol = 'AAPL' # or 'FB' or any other ticker symbol
url = 'https://finance.yahoo.com/q/is?s={}&annual'.format(ticker_symbol))

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，11 月前
查看次数：	6761 次
最近记录：	9 年，10 月前