byt*_*776 1 python beautifulsoup web-scraping http-status-code-404
我正在尝试通过此 URL https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL从雅虎财经中抓取数据。运行下面的python代码后,我得到以下HTML响应
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests, lxml
from lxml import html
stockStatDict = {}
stockSymbol = 'AAPL'
URL = 'https://finance.yahoo.com/quote/'+ stockSymbol + '/key-statistics?p=' + stockSymbol
page = requests.get(URL)
print(page.text)
<!DOCTYPE html>
<html lang="en-us"><head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<title>Yahoo</title>
<meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<style>
html {
height: 100%;
}
body {
background: #fafafc url(https://s.yimg.com/nn/img/sad-panda-201402200631.png) 50% 50%;
background-size: cover;
height: 100%;
text-align: center;
font: 300 18px "helvetica neue", helvetica, verdana, tahoma, arial, sans-serif;
}
table {
height: 100%;
width: 100%;
table-layout: fixed;
border-collapse: collapse;
border-spacing: 0;
border: none;
}
h1 {
font-size: 42px;
font-weight: 400;
color: #400090;
}
p {
color: #1A1A1A;
}
#message-1 {
font-weight: bold;
margin: 0;
}
#message-2 {
display: inline-block;
*display: inline;
zoom: 1;
max-width: 17em;
_width: 17em;
}
</style>
<script>
document.write('<img src="//geo.yahoo.com/b?s=1197757129&t='+new Date().getTime()+'&src=aws&err_url='+encodeURIComponent(document.URL)+'&err=%<pssc>&test='+encodeURIComponent('%<{Bucket}cqh[:200]>')+'" width="0px" height="0px"/>');var beacon = new Image();beacon.src="//bcn.fp.yahoo.com/p?s=1197757129&t="+new Date().getTime()+"&src=aws&err_url="+encodeURIComponent(document.URL)+"&err=%<pssc>&test="+encodeURIComponent('%<{Bucket}cqh[:200]>');
</script>
</head>
<body>
<!-- status code : 404 -->
<!-- Not Found on Server -->
<table>
<tbody><tr>
<td>
<img src="https://s.yimg.com/rz/p/yahoo_frontpage_en-US_s_f_p_205x58_frontpage.png" alt="Yahoo Logo">
<h1 style="margin-top:20px;">Will be right back...</h1>
<p id="message-1">Thank you for your patience.</p>
<p id="message-2">Our engineers are working quickly to resolve the issue.</p>
</td>
</tr>
</tbody></table>
</body></html>
Run Code Online (Sandbox Code Playgroud)
我很困惑,因为我使用以下代码在此 URL https://finance.yahoo.com/quote/AAPL?p=AAPL上抓取摘要选项卡上的数据没有问题
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests, lxml
from lxml import html
stockDict = {}
stockSymbol = 'AAPL'
URL = 'https://finance.yahoo.com/quote/'+ stockSymbol + '?p=' + stockSymbol
page = requests.get(URL)
print(page.text)
soup = BeautifulSoup(page.content, 'html.parser')
stock_data = soup.find_all('table')
stock_data
for table in stock_data:
trs = table.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
if len(tds) > 0:
stockDict[tds[0].get_text()] = [tds[1].get_text()]
stock_sum_df = pd.DataFrame(data=stockDict)
print(stock_sum_df.head())
print(stock_sum_df.info())
Run Code Online (Sandbox Code Playgroud)
任何人都知道我做错了什么?如果这有什么不同,我也在使用雅虎财经的免费版本。
所以我想出了你的问题。
User-Agent 请求头包含一个特征字符串,允许网络协议对等体识别请求软件用户代理的应用程序类型、操作系统、软件供应商或软件版本。在服务器端验证 User-Agent 标头是一项常见操作,因此请确保使用有效浏览器的 User-Agent 字符串以避免被阻止。
来源:http : //go-colly.org/articles/scraping_related_http_headers/ )
您唯一需要做的就是设置一个合法的用户代理。因此添加标题来模拟浏览器:
# This is a standard user-agent of Chrome browser running on Windows 10
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }
Run Code Online (Sandbox Code Playgroud)
例子:
import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
stockSymbol = 'AAPL'
url = 'https://finance.yahoo.com/quote/'+ stockSymbol + '/key-statistics?p=' + stockSymbol
resp = requests.get(url, headers=headers, timeout=5).text
print(resp)
Run Code Online (Sandbox Code Playgroud)
此外,您可以添加另一组标头以伪装成合法浏览器。添加更多这样的标题:
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'DNT' : '1', # Do Not Track Request Header
'Connection' : 'close'
}
Run Code Online (Sandbox Code Playgroud)
这些事情通常由两个主要原因引起:
因此,在设计自动化系统时,在头文件中提供用户代理总是一个好主意。
| 归档时间: |
|
| 查看次数: |
1234 次 |
| 最近记录: |