使用Python和BeautifulSoup从网页下载.xls文件

Question

使用Python和BeautifulSoup从网页下载.xls文件

Anu*_*hit 5 python beautifulsoup web-scraping

我想下载所有的.xls或.xlsx或.csv从本网站到一个指定的文件夹.

https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009

Run Code Online (Sandbox Code Playgroud)

我已经研究过机械化,漂亮的汤,urllib2等.Mechanize在Python 3中不起作用,urllib2也有Python 3的问题,我寻找解决方法,但我不能.所以,我目前正在尝试使用Beautiful Soup工作.

我找到了一些示例代码并尝试修改它以适应我的问题,如下所示 -

from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin

url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009/'
u = urlopen(url)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
    href = link.get('href')
    if href.startswith('javascript:'):
        continue
    filename = href.rsplit('/', 1)[-1]
    href = urljoin(url, quote(href))
    try:
        urlretrieve(href, filename)
    except:
        print('failed to download')

Run Code Online (Sandbox Code Playgroud)

但是,运行时此代码不会从目标页面中提取文件,也不会输出任何失败消息(例如"无法下载").

如何使用BeautifulSoup从页面中选择Excel文件？
如何使用Python将这些文件下载到本地文件？

Answer 1

mfi*_*tzp 5

您的脚本目前的问题是：

该文件的url尾部/会在请求时提供无效页面，而不列出您要下载的文件。
中的 CSS 选择器soup.select(...)正在选择该链接文档中任何地方都不存在的div属性。webpartid
您正在加入 URL 并引用它，即使链接在页面中作为绝对 URL 给出并且不需要引用。
该try:...except:块阻止您看到尝试下载文件时生成的错误。使用except没有特定异常的块是不好的做法，应该避免。

代码的修改版本将获取正确的文件并尝试下载它们，如下所示：

from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin

# Remove the trailing / you had, as that gives a 404 page
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
u = urlopen(url)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html, "html.parser")

# Select all A elements with href attributes containing URLs starting with http://
for link in soup.select('a[href^="http://"]'):
    href = link.get('href')

    # Make sure it has one of the correct extensions
    if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
        continue

    filename = href.rsplit('/', 1)[-1]
    print("Downloading %s to %s..." % (href, filename) )
    urlretrieve(href, filename)
    print("Done.")

Run Code Online (Sandbox Code Playgroud)

但是，如果您运行此命令，您会注意到urllib.error.HTTPError: HTTP Error 403: Forbidden抛出了异常，即使该文件可以在浏览器中下载。起初我以为这是一个引用检查（以防止热链接），但是如果您在浏览器（例如 Chrome 开发人员工具）中查看请求，您会注意到初始请求http://也被阻止，然后 Chrome 尝试https://请求对于同一个文件。

换句话说，请求必须通过 HTTPS 才能工作（无论页面中的 URL 有何说明）。要解决此问题，您需要在使用请求的 URL 之前重写http:to 。https:以下代码将正确修改 URL 并下载文件。我还添加了一个变量来指定输出文件夹，该文件夹使用以下命令添加到文件名中os.path.join：

import os
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve

URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = ''  # path to output folder, '.' or '' uses current folder

u = urlopen(URL)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
    href = link.get('href')
    if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
        continue

    filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])

    # We need a https:// URL for this site
    href = href.replace('http://','https://')

    print("Downloading %s to %s..." % (href, filename) )
    urlretrieve(href, filename)
    print("Done.")

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，10 月前
查看次数：	8049 次
最近记录：	7 年，9 月前