如何在python中检索网页,包括任何图像

Question

如何在python中检索网页,包括任何图像

我正在尝试检索网页的来源,包括任何图片.目前我有这个:

import urllib

page = urllib.urlretrieve('http://127.0.0.1/myurl.php', 'urlgot.php')
print urlgot.php

Run Code Online (Sandbox Code Playgroud)

它检索源很好,但我还需要下载任何链接的图像.

我以为我可以创建一个正则表达式,在下载的源代码中搜索img src或类似内容; 但是,我想知道是否还有urllib函数可以检索图像？类似于wget命令:

wget -r --no-parent http://127.0.0.1/myurl.php

Run Code Online (Sandbox Code Playgroud)

我不想使用os模块并运行wget,因为我希望脚本在所有系统上运行.出于这个原因,我也不能使用任何第三方模块.

任何帮助深表感谢!谢谢

Answer 1

Gri*_*ave 7

当Python内置一个非常好的解析器时,不要使用正则表达式:

from urllib.request import urlretrieve  # Py2: from urllib
from html.parser import HTMLParser      # Py2: from HTMLParser

base_url = 'http://127.0.0.1/'

class ImgParser(HTMLParser):
    def __init__(self, *args, **kwargs):
        self.downloads = []
        HTMLParser.__init__(self, *args, **kwargs)

    def handle_starttag(self, tag, attrs):
        if tag == 'img':
            for attr in attrs:
                if attr[0] == 'src':
                    self.downloads.append(attr[1])

parser = ImgParser()
with open('test.html') as f:
    # instead you could feed it the original url obj directly
    parser.feed(f.read())

for path in parser.downloads:
    url = base_url + path
    print(url)
    urlretrieve(url, path)

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，3 月前
查看次数：	2987 次
最近记录：	7 年，9 月前