Moh*_*nka 40 python screen-scraping
我正在编写一个刮刀,从HTML页面下载所有图像文件并将其保存到特定文件夹.所有图像都是HTML页面的一部分.
Rya*_*rom 81
下面是一些从提供的URL下载所有图像的代码,并将它们保存在指定的输出文件夹中.您可以根据自己的需要进行修改.
"""
dumpimages.py
Downloads all the images on the supplied URL, and saves them to the
specified output file ("/test/" by default)
Usage:
python dumpimages.py http://example.com/ [output]
"""
from bs4 import BeautifulSoup as bs
from urllib.request import (
urlopen, urlparse, urlunparse, urlretrieve)
import os
import sys
def main(url, out_folder="/test/"):
"""Downloads all the images at 'url' to /test/"""
soup = bs(urlopen(url))
parsed = list(urlparse(url))
for image in soup.findAll("img"):
print("Image: %(src)s" % image)
filename = image["src"].split("/")[-1]
parsed[2] = image["src"]
outpath = os.path.join(out_folder, filename)
if image["src"].lower().startswith("http"):
urlretrieve(image["src"], outpath)
else:
urlretrieve(urlunparse(parsed), outpath)
def _usage():
print("usage: python dumpimages.py http://example.com [outpath]")
if __name__ == "__main__":
url = sys.argv[-1]
out_folder = "/test/"
if not url.lower().startswith("http"):
out_folder = sys.argv[-1]
url = sys.argv[-2]
if not url.lower().startswith("http"):
_usage()
sys.exit(-1)
main(url, out_folder)
Run Code Online (Sandbox Code Playgroud)
编辑:您可以立即指定输出文件夹.
Cat*_*lin 12
Ryan的解决方案很好,但是如果图像源URL是绝对URL或者只是简单地连接到主页面URL时没有给出好结果的任何东西,则会失败.urljoin识别绝对URL和相对URL,因此将中间的循环替换为:
for image in soup.findAll("img"):
print "Image: %(src)s" % image
image_url = urlparse.urljoin(url, image['src'])
filename = image["src"].split("/")[-1]
outpath = os.path.join(out_folder, filename)
urlretrieve(image_url, outpath)
Run Code Online (Sandbox Code Playgroud)
这是下载一个图像的功能:
def download_photo(self, img_url, filename):
file_path = "%s%s" % (DOWNLOADED_IMAGE_PATH, filename)
downloaded_image = file(file_path, "wb")
image_on_web = urllib.urlopen(img_url)
while True:
buf = image_on_web.read(65536)
if len(buf) == 0:
break
downloaded_image.write(buf)
downloaded_image.close()
image_on_web.close()
return file_path
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
86169 次 |
最近记录: |