使用gevent下载图像

Fra*_*ter 6 python concurrency gevent greenlets

我的任务是从给定的网址列表中下载1M +图像.建议的方法是什么?

阅读Greenlet Vs. 我调查过的线程gevent,但是我无法可靠地运行它.我玩了100个网址的测试集,有时它在1.5秒内完成,但有时它需要超过30秒,这很奇怪,因为每个请求的超时*为0.1,所以它永远不会超过10秒.

*见下面的代码

我也调查过,grequests但他们似乎有异常处理问题.

我的"要求"就是我能做到的

  • 检查下载时出现的错误(超时,损坏的图像......),
  • 监控已处理图像的数量和进度
  • 尽可能快.
from gevent import monkey; monkey.patch_all()
from time import time
import requests
from PIL import Image
import cStringIO
import gevent.hub
POOL_SIZE = 300


def download_image_wrapper(task):
    return download_image(task[0], task[1])

def download_image(image_url, download_path):
    raw_binary_request = requests.get(image_url, timeout=0.1).content
    image = Image.open(cStringIO.StringIO(raw_binary_request))
    image.save(download_path)

def download_images_gevent_spawn(list_of_image_urls, base_folder):
    download_paths = ['/'.join([base_folder, url.split('/')[-1]])
                      for url in list_of_image_urls]
    parameters = [[image_url, download_path] for image_url, download_path in
             zip(list_of_image_urls, download_paths)]
    tasks = [gevent.spawn(download_image_wrapper, parameter_tuple) for parameter_tuple in parameters]
    for task in tasks:
        try:
            task.get()
        except Exception:
            print 'x',
            continue
        print '.',

test_urls = # list of 100 urls

t1 = time()
download_images_gevent_spawn(test_urls, 'download_temp')
print time() - t1
Run Code Online (Sandbox Code Playgroud)

小智 -1

我建议关注 Grablib http://grablib.org/

它是一个基于 pycurl 和 multicurl 的异步解析器。它还尝试自动解决网络错误(例如超时重试等)。

我相信Grab:Spider模块能解决你99%的问题。 http://docs.grablib.org/en/latest/index.html#spider-toc