带有嵌套Web请求的Gevent池

Dom*_*ane 4 python pool gevent web

我尝试组织最多10个并发下载的池。该功能应下载基本URL,然后解析该页面上的所有URL并下载每个URL,但是并发下载的总数量不应超过10。

from lxml import etree 
import gevent
from gevent import monkey, pool
import requests

monkey.patch_all()
urls = [
    'http://www.google.com', 
    'http://www.yandex.ru', 
    'http://www.python.org', 
    'http://stackoverflow.com',
    # ... another 100 urls
    ]

LINKS_ON_PAGE=[]
POOL = pool.Pool(10)

def parse_urls(page):
    html = etree.HTML(page)
    if html:
        links = [link for link in html.xpath("//a/@href") if 'http' in link]
    # Download each url that appears in the main URL
    for link in links:
        data = requests.get(link)
        LINKS_ON_PAGE.append('%s: %s bytes: %r' % (link, len(data.content), data.status_code))

def get_base_urls(url):
    # Download the main URL
    data = requests.get(url)
    parse_urls(data.content)
Run Code Online (Sandbox Code Playgroud)

如何组织它以并发方式运行,但要保留所有Web请求的常规全局池限制?

Bry*_*yes 5

我认为以下内容将为您提供所需的东西。我在示例中使用BeautifulSoup,而不是您拥有的链接剥离内容。

from bs4 import BeautifulSoup
import requests
import gevent
from gevent import monkey, pool
monkey.patch_all()

jobs = []
links = []
p = pool.Pool(10)

urls = [
    'http://www.google.com', 
    # ... another 100 urls
]

def get_links(url):
    r = requests.get(url)
    if r.status_code == 200:
        soup = BeautifulSoup(r.text)
        links + soup.find_all('a')

for url in urls:
    jobs.append(p.spawn(get_links, url))
gevent.joinall(jobs)
Run Code Online (Sandbox Code Playgroud)