如何使用python-requests和事件挂钩编写带有回调函数的Web爬虫?

pem*_*ahl 4 python callback web-scraping python-requests

我最近看了一下python-requests模块,我想用它编写一个简单的Web爬虫.给定一个开始URL的集合,我想编写一个Python函数,搜索其他URL的起始URL的网页内容,然后再次调用相同的函数作为回调,新的url作为输入,依此类推.起初,我认为事件挂钩将是用于此目的的正确工具,但其文档部分非常稀疏.在另一个页面上,我读到用于事件挂钩的函数必须返回传递给它们的同一个对象.因此事件挂钩显然不适用于此类任务.或者我只是没有把它弄好......

这是我想要做的一些伪代码(借用伪Scrapy蜘蛛):

import lxml.html    

def parse(response):
    for url in lxml.html.parse(response.url).xpath('//@href'):
        return Request(url=url, callback=parse)
Run Code Online (Sandbox Code Playgroud)

有人能告诉我如何使用python请求进行操作吗?事件挂钩是正确的工具还是我需要不同的东西?(注意:由于各种原因,Scrapy不适合我.)非常感谢!

K Z*_*K Z 7

我将如何做到这一点:

import grequests
from bs4 import BeautifulSoup


def get_urls_from_response(r):
    soup = BeautifulSoup(r.text)
    urls = [link.get('href') for link in soup.find_all('a')]
    return urls


def print_url(args):
    print args['url']


def recursive_urls(urls):
    """
    Given a list of starting urls, recursively finds all descendant urls
    recursively
    """
    if len(urls) == 0:
        return
    rs = [grequests.get(url, hooks=dict(args=print_url)) for url in urls]
    responses = grequests.map(rs)
    url_lists = [get_urls_from_response(response) for response in responses]
    urls = sum(url_lists, [])  # flatten list of lists into a list
    recursive_urls(urls)
Run Code Online (Sandbox Code Playgroud)

我没有测试过代码,但总体思路就在那里.

请注意,我使用的是grequests代替requests性能提升.grequest基本上gevent+request,根据我的经验,这种任务快得多,因为你检索异步的链接gevent.


编辑:这里是不使用递归的相同算法:

import grequests
from bs4 import BeautifulSoup


def get_urls_from_response(r):
    soup = BeautifulSoup(r.text)
    urls = [link.get('href') for link in soup.find_all('a')]
    return urls


def print_url(args):
    print args['url']


def recursive_urls(urls):
    """
    Given a list of starting urls, recursively finds all descendant urls
    recursively
    """
    while True:
        if len(urls) == 0:
            break
        rs = [grequests.get(url, hooks=dict(args=print_url)) for url in urls]
        responses = grequests.map(rs)
        url_lists = [get_urls_from_response(response) for response in responses]
        urls = sum(url_lists, [])  # flatten list of lists into a list

if __name__ == "__main__":
    recursive_urls(["INITIAL_URLS"])
Run Code Online (Sandbox Code Playgroud)

  • @PeterStahl没问题,我在不使用递归的情况下添加了相同的代码.我没有挖掘如何使用"grequests"的工人,我想这是一个不同的问题. (2认同)