pem*_*ahl 4 python callback web-scraping python-requests
我最近看了一下python-requests模块,我想用它编写一个简单的Web爬虫.给定一个开始URL的集合,我想编写一个Python函数,搜索其他URL的起始URL的网页内容,然后再次调用相同的函数作为回调,新的url作为输入,依此类推.起初,我认为事件挂钩将是用于此目的的正确工具,但其文档部分非常稀疏.在另一个页面上,我读到用于事件挂钩的函数必须返回传递给它们的同一个对象.因此事件挂钩显然不适用于此类任务.或者我只是没有把它弄好......
这是我想要做的一些伪代码(借用伪Scrapy蜘蛛):
import lxml.html
def parse(response):
for url in lxml.html.parse(response.url).xpath('//@href'):
return Request(url=url, callback=parse)
Run Code Online (Sandbox Code Playgroud)
有人能告诉我如何使用python请求进行操作吗?事件挂钩是正确的工具还是我需要不同的东西?(注意:由于各种原因,Scrapy不适合我.)非常感谢!
我将如何做到这一点:
import grequests
from bs4 import BeautifulSoup
def get_urls_from_response(r):
soup = BeautifulSoup(r.text)
urls = [link.get('href') for link in soup.find_all('a')]
return urls
def print_url(args):
print args['url']
def recursive_urls(urls):
"""
Given a list of starting urls, recursively finds all descendant urls
recursively
"""
if len(urls) == 0:
return
rs = [grequests.get(url, hooks=dict(args=print_url)) for url in urls]
responses = grequests.map(rs)
url_lists = [get_urls_from_response(response) for response in responses]
urls = sum(url_lists, []) # flatten list of lists into a list
recursive_urls(urls)
Run Code Online (Sandbox Code Playgroud)
我没有测试过代码,但总体思路就在那里.
请注意,我使用的是grequests代替requests性能提升.grequest基本上gevent+request,根据我的经验,这种任务要快得多,因为你检索异步的链接gevent.
编辑:这里是不使用递归的相同算法:
import grequests
from bs4 import BeautifulSoup
def get_urls_from_response(r):
soup = BeautifulSoup(r.text)
urls = [link.get('href') for link in soup.find_all('a')]
return urls
def print_url(args):
print args['url']
def recursive_urls(urls):
"""
Given a list of starting urls, recursively finds all descendant urls
recursively
"""
while True:
if len(urls) == 0:
break
rs = [grequests.get(url, hooks=dict(args=print_url)) for url in urls]
responses = grequests.map(rs)
url_lists = [get_urls_from_response(response) for response in responses]
urls = sum(url_lists, []) # flatten list of lists into a list
if __name__ == "__main__":
recursive_urls(["INITIAL_URLS"])
Run Code Online (Sandbox Code Playgroud)