我需要收集很多(真的很多)数据进行统计,所有必要的信息都在里面<script type="application/ld+json"></script>
,我在它下面写了scrapy解析器(html中的脚本),但是解析很慢(大约每秒3页)。有什么办法可以加快这个过程吗?理想情况下,我希望每秒看到 10 多页
#spider.py:
import scrapy
import json
class Spider(scrapy.Spider):
name = 'scrape'
start_urls = [
about 10000 urls
]
def parse(self, response):
data = json.loads(response.css('script[type="application/ld+json"]::text').extract_first())
name = data['name']
image = data['image']
path = response.css('span[itemprop="name"]::text').extract()
yield {
'name': name,
'image': image,
'path': path
}
return
Run Code Online (Sandbox Code Playgroud)
#settings.py:
USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:67.0) Gecko/20100101 Firefox/67.0"
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 32
DOWNLOAD_DELAY = 0.33
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
EXTENSIONS = {
'scrapy.extensions.telnet.TelnetConsole': …Run Code Online (Sandbox Code Playgroud) scrapy ×1