我有几种不同的蜘蛛,想要立刻运行它们.基于此和此,我可以在同一个过程中运行多个蜘蛛.但是,我不知道如何设计一个信号系统,以便在所有蜘蛛完成后停止反应堆.
我试过了:
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
Run Code Online (Sandbox Code Playgroud)
和
crawler.signals.connect(reactor.stop, signal=signals.spider_idle)
Run Code Online (Sandbox Code Playgroud)
在这两种情况下,当第一个履带关闭时,反应器停止.当然,我希望在所有蜘蛛完成后反应堆停止.
有人能告诉我如何做到这一点吗?
我正在尝试使用scrapy来抓取一个包含多页信息的网站.
我的代码是:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from tcgplayer1.items import Tcgplayer1Item
class MySpider(BaseSpider):
name = "tcg"
allowed_domains = ["http://www.tcgplayer.com/"]
start_urls = ["http://store.tcgplayer.com/magic/journey-into-nyx?PageNumber=1"]
def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[@class='magicCard']")
for title in titles:
item = Tcgplayer1Item()
item["cardname"] = title.xpath(".//li[@class='cardName']/a/text()").extract()[0]
vendor = title.xpath(".//tr[@class='vendor ']")
item["price"] = vendor.xpath("normalize-space(.//td[@class='price']/text())").extract()
item["quantity"] = vendor.xpath("normalize-space(.//td[@class='quantity']/text())").extract()
item["shipping"] = vendor.xpath("normalize-space(.//span[@class='shippingAmount']/text())").extract()
item["condition"] = vendor.xpath("normalize-space(.//td[@class='condition']/a/text())").extract()
item["vendors"] = vendor.xpath("normalize-space(.//td[@class='seller']/a/text())").extract()
yield item
Run Code Online (Sandbox Code Playgroud)
我试图刮掉所有页面,直到它到达页面的末尾...有时会有比其他页面更多的页面,因此很难准确说出页码的结束位置.
蜘蛛参考:
import scrapy
from scrapy.spiders import Spider
from scrapy.selector import Selector
from script.items import ScriptItem
class RunSpider(scrapy.Spider):
name = "run"
allowed_domains = ["stopitrightnow.com"]
start_urls = (
'http://www.stopitrightnow.com/',
)
def parse(self, response):
for widget in response.xpath('//div[@class="shopthepost-widget"]'):
#print widget.extract()
item = ScriptItem()
item['url'] = widget.xpath('.//a/@href').extract()
url = item['url']
#print url
yield item
Run Code Online (Sandbox Code Playgroud)
当我运行它时,终端输出如下:
2015-08-21 14:23:51 [scrapy] DEBUG: Scraped from <200 http://www.stopitrightnow.com/>
{'url': []}
<div class="shopthepost-widget" data-widget-id="708473">
<script type="text/javascript">!function(d,s,id){var e, p = /^http:/.test(d.location) ? 'http' : 'https';if(!d.getElementById(id)) {e = d.createElement(s);e.id = id;e.src = …Run Code Online (Sandbox Code Playgroud) 我正在尝试使用Scrapy来获取“ DNS查找失败”网站上的所有链接。
问题是,每个没有任何错误的网站都会在parse_obj方法上打印,但是当URL返回DNS查找失败时,不会调用 callback parse_obj。
我想获取所有带有错误“ DNS查找失败 ”的域,我该怎么办?
日志:
2016-03-08 12:55:12 [scrapy] INFO: Spider opened
2016-03-08 12:55:12 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-08 12:55:12 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-08 12:55:12 [scrapy] DEBUG: Crawled (200) <GET http://domain.com> (referer: None)
2016-03-08 12:55:12 [scrapy] DEBUG: Retrying <GET http://expired-domain.com/> (failed 1 times): DNS lookup failed: address 'expired-domain.com' not found: [Errno 11001] getaddrinfo failed.
Run Code Online (Sandbox Code Playgroud)
代码:
class MyItem(Item):
url= Field() …Run Code Online (Sandbox Code Playgroud) 我在我的专用服务器上使用Scrapy,我想知道如何为我的爬虫获得最佳性能.
这是我的自定义设置:
custom_settings = {
'RETRY_ENABLED': True,
'DEPTH_LIMIT' : 0,
'DEPTH_PRIORITY' : 1,
'LOG_ENABLED' : False,
'CONCURRENT_REQUESTS_PER_DOMAIN' : 32,
'CONCURRENT_REQUESTS' : 64,
}
Run Code Online (Sandbox Code Playgroud)
我实际上爬了大约200个链接/分钟.
服务器:
32 Go RAM : DDR4 ECC 2133 MHz
CPU : 4c/8t : 2,2 / 2,6 GHz
Run Code Online (Sandbox Code Playgroud) 我使用scrapy爬行1000个网址并将刮下的物品存放在一个mongodb中.我想知道每个网址找到了多少项.从scrapy stats我可以看到'item_scraped_count': 3500
但是,我需要分别为每个start_url计算这个数.还有referer对于我可能会使用手动计算每个网址项目的每个项目领域:
2016-05-24 15:15:10 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=6w-_ucPV674> (referer: https://www.youtube.com/results?q=billys&sp=EgQIAhAB)
Run Code Online (Sandbox Code Playgroud)
但我想知道是否有来自scrapy的内置支持.
以下:scrapy的教程我做了一个简单的图像爬虫(刮掉Bugattis的图像).这在下面的实施例中说明.
但是,按照指南给我留下了一个不起作用的爬虫!它找到所有网址,但不下载图片.
我找到了一个鸭子胶带解决方案:替换ITEM_PIPELINES等等IMAGES_STORE;
ITEM_PIPELINES['scrapy.pipeline.images.FilesPipeline'] = 1 和
IMAGES_STORE - > FILES_STORE
但我不知道为什么会这样呢?我想使用scrapy记录的ImagePipeline.
例
settings.py
BOT_NAME = 'imagespider'
SPIDER_MODULES = ['imagespider.spiders']
NEWSPIDER_MODULE = 'imagespider.spiders'
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = "/home/user/Desktop/imagespider/output"
Run Code Online (Sandbox Code Playgroud)
items.py
import scrapy
class ImageItem(scrapy.Item):
file_urls = scrapy.Field()
files = scrapy.Field()
Run Code Online (Sandbox Code Playgroud)
imagespider.py
from imagespider.items import ImageItem
import scrapy
class ImageSpider(scrapy.Spider):
name = "imagespider"
start_urls = (
"https://www.find.com/search=bugatti+veyron",
)
def parse(self, response):
for elem in response.xpath("//img"):
img_url = elem.xpath("@src").extract_first()
yield ImageItem(file_urls=[img_url])
Run Code Online (Sandbox Code Playgroud) 我是scrapy的新手我试图刮掉黄页用于学习目的一切正常,但我想要电子邮件地址,但要做到这一点,我需要访问解析内部提取的链接,并用另一个parse_email函数解析它,但它不会炒.
我的意思是我测试了它运行的parse_email函数,但它不能从主解析函数内部工作,我希望parse_email函数获取链接的源,所以我使用回调调用parse_email函数,但它只返回这些链接<GET https://www.yellowpages.com/los-angeles-ca/mip/palm-tree-la-7254813?lid=7254813> 它应该返回电子邮件由于某种原因parse_email函数不起作用,只是返回链接而不打开页面
这是我评论过的部分代码
import scrapy
import requests
from urlparse import urljoin
scrapy.optional_features.remove('boto')
class YellowSpider(scrapy.Spider):
name = 'yellow spider'
start_urls = ['https://www.yellowpages.com/search?search_terms=restaurant&geo_location_terms=Los+Angeles%2C+CA']
def parse(self, response):
SET_SELECTOR = '.info'
for brickset in response.css(SET_SELECTOR):
NAME_SELECTOR = 'h3 a ::text'
ADDRESS_SELECTOR = '.adr ::text'
PHONE = '.phone.primary ::text'
WEBSITE = '.links a ::attr(href)'
#Getiing the link of the page that has the email usiing this selector
EMAIL_SELECTOR = 'h3 a ::attr(href)'
#extracting the link
email = brickset.css(EMAIL_SELECTOR).extract_first()
#joining and making complete …Run Code Online (Sandbox Code Playgroud) 所以我正在尝试使用CrawlSpider并理解Scrapy Docs中的以下示例:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item …Run Code Online (Sandbox Code Playgroud) 对于我scrapy项目我目前使用的FilesPipeline。下载的文件以其URL的SHA1哈希作为文件名存储。
[(True,
{'checksum': '2b00042f7481c7b056c4b410d28f33cf',
'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg',
'url': 'http://www.example.com/files/product1.pdf'}),
(False,
Failure(...))]
Run Code Online (Sandbox Code Playgroud)
如何使用自定义文件名存储文件?
在上面的示例中,我希望文件名为“ product1_0a79c461a4062ac383dc4fade7bc09f1384a3910.pdf”,因此我保持唯一性,但使文件名可见。
首先,我探索了pipelines.py我的项目,但没有取得太大的成功。
import scrapy
from scrapy.pipelines.images import FilesPipeline
from scrapy.exceptions import DropItem
class MyFilesPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None):
return request.meta.get('filename','')
def get_media_requests(self, item, info):
file_url = item['file_url']
meta = {'filename': item['name']}
yield Request(url=file_url, meta=meta)
Run Code Online (Sandbox Code Playgroud)
并在我的 settings.py
ITEM_PIPELINES = {
#'scrapy.pipelines.files.FilesPipeline': 300
'io_spider.pipelines.MyFilesPipeline': 200
}
Run Code Online (Sandbox Code Playgroud)
一个类似的问题已经被问,但它的目标图像,而不是文件。
任何帮助将不胜感激。
scrapy ×10
scrapy-spider ×10
python ×8
web-scraping ×5
web-crawler ×2
javascript ×1
scraper ×1
selenium ×1