use*_*364 5 python selenium scrapy web-scraping selenium-webdriver
我想要捕获10个链接
当我运行蜘蛛时,我可以获取json文件中的链接,但仍然存在这样的错误:
看起来硒运行了两次.问题是什么?
请指导我谢谢
2014-08-06 10:30:26+0800 [spider2] DEBUG: Scraped from <200 http://www.test/a/1>
{'link': u'http://www.test/a/1'}
2014-08-06 10:30:26+0800 [spider2] ERROR: Spider error processing <GET
http://www.test/a/1>
Traceback (most recent call last):
........
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 571, in create_connection
raise err
socket.error: [Errno 61] Connection refused
Run Code Online (Sandbox Code Playgroud)
这是我的代码:
from selenium import webdriver
from scrapy.spider import Spider
from ta.items import TaItem
from selenium.webdriver.support.wait import WebDriverWait
from scrapy.http.request import Request
class ProductSpider(Spider):
name = "spider2"
start_urls = ['http://www.test.com/']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
self.driver.implicitly_wait(20)
next = self.driver.find_elements_by_css_selector("div.body .heading a")
for a in next:
item = TaItem()
item['link'] = a.get_attribute("href")
yield Request(url=item['link'], meta={'item': item}, callback=self.parse_detail)
def parse_detail(self,response):
item = response.meta['item']
yield item
self.driver.close()
Run Code Online (Sandbox Code Playgroud)
问题是您过早关闭驱动程序。
只有当蜘蛛完成它的工作时,你才应该关闭它,听spider_closed信号:
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from selenium import webdriver
from scrapy.spider import Spider
from ta.items import TaItem
from scrapy.http.request import Request
class ProductSpider(Spider):
name = "spider2"
start_urls = ['http://www.test.com/']
def __init__(self):
self.driver = webdriver.Firefox()
dispatcher.connect(self.spider_closed, signals.spider_closed)
def parse(self, response):
self.driver.get(response.url)
self.driver.implicitly_wait(20)
next = self.driver.find_elements_by_css_selector("div.body .heading a")
for a in next:
item = TaItem()
item['link'] = a.get_attribute("href")
yield Request(url=item['link'], meta={'item': item}, callback=self.parse_detail)
def parse_detail(self,response):
item = response.meta['item']
yield item
def spider_closed(self, spider):
self.driver.close()
Run Code Online (Sandbox Code Playgroud)
另请参阅:scrapy:当蜘蛛退出时调用函数。
| 归档时间: |
|
| 查看次数: |
6894 次 |
| 最近记录: |