所以我正在构建这个蜘蛛并且它爬行很好,因为我可以登录到shell并浏览HTML页面并测试我的Xpath查询.
不知道我做错了什么.任何帮助,将不胜感激.我已经重新安装了Twisted,但没有.
我的蜘蛛看起来像这样 -
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from spider_scrap.items import spiderItem
class spider(BaseSpider):
name="spider1"
#allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com"
]
def parse(self, response):
items = []
hxs = HtmlXPathSelector(response)
sites = hxs.select('//*[@id="search_results"]/div[1]/div')
for site in sites:
item = spiderItem()
item['title'] = site.select('div[2]/h2/a/text()').extract item['author'] = site.select('div[2]/span/a/text()').extract
item['price'] = site.select('div[3]/div[1]/div[1]/div/b/text()').extract()
items.append(item)
return items
Run Code Online (Sandbox Code Playgroud)
当我运行蜘蛛 - scrapy爬行Spider1时,我收到以下错误 -
2012-09-25 17:56:12-0400 [scrapy] DEBUG: Enabled item pipelines:
2012-09-25 17:56:12-0400 [Spider1] INFO: Spider opened
2012-09-25 17:56:12-0400 [Spider1] INFO: Crawled …Run Code Online (Sandbox Code Playgroud)