asa*_*mat 3 python json couchdb scrapy
items.py classes
import scrapy
from scrapy.item import Item, Field
import json
class Attributes(scrapy.Item):
description = Field()
pages=Field()
author=Field()
class Vendor(scrapy.Item):
title=Field()
order_url=Field()
class bookItem(scrapy.Item):
title = Field()
url = Field()
marketprice=Field()
images=Field()
price=Field()
attributes=Field()
vendor=Field()
time_scraped=Field()
Run Code Online (Sandbox Code Playgroud)
我的刮刀
items.py classes
import scrapy
from scrapy.item import Item, Field
import json
class Attributes(scrapy.Item):
description = Field()
pages=Field()
author=Field()
class Vendor(scrapy.Item):
title=Field()
order_url=Field()
class bookItem(scrapy.Item):
title = Field()
url = Field()
marketprice=Field()
images=Field()
price=Field()
attributes=Field()
vendor=Field()
time_scraped=Field()
Run Code Online (Sandbox Code Playgroud)
堆栈跟踪
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapper.items import bookItem,Attributes,Vendor
import couchdb
import logging
import json
import time
from couchdb import Server
class libertySpider(CrawlSpider):
couch = couchdb.Server()
db = couch['python-tests']
name = "libertybooks"
allowed_domains = ["libertybooks.com"]
unvisited_urls = []
visited_urls = []
start_urls = [
"http://www.libertybooks.com"
]
url=["http://www.kaymu.pk"]
rules = [Rule(SgmlLinkExtractor(), callback='parse_item', follow=True)]
total=0
productpages=0
exceptionnum=0
def parse_item(self,response):
if response.url.find("pid")!=-1:
with open("number.html","w") as w:
self.total=self.total+1
w.write(str(self.total)+","+str(self.productpages))
itm=bookItem()
attrib=Attributes()
ven=Vendor()
images=[]
try:
name=response.xpath('//span[@id="pagecontent_lblbookName"]/text()').extract()[0]
name=name.encode('utf-8')
except:
name="name not found"
try:
price=response.xpath('//span[@id="pagecontent_lblPrice"]/text()').extract()[0]
price=price.encode('utf-8')
except:
price=-1
try:
marketprice=response.xpath('//span[@id="pagecontent_lblmarketprice"]/text()').extract()[0]
marketprice=marketprice.encode('utf-8')
except:
marketprice=-1
try:
pages=response.xpath('//span[@id="pagecontent_spanpages"]/text()').extract()[0]
pages=pages.encode('utf-8')
except:
pages=-1
try:
author=response.xpath('//span[@id="pagecontent_lblAuthor"]/text()').extract()[0]
author=author.encode('utf-8')
except:
author="author not found"
try:
description=response.xpath('//span[@id="pagecontent_lblbookdetail"]/text()').extract()[0]
description=description.encode('utf-8')
except:
description="des: not found"
try:
image=response.xpath('//img[@id="pagecontent_imgProduct"]/@src').extract()[0]
image=image.encode('utf-8')
except:
image="#"
ven['title']='libertybooks'
ven['order_url']=response.url
itm['vendor']=ven
itm['time_scraped']=time.ctime()
itm['title']=name
itm['url']=response.url
itm['price']=price
itm['marketprice']=marketprice
itm['images']=images
attrib['pages']=pages
attrib['author']=author
attrib['description']=description
itm['attributes']=attrib
self.saveindb(itm)
return itm
def saveindb(self,obj):
logging.debug(obj)
self.db.save(obj)
Run Code Online (Sandbox Code Playgroud)
我是scrapy和couchdb的初学者,我还尝试使用“json.dumps(itm, default=lambda o: o. dict , sort_keys=True, indent=4)”将项目对象转换为json对象但是得到了相同的响应,所以请告诉我有没有办法让我的类 json 可序列化,以便它们可以存储在 couchdb 中?
好吧,简短的答案就是使用ScrapyJSONEncoder:
from scrapy.utils.serialize import ScrapyJSONEncoder
_encoder = ScrapyJSONEncoder()
...
def saveindb(self,obj):
logging.debug(obj)
self.db.save(_encoder.encode(obj))
Run Code Online (Sandbox Code Playgroud)
更长的版本是:如果你想让这个蜘蛛成长(如果它不应该是一次性的),你可能想要使用管道将项目存储在 CouchDB 中并保持关注点分开(在蜘蛛中爬行/抓取代码,以管道代码存储在数据库中)。
起初这可能看起来像是过度设计,但当项目开始增长并使测试更容易时,它确实有帮助。
归档时间: |
|
查看次数: |
3002 次 |
最近记录: |