Dam*_*ian 6 python web-crawler scrapy
我有这样的问题:
所以,我不希望网站被多次抓取.我修改了中间件并编写了一个print语句来测试它是否正确分类已经看过的网站.确实如此.
尽管如此,解析似乎多次执行,因为我收到的json-File包含双重条目.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from crawlspider.items import KickstarterItem
from HTMLParser import HTMLParser
### code for stripping off HTML tags:
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return str(''.join(self.fed))
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
###
items = []
class MySpider(CrawlSpider):
name = 'kickstarter'
allowed_domains = ['kickstarter.com']
start_urls = ['http://www.kickstarter.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('discover/categories/comics', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('projects/', )), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = KickstarterItem()
item['date'] = hxs.select('//*[@id="about"]/div[2]/ul/li[1]/text()').extract()
item['projname'] = hxs.select('//*[@id="title"]/a').extract()
item['projname'] = strip_tags(str(item['projname']))
item['projauthor'] = hxs.select('//*[@id="name"]')
item['projauthor'] = item['projauthor'].select('string()').extract()[0]
item['backers'] = hxs.select('//*[@id="backers_count"]/data').extract()
item['backers'] = strip_tags(str(item['backers']))
item['collmoney'] = hxs.select('//*[@id="pledged"]/data').extract()
item['collmoney'] = strip_tags(str(item['collmoney']))
item['goalmoney'] = hxs.select('//*[@id="stats"]/h5[2]/text()').extract()
items.append(item)
return items
Run Code Online (Sandbox Code Playgroud)
我的items.py看起来像这样:
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html
from scrapy.item import Item, Field
class KickstarterItem(Item):
# define the fields for your item here like:
date = Field()
projname = Field()
projauthor = Field()
backers = Field()
collmoney = Field()
goalmoney = Field()
pass
Run Code Online (Sandbox Code Playgroud)
我的中间件看起来像这样:
import os
from scrapy.dupefilter import RFPDupeFilter
from scrapy.utils.request import request_fingerprint
class CustomFilter(RFPDupeFilter):
def __getid(self, url):
mm = url.split("/")[4] #extracts project-id (is a number) from project-URL
print "_____________", mm
return mm
def request_seen(self, request):
fp = self.__getid(request.url)
self.fingerprints.add(fp)
if fp in self.fingerprints and fp.isdigit(): # .isdigit() checks wether fp comes from a project ID
print "______fp is a number (therefore a project-id) and has been encountered before______"
return True
if self.file:
self.file.write(fp + os.linesep)
Run Code Online (Sandbox Code Playgroud)
我将此行添加到settings.py:
DUPEFILTER_CLASS = 'crawlspider.duplicate_filter.CustomFilter'
Run Code Online (Sandbox Code Playgroud)
我使用"scrapy crawl kickstarter -o items.json -t json"调用脚本.然后我从中间件代码中看到正确的打印语句.有关为什么json包含多个包含相同数据的条目的注释?
现在,这些是删除重复项的三个修改:
我将其添加到 settings.py 中:
ITEM_PIPELINES = ['crawlspider.pipelines.DuplicatesPipeline',]
让 scrapy 知道我在 pipelines.py 中添加了一个函数 DuplicatesPipeline:
from scrapy import signals
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item['projname'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['projname'])
return item
Run Code Online (Sandbox Code Playgroud)
您不需要调整蜘蛛,也不要使用我之前发布的 dupefilter/中间件内容。
但我感觉我的解决方案不会减少通信,因为必须先创建 Item 对象,然后才能对其进行评估并可能将其删除。但我对此表示同意。
(提问者找到的解决方案,移至答案)
| 归档时间: |
|
| 查看次数: |
2646 次 |
| 最近记录: |