我试图让这个蜘蛛工作,如果要求分别刮下它的组件,它可以工作,但是当我尝试使用Srapy回调函数来接收参数后,我会崩溃.目标是在输出json文件中以格式写入时抓取多个页面并刮取数据:
作者| 专辑| 标题| 歌词
每个数据都位于不同的网页上,这就是我为什么要使用Scrapy回调函数来实现这一目标的原因.
此外,上述每个项目都在Scrapy items.py下定义为:
import scrapy
class TutorialItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
author = scrapy.Field()
album = scrapy.Field()
title = scrapy.Field()
lyrics = scrapy.Field()
Run Code Online (Sandbox Code Playgroud)
蜘蛛代码从这里开始:
import scrapy
import re
import json
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tutorial.items import TutorialItem
# urls class
class DomainSpider(scrapy.Spider):
name = "domainspider"
allowed_domains = ['www.domain.com']
start_urls = [
'http://www.domain.com',
]
rules = (
Rule(LinkExtractor(allow='www\.domain\.com/[A-Z][a-zA-Z_/]+$'),
'parse', follow=True,
), …Run Code Online (Sandbox Code Playgroud)