Scrapy不会抓取所有页面

Question

Scrapy不会抓取所有页面

这是我的工作代码:

from scrapy.item import Item, Field

class Test2Item(Item):
    title = Field()

from scrapy.http import Request
from scrapy.conf import settings
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class Khmer24Spider(CrawlSpider):
    name = 'khmer24'
    allowed_domains = ['www.khmer24.com']
    start_urls = ['http://www.khmer24.com/']
    USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22 AlexaToolbar/alxg-3.1"
    DOWNLOAD_DELAY = 2

    rules = (
        Rule(SgmlLinkExtractor(allow=r'ad/.+/67-\d+\.html'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        i = Test2Item()
        i['title'] = (hxs.select(('//div[@class="innerbox"]/h1/text()')).extract()[0]).strip(' \t\n\r')
        return i

Run Code Online (Sandbox Code Playgroud)

它只能废弃10或15条记录.总是随机数!我无法设法获得具有http://www.khmer24.com/ad/any-words/67-anynumber.html等模式的所有页面

我真的怀疑Scrapy因为重复请求而完成了爬行.但他们建议使用dont_filter = True,我不知道将它放在我的代码中的哪个位置.

我是Scrapy的新手,真的需要帮助.

Answer 1

Jav*_* Xu 5

1."他们建议使用dont_filter = True但是,我不知道将它放在我的代码中的哪个位置."

这个参数在BaseSpider中,CrawlSpider继承自.(scrapy/spider.py)默认设置为True.

"它只能废弃10或15条记录."

原因:这是因为start_urls不是那么好.在这个问题中,蜘蛛开始在http://www.khmer24.com/中爬行,让我们假设它有10个url要遵循(这满足模式).然后,蜘蛛继续爬行这10个网址.但是由于这些页面包含如此少的满意模式,蜘蛛会得到一些网址(甚至没有网址),这会导致停止爬行.

可能的解决方案:上面我所说的原因只是重申了冰雪的意见.解决方案也是如此.

建议使用"所有广告"页面作为start_urls.(您也可以将主页用作start_urls并使用新规则.)

新规则:

rules = (
    # Extract all links and follow links from them 
    # (since no callback means follow=True by default)
    # (If "allow" is not given, it will match all links.)
    Rule(SgmlLinkExtractor()), 

    # Extract links matching the "ad/any-words/67-anynumber.html" pattern
    # and parse them with the spider's method parse_item (NOT FOLLOW THEM)
    Rule(SgmlLinkExtractor(allow=r'ad/.+/67-\d+\.html'), callback='parse_item'),
)

Run Code Online (Sandbox Code Playgroud)

请参阅: SgmlLinkExtractor, CrawlSpider示例

归档时间：	12 年，6 月前
查看次数：	3997 次
最近记录：	12 年，3 月前