避免在使用刮板的网站上被禁止

Question

避免在使用刮板的网站上被禁止

我正在尝试从gsmarena下载数据。可以从以下站点“ http://www.gsmarena.com/htc_one_me-7275.php ” 下载一个HTC我规范的示例代码，如下所述：

网站上的数据以表格和表格行的形式分类。数据格式为：

table header > td[@class='ttl'] > td[@class='nfo']

Run Code Online (Sandbox Code Playgroud)

Items.py文件：

import scrapy

class gsmArenaDataItem(scrapy.Item):
    phoneName = scrapy.Field()
    phoneDetails = scrapy.Field()
    pass

Run Code Online (Sandbox Code Playgroud)

蜘蛛文件：

from scrapy.selector import Selector
from scrapy import Spider
from gsmarena_data.items import gsmArenaDataItem

class testSpider(Spider):
    name = "mobile_test"
    allowed_domains = ["gsmarena.com"]
    start_urls = ('http://www.gsmarena.com/htc_one_me-7275.php',)

    def parse(self, response):
        # extract whatever stuffs you want and yield items here
        hxs = Selector(response)
        phone = gsmArenaDataItem()
        tableRows = hxs.css("div#specs-list table")
        for tableRows in tableRows:
            phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0]
            for ttl in tableRows.xpath(".//td[@class='ttl']"):
                ttl_value = " ".join(ttl.xpath(".//text()").extract())
                nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())
                colonSign = ": "
                commaSign = ", "
                seq = [ttl_value, colonSign, nfo_value, commaSign]
                seq = seq.join(seq)
        phone['phoneDetails'] = seq
        yield phone

Run Code Online (Sandbox Code Playgroud)

但是，一旦尝试使用以下方式将页面加载到scrapy shell中，我就会被禁止：

"http://www.gsmarena.com/htc_one_me-7275.php"

Run Code Online (Sandbox Code Playgroud)

我什至尝试在settings.py中使用DOWNLOAD_DELAY = 3。

请提出我应该怎么做。

Answer 1

FBi*_*idu 6

这可能是由于Scrapy的用户代理所致。正如您在此处看到的那样，该BOT_NAME变量用于组成USER_AGENT。我的猜测是您要爬网的站点阻止了该操作。我试图查看他们的robots.txt文件，但从那里一无所知。

您可以尝试设置自定义UserAgent。在您settings.py添加以下行：

USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0"

Run Code Online (Sandbox Code Playgroud)

实际上，您USER_AGENT可能是与浏览器相关的任何人

归档时间：	10 年，8 月前
查看次数：	3942 次
最近记录：	10 年，8 月前