请求网址中缺少方案

Tob*_*oby 20 python url scrapy

我一直坚持这个bug,以下错误信息如下:

File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\http\request\__init__.py", line 61, in _set_url
            raise ValueError('Missing scheme in request url: %s' % self._url)
            exceptions.ValueError: Missing scheme in request url: h
Run Code Online (Sandbox Code Playgroud)

Scrapy代码:

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import Selector
    from scrapy.http import Request
    from spyder.items import SypderItem

    import sys
    import MySQLdb
    import hashlib
    from scrapy import signals
    from scrapy.xlib.pydispatch import dispatcher

    # _*_ coding: utf-8 _*_

    class some_Spyder(CrawlSpider):
        name = "spyder"

        def __init__(self, *a, **kw):
            # catch the spider stopping
            # dispatcher.connect(self.spider_closed, signals.spider_closed)
            # dispatcher.connect(self.on_engine_stopped, signals.engine_stopped)

            self.allowed_domains = "domainname.com"
            self.start_urls = "http://www.domainname.com/"
            self.xpaths = '''//td[@class="CatBg" and @width="25%" 
                          and @valign="top" and @align="center"]
                          /table[@cellspacing="0"]//tr/td/a/@href'''

            self.rules = (
                Rule(SgmlLinkExtractor(restrict_xpaths=(self.xpaths))),
                Rule(SgmlLinkExtractor(allow=('cart.php?')), callback='parse_items'),
                )

            super(spyder, self).__init__(*a, **kw)

        def parse_items(self, response):
            sel = Selector(response)
            items = []
            listings = sel.xpath('//*[@id="tabContent"]/table/tr')

            item = IgeItem()
            item["header"] = sel.xpath('//td[@valign="center"]/h1/text()')

            items.append(item)
            return items
Run Code Online (Sandbox Code Playgroud)

我很确定这与url有关,我要求在LinkExtractor中遵循scrapy.在shell中提取它们时,它们看起来像这样:

data=u'cart.php?target=category&category_id=826'
Run Code Online (Sandbox Code Playgroud)

与从工作蜘蛛中提取的另一个url相比:

data=u'/path/someotherpath/category.php?query=someval'
Run Code Online (Sandbox Code Playgroud)

我已经看了几个关于SO的问题,比如用scrapy下载图片,但是从阅读它看起来我觉得我的问题可能略有不同.

我也看了看这个 - http://static.scrapy.org/coverage-report/scrapy_http_request___init__.html

这解释了如果self.urls缺少一个":"就会抛出错误,从查看我定义的start_urls,我不能完全理解为什么这个错误会显示,因为该方案是明确定义的.

谢谢阅读,

托比

Guy*_*ely 24

更改start_urls到:

self.start_urls = ["http://www.bankofwow.com/"]
Run Code Online (Sandbox Code Playgroud)


ric*_*ier 5

在网址前加上“ http”或“ https”

  • 这是获得相同错误的另一种方法。写一个没有“http”的网址。 (2认同)

pau*_*rth 5

正如@Guy早先回答的那样,start_urlsattribute必须是一个列表,该exceptions.ValueError: Missing scheme in request url: h消息来自:错误消息中的“ h ”是“ http://www.bankofwow.com/ ” 的第一个字符,被解释为一个列表(字符)

allowed_domains 还必须是域列表,否则您将收到过滤的“异地”请求。

更改restrict_xpaths

self.xpaths = """//td[@class="CatBg" and @width="25%" 
                    and @valign="top" and @align="center"]
                   /table[@cellspacing="0"]//tr/td"""
Run Code Online (Sandbox Code Playgroud)

它应该代表文档中可以找到链接的区域,而不应该直接是链接URL

来自http://doc.scrapy.org/en/latest/topics/link-extractors.html#sgmllinkextractor

strict_xpaths(str或list)–是一个XPath(或XPath的列表),它定义了响应中应从中提取链接的区域。如果给出,则仅扫描那些XPath选择的文本以查找链接。

最后,习惯将这些定义为类属性,而不是在中设置__init__

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import Request
from bow.items import BowItem

import sys
import MySQLdb
import hashlib
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher

# _*_ coding: utf-8 _*_

class bankOfWow_spider(CrawlSpider):
    name = "bankofwow"

    allowed_domains = ["bankofwow.com"]
    start_urls = ["http://www.bankofwow.com/"]
    xpaths = '''//td[@class="CatBg" and @width="25%"
                  and @valign="top" and @align="center"]
                  /table[@cellspacing="0"]//tr/td'''

    rules = (
        Rule(SgmlLinkExtractor(restrict_xpaths=(xpaths,))),
        Rule(SgmlLinkExtractor(allow=('cart.php?')), callback='parse_items'),
        )

    def __init__(self, *a, **kw):
        # catch the spider stopping
        # dispatcher.connect(self.spider_closed, signals.spider_closed)
        # dispatcher.connect(self.on_engine_stopped, signals.engine_stopped)
        super(bankOfWow_spider, self).__init__(*a, **kw)

    def parse_items(self, response):
        sel = Selector(response)
        items = []
        listings = sel.xpath('//*[@id="tabContent"]/table/tr')

        item = IgeItem()
        item["header"] = sel.xpath('//td[@valign="center"]/h1/text()')

        items.append(item)
        return items
Run Code Online (Sandbox Code Playgroud)