Scrapy HtmlXPathSelector

Kea*_*ver 5 scrapy

只是尝试scrapy并试图让一个基本的蜘蛛工作.我知道这可能是我想念的东西,但我已经尝试了所有我能想到的东西.

我得到的错误是:

line 11, in JustASpider
    sites = hxs.select('//title/text()')
NameError: name 'hxs' is not defined
Run Code Online (Sandbox Code Playgroud)

我的代码目前非常基础,但我似乎还无法找到我出错的地方.谢谢你的帮助!

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class JustASpider(BaseSpider):
    name = "google.com"
    start_urls = ["http://www.google.com/search?hl=en&q=search"]


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//title/text()')
        for site in sites:
            print site.extract()


SPIDER = JustASpider()
Run Code Online (Sandbox Code Playgroud)

pin*_*nny 7

代码看起来很旧版本.我建议改用这些代码

from scrapy.spider import Spider
from scrapy.selector import Selector

class JustASpider(Spider):
    name = "googlespider"
    allowed_domains=["google.com"]
    start_urls = ["http://www.google.com/search?hl=en&q=search"]


    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//title/text()').extract()
        print sites
        #for site in sites: (I dont know why you want to loop for extracting the text in the title element)
            #print site.extract()
Run Code Online (Sandbox Code Playgroud)
希望它有所帮助,是一个很好的例子.


Kea*_*ver 6

我在最后删除了SPIDER调用并删除了for循环.只有一个标题标签(正如人们所期望的那样),它似乎正在抛弃循环.我工作的代码如下:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class JustASpider(BaseSpider):
    name = "google.com"
    start_urls = ["http://www.google.com/search?hl=en&q=search"]


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//title/text()')
        final = titles.extract()
Run Code Online (Sandbox Code Playgroud)