Scrapy Spider不提取xpath数据

Question

Scrapy Spider不提取xpath数据

Qui*_*ver 2 python xpath scrapy web-scraping

我是python的新手.我通常使用php来抓取数据.我正在尝试切换到python.我从这里开始关注教程.

http://doc.scrapy.org/en/latest/intro/tutorial.html

我想从这个维基百科页面抓取国家和首都. https://en.wikipedia.org/wiki/List_of_national_capitals_in_alphabetical_order

我的蜘蛛程序是:

import scrapy

class CountrySpider(scrapy.Spider):
    name = "countryCapitals"
    allowed_domains = ["wikipedia.org"]
    start_urls = [
                    "https://en.wikipedia.org/wiki/List_of_national_capitals_in_alphabetical_order"
                    ]

    def parse(self, response):
            for sel in response.xpath('//*[@id="mw-content-text"]/table[2]/tbody/tr'):
                    country = sel.xpath('//td[1]').extract()
                    capital = sel.xpath('td[2]/b/span.text()').extract()
                    print country , capital

Run Code Online (Sandbox Code Playgroud)

它没有按照预期打印任何数据.对此有任何帮助表示赞赏.

Answer 1

Jav*_*nxo 5

看起来浏览器控制台中显示的HTML与原始源代码略有不同.例如,像@furas指出的那样,tdoby标签是问题的一部分.但是提取大写文本的xpath也是不正确的.

我使用下面的解析方法进行了测试,它对我很好,我也改变了国家xpath以提取国家文本.

def parse(self, response):
        for sel in response.xpath('//*[@id="mw-content-text"]/table[2]/tr'):
                country = sel.xpath('td[1]/a/text()').extract()
                capital = sel.xpath('td[2]//a/text()').extract()
                print country , capital

Run Code Online (Sandbox Code Playgroud)

部分输出示例:

[u'Abu Dhabi'] [u'United Arab Emirates']
[u'Abuja'] [u'Nigeria']
[u'Accra'] [u'Ghana']
[u'Adamstown'] [u'Pitcairn Islands']
[u'Addis Ababa'] [u'Ethiopia']
[u'Algiers'] [u'Algeria']
[u'Alofi'] [u'Niue']
[u'Amman'] [u'Jordan']

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，9 月前
查看次数：	771 次
最近记录：	9 年，9 月前