Scrapy - 从表中提取项目

Question

Scrapy - 从表中提取项目

试图让我的头围绕Scrapy,但达到了几个死胡同.

我在页面上有一个2个表格,并希望从每个表格中提取数据,然后移动到下一页.

表看起来像这样(第一个称为Y1,第二个称为Y2),结构相同.

<div id="Y1" style="margin-bottom: 0px; margin-top: 15px;">
                                <h2>First information</h2><hr style="margin-top: 5px; margin-bottom: 10px;">                    

                <table class="table table-striped table-hover table-curved">
                    <thead>
                        <tr>
                            <th class="tCol1" style="padding: 10px;">First Col Head</th>
                            <th class="tCol2" style="padding: 10px;">Second Col Head</th>
                            <th class="tCol3" style="padding: 10px;">Third Col Head</th>
                        </tr>
                    </thead>
                    <tbody>

                        <tr>
                            <td>Info 1</td>
                            <td>Monday 5 September, 2016</td>
                            <td>Friday 21 October, 2016</td>
                        </tr>
                        <tr class="vevent">
                            <td class="summary"><b>Info 2</b></td>
                            <td class="dtstart" timestamp="1477094400"><b></b></td>
                            <td class="dtend" timestamp="1477785600">
                            <b>Sunday 30 October, 2016</b></td>
                        </tr>
                        <tr>
                            <td>Info 3</td>
                            <td>Monday 31 October, 2016</td>
                            <td>Tuesday 20 December, 2016</td>
                        </tr>


                    <tr class="vevent">
                        <td class="summary"><b>Info 4</b></td>                      
                        <td class="dtstart" timestamp="1482278400"><b>Wednesday 21 December, 2016</b></td>
                        <td class="dtend" timestamp="1483315200">
                        <b>Monday 2 January, 2017</b></td>
                    </tr>



                </tbody>
            </table>

Run Code Online (Sandbox Code Playgroud)

正如你所看到的,结构有点不一致,但只要我能得到每个td并输出到csv那么我就会很开心.

我尝试使用xPath,但这只会让我更加困惑.

我最后一次尝试:

import scrapy

class myScraperSpider(scrapy.Spider):
name = "myScraper"

allowed_domains = ["mysite.co.uk"]
start_urls =    (
                'https://mysite.co.uk/page1/',
                )

def parse_products(self, response):
    products = response.xpath('//*[@id="Y1"]/table')
    # ignore the table header row
    for product in products[1:]  
       item = Schooldates1Item()
       item['hol'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[1]').extract()[0]
       item['first'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[2]').extract()[0]
       item['last'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[3]').extract()[0]
       yield item

Run Code Online (Sandbox Code Playgroud)

这里没有错误,但它只是触发了大量有关爬网的信息,但没有实际结果.

更新:

  import scrapy

       class SchoolSpider(scrapy.Spider):
name = "school"

allowed_domains = ["termdates.co.uk"]
start_urls =    (
                'https://termdates.co.uk/school-holidays-16-19-abingdon/',
                )

  def parse_products(self, response):
  products = sel.xpath('//*[@id="Year1"]/table//tr')
 for p in products[1:]:
  item = dict()
  item['hol'] = p.xpath('td[1]/text()').extract_first()
  item['first'] = p.xpath('td[1]/text()').extract_first()
  item['last'] = p.xpath('td[1]/text()').extract_first()
  yield item

Run Code Online (Sandbox Code Playgroud)

这给了我:IndentationError:意外的缩进

如果我运行下面修改过的脚本(感谢@Granitosaurus)输出到CSV(-o schoolDates.csv)我得到一个空文件:

import scrapy

class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',)

def parse_products(self, response):
    products = sel.xpath('//*[@id="Year1"]/table//tr')
    for p in products[1:]:
        item = dict()
        item['hol'] = p.xpath('td[1]/text()').extract_first()
        item['first'] = p.xpath('td[1]/text()').extract_first()
        item['last'] = p.xpath('td[1]/text()').extract_first()
        yield item

Run Code Online (Sandbox Code Playgroud)

这是日志:

2017-03-23 12:04:08 [scrapy.core.engine]信息:蜘蛛开启2017-03-23 12:04:08 [scrapy.extensions.logstats]信息:抓0页(0页/分) ,刮0项(0项/分)2017-03-23 12:04:08 [scrapy.extensions.telnet] DEBUG:telnet控制台听... 2017-03-23 12:04:08 [scrapy. core.engine] DEBUG:Crawled(200)https://termdates.co.uk/robots.txt>(referer:None)2017-03-23 12:04:08 [scrapy.core.engine] DEBUG:Crawled( 200)https://termdates.co.uk/school-holidays-16-19-abingdon/>(referer:None)2017-03-23 12:04:08 [scrapy.core.scraper] 错误:蜘蛛错误处理https://termdates.co.uk/school-holidays-16-19-abingdon/>(referer:None)回溯(最近一次调用最后一次):文件"c:\ python27\lib\site-packages\twisted\internet\defer.py",第653行,在_ runCallbacks中current.result = callback(current.result,*args,**kw)文件"c:\ python27\lib\site-packages\scrapy-1.3.3-py2. 7.egg\scrapy\spiders__init __.py",第76行,在解析时引发NotImplementedError NotImplementedError 201 7-03-23 12:04:08 [scrapy.core.engine]信息:关闭蜘蛛(已完成)2017-03-23 12:04:08 [scrapy.statscollectors]信息:倾倒Scrapy统计数据:{'downloader/request_bytes ':467,'downloader/request_count':2,'downloader/request_method_count/GET':2,'downloader/response_bytes':11311,'downloader/response_count':2,'downloader/response_status_count/200':2,'finish_reason ':'完成','finish_time':datetime.datetime(2017,3,23,12,4,8,850000),'log_count/DEBUG':3,'log_count/ERROR':1,'log_count/INFO' :7,'response_received_count':2,'scheduler/dequeued':1,'scheduler/dequeued/memory':1,'scheduler/enqueued':1,'scheduler/enqueued/memory':1,'spider_exceptions/NotImplementedError' :1,'start_time':datetime.datetime(2017,3,23,12,4,8,356000)} 2017-03-23 12:04:08 [scrapy.core.engine]信息:蜘蛛关闭(完成)

更新2 :(跳过行)这会将结果推送到csv文件,但会跳过每隔一行.

Shell显示 {'hol':无,'last':u'\ r \n\t\t\t\t\t\t\t\t\t't','first':None}

import scrapy

class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',)

def parse(self, response):
    products = response.xpath('//*[@id="Year1"]/table//tr')
    for p in products[1:]:
        item = dict()
        item['hol'] = p.xpath('td[1]/text()').extract_first()
        item['first'] = p.xpath('td[2]/text()').extract_first()
        item['last'] = p.xpath('td[3]/text()').extract_first()
        yield item

Run Code Online (Sandbox Code Playgroud)

解决方案:感谢@ vold 这会抓取start_urls中的所有页面并处理不一致的表格布局

# -*- coding: utf-8 -*-
import scrapy
from SchoolDates_1.items import Schooldates1Item

class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',
              'https://termdates.co.uk/school-holidays-3-dimensions',)

def parse(self, response):
    products = response.xpath('//*[@id="Year1"]/table//tr')
    # ignore the table header row
    for product in products[1:]:
        item = Schooldates1Item()
        item['hol'] = product.xpath('td[1]//text()').extract_first()
        item['first'] = product.xpath('td[2]//text()').extract_first()
        item['last'] = ''.join(product.xpath('td[3]//text()').extract()).strip()
        item['url'] = response.url
        yield item

Run Code Online (Sandbox Code Playgroud)

Answer 1

vol*_*old 9

您需要稍微更正您的代码.由于您已经选择了表格中的所有元素,因此无需再次指向表格.因此,您可以将xpath缩短为这样的东西td[1]//text().

def parse_products(self, response):
    products = response.xpath('//*[@id="Year1"]/table//tr')
    # ignore the table header row
    for product in products[1:]  
       item = Schooldates1Item()
       item['hol'] = product.xpath('td[1]//text()').extract_first()
       item['first'] = product.xpath('td[2]//text()').extract_first()
       item['last'] = product.xpath('td[3]//text()').extract_first()
       yield item

Run Code Online (Sandbox Code Playgroud)

编辑我的答案,因为@stutray提供了一个网站的链接.

Answer 2

Uma*_*air 5

您可以使用 CSS 选择器代替 xPath，我总是发现 CSS 选择器很简单。

\n\n

def parse_products(self, response):\n\n    for table in response.css("#Y1 table")[1:]:\n       item = Schooldates1Item()\n       item[\'hol\'] = product.css(\'td:nth-child(1)::text\').extract_first()\n       item[\'first\'] = product.css(\'td:nth-child(2)::text\').extract_first()\n       item[\'last\'] = product.css(\'td:nth-child(3)::text\').extract_first()\n       yield item\n

Run Code Online (Sandbox Code Playgroud)\n\n

也不要tbody在选择器中使用标签。来源：

\n\n

\n
尤其是 Firefox，它以向表格添加元素而闻名。另一方面，Scrapy 不会修改原始页面 HTML，因此如果您在 XPath 表达式中使用，您将无法提取任何数据。
\n

\n

归档时间：	8 年，7 月前
查看次数：	10425 次
最近记录：	7 年，11 月前