试图让我的头围绕Scrapy,但达到了几个死胡同.
我在页面上有一个2个表格,并希望从每个表格中提取数据,然后移动到下一页.
表看起来像这样(第一个称为Y1,第二个称为Y2),结构相同.
<div id="Y1" style="margin-bottom: 0px; margin-top: 15px;">
<h2>First information</h2><hr style="margin-top: 5px; margin-bottom: 10px;">
<table class="table table-striped table-hover table-curved">
<thead>
<tr>
<th class="tCol1" style="padding: 10px;">First Col Head</th>
<th class="tCol2" style="padding: 10px;">Second Col Head</th>
<th class="tCol3" style="padding: 10px;">Third Col Head</th>
</tr>
</thead>
<tbody>
<tr>
<td>Info 1</td>
<td>Monday 5 September, 2016</td>
<td>Friday 21 October, 2016</td>
</tr>
<tr class="vevent">
<td class="summary"><b>Info 2</b></td>
<td class="dtstart" timestamp="1477094400"><b></b></td>
<td class="dtend" timestamp="1477785600">
<b>Sunday 30 October, 2016</b></td>
</tr>
<tr>
<td>Info 3</td>
<td>Monday 31 October, 2016</td>
<td>Tuesday 20 December, 2016</td>
</tr>
<tr class="vevent">
<td class="summary"><b>Info 4</b></td>
<td class="dtstart" timestamp="1482278400"><b>Wednesday 21 December, 2016</b></td>
<td class="dtend" timestamp="1483315200">
<b>Monday 2 January, 2017</b></td>
</tr>
</tbody>
</table>
Run Code Online (Sandbox Code Playgroud)
正如你所看到的,结构有点不一致,但只要我能得到每个td并输出到csv那么我就会很开心.
我尝试使用xPath,但这只会让我更加困惑.
我最后一次尝试:
import scrapy
class myScraperSpider(scrapy.Spider):
name = "myScraper"
allowed_domains = ["mysite.co.uk"]
start_urls = (
'https://mysite.co.uk/page1/',
)
def parse_products(self, response):
products = response.xpath('//*[@id="Y1"]/table')
# ignore the table header row
for product in products[1:]
item = Schooldates1Item()
item['hol'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[1]').extract()[0]
item['first'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[2]').extract()[0]
item['last'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[3]').extract()[0]
yield item
Run Code Online (Sandbox Code Playgroud)
这里没有错误,但它只是触发了大量有关爬网的信息,但没有实际结果.
更新:
import scrapy
class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = (
'https://termdates.co.uk/school-holidays-16-19-abingdon/',
)
def parse_products(self, response):
products = sel.xpath('//*[@id="Year1"]/table//tr')
for p in products[1:]:
item = dict()
item['hol'] = p.xpath('td[1]/text()').extract_first()
item['first'] = p.xpath('td[1]/text()').extract_first()
item['last'] = p.xpath('td[1]/text()').extract_first()
yield item
Run Code Online (Sandbox Code Playgroud)
这给了我:IndentationError:意外的缩进
如果我运行下面修改过的脚本(感谢@Granitosaurus)输出到CSV(-o schoolDates.csv)我得到一个空文件:
import scrapy
class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',)
def parse_products(self, response):
products = sel.xpath('//*[@id="Year1"]/table//tr')
for p in products[1:]:
item = dict()
item['hol'] = p.xpath('td[1]/text()').extract_first()
item['first'] = p.xpath('td[1]/text()').extract_first()
item['last'] = p.xpath('td[1]/text()').extract_first()
yield item
Run Code Online (Sandbox Code Playgroud)
这是日志:
更新2 :(跳过行)这会将结果推送到csv文件,但会跳过每隔一行.
Shell显示 {'hol':无,'last':u'\ r \n\t\t\t\t\t\t\t\t\t't','first':None}
import scrapy
class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',)
def parse(self, response):
products = response.xpath('//*[@id="Year1"]/table//tr')
for p in products[1:]:
item = dict()
item['hol'] = p.xpath('td[1]/text()').extract_first()
item['first'] = p.xpath('td[2]/text()').extract_first()
item['last'] = p.xpath('td[3]/text()').extract_first()
yield item
Run Code Online (Sandbox Code Playgroud)
解决方案:感谢@ vold 这会抓取start_urls中的所有页面并处理不一致的表格布局
# -*- coding: utf-8 -*-
import scrapy
from SchoolDates_1.items import Schooldates1Item
class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',
'https://termdates.co.uk/school-holidays-3-dimensions',)
def parse(self, response):
products = response.xpath('//*[@id="Year1"]/table//tr')
# ignore the table header row
for product in products[1:]:
item = Schooldates1Item()
item['hol'] = product.xpath('td[1]//text()').extract_first()
item['first'] = product.xpath('td[2]//text()').extract_first()
item['last'] = ''.join(product.xpath('td[3]//text()').extract()).strip()
item['url'] = response.url
yield item
Run Code Online (Sandbox Code Playgroud)
您需要稍微更正您的代码.由于您已经选择了表格中的所有元素,因此无需再次指向表格.因此,您可以将xpath缩短为这样的东西td[1]//text().
def parse_products(self, response):
products = response.xpath('//*[@id="Year1"]/table//tr')
# ignore the table header row
for product in products[1:]
item = Schooldates1Item()
item['hol'] = product.xpath('td[1]//text()').extract_first()
item['first'] = product.xpath('td[2]//text()').extract_first()
item['last'] = product.xpath('td[3]//text()').extract_first()
yield item
Run Code Online (Sandbox Code Playgroud)
编辑我的答案,因为@stutray提供了一个网站的链接.
您可以使用 CSS 选择器代替 xPath,我总是发现 CSS 选择器很简单。
\n\ndef parse_products(self, response):\n\n for table in response.css("#Y1 table")[1:]:\n item = Schooldates1Item()\n item[\'hol\'] = product.css(\'td:nth-child(1)::text\').extract_first()\n item[\'first\'] = product.css(\'td:nth-child(2)::text\').extract_first()\n item[\'last\'] = product.css(\'td:nth-child(3)::text\').extract_first()\n yield item\nRun Code Online (Sandbox Code Playgroud)\n\n也不要tbody在选择器中使用标签。来源:
\n\n尤其是 Firefox,它以向表格添加元素而闻名。另一方面,Scrapy 不会修改原始页面 HTML,因此如果您在 XPath 表达式中使用,您将无法提取任何数据。
\n
| 归档时间: |
|
| 查看次数: |
10425 次 |
| 最近记录: |