Adi*_*tya 7 javascript python regex scrapy web-scraping
我的items.py文件是这样的:
from scrapy.item import Item, Field
class SpiItem(Item):
title = Field()
lat = Field()
lng = Field()
add = Field()
Run Code Online (Sandbox Code Playgroud)
而蜘蛛是:
import scrapy
import re
from spi.items import SpiItem
class HdfcSpider(scrapy.Spider):
name = "hdfc"
allowed_domains = ["hdfc.com"]
start_urls = ["http://hdfc.com/branch-locator"]
def parse(self,response):
addresses = response.xpath('//script')
for sel in addresses:
item = SpiItem()
item['title'] = sel.xpath('//script[@type="text/javascript"][1]').re('(?<="title":).+(?=")')
item['lat'] = sel.xpath('//script[@type="text/javascript"][1]').re('(?<="latitude":).+(?=")')
item['lng'] = sel.xpath('//script[@type="text/javascript"][1]').re('(?<="longitude":).+(?=")')
item['add'] = sel.xpath('//script[@type="text/javascript"][1]').re('(?<="html":).+(?=")')
yield item
Run Code Online (Sandbox Code Playgroud)
查看页面源代码的整个javascript代码都写在://html/body/table/tbody/tr[348]/td[2]
.
为什么我的代码不起作用?我想只提取items文件中提到的四个字段.
ale*_*cxe 12
不是使用正则表达式逐字段提取,而是提取整个locations
对象,通过它加载json.loads()
并从Python字典中提取所需的数据:
def parse(self,response):
pattern = re.compile(r"var locations= ({.*?});", re.MULTILINE | re.DOTALL)
locations = response.xpath('//script[contains(., "var locations")]/text()').re(pattern)[0]
locations = json.loads(locations)
for title, data in locations.iteritems():
print title
Run Code Online (Sandbox Code Playgroud)