我正在尝试抓取一个网站,我的蜘蛛(我不知道为什么)正在抓住我的链接这种混乱!
它正在抓取我想要的所有链接,但它只存储了第一个(排名和url_seller作为示例)...我是爬行,蟒蛇或scrapy这个世界的新手,但我想要的只是学习!! 我在这里发布我的代码,有人可以帮助我吗?
# -*- coding: utf-8 -*-
import scrapy
import re
import numbers
from MarketplacePulse.items import MarketplacepulseItem
import urllib.parse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MarketplacePulseSpider(CrawlSpider):
name = 'MP_A_UK'
allowed_domains = ['marketplacepulse.com', 'amazon.co.uk']
start_urls = ['https://www.marketplacepulse.com/amazon/uk']
def parse(self, response):
item = MarketplacepulseItem()
rank = response.xpath('//div/table/tbody/tr/td[@class="number"]/text()').extract()
print('\n', rank, '\n')
url_1 = response.xpath('//div/table/tbody/tr/td/a/@href').extract()
print('\n', url_1, '\n')
for i in range(len(rank)-2):
item['month_rank'] = ''.join(rank[i]).strip()
item['year_rank'] = ''.join(rank[i+1]).strip()
item['lifetime_rank'] = ''.join(rank[i+2]).strip()
i += 3
for i in range(len(url_1)):
url_tmp …Run Code Online (Sandbox Code Playgroud)