同时抓取太多链接

P.P*_*que 2 python parsing web-crawler scrapy

我正在尝试抓取一个网站,我的蜘蛛(我不知道为什么)正在抓住我的链接这种混乱!

它正在抓取我想要的所有链接,但它只存储了第一个(排名和url_seller作为示例)...我是爬行,蟒蛇或scrapy这个世界的新手,但我想要的只是学习!! 我在这里发布我的代码,有人可以帮助我吗?

# -*- coding: utf-8 -*-
import scrapy
import re
import numbers
from MarketplacePulse.items import MarketplacepulseItem
import urllib.parse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MarketplacePulseSpider(CrawlSpider):
    name = 'MP_A_UK'
    allowed_domains = ['marketplacepulse.com', 'amazon.co.uk']
    start_urls = ['https://www.marketplacepulse.com/amazon/uk']

    def parse(self, response):
        item = MarketplacepulseItem()

        rank = response.xpath('//div/table/tbody/tr/td[@class="number"]/text()').extract()
        print('\n', rank, '\n')
        url_1 = response.xpath('//div/table/tbody/tr/td/a/@href').extract()
        print('\n', url_1, '\n')
        for i in range(len(rank)-2):
            item['month_rank'] = ''.join(rank[i]).strip()
            item['year_rank'] = ''.join(rank[i+1]).strip()
            item['lifetime_rank'] = ''.join(rank[i+2]).strip()
            i += 3

        for i in range(len(url_1)):
            url_tmp = urllib.parse.urljoin('https://www.marketplacepulse.com',url_1[i])
            yield scrapy.Request(url_tmp, callback=self.parse_2, meta={'item': item})

    def parse_2(self, response):
        item = response.meta['item']

        url_2 = response.xpath('//body/div/section/div/div/div/p/a[contains(text(), "Amazon.co.uk")]/@href').extract()

        item['url_seller'] = ''.join(url_2).strip()
        yield scrapy.Request(str(url_2), callback=self.parse_3, meta={'item': item})

    def parse_3(self, response):
        item = response.meta['item']

        business_name = response.xpath('//div[@class="a-row a-spacing-medium"]/div[@class="a-column a-span6"]/ul[@class="a-unordered-list a-nostyle a-vertical"]/li//span[@class="a-list-item"]/span[.="Business Name:"]/following-sibling::text()').extract()
        phone_number = response.xpath('//div[@class="a-column a-span6"]/ul[@class="a-unordered-list a-nostyle a-vertical"]/li//span[@class="a-list-item"]/span[.="Phone number:"]/following-sibling::text()').extract()
        address = response.xpath('//div[@class="a-column a-span6"]/ul[@class="a-unordered-list a-nostyle a-vertical"]/li//span[span[contains(.,"Address:")]]/ul//li//text()').extract()

        item['business_name'] = ''.join(business_name).strip()
        item['phone_number'] = ''.join(phone_number).strip()
        item['address'] = '\n'.join(address).strip()
        yield item
Run Code Online (Sandbox Code Playgroud)

我还发布了一个我想要的例子和我得到的例子......你会看到我希望的问题!

我想要的是 :

2017-07-18 11:28:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.co.uk/sp?_encoding=UTF8&asin=&isAmazonFulfilled=&isCBA=&marketplaceID=A1F83G8C2ARO7P&orderID=&seller=A7CL6GT0UVQKS&tab=&vasStoreID=>
{'address': '55740 Currant Rd\nMishawaka\nIndiana\n46545\nUS',
 'business_name': 'Better World Books Marketplace Inc',
 'lifetime_rank': '863',
 'month_rank': '218',
 'phone_number': '',
 'url_seller': 'https://www.amazon.co.uk/gp/aag/main?seller=A7CL6GT0UVQKS&tag=mk4343k-21',
 'year_rank': '100'}
2017-07-18 11:28:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.co.uk/sp?_encoding=UTF8&asin=&isAmazonFulfilled=&isCBA=&marketplaceID=A1F83G8C2ARO7P&orderID=&seller=W5VG5JB9GHYUG&tab=&vasStoreID=>
{'address': 'ROOM 919, BLOCK 2 West, SEG TECHNOLOGY PARK\n'
            'SHENZHEN\n'
            'GUANGDONG\n'
            '518000\n'
            'CN\n'
            'FU TIAN QU QIAO XIANG LU HAO FENG YUAN 7 DONG 7A\n'
            'SHENZHEN\n'
            'GUANGDONG\n'
            '518000\n'
            'CN',
 'business_name': 'MUDDER TECHNOLOGY CO., LTD',
 'lifetime_rank': '3',
 'month_rank': '28',
 'phone_number': '86 18565729081',
 'url_seller': 'https://www.amazon.co.uk/gp/aag/main?seller=W5VG5JB9GHYUG&tag=mk4343k-21',
 'year_rank': '10'}
Run Code Online (Sandbox Code Playgroud)

我得到了什么:

2017-07-18 11:28:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.co.uk/sp?_encoding=UTF8&asin=&isAmazonFulfilled=&isCBA=&marketplaceID=A1F83G8C2ARO7P&orderID=&seller=A20T907OQC02JJ&tab=&vasStoreID=>
{'address': '55740 Currant Rd\nMishawaka\nIndiana\n46545\nUS',
 'business_name': 'Better World Books Marketplace Inc',
 'lifetime_rank': '863',
 'month_rank': '218',
 'phone_number': '',
 'url_seller': 'https://www.amazon.co.uk/gp/aag/main?seller=A7CL6GT0UVQKS&tag=mk4343k-21',
 'year_rank': '100'}
2017-07-18 11:28:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.co.uk/sp?_encoding=UTF8&asin=&isAmazonFulfilled=&isCBA=&marketplaceID=A1F83G8C2ARO7P&orderID=&seller=A1XG2K8M6HRQZ8&tab=&vasStoreID=>
{'address': 'ROOM 919, BLOCK 2 West, SEG TECHNOLOGY PARK\n'
            'SHENZHEN\n'
            'GUANGDONG\n'
            '518000\n'
            'CN\n'
            'FU TIAN QU QIAO XIANG LU HAO FENG YUAN 7 DONG 7A\n'
            'SHENZHEN\n'
            'GUANGDONG\n'
            '518000\n'
            'CN',
 'business_name': 'MUDDER TECHNOLOGY CO., LTD',
 'lifetime_rank': '863',
 'month_rank': '218',
 'phone_number': '86 18565729081',
 'url_seller': 'https://www.amazon.co.uk/gp/aag/main?seller=A7CL6GT0UVQKS&tag=mk4343k-21',
 'year_rank': '100'}
Run Code Online (Sandbox Code Playgroud)

您可以看到url_seller完全相同且排名(按月,年或生命周期)也...但我希望它们不同.....并且url_seller与我抓取的链接不同,但它必须是相同的.....请帮忙吗?

3D1*_*T0R 5

排名问题

我会一步一步地走过去:

  • 您将获得一个排名列表:

    rank = response.xpath('//div/table/tbody/tr/td[@class="number"]/text()').extract()
    
    Run Code Online (Sandbox Code Playgroud)
  • 您将获得一个URL列表:

    url_1 = response.xpath('//div/table/tbody/tr/td/a/@href').extract()
    
    Run Code Online (Sandbox Code Playgroud)
  • 这是你出错的地方:

    for i in range(len(rank)-2):
        item['month_rank'] = ''.join(rank[i]).strip()
        item['year_rank'] = ''.join(rank[i+1]).strip()
        item['lifetime_rank'] = ''.join(rank[i+2]).strip()
        i += 3
    
    Run Code Online (Sandbox Code Playgroud)

    首先,由于你正在使用for循环,你的i变量将在每个循环开始时重置为下一个项目(在本例中是下一个数字),因此你循环遍历每个循环,而不是循环三次.i += 3抱歉,这无所事事.

    无论如何,循环的目的似乎是构建以下结构:

    {'month_rank': <rank>, 'year_rank': <rank>, 'lifetime_rank': <rank>}
    
    Run Code Online (Sandbox Code Playgroud)

    所以...,其次,每次运行此循环时,都会覆盖上一组值而不对它们进行任何操作.哎呀.

  • 然后,您继续遍历您的URL列表,传递您之前构建的循环的最后一组排名,以及每个URL到您的parse_2函数.

    for i in range(len(url_1)):
        url_tmp = urllib.parse.urljoin('https://www.marketplacepulse.com',url_1[i])
        yield scrapy.Request(url_tmp, callback=self.parse_2, meta={'item': item})
    
    Run Code Online (Sandbox Code Playgroud)

    最终每次调用都会parse_2获得相同的排名数据集.

要解决此问题,您应该在同一个循环中处理您的URL及其相关排名:

    for i in range(len(url_1)):
        url_tmp = urllib.parse.urljoin('https://www.marketplacepulse.com',url_1[i])

        item['month_rank'] = ''.join(rank[i*3]).strip()
        item['year_rank'] = ''.join(rank[i*3+1]).strip()
        item['lifetime_rank'] = ''.join(rank[i*3+2]).strip()

        yield scrapy.Request(url_tmp, callback=self.parse_2, meta={'item': item})
Run Code Online (Sandbox Code Playgroud)

这应该可以解决你的等级问题.


url_seller问题

我对url_seller问题不太了解,因为看起来它应该使用相同的url item['url_seller']和它的调用parse_3,看起来它正在使用正确的信息来调用parse_3,但是继续在item['url_seller']over和中使用相同的信息再次.

我在这里走出困境,因为如果我正确理解这种情况,两种方法都应该(在我认为这是特殊情况下)制作相同的字符串,但是到目前为止我注意到的唯一区别是你正在使用的那个和你正在使用''.join(url_2).strip()的另一个str(url_2).

由于你正在使用的部分str(url_2)似乎在使用它的地方正常工作,也许你应该尝试在另一个地方使用它:

    item['url_seller'] = str(url_2)
Run Code Online (Sandbox Code Playgroud)