scrapy：请求 url 必须是 str 或 unicode，得到了选择器

Question

scrapy：请求 url 必须是 str 或 unicode，得到了选择器

RpB*_*RpB 2 screen-scraping scrapy python-2.7

我正在使用 Scrapy 编写一个蜘蛛，以抓取 Pinterest 的用户详细信息。我正在尝试获取用户及其关注者的详细信息（依此类推，直到最后一个节点）。

下面是蜘蛛代码：

从 scrapy.spider 导入 BaseSpider

从 pinners.items 导入scrapy 从scrapy.http 导入PinterestItem 从urlparse 导入FormRequest 导入urlparse

类示例（BaseSpider）：

name = 'sample'
allowed_domains = ['pinterest.com']
start_urls = ['https://www.pinterest.com/banka/followers', ]

def parse(self, response):
    for base_url in response.xpath('//div[@class="Module User gridItem"]/a/@href'):
        list_a = response.urljoin(base_url.extract())
        for new_urls in response.xpath('//div[@class="Module User gridItem"]/a/@href'):
            yield scrapy.Request(new_urls, callback=self.Next)
    yield scrapy.Request(list_a, callback=self.Next)

def Next(self, response):
    href_base = response.xpath('//div[@class = "tabs"]/ul/li/a')
    href_board = href_base.xpath('//div[@class="BoardCount Module"]')
    href_pin = href_base.xpath('.//div[@class="Module PinCount"]')
    href_like = href_base.xpath('.//div[@class="LikeCount Module"]')
    href_followers = href_base.xpath('.//div[@class="FollowerCount Module"]')
    href_following = href_base.xpath('.//div[@class="FollowingCount Module"]')
    item = PinterestItem()
    item["Board_Count"] = href_board.xpath('.//span[@class="value"]/text()').extract()[0]
    item["Pin_Count"] = href_pin.xpath('.//span[@class="value"]/text()').extract()
    item["Like_Count"] = href_like.xpath('.//span[@class="value"]/text()').extract()
    item["Followers_Count"] = href_followers.xpath('.//span[@class="value"]/text()').extract()
    item["Following_Count"] = href_following.xpath('.//span[@class="value"]/text()').extract()
    item["User_ID"] = response.xpath('//link[@rel="canonical"]/@href').extract()[0]
    yield item

Run Code Online (Sandbox Code Playgroud)

我收到以下错误：

raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got Selector:

Run Code Online (Sandbox Code Playgroud)

我确实检查了 list_a 的类型（提取的网址）。它给了我 unicode。

Answer 1

小智 6

错误是由 parse 方法中的内部 for 循环生成的：

for new_urls in response.xpath('//div[@class="Module User gridItem"]/a/@href'):
        yield scrapy.Request(new_urls, callback=self.Next)

Run Code Online (Sandbox Code Playgroud)

该new_urls变量实际上是一个选择，请尝试是这样的：

for base_url in response.xpath('//div[@class="Module User gridItem"]/a/@href'):
    list_a = response.urljoin(base_url.extract())        
    yield scrapy.Request(list_a, callback=self.Next)

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，3 月前
查看次数：	3039 次
最近记录：	8 年，1 月前