网页搜集-麦肯锡文章

jwa*_*man 0 python scrapy web-scraping web

我正在寻找文章标题。我不知道如何提取标题文本。您能否看下面我的代码并提出解决方案。

我是新手。感谢您的帮助!

网页的Web开发人员视图的屏幕快照 https://imgur.com/a/O1lLquY

import scrapy



class BrickSetSpider(scrapy.Spider):
    name = "brickset_spider"
    start_urls = ['https://www.mckinsey.com/search?q=Agile&start=1']

    def parse(self, response):
        for quote in response.css('div.text-wrapper'):
            item = {
                'text': quote.css('h3.headline::text').extract(),
            }
            print(item)
            yield item
Run Code Online (Sandbox Code Playgroud)

vez*_*hik 5

对于新手开发者来说看起来不错!我只更改了您parse函数中的选择器:

for quote in response.css('div.block-list div.item'):
    yield {
        'text': quote.css('h3.headline::text').get(),
    }
Run Code Online (Sandbox Code Playgroud)

UPD:嗯,您的网站似乎在要求更多数据。

打开开发人员工具并https://www.mckinsey.com/services/ContentAPI/SearchAPI.svc/search使用params 检查对的请求{"q":"Agile","page":1,"app":"","sort":"default","ignoreSpellSuggestion":false}。您可以scrapy.Request使用这些参数和适当的标头进行制作,并使用数据获取json。它可以很容易地用jsonlib 解析。

UPD2:从该curl中可以看出curl 'https://www.mckinsey.com/services/ContentAPI/SearchAPI.svc/search' -H 'content-type: application/json' --data-binary '{"q":"Agile","page”:1,”app":"","sort":"default","ignoreSpellSuggestion":false}' --compressed,我们需要以这种方式发出请求:

from scrapy import Request
import json

data = {"q": "Agile", "page": 1, "app": "", "sort": "default", "ignoreSpellSuggestion": False}
headers = {"content-type": "application/json"}
url = "https://www.mckinsey.com/services/ContentAPI/SearchAPI.svc/search"
yield Request(url, headers=headers, body=json.dumps(data), callback=self.parse_api)
Run Code Online (Sandbox Code Playgroud)

然后在parse_api函数中解析响应:

def parse_api(self, response):
    data = json.loads(response.body)
    # and then extract what you need
Run Code Online (Sandbox Code Playgroud)

因此,您可以page在请求中迭代参数并获取所有页面。

UPD3:工作解决方案:

from scrapy import Spider, Request
import json


class BrickSetSpider(Spider):
    name = "brickset_spider"

    data = {"q": "Agile", "page": 1, "app": "", "sort": "default", "ignoreSpellSuggestion": False}
    headers = {"content-type": "application/json"}
    url = "https://www.mckinsey.com/services/ContentAPI/SearchAPI.svc/search"

    def start_requests(self):
        yield Request(self.url, headers=self.headers, method='POST',
                  body=json.dumps(self.data), meta={'page': 1})

    def parse(self, response):
        data = json.loads(response.body)
        results = data.get('data', {}).get('results')
        if not results:
            return

        for row in results:
            yield {'title': row.get('title')}

        page = response.meta['page'] + 1
        self.data['page'] = page
        yield Request(self.url, headers=self.headers, method='POST', body=json.dumps(self.data), meta={'page': page})
Run Code Online (Sandbox Code Playgroud)