如何以编程方式获取 javascript 页面后面的 CSV 链接?

use*_*035 2 python beautifulsoup web-scraping

我正在使用 python,当我单击本页DATA V CSV底部的按钮时,我试图获取 CSV 来源的链接。

我试过beautifulsoup

import requests
from bs4 import BeautifulSoup

url = 'https://www.ceps.cz/en/all-data#AktualniSystemovaOdchylkaCR'
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

# Find the link to the CSV file
csv_link = soup.find('a', string='DATA V CSV').get('href')
Run Code Online (Sandbox Code Playgroud)

我也尝试过:

soup.find("button", {"id":"DATA V CSV"})

但没有找到后面的链接DATA V CSV

bad*_*ker 5

为了获取所有数据,您需要完全模仿发送到服务器的请求。

操作方法如下:

from shutil import copyfileobj
from urllib.parse import urlencode

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
    "referer": "https://www.ceps.cz/en/all-data",
    "accept": "application/json, text/javascript, */*; q=0.01",
    "cookie": "nette-samesite=1; ARRAffinity=3ee2404f26d0149d946e50cb3d4c22661f9f3b6510837fa538c67990a81979de; ARRAffinitySameSite=3ee2404f26d0149d946e50cb3d4c22661f9f3b6510837fa538c67990a81979de"
}

payload = {
    "do": "loadGraphData",
    "method": "AktualniSystemovaOdchylkaCR",
    "graph_id": "1026",
    "move_graph": "day",
    "download": "csv",
    "date_to": "2023-03-28T23:59:59",
    "date_from": "2023-03-28T00:00:00",
    "agregation": "MI",
    "date_type": "day",
    "interval": "false",
    "version": "bla",
    "function": "AVG",
}

all_data = "https://www.ceps.cz/en/all-data"
download_url = "https://www.ceps.cz/download-data/?format=csv"

with requests.Session() as s:
    headers.update({"x-requested-with": "XMLHttpRequest"})
    r = s.get(f"{all_data}?{urlencode(payload)}", headers=headers)
    print(r.json()["result"])
    headers.pop("x-requested-with")
    with s.get(download_url, headers=headers, stream=True) as r, \
            open("data.csv", "wb") as f:
        copyfileobj(r.raw, f)

Run Code Online (Sandbox Code Playgroud)

您应该得到一个semicolon类似这样的分隔文件:

在此输入图像描述