当网站有文本时,Beautiful Soup 返回一个空字符串

mav*_*ick 0 python beautifulsoup web-scraping python-requests

在这里考虑这个网站:https : //dlnr.hawaii.gov/dsp/parks/oahu/ahupuaa-o-kahana-state-park/

我正在寻找右侧标题下的内容。这是我的示例代码,它应该返回内容列表但返回空字符串:

import requests as req
from bs4 import BeautifulSoup as bs

r = req.get('https://dlnr.hawaii.gov/dsp/parks/oahu/ahupuaa-o-kahana-state-park/').text
soup = bs(r)

par = soup.find('h3', text= 'Facilities')

for sib in par.next_siblings:
    print(sib)
Run Code Online (Sandbox Code Playgroud)

这将返回:

<ul class="park_icon">
<div class="clearfix"></div>
</ul>
Run Code Online (Sandbox Code Playgroud)

该网站不显示该类的任何 div 元素。此外,未捕获列表项。

bad*_*ker 6

该框架中的设施和其他信息由 动态加载JavaScript,因此bs4在源中看不到它们,HTML因为它们根本不存在。

但是,您可以查询端点并获取所需的所有信息。

就是这样:

import json
import re
import time

import requests

headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/90.0.4430.93 Safari/537.36",
    "referer": "https://dlnr.hawaii.gov/",
}

endpoint = f"https://stateparksadmin.ehawaii.gov/camping/park-site.json?parkId=57853&_={int(time.time())}"
response = requests.get(endpoint, headers=headers).text
data = json.loads(re.search(r"callback\((.*)\);", response).group(1))
print("\n".join(f for f in data["park info"]["facilities"]))
Run Code Online (Sandbox Code Playgroud)

输出:

Boat Ramp
Campsites
Picnic table
Restroom
Showers
Trash Cans
Water Fountain
Run Code Online (Sandbox Code Playgroud)

这是整个JSON

{
  "park info": {
    "name": "Ahupua\u02bba \u02bbO Kahana State Park",
    "id": 57853,
    "island": "Oahu",
    "activities": [
      "Beachgoing",
      "Camping",
      "Dogs on Leash",
      "Fishing",
      "Hiking",
      "Hunting",
      "Sightseeing"
    ],
    "facilities": [
      "Boat Ramp",
      "Campsites",
      "Picnic table",
      "Restroom",
      "Showers",
      "Trash Cans",
      "Water Fountain"
    ],
    "prohibited": [
      "No Motorized Vehicles/ATV's",
      "No Alcoholic Beverages",
      "No Open Fires",
      "No Smoking",
      "No Commercial Activities"
    ],
    "hazards": [],
    "photos": [],
    "location": {
      "latitude": 21.556086,
      "longitude": -157.875579
    },
    "hiking": [
      {
        "name": "Nakoa Trail",
        "id": 17,
        "activities": [
          "Dogs on Leash",
          "Hiking",
          "Hunting",
          "Sightseeing"
        ],
        "facilities": [
          "No Drinking Water"
        ],
        "prohibited": [
          "No Bicycles",
          "No Open Fires",
          "No Littering/Dumping",
          "No Camping",
          "No Smoking"
        ],
        "hazards": [
          "Flash Flood"
        ],
        "photos": [],
        "location": {
          "latitude": 21.551087,
          "longitude": -157.881228
        },
        "has_google_street": false
      },
      {
        "name": "Kapa\u2018ele\u2018ele Trail",
        "id": 18,
        "activities": [
          "Dogs on Leash",
          "Hiking",
          "Sightseeing"
        ],
        "facilities": [
          "No Drinking Water",
          "Restroom",
          "Trash Cans"
        ],
        "prohibited": [
          "No Bicycles",
          "No Open Fires",
          "No Littering/Dumping",
          "No Camping",
          "No Smoking"
        ],
        "hazards": [],
        "photos": [],
        "location": {
          "latitude": 21.554744,
          "longitude": -157.876601
        },
        "has_google_street": false
      }
    ]
  }
}
Run Code Online (Sandbox Code Playgroud)

  • 您在上面的帖子中的最初尝试并没有表明有关 scrapy 的任何内容。一旦您的问题得到解答,您就不应该提出任何新的要求。但是,您始终可以创建一个新帖子来描述任何新问题@maverick。 (2认同)