Python和BeautifulSoup开幕页面

Question

Python和BeautifulSoup开幕页面

Bre*_*ott 5 python beautifulsoup web-scraping

我想知道如何使用BeautifulSoup打开列表中的另一个页面？我已按照本教程进行操作，但是并没有告诉我们如何打开列表中的另一页。另外，我如何打开嵌套在类内部的“ a href”？

这是我的代码：

# coding: utf-8

import requests
from bs4 import BeautifulSoup

r = requests.get("")
soup = BeautifulSoup(r.content)
soup.find_all("a")

for link in soup.find_all("a"):
    print link.get("href")

    for link in soup.find_all("a"):
        print link.text

    for link in soup.find_all("a"):
        print link.text, link.get("href")

    g_data = soup.find_all("div", {"class":"listing__left-column"})

    for item in g_data:
        print item.contents

    for item in g_data:
        print item.contents[0].text
        print link.get('href')

    for item in g_data:
        print item.contents[0]

Run Code Online (Sandbox Code Playgroud)

我正在尝试从每个公司的标题中收集href，然后将其打开并抓取该数据。

Answer 1

Mar*_*ans 8

我仍然不确定您从哪里获取 HTML，但是如果您尝试提取所有href标签，那么以下方法应该基于您发布的图像起作用：

import requests
from bs4 import BeautifulSoup

r = requests.get("<add your URL here>")
soup = BeautifulSoup(r.content)

for a_tag in soup.find_all('a', class_='listing-name', href=True):
    print 'href: ', a_tag['href']

Run Code Online (Sandbox Code Playgroud)

通过添加href=True到find_all()，它确保只返回a包含href属性的元素，因此无需将其作为属性进行测试。

只是为了警告您，您可能会发现一些网站会在一两次尝试后将您锁定，因为它们能够检测到您正在尝试通过脚本而不是人类访问网站。如果您觉得没有得到正确的响应，我建议您打印返回的 HTML，以确保它仍然符合您的预期。

如果您想获取每个链接的 HTML，可以使用以下内容：

import requests
from bs4 import BeautifulSoup

# Configure this to be your first request URL
r = requests.get("http://www.mywebsite.com/search/")
soup = BeautifulSoup(r.content)

for a_tag in soup.find_all('a', class_='listing-name', href=True):
    print 'href: ', a_tag['href']

# Configure this to the root of the above website, e.g. 'http://www.mywebsite.com'
base_url = "http://www.mywebsite.com"

for a_tag in soup.find_all('a', class_='listing-name', href=True):
    print '-' * 60      # Add a line of dashes
    print 'href: ', a_tag['href']
    request_href = requests.get(base_url + a_tag['href'])
    print request_href.content

Run Code Online (Sandbox Code Playgroud)

使用 Python 2.x 测试，对于 Python 3.x，请在打印语句中添加括号。

归档时间：	10 年，5 月前
查看次数：	8897 次
最近记录：	7 年，6 月前