see*_*345 5 python beautifulsoup
我有一个带有多个 div的HTML 页面,例如:
<div class="post-info-wrap">
<h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-1st-post/" title="Example of 1st post – Example 1 Post" rel="bookmark">sample post – example 1 post</a></h2>
<div class="post-meta clearfix">
<div class="post-info-wrap">
<h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-2nd-post/" title="Example of 2nd post – Example 2 Post" rel="bookmark">sample post – example 2 post</a></h2>
<div class="post-meta clearfix">
Run Code Online (Sandbox Code Playgroud)
我需要使用类 post-info-wrap 获取所有 div 的值我是 BeautifulSoup 的新手
所以我需要这些网址:
等等...
我试过了:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.example.com/blog/author/abc")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
for link in soup.select('.post-info-wrap'):
print link.find('a').attrs['href']
Run Code Online (Sandbox Code Playgroud)
这段代码似乎不起作用。我是美丽汤的新手。如何提取链接?
您可以使用soup.find_all:
from bs4 import BeautifulSoup as soup
r = [i.a['href'] for i in soup(html, 'html.parser').find_all('div', {'class':'post-info-wrap'})]
Run Code Online (Sandbox Code Playgroud)
输出:
['https://www.example.com/blog/111/this-is-1st-post/', 'https://www.example.com/blog/111/this-is-2nd-post/']
Run Code Online (Sandbox Code Playgroud)
link = i.find('a',href=True)总是不回anchor tag (a),可能会回NoneType,所以你需要验证 link 是否为 None ,继续 for 循环,否则获取链接 href 值。
通过 url 抓取链接:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.example.com/blog/author/abc")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
for i in soup.find_all('div',{'class':'post-info-wrap'}):
link = i.find('a',href=True)
if link is None:
continue
print(link['href'])
Run Code Online (Sandbox Code Playgroud)
通过 HTML 抓取链接:
from bs4 import BeautifulSoup
html = '''<div class="post-info-wrap"><h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-1st-post/" title="Example of 1st post – Example 1 Post" rel="bookmark">sample post – example 1 post</a></h2><div class="post-meta clearfix">
<div class="post-info-wrap"><h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-2nd-post/" title="Example of 2nd post – Example 2 Post" rel="bookmark">sample post – example 2 post</a></h2><div class="post-meta clearfix">'''
soup = BeautifulSoup(html, "html.parser")
for i in soup.find_all('div',{'class':'post-info-wrap'}):
link = i.find('a',href=True)
if link is None:
continue
print(link['href'])
Run Code Online (Sandbox Code Playgroud)
更新:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.get("https://www.example.com/blog/author/abc")
soup = BeautifulSoup(driver.page_source, "html.parser")
for i in soup.find_all('div', {'class': 'post-info-wrap'}):
link = i.find('a', href=True)
if link is None:
continue
print(link['href'])
Run Code Online (Sandbox Code Playgroud)
输出:
https://www.example.com/blog/911/article-1/
https://www.example.com/blog/911/article-2/
https://www.example.com/blog/911/article-3/
https://www.example.com/blog/911/article-4/
https://www.example.com/blog/random-blog/article-5/
Run Code Online (Sandbox Code Playgroud)
对于 Chrome 浏览器:
http://chromedriver.chromium.org/downloads
安装 Chrome 浏览器的网络驱动程序:
https://christopher.su/2015/selenium-chromedriver-ubuntu/
硒教程
https://selenium-python.readthedocs.io/
'/usr/bin/chromedriver'chrome webdriver 路径在哪里。
| 归档时间: |
|
| 查看次数: |
6484 次 |
| 最近记录: |