BeautifulSoup 不会返回页面上的所有元素

Question

BeautifulSoup 不会返回页面上的所有元素

Koh*_*ani 4 python google-chrome beautifulsoup web-scraping

我是网络抓取的新手，刚开始使用 BeautifulSoup。这是我的问题。

当您使用诸如“define:lucid”之类的搜索查询以这种方式在 Google 中查找某个词时，在大多数情况下，首页会出现一个显示含义和发音的面板。（显示在嵌入图像的左侧）

【谷歌默认词典示例】

我要自动抓取和收集的内容是含义的文本和存储发音的mp3数据的URL。手动使用 Chrome Inspector，这些很容易在它的“元素”部分找到，例如，检查器（显示在图像的右侧）显示 URL，其中存储了“lucid”（此处）发音的 mp3 数据.

但是，使用requests获取搜索结果的HTML内容并用BeautifulSoup解析，如下面的代码，soup只能获取到面板中的IPA“/?lu?s?d/”等少数内容和属性“形容词”就像下面的结果，我需要的内容都找不到，比如音频元素中的东西。

如果可能，我如何使用 BeautifulSoup 获取信息，否则哪些替代工具适合此任务？

PS 我认为谷歌词典的发音质量比任何其他词典网站的发音质量都要好。所以我想坚持下去。

代码：

import requests
from bs4 import BeautifulSoup

query = "define:lucid"
goog_search = "https://www.google.co.uk/search?q=" + query

r = requests.get(goog_search)

soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())

Run Code Online (Sandbox Code Playgroud)

部分soup内容：

           </span>
           <span style="font:smaller 'Doulos SIL','Gentum','TITUS Cyberbit Basic','Junicode','Aborigonal Serif','Arial Unicode MS','Lucida Sans Unicode','Chrysanthi Unicode';padding-left:15px">
            /?lu?s?d/
           </span>
          </div>
         </h3>
         <table style="font-size:14px;width:100%">
          <tr>
           <td>
            <div style="color:#666;padding:5px 0">
             adjective
            </div>

Run Code Online (Sandbox Code Playgroud)

Answer 1

M3R*_*3RS 6

您运行的基本请求不会返回通过 JavaScript 呈现的页面部分。如果您在 Chrome 中右键单击并选择查看页面源，则音频链接不存在。解决方案：您可以通过selenium. 使用下面的代码，我得到了<audio>包含链接的标签。

你必须pip install selenium下载ChromeDriver并添加包含它的文件夹来PATH喜欢export PATH=$PATH:~/downloads/

import requests
from bs4 import BeautifulSoup
import time
from selenium import webdriver

def render_page(url):
    driver = webdriver.Chrome()
    driver.get(url)
    time.sleep(3)
    r = driver.page_source
    #driver.quit()
    return r

query = "define:lucid"
goog_search = "https://www.google.co.uk/search?q=" + query

r = render_page(goog_search)

soup = BeautifulSoup(r, "html.parser")
print(soup.prettify())

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，5 月前
查看次数：	2938 次
最近记录：	8 年，5 月前