即使元素存在，BeautifulSoup 也返回 None

Question

即使元素存在，BeautifulSoup 也返回 None

sco*_*che 9 python beautifulsoup web-scraping

我已经针对类似问题解决了大多数解决方案，但还没有找到一个有效的解决方案，更重要的是，还没有找到解释为什么在抓取网站上调用 Javascript 或其他内容之外会发生这种情况的解释。

我试图从网站上抓取“官员”游戏的表格：http : //www.pro-football-reference.com/boxscores/201309050den.htm

我的代码是：

url = "http://www.pro-football-reference.com/boxscores/201309050den.htm"
html = urlopen(url)    
bsObj = BeautifulSoup(html, "lxml")
officials = bsObj.findAll("table",{"id":"officials"})

for entry in officials:
    print(str(entry))

Run Code Online (Sandbox Code Playgroud)

我现在只是打印到控制台，但是我得到一个带有 findAll 或 None 的空列表。我也用基本的 html.parser 尝试过这个，但没有成功。

对 html 有更好理解的人可以告诉我这个网页有什么不同吗？提前致谢！

Answer 1

the*_*guy 5

试试这个代码：

from selenium import webdriver
import time
from bs4 import BeautifulSoup


driver = webdriver.Chrome()
url= "http://www.pro-football-reference.com/boxscores/201309050den.htm"
driver.maximize_window()
driver.get(url)

time.sleep(5)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
officials = soup.findAll("table",{"id":"officials"})

for entry in officials:
    print(str(entry))


driver.quit()

Run Code Online (Sandbox Code Playgroud)

它会打印：

<table class="suppress_all sortable stats_table now_sortable" data-cols-to-freeze="0" id="officials"><thead><tr class="thead onecell"><td class=" center" colspan="2" data-stat="onecell">Officials</td></tr></thead><caption>Officials Table</caption><tbody>
<tr data-row="0"><th class=" " data-stat="ref_pos" scope="row">Referee</th><td class=" " data-stat="name"><a href="/officials/ColeWa0r.htm">Walt Coleman</a></td></tr>
<tr data-row="1"><th class=" " data-stat="ref_pos" scope="row">Umpire</th><td class=" " data-stat="name"><a href="/officials/ElliRo0r.htm">Roy Ellison</a></td></tr>
<tr data-row="2"><th class=" " data-stat="ref_pos" scope="row">Head Linesman</th><td class=" " data-stat="name"><a href="/officials/BergJe1r.htm">Jerry Bergman</a></td></tr>
<tr data-row="3"><th class=" " data-stat="ref_pos" scope="row">Field Judge</th><td class=" " data-stat="name"><a href="/officials/GautGr0r.htm">Greg Gautreaux</a></td></tr>
<tr data-row="4"><th class=" " data-stat="ref_pos" scope="row">Back Judge</th><td class=" " data-stat="name"><a href="/officials/YettGr0r.htm">Greg Yette</a></td></tr>
<tr data-row="5"><th class=" " data-stat="ref_pos" scope="row">Side Judge</th><td class=" " data-stat="name"><a href="/officials/PattRi0r.htm">Rick Patterson</a></td></tr>
<tr data-row="6"><th class=" " data-stat="ref_pos" scope="row">Line Judge</th><td class=" " data-stat="name"><a href="/officials/BaynRu0r.htm">Rusty Baynes</a></td></tr>
</tbody></table>

Run Code Online (Sandbox Code Playgroud)

你能解释一下为什么这有效吗？并感谢您的帮助！ (2认同)

Answer 2

Or *_*uan 3

你看不到它，因为它不存在。尝试关闭 JS并用浏览器打开它，您会发现它不存在 - 该网站进行了一些 JS DOM 操作。

您的选择是：

在你的例子中，你想要的 HTML 就在那里 - 就在评论中，用 beautifulsoup 从评论中提取它。
使用Selenium或等效工具来渲染 JS（这正是你的浏览器的工作方式）

归档时间：	9 年，7 月前
查看次数：	24863 次
最近记录：	6 年，10 月前