刮困难表

Question

刮困难表

Aar*_*and 2 python beautifulsoup web-scraping

我已经尝试从这里刮一张桌子好一段时间了，但是没有成功。我要抓取的表格标题为“每场比赛统计信息的团队人数”。我有信心，一旦我能够抓取该表的一个元素，便可以遍历列表中所需的列，并最终得到一个熊猫数据框。

到目前为止，这是我的代码：

from bs4 import BeautifulSoup
import requests

# url that we are scraping
r = requests.get('https://www.basketball-reference.com/leagues/NBA_2019.html')
# Lets look at what the request content looks like
print(r.content)

# use Beautifulsoup on content from request
c = r.content
soup = BeautifulSoup(c)
print(soup)

# using prettify() in Beautiful soup indents HTML like it should be in the web page
# This can make reading the HTML a little be easier
print(soup.prettify())

# get elements within the 'main-content' tag
team_per_game = soup.find(id="all_team-stats-per_game")
print(team_per_game)

Run Code Online (Sandbox Code Playgroud)

任何帮助将不胜感激。

Answer 1

Mar*_*ers 6

该网页采用了一种技巧来阻止搜索引擎和其他自动化Web客户端（包括刮板）查找表数据：这些表存储在HTML注释中：

<div id="all_team-stats-per_game" class="table_wrapper setup_commented commented">

<div class="section_heading">
  <span class="section_anchor" id="team-stats-per_game_link" data-label="Team Per Game Stats"></span><h2>Team Per Game Stats</h2>    <div class="section_heading_text">
      <ul> <li><small>* Playoff teams</small></li>
      </ul>
    </div>      
</div>
<div class="placeholder"></div>
<!--
   <div class="table_outer_container">
      <div class="overthrow table_container" id="div_team-stats-per_game">
  <table class="sortable stats_table" id="team-stats-per_game" data-cols-to-freeze=2><caption>Team Per Game Stats Table</caption>

...

</table>

      </div>
   </div>
-->
</div>

Run Code Online (Sandbox Code Playgroud)

我注意到开头div有setup_commented和commented类。然后，浏览器将执行页面中包含的JavaScript代码，然后从这些注释中加载文本，并将placeholderdiv 替换为新HTML内容，以供浏览器显示。

您可以在此处提取评论文本：

from bs4 import BeautifulSoup, Comment

soup = BeautifulSoup(r.content, 'lxml')
placeholder = soup.select_one('#all_team-stats-per_game .placeholder')
comment = next(elem for elem in placeholder.next_siblings if isinstance(elem, Comment))
table_soup = BeautifulSoup(comment, 'lxml')

Run Code Online (Sandbox Code Playgroud)

然后继续解析表HTML。

这个特定的站点同时发布了使用条款和有关数据使用的页面，如果您要使用它们的数据，则可能需要阅读。具体来说，它们的条款在第6节中规定。网站内容：

未经SRL事先书面同意，您不得构架，捕获，收获或收集本网站或内容的任何部分。

收集数据将属于该类别。

归档时间：	6 年，11 月前
查看次数：	101 次
最近记录：	6 年，11 月前