无法使用 css 选择器在 python 中获取数据

Question

无法使用 css 选择器在 python 中获取数据

嗨，我想从这个网站获取电影片名：

url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})  
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("#page_filling_chart > table > tbody > tr > td > b > a")
for i in range(len(movie_list)):
    print(movie_list[i].text)

Run Code Online (Sandbox Code Playgroud)

我得到了响应 200，并且在抓取其他信息时没有问题。但问题出在变量movie_list 中。

当我打印（电影列表）时，它只返回空列表，这意味着我使用的标签错误。

Answer 1

lar*_*sks 5

如果更换：

movie_list = html.select("#page_filling_chart > table > tbody > tr > td > b > a")

Run Code Online (Sandbox Code Playgroud)

和：

movie_list = html.select("#page_filling_chart table tr > td > b > a")

Run Code Online (Sandbox Code Playgroud)

你得到了我认为你正在寻找的东西。这里的主要变化是用parent > child后代选择器 ( ancestor descendant)替换子选择器 ( )，这对于中间内容的外观要宽容得多。

更新：这很有趣。您选择的BeautifulSoup解析器似乎会导致不同的行为。

相比：

>>> html = BeautifulSoup(raw, 'html.parser')
>>> html.select('#page_filling_chart > table')
[]

Run Code Online (Sandbox Code Playgroud)

和：

>>> html = BeautifulSoup(raw, 'lxml')
>>> html.select('#page_filling_chart > table')
[<table>
<tr><th>Rank</th><th>Movie</th><th>Release<br/>Date</th><th>Distributor</th><th>Genre</th><th>2019 Gross</th><th>Tickets Sold</th></tr>
<tr>
[...]

Run Code Online (Sandbox Code Playgroud)

事实上，使用lxml解析器你几乎可以使用你原来的选择器。这有效：

html.select("#page_filling_chart > table > tr > td > b > a"

Run Code Online (Sandbox Code Playgroud)

解析后， atable没有tbody。

经过一段时间的尝试后，您必须像这样重写原始查询才能使其与以下内容一起使用html.parser：

html.select("#page_filling_chart2 > p > p > p > p > p > table > tr > td > b > a")

Run Code Online (Sandbox Code Playgroud)

当它们从源中丢失时，它看起来html.parser不会合成关闭</p>元素，所以所有未关闭的<p>标签都会导致一个奇怪的解析文档结构。

归档时间：	5 年，5 月前
查看次数：	194 次
最近记录：	5 年，5 月前