mor*_*orn 6 python beautifulsoup
我是python和html的新手.我试图使用请求和BeautifulSoup从页面中检索注释的数量.
在此示例中,我尝试获取数字226.以下是我在Chrome中检查页面时可以看到的代码:
<a title="Go to the comments page" class="article__comments-counts" href="http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/comments/">
<span class="civil-comment-count" data-site-id="globeandmail" data-id="33519766" data-language="en">
226
</span>
Comments
</a>
Run Code Online (Sandbox Code Playgroud)
当我从URL请求文本时,我可以找到代码但是span标记之间没有内容,没有226.这是我的代码:
import requests, bs4
url = 'http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/'
r = requests.get()
soup = bs4.BeautifulSoup(r.text, 'html.parser')
span = soup.find('span', class_='civil-comment-count')
Run Code Online (Sandbox Code Playgroud)
它返回此,与上面相同,但没有226.
<span class="civil-comment-count" data-id="33519766" data-language="en" data-site-id="globeandmail">
</span>
Run Code Online (Sandbox Code Playgroud)
我不知道为什么价值没有出现.提前感谢您的任何帮助.
该页面,特别是评论数量,确实涉及要加载和显示的 JavaScript。但是,您不必使用 Selenium,向其背后的 API 发出请求:
import requests
with requests.Session() as session:
session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36"}
# visit main page
base_url = 'http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/'
session.get(base_url)
# get the comments count
url = "https://api-civilcomments.global.ssl.fastly.net/api/v1/topics/multiple_comments_count.json"
params = {"publication_slug": "globeandmail",
"reference_language": "en",
"reference_ids": "33519766"}
r = session.get(url, params=params)
print(r.json())
Run Code Online (Sandbox Code Playgroud)
印刷:
{'comment_counts': {'33519766': 226}}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
505 次 |
| 最近记录: |