使用“加载更多结果”按钮抓取页面

Question

使用“加载更多结果”按钮抓取页面

jim*_*iat 4 python python-requests

我正在尝试使用requests和BeautifulSoup/来抓取以下页面Lxml

https://www.reuters.com/search/news?blob=soybean&sortBy=date&dateRange=all

这是一种带有load more results按钮的页面。我找到了几页解释如何执行此操作的页面，但不在requests.

我知道我应该多花几个小时研究这个问题，然后再在这里提问，以证明我已经尝试过。

我试图查看检查窗格、网络选项卡等，但我对了解如何与 javascript 交互的请求仍然有点太新鲜了。

我不需要完全成熟的脚本/解决方案作为答案，只需要一些关于如何使用完成这项非常典型任务的指示requests，以节省我宝贵的研究时间。

提前致谢。

Answer 1

bri*_*fey 6

这是一个快速脚本，应该显示如何使用 Selenium 完成此操作：

from selenium import webdriver
import time

url = "https://www.reuters.com/search/news?blob=soybean&sortBy=date&dateRange=all"
driver = webdriver.PhantomJS()
driver.get(url)
html = driver.page_source.encode('utf-8')
page_num = 0

while driver.find_elements_by_css_selector('.search-result-more-txt'):
    driver.find_element_by_css_selector('.search-result-more-txt').click()
    page_num += 1
    print("getting page number "+str(page_num))
    time.sleep(1)

html = driver.page_source.encode('utf-8')

Run Code Online (Sandbox Code Playgroud)

我不知道如何使用requests. 路透社似乎有很多关于大豆的文章。当我写完这个答案时，我已经完成了 250 多次“页面加载”。

抓取所有页面或大量页面后，您可以通过html传入 Beautiful Soup来抓取数据：

soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('div', attrs={"class":'search-result-indiv'})
articles = [a.find('a')['href'] for a in links if a != '']

Run Code Online (Sandbox Code Playgroud)

我收到一条错误，Selenium 不再适用于 PhantomJS“C:\Users\xxx\Anaconda3\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium 对 PhantomJS 的支持已被弃用，请使用无头版本的 Chrome 或 Firefox，而不是 warnings.warn('Selenium 对 PhantomJS 的支持已被弃用，请使用无头'" 并且它将无法运行。有什么建议吗？ (2认同)

归档时间：	7 年，9 月前
查看次数：	4641 次
最近记录：	4 年，11 月前