使用 Selenium + Python 遍历链接并从结果页面中抓取数据

Question

使用 Selenium + Python 遍历链接并从结果页面中抓取数据

我是 Selenium 的新手，需要抓取一个网站，其中包含一个结构完全如下的链接列表：

<a class="unique" href="...">
    <i class="something"></i>
    "Text - "
    <span class="something">Text</span>
</a>
<a class="unique" href="...">
    <i class="something"></i>
    "Text - "
    <span class="something">Text</span>
</a>
...
...

Run Code Online (Sandbox Code Playgroud)

我需要在循环内单击此链接列表并从结果页面中抓取数据。到目前为止我所做的是：

lists = browser.find_elements_by_xpath("//a[@class='unique']")
for lis in lists:
    print(lis.text)
    lis.click()
    time.sleep(4)
    # Scrape data from this page (works fine).
    browser.back()
    time.sleep(4)

Run Code Online (Sandbox Code Playgroud)

它适用于第一个循环，但当第二个循环到达时

print(lis.text)

Run Code Online (Sandbox Code Playgroud)

它抛出一个错误说：

StaleElementReferenceException：消息：过时的元素引用：元素未附加到页面文档

我试过了print (lists)，它给出了所有链接元素的列表，所以工作正常。当浏览器返回上一页时会出现问题。我试过延长时间并使用browser.get(...)而不是，browser.back()但错误仍然存在。我不明白为什么它不会打印，lis.text因为列表仍然包含所有元素的列表。任何帮助将不胜感激。

Answer 1

小智 2

您正在尝试单击文本而不是启动链接。

单击每个链接、抓取数据并返回似乎也不是有效的，而是您可以将所有链接存储在某个列表中，然后您可以使用该方法导航到每个链接driver.get('some link')，然后可以抓取数据。为了避免一些异常，请尝试以下修改后的代码：

# Locate the anchor nodes first and load all the elements into some list
lists = browser.find_elements_by_xpath("//a[@class='unique']")
# Empty list for storing links
links = []
for lis in lists:
    print(lis.get_attribute('href'))
    # Fetch and store the links
    links.append(lis.get_attribute('href'))

# Loop through all the links and launch one by one
for link in links:
    browser.get(link)
    # Scrape here
    sleep(3)

Run Code Online (Sandbox Code Playgroud)

或者，如果您想使用相同的逻辑，那么您可以使用 Fluent Wait 来避免一些异常，例如 StaleElementReferenceException，如下所示：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import *

wait = WebDriverWait(browser, 10, poll_frequency=1, ignored_exceptions=[StaleElementReferenceException])
element = wait.until(EC.element_to_be_clickable((By.XPATH, "xPath that you want to click")))

Run Code Online (Sandbox Code Playgroud)

我希望它有帮助...

归档时间：	6 年，11 月前
查看次数：	2392 次
最近记录：	6 年，11 月前