如何提取lxml中指定的div表数据？

Question

如何提取lxml中指定的div表数据？

it_*_*ure 6 html python lxml web-scraping python-3.x

有一个页面我想在 lxml 中解析？当你点击时，表格数据会变成不同的形式。

from urllib.request import urlopen
import lxml.html
url="http://f10.eastmoney.com/f10_v2/FinanceAnalysis.aspx?code=sz300059"
material=urlopen(url).read()
root=lxml.html.parse(material)

Run Code Online (Sandbox Code Playgroud)

如果我写

set=root.xpath('//table[@id="BBMX_table"]//tr')

Run Code Online (Sandbox Code Playgroud)

我得到对应的表数据

<li class="first current" onclick="ChangeRptF10AssetStatement('30005902','8','All',this,'');">

Run Code Online (Sandbox Code Playgroud)

我得到的是：在此处输入图片说明

我想得到的表数据是对应的

<li class="" onclick="ChangeRptF10AssetStatement('30005902','8','Year',this,'');">

Run Code Online (Sandbox Code Playgroud)

我想得到的是：

在此处输入图片说明
我怎样才能root.xpath正确地写出我的 xpath 表达式？更多信息：当你点击时???，表格会变成另一个。onclick="ChangeRptF10AssetStatement('30005902','8','Year',this,'')

我试过硒：

import lxml
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument("--headless")
browser = webdriver.Chrome(options=chrome_options,executable_path='/usr/bin/chromedriver')
browser.get("http://f10.eastmoney.com/f10_v2/FinanceAnalysis.aspx?code=sz300059")
root = lxml.html.document_fromstring(browser.page_source)
mystring = lxml.etree.tostring(root, encoding = "unicode")
with open("/tmp/test.html","w") as fh:
    fh.write(mystring)

Run Code Online (Sandbox Code Playgroud)

打开/tmp/test.html，里面没有数据，我怎样才能得到我的期望数据？

Answer 1

Sơn*_*ờng 2

当您抓取网站时，可能会导致不必要的后果。

\n\n

确保您正在抓取的网站不禁止您这样做。如果他们说不要抓取该网站，您应该尊重这一点。

\n\n

我在您的代码中看到使用 selenium 并输出 html 文件：

\n\n

更新：希望创建稳定的代码来工作，根据Sers的建议：应该优化等待网站元素加载完成的方法。我将代码调整如下：

\n\n

from selenium import webdriver\nfrom selenium.webdriver.chrome.options import Options\nfrom selenium.common.exceptions import TimeoutException\nfrom selenium.webdriver.support.ui import WebDriverWait\nfrom selenium.webdriver.support import expected_conditions as EC\nfrom selenium.webdriver.common.by import By\nimport time\n\n\nchrome_options = Options()\nchrome_options.add_argument(\'--no-sandbox\')\nchrome_options.add_argument(\'--disable-dev-shm-usage\')\nchrome_options.add_argument("--headless")\nbrowser = webdriver.Chrome(options=chrome_options,\n                           executable_path=r\'F:\\chromedriver.exe\')\nwait = WebDriverWait(browser, 20)\n\nlist_stock = [\'sz300059\', \'sz300766\', \'sz002950\']\n\n\ntry:\n    for id_stock in list_stock:\n        url_id = "http://f10.eastmoney.com/f10_v2/FinanceAnalysis.aspx?code=" + id_stock\n        browser.get(url_id)\n\n        # click to element  \xe6\x8c\x89\xe5\xb9\xb4\xe5\xba\xa6   (Per year)\n\n        wait.until(lambda e: e.execute_script(\'return document.readyState\') != "loading")\n        wait.until(EC.presence_of_all_elements_located([By.CSS_SELECTOR, "#zyzbTab > li:nth-child(2)"]))\n\n        element_per_year = browser.find_element_by_css_selector(\'#zyzbTab > li:nth-child(2)\')\n\n        element_per_year.click()\n\n        # get table\n        wait.until(lambda e: e.execute_script(\'return document.readyState\') != "loading")\n\n        wait.until(EC.presence_of_all_elements_located([By.CSS_SELECTOR, "#report_zyzb"]))\n        # time.sleep(5)\n        element_tb_per_year = browser.find_element_by_css_selector(\'#report_zyzb\')\n        tb_per_year_html = element_tb_per_year.get_attribute(\'innerHTML\')\n\n        path_file_html = fr\'F:\\test_{id_stock}.html\'\n\n        with open(path_file_html, "w", encoding=\'utf-8\') as fh:\n            fh.write(tb_per_year_html)\n\n        print(f\'export id: {id_stock}\')\n\n\nexcept TimeoutException:\n    print("Timed out waiting for page to load")\n\nfinally:\n    browser.close()\n    browser.quit()\n

Run Code Online (Sandbox Code Playgroud)\n\n

当WebDriverWait工作不正确时，我认为你应该使用time.sleep。您可以通过谷歌搜索更多有关此的信息。

\n\n

这是图像：

\n\n

\n

归档时间：	11 年，9 月前
查看次数：	1629 次
最近记录：	5 年，11 月前