it_*_*ure 6 html python lxml web-scraping python-3.x
有一个页面我想在 lxml 中解析?当你点击时,表格数据会变成不同的形式。
from urllib.request import urlopen
import lxml.html
url="http://f10.eastmoney.com/f10_v2/FinanceAnalysis.aspx?code=sz300059"
material=urlopen(url).read()
root=lxml.html.parse(material)
Run Code Online (Sandbox Code Playgroud)
如果我写
set=root.xpath('//table[@id="BBMX_table"]//tr')
Run Code Online (Sandbox Code Playgroud)
我得到对应的表数据
<li class="first current" onclick="ChangeRptF10AssetStatement('30005902','8','All',this,'');">
Run Code Online (Sandbox Code Playgroud)
我得到的是:
我想得到的表数据是对应的
<li class="" onclick="ChangeRptF10AssetStatement('30005902','8','Year',this,'');">
Run Code Online (Sandbox Code Playgroud)
我想得到的是:

我怎样才能root.xpath正确地写出我的 xpath 表达式?更多信息:当你点击时???,表格会变成另一个。onclick="ChangeRptF10AssetStatement('30005902','8','Year',this,'')
我试过硒:
import lxml
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument("--headless")
browser = webdriver.Chrome(options=chrome_options,executable_path='/usr/bin/chromedriver')
browser.get("http://f10.eastmoney.com/f10_v2/FinanceAnalysis.aspx?code=sz300059")
root = lxml.html.document_fromstring(browser.page_source)
mystring = lxml.etree.tostring(root, encoding = "unicode")
with open("/tmp/test.html","w") as fh:
fh.write(mystring)
Run Code Online (Sandbox Code Playgroud)
打开/tmp/test.html,里面没有数据,我怎样才能得到我的期望数据?
当您抓取网站时,可能会导致不必要的后果。
\n\n确保您正在抓取的网站不禁止您这样做。如果他们说不要抓取该网站,您应该尊重这一点。
\n\n我在您的代码中看到使用 selenium 并输出 html 文件:
\n\n更新:希望创建稳定的代码来工作,根据Sers的建议:应该优化等待网站元素加载完成的方法。我将代码调整如下:
\n\nfrom selenium import webdriver\nfrom selenium.webdriver.chrome.options import Options\nfrom selenium.common.exceptions import TimeoutException\nfrom selenium.webdriver.support.ui import WebDriverWait\nfrom selenium.webdriver.support import expected_conditions as EC\nfrom selenium.webdriver.common.by import By\nimport time\n\n\nchrome_options = Options()\nchrome_options.add_argument(\'--no-sandbox\')\nchrome_options.add_argument(\'--disable-dev-shm-usage\')\nchrome_options.add_argument("--headless")\nbrowser = webdriver.Chrome(options=chrome_options,\n executable_path=r\'F:\\chromedriver.exe\')\nwait = WebDriverWait(browser, 20)\n\nlist_stock = [\'sz300059\', \'sz300766\', \'sz002950\']\n\n\ntry:\n for id_stock in list_stock:\n url_id = "http://f10.eastmoney.com/f10_v2/FinanceAnalysis.aspx?code=" + id_stock\n browser.get(url_id)\n\n # click to element \xe6\x8c\x89\xe5\xb9\xb4\xe5\xba\xa6 (Per year)\n\n wait.until(lambda e: e.execute_script(\'return document.readyState\') != "loading")\n wait.until(EC.presence_of_all_elements_located([By.CSS_SELECTOR, "#zyzbTab > li:nth-child(2)"]))\n\n element_per_year = browser.find_element_by_css_selector(\'#zyzbTab > li:nth-child(2)\')\n\n element_per_year.click()\n\n # get table\n wait.until(lambda e: e.execute_script(\'return document.readyState\') != "loading")\n\n wait.until(EC.presence_of_all_elements_located([By.CSS_SELECTOR, "#report_zyzb"]))\n # time.sleep(5)\n element_tb_per_year = browser.find_element_by_css_selector(\'#report_zyzb\')\n tb_per_year_html = element_tb_per_year.get_attribute(\'innerHTML\')\n\n path_file_html = fr\'F:\\test_{id_stock}.html\'\n\n with open(path_file_html, "w", encoding=\'utf-8\') as fh:\n fh.write(tb_per_year_html)\n\n print(f\'export id: {id_stock}\')\n\n\nexcept TimeoutException:\n print("Timed out waiting for page to load")\n\nfinally:\n browser.close()\n browser.quit()\nRun Code Online (Sandbox Code Playgroud)\n\n当WebDriverWait工作不正确时,我认为你应该使用time.sleep。您可以通过谷歌搜索更多有关此的信息。
\n\n这是图像:
\n\n\n