Python Selenium访问HTML源代码

Question

Python Selenium访问HTML源代码

use*_*791 87 python selenium selenium-webdriver

如何使用带有Python的Selenium模块在变量中获取HTML源代码？

我想做这样的事情:

from selenium import webdriver
browser = webdriver.Firefox()
browser.get(raw_input("Enter URL: "))
if "whatever" in html_source:
    # Do something
else:
    # Do something else

Run Code Online (Sandbox Code Playgroud)

我怎样才能做到这一点？我不知道如何访问HTML源代码.

Answer 1

Aut*_*ter 168

你需要打电话给page_source酒店.见下文.

from selenium import webdriver
browser = webdriver.Firefox()
browser.get(raw_input("Enter URL: "))
html_source = browser.page_source
if "whatever" in html_source:
    # do something
else:
    # do something else

Run Code Online (Sandbox Code Playgroud)

如果我们需要在所有javascript执行后获取页面源,该怎么办？ (9认同)
到目前为止最佳答案!最直接,最明确的方法,更紧凑,另一个,仍然有效,替代(`find_element_by_xpath("//*").get_attribute("outerHTML")`( (5认同)
仅在页面已完全加载时才有效.如果页面无限期加载,则此属性不起作用. (3认同)

Answer 2

Dhi*_*raj 7

driver.page_source将帮助您获取页面源代码。您可以检查该文本是否存在于页面源中。

from selenium import webdriver
driver = webdriver.Firefox()
driver.get("some url")
if "your text here" in driver.page_source:
    print('Found it!')
else:
    print('Did not find it.')

Run Code Online (Sandbox Code Playgroud)

如果要将页面源存储在变量中，请在driver.get之后添加以下行：

var_pgsource=driver.page_source

Run Code Online (Sandbox Code Playgroud)

并将if条件更改为：

if "your text here" in var_pgsource:

Run Code Online (Sandbox Code Playgroud)

Answer 3

Mob*_*san 6

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()
html_source_code = driver.execute_script("return document.body.innerHTML;")
html_soup: BeautifulSoup = BeautifulSoup(html_source_code, 'html.parser')

Run Code Online (Sandbox Code Playgroud)

现在您可以应用 BeautifulSoup 函数来提取数据...

Answer 4

Mil*_*nka 5

使用Selenium2Library,您可以使用 get_source()

import Selenium2Library
s = Selenium2Library.Selenium2Library()
s.open_browser("localhost:7080", "firefox")
source = s.get_source()

Run Code Online (Sandbox Code Playgroud)

我可以设置延迟并获取最新的来源吗？使用javascript加载动态内容. (6认同)

Answer 5

Gri*_*fin -7

我建议使用urllib获取源代码，如果您要解析，请使用Beautiful Soup之类的东西。

import urllib

url = urllib.urlopen("http://example.com") # Open the URL.
content = url.readlines() # Read the source and save it to a variable.

Run Code Online (Sandbox Code Playgroud)

Selenium 做了很多 urllib 没有做的事情（例如执行 JavaScript）。 (8认同)

归档时间：	14 年，1 月前
查看次数：	122551 次
最近记录：	6 年，2 月前