Python：如何解析需要登录的网页的HTML？

Question

Python：如何解析需要登录的网页的HTML？

Dam*_*en 2 html python parsing beautifulsoup

我正在尝试解析需要登录的网页的 HTML。我可以使用以下脚本获取网页的 HTML：

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import re

webpage = urlopen ('https://www.example.com')
soup = BeautifulSoup (webpage)
print soup
#This would print the source of example.com

Run Code Online (Sandbox Code Playgroud)

但事实证明，尝试获取我登录的网页的源代码更加困难。我尝试将 ('https://www.example.com') 替换为 ('https://user:pass@example.com')，但收到无效 URL 错误。

有人知道我该怎么做吗？提前致谢。

Answer 1

Dav*_*542 5

Selenium WebDriver ( http://seleniumhq.org/projects/webdriver/ ) 可能适合您的需求。您可以登录该页面，然后打印 HTML 内容。这是一个例子：

from selenium import webdriver

# initiate
driver = webdriver.Firefox() # initiate a driver, in this case Firefox
driver.get("http://example.com") # go to the url

# locate the login form
username_field = driver.find_element_by_name(...) # get the username field
password_field = driver.find_element_by_name(...) # get the password field

# log in
username_field.send_keys("username") # enter in your username
password_field.send_keys("password") # enter in your password
password_field.submit() # submit it

# print HTML
html = driver.page_source
print html

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，10 月前
查看次数：	6691 次
最近记录：	7 年，7 月前