Python:如何解析需要登录的网页的HTML?

Dam*_*en 2 html python parsing beautifulsoup

我正在尝试解析需要登录的网页的 HTML。我可以使用以下脚本获取网页的 HTML:

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import re

webpage = urlopen ('https://www.example.com')
soup = BeautifulSoup (webpage)
print soup
#This would print the source of example.com
Run Code Online (Sandbox Code Playgroud)

但事实证明,尝试获取我登录的网页的源代码更加困难。我尝试将 ('https://www.example.com') 替换为 ('https://user:pass@example.com'),但收到无效 URL 错误。

有人知道我该怎么做吗?提前致谢。

Dav*_*542 5

Selenium WebDriver ( http://seleniumhq.org/projects/webdriver/ ) 可能适合您的需求。您可以登录该页面,然后打印 HTML 内容。这是一个例子:

from selenium import webdriver

# initiate
driver = webdriver.Firefox() # initiate a driver, in this case Firefox
driver.get("http://example.com") # go to the url

# locate the login form
username_field = driver.find_element_by_name(...) # get the username field
password_field = driver.find_element_by_name(...) # get the password field

# log in
username_field.send_keys("username") # enter in your username
password_field.send_keys("password") # enter in your password
password_field.submit() # submit it

# print HTML
html = driver.page_source
print html
Run Code Online (Sandbox Code Playgroud)