Ghu*_*lam 8 nlp arabic python-3.x
我能够成功打开 URL 并将生成的页面保存为 .html 文件。但是,我无法确定如何下载和保存 .mhtml(网页,单个文件)。
我的代码是:
import urllib.parse, time
from urllib.parse import urlparse
import urllib.request
url = ('https://www.example.com')
encoded_url = urllib.parse.quote(url, safe='')
print(encoded_url)
base_url = ("https://translate.google.co.uk/translate?sl=auto&tl=en&u=")
translation_url = base_url+encoded_url
print(translation_url)
req = urllib.request.Request(translation_url, headers={'User-Agent': 'Mozilla/6.0'})
print(req)
response = urllib.request.urlopen(req)
time.sleep(15)
print(response)
webContent = response.read()
print(webContent)
f = open('GoogleTranslated.html', 'wb')
f.write(webContent)
print(f)
f.close
Run Code Online (Sandbox Code Playgroud)
我尝试使用 wget 使用此问题中捕获的详细信息: How to download apages (mhtml format) using wget in python但详细信息不完整(或者我根本无法理解)。
在此阶段任何建议都会有所帮助。
与之前的答案相比,我的解决方案不涉及任何受控的鼠标或键盘操作。下载的 mhtml 文件也可以存储在您提供的任何位置。这个方法是我从一个中文博客学到的。关键思想是使用chrome-dev-tools命令。
代码如下所示。
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.qq.com/')
# Execute Chrome dev tool command to obtain the mhtml file
res = driver.execute_cdp_cmd('Page.captureSnapshot', {})
# Write the file locally
with open('./store/qq.mhtml', 'w', newline='') as f:
f.write(res['data'])
driver.quit()
Run Code Online (Sandbox Code Playgroud)
希望这会有所帮助!您可以 在此处查看有关 chrome 开发协议的信息。
您是否尝试使用 Selenium 和 Chrome Webdriver 来保存页面?
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
from selenium.webdriver.support.ui import WebDriverWait
import pyautogui
URL = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
FILE_NAME = ''
# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome()
driver.get(URL)
# wait until body is loaded
WebDriverWait(driver, 60).until(visibility_of_element_located((By.TAG_NAME, 'body')))
time.sleep(1)
# open 'Save as...' to save html and assets
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
if FILE_NAME != '':
pyautogui.typewrite(FILE_NAME)
pyautogui.hotkey('enter')
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
6509 次 |
| 最近记录: |