使用python/selenium保存完整的网页(包括css,图像)

Max*_*wer 17 python selenium bioinformatics web-crawler

我正在使用Python/Selenium将基因序列提交到在线数据库,并希望保存我得到的整个结果页面.下面的代码可以让我得到我想要的结果:

from selenium import webdriver

URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'
CHROME_WEBDRIVER_LOCATION = '/home/max/Downloads/chromedriver' # update this for your machine

# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome(executable_path=CHROME_WEBDRIVER_LOCATION)
driver.get(URL)
time.sleep(5)

# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)

blast_button = driver.find_element_by_id("b1")
blast_button.click()
time.sleep(60)
Run Code Online (Sandbox Code Playgroud)

那时我有一个页面,我可以手动点击"另存为",并获得一个本地文件(带有相应的image/js资产文件夹),让我可以在本地查看整个返回的页面(减去从中动态生成的内容)向下滚动页面,这很好).我假设有一种简单的方法可以模仿python/selenium中的"另存为"功能,但还没有找到.保存下面页面的代码只保存html,并没有给我留下看起来像在Web浏览器中看到的本地文件,图像等.

content = driver.page_source
with open('webpage.html', 'w') as f:
    f.write(content)
Run Code Online (Sandbox Code Playgroud)

我也在SO上找到了这个问题/答案,但是接受的答案只是打开了"另存为"框,并没有提供点击它的方法(正如两位评论者指出的那样)

有没有一种简单的方法'使用python将[整页]保存为'?理想情况下,我更喜欢使用硒的答案,因为硒使爬行部分变得如此简单,但如果有更好的工具可以使用另一个库,我愿意这样做.或者我可能只需要在代码中指定我想要下载的所有图像/表格,并且没有模拟右键单击"另存为"功能的快捷方式?

更新 - 跟随詹姆斯回答的问题所以我运行詹姆斯的代码生成一个page.html(和相关的文件),并将其与手动点击保存为的html文件进行比较.在page.html通过詹姆斯的脚本保存的是伟大的,有我需要的一切,但在浏览器中打开时,它也说明了很多这是隐藏在手动save'd页额外的格式文本.请参阅随附的屏幕截图(左侧手动保存的页面,脚本保存的页面,右侧显示了额外的格式化文本). 在此输入图像描述

这对我来说尤其令人惊讶,因为James的脚本保存的页面的原始html似乎表明这些字段仍然应该被隐藏.请参阅下面的html,它在两个文件中显示相同,但​​有问题的文本仅出现在James脚本保存的浏览器呈现页面中:

<p class="helpbox ui-ncbitoggler-slave ui-ncbitoggler" id="hlp1" aria-hidden="true">
These options control formatting of alignments in results pages. The
default is HTML, but other formats (including plain text) are available.
PSSM and PssmWithParameters are representations of Position Specific Scoring Matrices and are only available for PSI-BLAST. 
The Advanced view option allows the database descriptions to be sorted by various indices in a table.
</p>
Run Code Online (Sandbox Code Playgroud)

知道为什么会这样吗?

FTh*_*son 12

正如您所指出的,Selenium 无法与浏览器的上下文菜单交互以使用Save as...,因此您可以使用外部自动化库,如pyautogui.

pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.typewrite(SEQUENCE + '.html')
pyautogui.hotkey('enter')
Run Code Online (Sandbox Code Playgroud)

此代码Save as...通过其键盘快捷键打开窗口CTRL+S,然后按 Enter 将网页及其资产保存到默认下载位置。此代码还将文件命名为序列,以便为其提供唯一名称,但您可以针对您的用例更改此名称。如果需要,您还可以使用选项卡和箭头键通过一些额外的工作来更改下载位置。

在 Ubuntu 18.10 上测试;根据您的操作系统,您可能需要修改发送的组合键。


完整代码,其中我还添加了条件等待以提高速度:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
from selenium.webdriver.support.ui import WebDriverWait
import pyautogui

URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'

# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome()
driver.get(URL)

# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)

blast_button = driver.find_element_by_id("b1")
blast_button.click()

# wait until results are loaded
WebDriverWait(driver, 60).until(visibility_of_element_located((By.ID, 'grView')))

# open 'Save as...' to save html and assets
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.typewrite(SEQUENCE + '.html')
pyautogui.hotkey('enter')
Run Code Online (Sandbox Code Playgroud)

  • “另存为...”最初在 MacOS 上对我不起作用。但是将 `pyautogui.hotkey('command','s')` 更改为 `pyautogui.keyDown('command') pyautogui.press('s')` 解决了这个问题。工作完美! (2认同)

Jam*_*mes 5

这不是一个完美的解决方案,但它会为您提供大部分所需的东西。您可以通过解析 html 并将任何加载的文件(图像、css、js 等)下载到相同的相对路径来复制“另存为完整网页(完整)”的行为。

由于跨源请求阻塞,大多数 javascript 将无法工作。但内容看起来(大部分)是一样的。

这用于requests保存加载的文件、lxml解析 html 和os路径工作。

from selenium import webdriver
import chromedriver_binary
from lxml import html
import requests
import os

driver = webdriver.Chrome()
URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' 
base = 'https://blast.ncbi.nlm.nih.gov/'

driver.get(URL)
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)
blast_button = driver.find_element_by_id("b1")
blast_button.click()

content = driver.page_source
# write the page content
os.mkdir('page')
with open('page/page.html', 'w') as fp:
    fp.write(content)

# download the referenced files to the same path as in the html
sess = requests.Session()
sess.get(base)            # sets cookies

# parse html
h = html.fromstring(content)
# get css/js files loaded in the head
for hr in h.xpath('head//@href'):
    if not hr.startswith('http'):
        local_path = 'page/' + hr
        hr = base + hr
    res = sess.get(hr)
    if not os.path.exists(os.path.dirname(local_path)):
        os.makedirs(os.path.dirname(local_path))
    with open(local_path, 'wb') as fp:
        fp.write(res.content)

# get image/js files from the body.  skip anything loaded from outside sources
for src in h.xpath('//@src'):
    if not src or src.startswith('http'):
        continue
    local_path = 'page/' + src
    print(local_path)
    src = base + src
    res = sess.get(hr)
    if not os.path.exists(os.path.dirname(local_path)):
        os.makedirs(os.path.dirname(local_path))
    with open(local_path, 'wb') as fp:
        fp.write(res.content)  
Run Code Online (Sandbox Code Playgroud)

你应该有一个文件夹,里面page有一个文件page.html,里面有你想要的内容。