使用“requests-html”时如何获取带有绝对链接路径的原始 html

Question

使用“requests-html”时如何获取带有绝对链接路径的原始 html

Mez*_*ezo 5 python python-3.x python-requests python-requests-html

requests使用库发出请求时https://stackoverflow.com

page = requests.get(url='https://stackoverflow.com')
print(page.content)

Run Code Online (Sandbox Code Playgroud)

我得到以下信息：

<!DOCTYPE html>
    <html class="html__responsive html__unpinned-leftnav">
    <head>
        <title>Stack Overflow - Where Developers Learn, Share, &amp; Build Careers</title>
        <link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196">
        <link rel="apple-touch-icon" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a">
        <link rel="image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a"> 
..........

Run Code Online (Sandbox Code Playgroud)

这里的这些源代码具有绝对路径，但是当使用requests-htmljs 渲染运行相同的 URL 时

with HTMLSession() as session:
    page = session.get('https://stackoverflow.com')
    page.html.render()
    print(page.content)

Run Code Online (Sandbox Code Playgroud)

我得到以下信息：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>StackOverflow.org</title>
<script type="text/javascript" src="lib/jquery.js"></script>
<script type="text/javascript" src="lib/interface.js"></script>
<script type="text/javascript" src="lib/window.js"></script>
<link href="lib/dock.css" rel="stylesheet" type="text/css" />
<link href="lib/window.css" rel="stylesheet" type="text/css" />
<link rel="icon" type="image/gif" href="favicon.gif"/>
..........

Run Code Online (Sandbox Code Playgroud)

这里的链接是相对路径，

如何获取带有绝对路径的源代码，就像requests使用requests-htmljs 渲染时一样？

Answer 1

pas*_*cha 2

这可能应该是request-html 开发人员的一个功能请求。然而现在我们可以通过这个黑客解决方案来实现这一点：

from requests_html import HTMLSession
from lxml import etree

with HTMLSession() as session:
    html = session.get('https://stackoverflow.com').html
    html.render()

    # iterate over all links
    for link in html.pq('a'):
        if "href" in link.attrib:
            # Make links absolute
            link.attrib["href"] = html._make_absolute(link.attrib["href"])

    # Print html with only absolute links
    print(etree.tostring(html.lxml).decode())

Run Code Online (Sandbox Code Playgroud)

我们通过迭代所有链接并使用 html 对象的私有_make_absolute函数将其位置更改为绝对位置来更改 lxml 树底层的 html 对象。

归档时间：	5 年，6 月前
查看次数：	1906 次
最近记录：	5 年，6 月前