如何在Python中编写Web代理

Question

如何在Python中编写Web代理

Kan*_*dle 21 python proxy tornado python-requests

我正在尝试在python中编写Web代理.我们的目标是访问一个像这样的网址:http://proxyurl/http://anothersite.com/并http://anothersite.com像往常一样看到他的内容.我通过滥用请求库得到了很多,但这并不是请求框架的预期用途.我之前已经编写了扭曲的代理,但我不确定如何将其与我想要做的事情联系起来.这是我到目前为止的地方......

import os
import urlparse

import requests

import tornado.ioloop
import tornado.web
from tornado import template

ROOT = os.path.dirname(os.path.abspath(__file__))
path = lambda *a: os.path.join(ROOT, *a)

loader = template.Loader(path(ROOT, 'templates'))


class ProxyHandler(tornado.web.RequestHandler):
    def get(self, slug):
        if slug.startswith("http://") or slug.startswith("https://"):
            if self.get_argument("start", None) == "true":
                parsed = urlparse.urlparse(slug)
                self.set_cookie("scheme", value=parsed.scheme)
                self.set_cookie("netloc", value=parsed.netloc)
                self.set_cookie("urlpath", value=parsed.path)
            #external resource
            else:
                response = requests.get(slug)
                headers = response.headers
                if 'content-type' in headers:
                    self.set_header('Content-type', headers['content-type'])
                if 'length' in headers:
                    self.set_header('length', headers['length'])
                for block in response.iter_content(1024):
                    self.write(block)
                self.finish()
                return
        else:
            #absolute
            if slug.startswith('/'):
                slug = "{scheme}://{netloc}{original_slug}".format(
                    scheme=self.get_cookie('scheme'),
                    netloc=self.get_cookie('netloc'),
                    original_slug=slug,
                )
            #relative
            else:
                slug = "{scheme}://{netloc}{path}{original_slug}".format(
                    scheme=self.get_cookie('scheme'),
                    netloc=self.get_cookie('netloc'),
                    path=self.get_cookie('urlpath'),
                    original_slug=slug,
                )
        response = requests.get(slug)
        #get the headers
        headers = response.headers
        #get doctype
        doctype = None
        if '<!doctype' in response.content.lower()[:9]:
            doctype = response.content[:response.content.find('>')+1]
        if 'content-type' in headers:
           self.set_header('Content-type', headers['content-type'])
        if 'length' in headers:
            self.set_header('length', headers['length'])
        self.write(response.content)


application = tornado.web.Application([
    (r"/(.+)", ProxyHandler),
])

if __name__ == "__main__":
    application.listen(8888)
    tornado.ioloop.IOLoop.instance().start()

Run Code Online (Sandbox Code Playgroud)

只是注意,如果查询字符串中有start = true,我设置一个cookie来保存scheme,netloc和urlpath.这样,然后命中代理的任何相对或绝对链接都使用该cookie来解析完整的URL.

有了这段代码,如果你去http://localhost:8888/http://espn.com/?start=true看看你会看到ESPN的内容.但是,在以下网站上它根本不起作用:http://www.bottegaveneta.com/us/shop/.我的问题是,最好的方法是什么？目前的方式是我正在实施这种强大还是存在一些可怕的陷阱这样做？如果它是正确的,为什么某些网站像我指出的网站根本不工作？

感谢您的任何帮助.

Answer 1

cpb*_*pb2 7

我最近写了一个类似的网络应用程序.请注意,这是我这样做的方式.我不是说你应该这样做.这些是我遇到的一些陷阱:

将属性值从相对值更改为绝对值

除了获取页面并将其呈现给客户端之外,还涉及更多内容.很多时候,您无法在没有任何错误的情况下代理网页.

为什么我所指出的某些网站根本不起作用？

许多网页依赖于资源的相对路径,以便以格式良好的方式显示网页.例如,此图片代码:

<img src="/header.png" />

Run Code Online (Sandbox Code Playgroud)

将导致客户端请求:

http://proxyurl/header.png

Run Code Online (Sandbox Code Playgroud)

哪个失败了.' src '值应转换为:

http://anothersite.com/header.png.

Run Code Online (Sandbox Code Playgroud)

因此,您需要使用BeautifulSoup等解析HTML文档,遍历所有标记并检查以下属性:

'src', 'lowsrc', 'href'

Run Code Online (Sandbox Code Playgroud)

并相应地更改其值,以便标记变为:

<img src="http://anothersite.com/header.png" />

Run Code Online (Sandbox Code Playgroud)

此方法适用于更多标签而不仅仅是图像标签.a,脚本,链接,li和框架也是你应该改变的一些.

HTML恶作剧

先前的方法应该让你走得更远,但你还没有完成.

都

<style type="text/css" media="all">@import "/stylesheet.css?version=120215094129002";</style>

Run Code Online (Sandbox Code Playgroud)

和

<div style="position:absolute;right:8px;background-image:url('/Portals/_default/Skins/BE/images/top_img.gif');height:200px;width:427px;background-repeat:no-repeat;background-position:right top;" >

Run Code Online (Sandbox Code Playgroud)

是使用BeautifulSoup难以访问和修改的代码示例.

在第一个例子中,有一个相对uri的css @Import.第二个涉及内联CSS语句中的' url() '方法.

在我的情况下,我最终编写了可怕的代码来手动修改这些值.您可能想要使用正则表达式,但我不确定.

重定向

使用Python-Requests或Urllib2,您可以轻松地自动跟踪重定向.只记得保存新的(基础)uri是什么; 你需要它来'将属性值从相对值改为绝对'操作.

您还需要处理"硬编码"重定向.比如这一个:

<meta http-equiv="refresh" content="0;url=http://new-website.com/">

Run Code Online (Sandbox Code Playgroud)

需要改为:

<meta http-equiv="refresh" content="0;url=http://proxyurl/http://new-website.com/">

Run Code Online (Sandbox Code Playgroud)

基本标签

的基础标签指定为文档中的所有相对URL基本URL /目标.您可能想要更改该值.

终于完成了？

不.有些网站严重依赖javascript在屏幕上绘制内容.这些网站是最难代理的.我一直在考虑使用像PhantomJS或Ghost这样的东西来获取和评估网页并将结果呈现给客户端.

也许我的源代码可以帮助你.您可以以任何您想要的方式使用它.

您可以在文档标题中粘贴一个`<base>`标记,以便一举修复相对的URL.(但是如果已经有了!) (3认同)

Answer 2

and*_*oot 0

我认为你不需要最后一个 if 块。这似乎对我有用：

class ProxyHandler(tornado.web.RequestHandler):
    def get(self, slug):
        print 'get: ' + str(slug)

        if slug.startswith("http://") or slug.startswith("https://"):
            if self.get_argument("start", None) == "true":
                parsed = urlparse.urlparse(slug)
                self.set_cookie("scheme", value=parsed.scheme)
                self.set_cookie("netloc", value=parsed.netloc)
                self.set_cookie("urlpath", value=parsed.path)
            #external resource
            else:
                response = requests.get(slug)
                headers = response.headers
                if 'content-type' in headers:
                    self.set_header('Content-type', headers['content-type'])
                if 'length' in headers:
                    self.set_header('length', headers['length'])
                for block in response.iter_content(1024):
                    self.write(block)
                self.finish()
                return
        else:

            slug = "{scheme}://{netloc}/{original_slug}".format(
                scheme=self.get_cookie('scheme'),
                netloc=self.get_cookie('netloc'),
                original_slug=slug,
            )
            print self.get_cookie('scheme')
            print self.get_cookie('netloc')
            print self.get_cookie('urlpath')
            print slug
        response = requests.get(slug)
        #get the headers
        headers = response.headers
        #get doctype
        doctype = None
        if '<!doctype' in response.content.lower()[:9]:
            doctype = response.content[:response.content.find('>')+1]
        if 'content-type' in headers:
           self.set_header('Content-type', headers['content-type'])
        if 'length' in headers:
            self.set_header('length', headers['length'])
        self.write(response.content)

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，8 月前
查看次数：	14598 次
最近记录：	10 年，4 月前