Rom*_*sky 5 python scrapy scrapy-splash splash-js-render
我使用scrapy-splash来构建我的蜘蛛.现在我需要的是维护会话,所以我使用scrapy.downloadermiddlewares.cookies.CookiesMiddleware并处理set-cookie标头.我知道它处理set-cookie标头,因为我设置了COOKIES_DEBUG = True,这导致CookeMiddleware关于set-cookie标头的打印输出.
问题是:当我还在图片中添加Splash时,set-cookie打印输出消失,实际上我得到的响应标题是{'Date':['Sun,2016年9月25日12:09:55 GMT'],' Content-Type':['text/html; charset = utf-8'],'Server':['TwistedWeb/16.1.1']}这与使用TwistedWeb的splash渲染引擎有关.
是否有任何指令告诉飞溅也给我原始的响应标题?
要获得原始响应标头,您可以编写Splash Lua脚本 ; 请参阅scrapy-splash README中的示例:
使用Lua脚本获取HTML响应,并将cookie,标题,正文和方法设置为正确的值; lua_source参数值缓存在Splash服务器上,不会随每个请求一起发送(它需要Splash 2.1+):
import scrapy
from scrapy_splash import SplashRequest
script = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(0.5))
local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
}
end
"""
class MySpider(scrapy.Spider):
# ...
yield SplashRequest(url, self.parse_result,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script},
headers={'X-My-Header': 'value'},
)
def parse_result(self, response):
# here response.body contains result HTML;
# response.headers are filled with headers from last
# web page loaded to Splash;
# cookies from all responses and from JavaScript are collected
# and put into Set-Cookie response header, so that Scrapy
# can remember them.
Run Code Online (Sandbox Code Playgroud)
scrapy-splash还提供用于cookie处理的内置帮助程序; 一旦按照自述文件中的描述配置了 scrapy-splash,它们就会在此示例中启用.
| 归档时间: |
|
| 查看次数: |
1242 次 |
| 最近记录: |