我正在尝试下载一个html页面,然后使用Racket在其上运行一个正则表达式.这适用于某些页面但不适用于其他页面.我最终解决了这个问题是因为有些页面被gzip压缩并发出HTTP GET请求,get-pure-port提供了gzip压缩页面,这当然看起来像乱码.
我的问题:有没有办法在球拍中解压缩页面,以便我可以运行正则表达式?
谢谢.
尽管行为良好的Web服务器不会为您提供gzip响应,除非您为它们提供了Accept-Encoding: gzip请求标头,但并非每个Web服务器都表现良好.
因此,您需要查找Content-Encoding: gzip响应标头并使用gunzip-through-ports.(你可以做同样的Content-Encoding: deflate和inflate.)
当然,要"寻找响应标题"你不能再使用get-pure-port,你必须使用get-impure-port和purify-port.伪代码:
#lang racket
(require net/url
net/head
file/gunzip)
(define u (string->url "http://www.wikipedia.org"))
(define in (get-impure-port u '("Accept-Encoding: gzip")))
(define h (purify-port in))
(define out (open-output-bytes))
(match (extract-field "Content-Encoding" h)
["gzip" (gunzip-through-ports in out)]
[_ (copy-port in out)])
(define bstr (get-output-bytes out))
(close-input-port in)
Run Code Online (Sandbox Code Playgroud)
ps我认为上面第一次尝试时更容易探索.但对于生产代码我可能会call/input-url用来帮助处理关闭端口:
#lang racket
(require net/url
net/head
file/gunzip)
(define u (string->url "http://www.wikipedia.org"))
(define bstr
(call/input-url u
(curryr get-impure-port '("Accept-Encoding: gzip"))
(lambda (in)
(define h (purify-port in))
(define out (open-output-bytes))
(match (extract-field "Content-Encoding" h)
["gzip" (gunzip-through-ports in out)]
[_ (copy-port in out)])
(get-output-bytes out))))
Run Code Online (Sandbox Code Playgroud)
PPS
如果它没有使用curryr和匿名函数,那个版本可能会更清晰.例如:
#lang racket
(require net/url
net/head
file/gunzip)
;; Like get-impure-port, but supplied Accept-Encoding gzip request
;; header.
(define (get-impure-port/gzip u)
(get-impure-port u '("Accept-Encoding: gzip")))
;; Read response headers using purify-port, and read the response
;; entity handling gzip encoding.
(define (read-response in)
(define h (purify-port in))
(define out (open-output-bytes))
(match (extract-field "Content-Encoding" h)
["gzip" (gunzip-through-ports in out)]
[_ (copy-port in out)])
(get-output-bytes out))
(define bstr
(call/input-url (string->url "http://www.wikipedia.org")
get-impure-port/gzip
read-response))
Run Code Online (Sandbox Code Playgroud)