如何从Racket中的html中提取元素?

pvd*_*pvd 2 scheme racket

我想在reddit中提取网址,我的代码是

#lang racket

(require net/url)
(require html)

(define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))
(define in (get-pure-port reddit #:redirections 5))

(define response-html (read-html-as-xml in))
(define content-0 (list-ref response-html 0))

(close-input-port in)
Run Code Online (Sandbox Code Playgroud)

上面的内容为0

(element
 (location 0 0 15)
 (location 0 0 82)
...
Run Code Online (Sandbox Code Playgroud)

我想知道如何从中提取特定内容.

Gre*_*ott 6

  1. 通常,将HTML作为x表达式而不是html模块的处理更方便struct.

  2. 您也应该使用call/input-url自动关闭端口.

您可以通过定义read-html-as-xexpr函数并使用它来结合这两种想法:

#lang racket/base

(require html
         net/url
         xml)

(define (read-html-as-xexpr in) ;; input-port? -> xexpr?
  (caddr
   (xml->xexpr
    (element #f #f 'root '()
             (read-html-as-xml in)))))

(define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))

(call/input-url reddit
                get-pure-port
                read-html-as-xexpr)
Run Code Online (Sandbox Code Playgroud)

这将返回一个大的x表达式,如:

'(html
  ((lang "en") (xml:lang "en") (xmlns "http://www.w3.org/1999/xhtml"))
  (head
   ()
   (title () "programming: search results")
   (meta
    ((content " reddit, reddit.com, vote, comment, submit ")
     (name "keywords")))
   (meta
    ((content "reddit: the front page of the internet") (name "description")))
   (meta ((content "origin") (name "referrer")))
   (meta ((content "text/html; charset=UTF-8") (http-equiv "Content-Type")))
... snip ...
Run Code Online (Sandbox Code Playgroud)

如何提取具体的那些?

  • 对于简单的HTML,我不希望整体结构发生变化,我经常会使用match.

  • 然而,更正确和更健壮的方法是使用该xml/path模块.



更新:我注意到你的问题是通过询问提取网址而开始的.这是更新的示例,用于se-path*/list获取所有元素href的所有属性<a>:

#lang racket/base

(require html
         net/url
         xml
         xml/path)

(define (read-html-as-xexprs in) ;; (-> input-port? xexpr?)
  (caddr
   (xml->xexpr
    (element #f #f 'root '()
             (read-html-as-xml in)))))

(define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))

(define xe (call/input-url reddit
                           get-pure-port
                           read-html-as-xexprs))

(se-path*/list '(a #:href) xe)
Run Code Online (Sandbox Code Playgroud)

结果:

'("#content"
  "http://www.reddit.com/r/announcements/"
  "http://www.reddit.com/r/Art/"
  "http://www.reddit.com/r/AskReddit/"
  "http://www.reddit.com/r/askscience/"
  "http://www.reddit.com/r/aww/"
  "http://www.reddit.com/r/blog/"
  "http://www.reddit.com/r/books/"
  "http://www.reddit.com/r/creepy/"
  "http://www.reddit.com/r/dataisbeautiful/"
  "http://www.reddit.com/r/DIY/"
  "http://www.reddit.com/r/Documentaries/"
  "http://www.reddit.com/r/EarthPorn/"
  "http://www.reddit.com/r/explainlikeimfive/"
  "http://www.reddit.com/r/Fitness/"
  "http://www.reddit.com/r/food/"
  ... snip ...
Run Code Online (Sandbox Code Playgroud)