mad*_*ant 2 java clojure web-scraping jsoup
使用JSoup用Clojure解析html字符串,源码如下
依赖关系
:dependencies [[org.clojure/clojure "1.10.1"]
[org.jsoup/jsoup "1.13.1"]]
Run Code Online (Sandbox Code Playgroud)
源代码
(require '[clojure.string :as str])
(def HTML (str "<html><head><title>Website title</title></head>
<body><p>Sample paragraph number 1 </p>
<p>Sample paragraph number 2</p>
</body></html>"))
(defn fetch_html [html]
(let [soup (Jsoup/parse html)
titles (.title soup)
paragraphs (.getElementsByTag soup "p")]
{:title titles :paragraph paragraphs}))
(fetch_html HTML)
Run Code Online (Sandbox Code Playgroud)
预期结果
{:title "Website title",
:paragraph ["Sample paragraph number 1"
"Sample paragraph number 2"]}
Run Code Online (Sandbox Code Playgroud)
不幸的是,结果并不如预期
user ==> (fetch_html HTML)
{:title "Website title", :paragraph []}
Run Code Online (Sandbox Code Playgroud)
(.getElementsByTag ...) 返回一个 Element 序列,您需要对每个元素调用 .text() 方法来获取文本值。我正在使用 Jsoup 版本 1.13.1。
(ns core
(:import (org.jsoup Jsoup))
(:require [clojure.string :as str]))
(def HTML (str "<html><head><title>Website title</title></head>
<body><p>Sample paragraph number 1 </p>
<p>Sample paragraph number 2</p>
</body></html>"))
(defn fetch_html [html]
(let [soup (Jsoup/parse html)
titles (.title soup)
paragraphs (.getElementsByTag soup "p")]
{:title titles :paragraph (mapv #(.text %) paragraphs)}))
(fetch_html HTML)
Run Code Online (Sandbox Code Playgroud)
还可以考虑使用 Reaver,它是一个包装 JSoup 的 Clojure 库,或者像其他人建议的任何其他包装器。
| 归档时间: |
|
| 查看次数: |
648 次 |
| 最近记录: |