在Common Lisp中刮取HTML表格?

5 common-lisp

我想从HTML <table>中包含的网页中提取一些信息.如何将所有表信息提取到一个好的| 分开的文件?

Author|Book|Year|Comments
Bill Bryson|Short History of Nearly Everything|2004
Stephen Hawking|A Brief History of Time|1998|Still haven't read.

理想情况下,我想要一个将URL和输出文件作为参数然后给出上述输出的函数.

(defun extract-table (url filename)
       (extract-from-html-table (fetch-web-page url)))

(extract-table "http://www.mypage.com" "output.txt")

上述输出的示例HTML输入:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
<head>
<title>Lisp</title>
</head>
<body>
<h1>Welcome to Lisp</h1>
<table class="any" style="font-size: 14px;">
  <TR class="header">
    <td>Author</td>
    <TD>Book</TD>
    <td>Year</td>
    <td>Comments</td>
  </TR>
  <tr class="odd">
    <td>Bill Bryson</td>
    <td>Short History of Nearly Everything</td>
    <td>2004</td>
  </tr>
  <tr>
    <td>Stephen Hawking</td>
    <td>A Brief History of Time</td>
    <td>1998</td>
    <td>Still haven't read.</td>
  </tr>
</table>
</body>
</html>

Dir*_*irk 7

Drakma开始获取数据.要解析这个问题,你可能会发现cxml很有帮助.或者更好的是:你可以使用closure-html,它应解析任意HTML 4. closure-html包的Common-Lisp.net页面有一个屏幕抓取示例.