使用python获取<a>标签的内容

Dwa*_*ter 3 python sgml html-parsing

假设我有html读入我的程序,如下所示:

<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html">F/T &amp; P/T Sales Associate - Caliente Fashions</a> - <font size="-1"> (North Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817804151.html">IMMEDIATE EMPLOYMENT WANTED!</a> - </p>

<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html">TRAVEL AGENT</a> - <font size="-1"> (NORTH VANCOUVER)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html">Optical Sales Position</a> - <font size="-1"> (New Westminster)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817709780.html">Sales Clerk</a> - <font size="-1"> (Kits)</font></p>

<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817676850.html">MARINE SALES</a> - <font size="-1"> (VANCOUVER ( KITS ))</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817608506.html">Retail Sales Associate</a> - <font size="-1"> (Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817573985.html">Retail with small parts appliance background</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817540938.html">Manager *Enjoyable work atmosphere</a> - <font size="-1"> (Langley Centre)</font></p>

<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html">Team Member - Retail Store - FT</a> - <font size="-1"> (Burnaby South)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817459155.html">STORE MANAGER-SHOE WAREHOUSE</a> - <font size="-1"> (South Surrey-Semiahmoo)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/pml/ret/1817448777.html">Retail Sales</a> - <font size="-1"> (Coquitlam)</font></p>
Run Code Online (Sandbox Code Playgroud)

如何获取文本节点的内容?我想最终得到的是在终端上打印类似于这一行的东西:

http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - TRAVEL AGENT
Run Code Online (Sandbox Code Playgroud)

到目前为止,我有以下代码提取href链接,但我不知道如何提取数据本身.我正在考虑handle_data(self, data)从sgmllib.py模块覆盖,但到目前为止,我似乎无法想到一种方法.

from sgmllib import SGMLParser

class URLLister(SGMLParser):
    def reset(self):
        SGMLParser.reset(self)
        self.urls = []

    def start_a(self, attrs):
        href = [v for k, v in attrs if k == "href"]
        if href:
            self.urls.extend(href)
Run Code Online (Sandbox Code Playgroud)

谢谢!

Ale*_*lli 8

最简单的可能是BeautifulSoup(确保使用3.0.8或更高3.0.*版本,不是 3.1.*,除非你使用的是Python 3 - 请看这里!).

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(thehtmlstring)

for anchor in soup.findAll('a'):
  print anchor['href'], anchor.string
Run Code Online (Sandbox Code Playgroud)

BeautifulSoup生成unicode字符串 - 如果这是一个问题,请确保编码它们,因为您希望以您希望的方式获取字节字符串!