BeautifulSoup用户的html5lib/lxml示例?

Chr*_*vey 1 python lxml beautifulsoup html5lib

我试图从BeautifulSoup中解脱出来,我喜欢但似乎(积极地)不受支持.我正在尝试使用html5lib和lxml,但我似乎无法弄清楚如何使用"find"和"findall"运算符.

通过查看html5lib的文档,我想出了一个测试程序:

import cStringIO

f = cStringIO.StringIO()
f.write("""
  <html>
    <body>
      <table>
       <tr>
          <td>one</td>
          <td>1</td>
       </tr>
       <tr>
          <td>two</td>
          <td>2</td
       </tr>
      </table>
    </body>
  </html>
  """)
f.seek(0)

import html5lib
from html5lib import treebuilders
from lxml import etree  # why?

parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("lxml"))
etree_document = parser.parse(f)

root = etree_document.getroot()

root.find(".//tr")
Run Code Online (Sandbox Code Playgroud)

但是这会返回None.我注意到,如果我这样做,etree.tostring(root)我会收回所有数据,但我的所有标签都以html(例如<html:table>)开头.但root.find(".//html:tr")抛出一个KeyError.

有人能让我回到正轨吗?

Chr*_*ris 6

您可以使用以下命令关闭命名空间: etree_document = html5lib.parse(t, treebuilder="lxml", namespaceHTMLElements=False)


Tim*_*ara 5

通常,lxml.html用于HTML.然后,您不必担心生成自己的解析器并担心命名空间.

>>> import lxml.html as l
>>> doc = """
...    <html><body>
...    <table>
...      <tr>
...        <td>one</td>
...        <td>1</td>
...      </tr>
...      <tr>
...        <td>two</td>
...        <td>2</td
...      </tr>
...    </table>
...    </body></html>"""
>>> doc = l.document_fromstring(doc)
>>> doc.finall('.//tr')
[<Element tr at ...>, <Element tr at ...>] #doctest: +ELLIPSIS
Run Code Online (Sandbox Code Playgroud)

仅供参考,lxml.html也允许您使用CSS选择器,我发现这是一种更简单的语法.

>>> doc.cssselect('tr')
[<Element tr at ...>, <Element tr at ...>] #doctest: +ELLIPSIS
Run Code Online (Sandbox Code Playgroud)