使用lxml.html提取文本

Question

使用lxml.html提取文本

我有一个HTML文件:

<html>
    <p>somestr
        <sup>1</sup>
       anotherstr
    </p>
</html>

Run Code Online (Sandbox Code Playgroud)

我想将文本提取为:

somestr ^1个 anotherstr

但我无法弄清楚该怎么做.我写了一个to_sup()函数,将数字字符串转换为上标,所以我得到的最接近的是:

for i in doc.xpath('.//p/text()|.//sup/text()'):
    if i.tag == 'sup':
        print to_sup(i),
    else:
        print i,

Run Code Online (Sandbox Code Playgroud)

但我ElementStringResult似乎没有办法获取标签名称,所以我有点迷失.任何想法如何解决？

Answer 1

Rob*_*ujo 8

第一个解决方案(连接没有分隔符的文本 - 另见python [lxml] - 清除html标签):

   import lxml.html
   document = lxml.html.document_fromstring(html_string)
   # internally does: etree.XPath("string()")(document)
   print document.text_content()

Run Code Online (Sandbox Code Playgroud)

这一个帮助了我 - 连接我需要的方式:

   from lxml import etree
   print "\n".join(etree.XPath("//text()")(document))

Run Code Online (Sandbox Code Playgroud)

Answer 2

Fre*_*Foo 4

只是不要调用text()XPathsup中的节点。

for x in doc.xpath("//p/text()|//sup"):
    try:
        print(to_sup(x.text))
    except AttributeError:
        print(x)

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，2 月前
查看次数：	7791 次
最近记录：	11 年，8 月前