获取lxml中标记内的所有文本

Kev*_*rke 64 python parsing lxml

我想编写一个代码片段,它将<content>在下面所有三个实例(包括代码标记)中的lxml中获取标记内的所有文本.我已经尝试了tostring(getchildren())但是会遗漏标签之间的文字.我没有太多运气在API中搜索相关功能.你能救我吗？

<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>

<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"


<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"

Run Code Online (Sandbox Code Playgroud)

text_content()是否满足您的需求？

text_content()删除所有标记,OP希望保留标记内的标记. (5认同)
@benselme为什么我使用`text_content`,它说`AttributeError:'lxml.etree._Element'对象没有属性'text_content' (5认同)
@roger`text_content()`仅在您的树是HTML时才可用(即如果它是使用`lxml.html`中的方法解析的). (5认同)
正如 Louis 所指出的，这仅适用于使用 `lxml.html` 解析的树。Arthur Debert 使用 `itertext()` 的解决方案是通用的。 (2认同)

只需使用该node.itertext()方法,如:

 ''.join(node.itertext())

Run Code Online (Sandbox Code Playgroud)

这很好用,但删除了你可能想要的任何标签. (3认同)

尝试:

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    parts = ([node.text] +
            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
            [node.tail])
    # filter removes possible Nones in texts and tails
    return ''.join(filter(None, parts))

Run Code Online (Sandbox Code Playgroud)

例:

from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)

Run Code Online (Sandbox Code Playgroud)

生产: '\nText outside tag <div>Text <em>inside</em> tag</div>\n'

代码被破坏并产生重复的内容:>>> stringify_children(lxmlhtml.fromstring('A <div> B </ div> C'))'A <p> A </ p> B <div> B </ div> CC" (5认同)
应该添加`tostring(c,encoding = str)`在Python 3上运行. (3认同)
@德尔南。不需要，`tostring` 已经处理了递归情况。你让我产生了怀疑，所以我在真实的代码上进行了尝试，并用一个例子更新了答案。谢谢你指出。 (2认同)

下面使用python生成器的代码片段工作得很好,效率很高.

''.join(node.itertext()).strip()

的Albertov的一个版本字符串化内容,解决了该漏洞通过hoju报道:

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    return ''.join(
        chunk for chunk in chain(
            (node.text,),
            chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())),
            (node.tail,)) if chunk)

Run Code Online (Sandbox Code Playgroud)

定义stringify_children这种方式可能不那么复杂：

from lxml import etree

def stringify_children(node):
    s = node.text
    if s is None:
        s = ''
    for child in node:
        s += etree.tostring(child, encoding='unicode')
    return s

Run Code Online (Sandbox Code Playgroud)

或在一行

return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))

Run Code Online (Sandbox Code Playgroud)

基本原理与此答案相同：将子节点的序列化留给 lxml。在这种情况下的tail部分node并不有趣，因为它在结束标记的“后面”。请注意，encoding可以根据需要更改参数。

另一种可能的解决方案是序列化节点本身，然后去除开始和结束标记：

def stringify_children(node):
    s = etree.tostring(node, encoding='unicode', with_tail=False)
    return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]

Run Code Online (Sandbox Code Playgroud)

这有点可怕。仅当node没有属性时，此代码才是正确的，我认为即使到那时也没有人愿意使用它。

最简单的代码片段之一，实际上对我有用，并且根据http://lxml.de/tutorial.html#using-xpath-to-find-text的文档是

etree.tostring(html, method="text")

Run Code Online (Sandbox Code Playgroud)

其中 etree 是您正在尝试读取其完整文本的节点/标签。但请注意，它并没有摆脱脚本和样式标签。

去掉html标签 (4认同)

import urllib2
from lxml import etree
url = 'some_url'

Run Code Online (Sandbox Code Playgroud)

获取网址

test = urllib2.urlopen(url)
page = test.read()

Run Code Online (Sandbox Code Playgroud)

获取包含表标签的所有html代码

tree = etree.HTML(page)

Run Code Online (Sandbox Code Playgroud)

xpath 选择器

table = tree.xpath("xpath_here")
res = etree.tostring(table)

Run Code Online (Sandbox Code Playgroud)

res 是表的 html 代码，这是为我做的工作。

因此您可以使用 xpath_text() 提取标签内容，并使用 tostring() 提取包括其内容的标签

div = tree.xpath("//div")
div_res = etree.tostring(div)

Run Code Online (Sandbox Code Playgroud)

text = tree.xpath_text("//content")

Run Code Online (Sandbox Code Playgroud)

或 text = tree.xpath("//content/text()")

div_3 = tree.xpath("//content")
div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</')

Run Code Online (Sandbox Code Playgroud)

使用 strip 方法的最后一行并不好，但它只是有效

归档时间：	15 年，1 月前
查看次数：	76584 次
最近记录：	7 年，8 月前

除了XHTML自包含标记之外,RegEx匹配开放标记 1323

更多相关链接

如何在Django查询集过滤中做不相等的操作？ 608

virtualenvs应该在哪里创建？ 96

熊猫:如何更改列的所有值？ 69

在Win7 64位,Python 2.6.4中安装PIL(Python映像库) 64

一个好的ocaml解析器？ 13

在Ruby中使用Parslet的缩进敏感解析器？ 10

jQuery如何处理注释元素？ 8

如何获取lxml中的html源代码？ 4

在iOS中逐行解析CSV 2

在Javascript中将包含特定格式日期的字符串解析为Date对象 1

@staticmethod和@classmethod有什么区别？ 3360

如何迭代字符串的单词？ 2895

为什么打印"B"比打印"#"要慢得多？ 2662

使用__init __()方法理解Python super() 2366

设置JavaScript函数的默认参数值 2277

如何在Python中将字符串解析为float或int？ 2108

使用JavaScript在新选项卡(而不是新窗口)中打开URL 1941

ssh"权限太开放"错误 1859

Markdown中的评论 1277

R无法解析 - Android错误 1056