从XML文档获取所有文本？

Question

从XML文档获取所有文本？

Ric*_*ard 3 python xml lxml

我如何以单个字符串的形式获取XML文档的所有文本内容- 像这个Ruby / hpricot示例，但使用Python。

我想用一个空格替换XML标签。

Answer 1

Pra*_*mar 6

我真的很喜欢 BeautifulSoup，如果可以避免的话，我宁愿不在 HTML 上使用正则表达式。

改编自：[this StackOverflow Answer] , [BeautifulSoup 文档]

from bs4 import BeautifulSoup
soup = BeautifulSoup(txt)    # txt is simply the a string with your XML file
pageText = soup.findAll(text=True)
print ' '.join(pageText)

Run Code Online (Sandbox Code Playgroud)

当然，您可以（并且应该）使用 BeautifulSoup 来导航页面以查找您要查找的内容。

Answer 2

sch*_*o72 6

使用stdlib xml.etree

import xml.etree.ElementTree as ET

tree = ET.parse('sample.xml') 
print(ET.tostring(tree.getroot(), encoding='utf-8', method='text'))

Run Code Online (Sandbox Code Playgroud)

Answer 3

kir*_*sos -2

编辑：这是当我认为一个空格缩进是正常现象时发布的答案，正如评论提到的那样，这不是一个好的答案。查看其他人以获得更好的解决方案。这仅出于存档原因而留在这里，请勿遵循！

您要求 lxml：

reslist = list(root.iter())
result = ' '.join([element.text for element in reslist])

Run Code Online (Sandbox Code Playgroud)

或者：

result = ''
for element in root.iter():
    result += element.text + ' '
result = result[:-1] # Remove trailing space

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，5 月前
查看次数：	9359 次
最近记录：	6 年，4 月前