如何使用 python 从 xml 中高效提取 <![CDATA[]> 内容？

Question

如何使用 python 从 xml 中高效提取 <![CDATA[]> 内容？

new*_*hon 3 python xml lxml python-2.7 pandas

我有以下 xml：

<?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23">
    <document><![CDATA["@username: That boner came at the wrong time ???? http://t.co/5X34233gDyCaCjR" HELP I'M DYING       ]]></document>
    <document><![CDATA[Ugh      ]]></document>
    <document><![CDATA[YES !!!! WE GO FOR IT. http://t.co/fiI23324E83b0Rt       ]]></document>
    <document><![CDATA[@username Shout out to me????        ]]></document>
</author>

Run Code Online (Sandbox Code Playgroud)

解析内容并将其提取<![CDATA[到]]>列表中的最有效方法是什么。比方说：

[@username: That boner came at the wrong time ???? http://t.co/5X34233gDyCaCjR" HELP I'M DYING      Ugh     YES !!!! WE GO FOR IT. http://t.co/fiI23324E83b0Rt      @username Shout out to me????       ]

Run Code Online (Sandbox Code Playgroud)

这是我尝试过的：

from bs4 import BeautifulSoup
x='/Users/user/PycharmProjects/TratandoDeMejorarPAN/test.xml'
y = BeautifulSoup(open(x), 'xml')
out = [y.author.document]
print out

Run Code Online (Sandbox Code Playgroud)

这是输出：

[<document>"@username: That boner came at the wrong time ???? http://t.co/5XgDyCaCjR" HELP I'M DYING        </document>]

Run Code Online (Sandbox Code Playgroud)

此输出的问题是我不应该得到<document></document>. 如何删除<document></document>标签并获取列表中该 xml 的所有元素？

Answer 1

Cha*_*ffy 5

这里有几个问题。（询问有关选择图书馆的问题违反了这里的规则，所以我忽略了问题的这一部分）。

您需要传入文件句柄，而不是文件名。

那是：y = BeautifulSoup(open(x))
您需要告诉 BeautifulSoup 它正在处理 XML。

那是：y = BeautifulSoup(open(x), 'xml')
CDATA部分不创建元素。你无法在 DOM 中搜索它们，因为它们不存在于 DOM 中；它们只是语法糖。只需查看正下方的文本document，不要尝试搜索名为的内容CDATA。

稍微不同的是，再次声明一下：<doc><![CDATA[foo]]</doc>与完全相同<doc>foo</doc>。节的不同之处CDATA在于它内部的所有内容都会自动转义，这意味着<![CDATA[<hello>]]被解释为<hello>. 但是，您无法从解析的对象树中判断您的文档是否包含CDATA带有文字的部分<和或带有and>的原始文本部分。这是设计使然，对于任何兼容的 XML DOM 实现都是如此。<>

现在，来看一些实际有效的代码：

import bs4

doc="""
<?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23">
    <document><![CDATA["@username: That came at the wrong time ????" HELP I'M DYING       ]]></document>
    <document><![CDATA[Ugh      ]]></document>
    <document><![CDATA[YES !!!! WE GO FOR IT.       ]]></document>
    <document><![CDATA[@username Shout out to me????        ]]></document>
</author>
"""

doc_el = bs4.BeautifulSoup(doc, 'xml')
print [ el.text for el in doc_el.findAll('document') ]

Run Code Online (Sandbox Code Playgroud)

如果要从文件中读取，请替换doc为open(filename, 'r').

归档时间：	10 年，6 月前
查看次数：	6438 次
最近记录：	10 年，6 月前

如何使用 python 从 xml 中高效提取 &lt;![CDATA[]&gt; 内容？

如何使用 python 从 xml 中高效提取 <![CDATA[]> 内容？