使用 BeautifulSoup 从 XML 文件读取 CDATA

Vla*_*gas 3 xml beautifulsoup python-3.x

我将推文保存在 XML 文件中:

\n\n
<tweet>\n  <tweetid>142389495503925248</tweetid>\n  <user>ccifuentes</user>\n  <content><![CDATA[Salgo de #VeoTV , que d\xc3\xada m\xc3\xa1s largoooooo...]]></content>\n  <date>2011-12-02T00:47:55</date>\n  <lang>es</lang>\n  <sentiments>\n   <polarity><value>NONE</value><type>AGREEMENT</type></polarity>\n  </sentiments>\n  <topics>\n   <topic>otros</topic>\n  </topics>\n </tweet>\n
Run Code Online (Sandbox Code Playgroud)\n\n

为了解析这些,我通过创建了一个 BeautifulSoup 实例

\n\n
soup = BeautifulSoup(xml, "lxml")\n
Run Code Online (Sandbox Code Playgroud)\n\n

其中 xml 是原始 XML 文件。为了访问一条推文,我这样做了:

\n\n
tweets = soup.find_all(\'tweet\')\nfor tw in tweets:\n    print(tw)\n    break\n
Run Code Online (Sandbox Code Playgroud)\n\n

这导致

\n\n
<tweet>\n<tweetid>142389495503925248</tweetid>\n<user>ccifuentes</user>\n<content></content>\n<date>2011-12-02T00:47:55</date>\n<lang>es</lang>\n<sentiments>\n<polarity><value>NONE</value><type>AGREEMENT</type></polarity>\n</sentiments>\n<topics>\n<topic>otros</topic>\n</topics>\n</tweet>\n
Run Code Online (Sandbox Code Playgroud)\n\n

请注意,当我打印第一条推文时,省略了 CDATA 部分。获得它对我来说很重要,我该怎么做?

\n

宏杰李*_*宏杰李 5

soup = bs4.BeautifulSoup(xml, 'xml')\n
Run Code Online (Sandbox Code Playgroud)\n\n

将解析器更改为xml

\n\n

出去:

\n\n
<content>Salgo de #VeoTV , que d\xc3\xada m\xc3\xa1s largoooooo...</content>\n
Run Code Online (Sandbox Code Playgroud)\n\n

或者html.parser

\n\n
soup = bs4.BeautifulSoup(xml, 'html.parser')\n
Run Code Online (Sandbox Code Playgroud)\n\n

出去:

\n\n
<content><![CDATA[Salgo de #VeoTV , que d\xc3\xada m\xc3\xa1s largoooooo...]]></content>\n
Run Code Online (Sandbox Code Playgroud)\n