在 Python 中处理 ` `

Question

在 Python 中处理 ` `

all*_*ata 5 python xml encoding beautifulsoup

问题背景：

我有一个 XML 文件，正在将其导入 BeautifulSoup 并进行解析。一个节点具有以下功能：

<DIAttribute name="ObjectDesc" value="Line1&#xD;&#xA;Line2&#xD;&#xA;Line3"/>

Run Code Online (Sandbox Code Playgroud)

请注意，该值在文本中包含和。
我理解这些是回车符和换行符的 XML 表示形式。

当我导入 BeautifulSoup 时，该值会转换为以下内容：

<DIAttribute name="ObjectDesc" value="Line1
Line2
Line3"/>

Run Code Online (Sandbox Code Playgroud)

您会注意到被
转换为换行符。

我的用例要求该值保持原始值。知道如何让它留下来吗？还是转换回来？

源代码：

蟒蛇：（2.7.11）

from bs4 import BeautifulSoup #version 4.4.0
s = BeautifulSoup(open('test.xml'),'lxml-xml',from_encoding="ansi")
print s.DIAttribute

#XML file looks like 
'''
<?xml version="1.0" encoding="UTF-8" ?>
<DIAttribute name="ObjectDesc" value="Line1&#xD;&#xA;Line2&#xD;&#xA;Line3"/>
'''

Run Code Online (Sandbox Code Playgroud)

Notepad++ 表示源 XML 文件的编码是 ANSI。

我尝试过的事情：

我已经浏览了文档但没有成功。

第 3 行的变化：

print s.DIAttribute.prettify('ascii')
print s.DIAttribute.prettify('windows-1252')
print s.DIAttribute.prettify('ansi')
print s.DIAttribute.prettify('utf-8')
print s.DIAttribute['value'].replace('\r','&#xD;').replace('\n','&#xA;')  #This works, but it feels like a bandaid and will likely other problems will remain.

Run Code Online (Sandbox Code Playgroud)

有人有什么想法吗？我很感激任何意见/建议。

Answer 1

ccp*_*zza 1

仅供记录，首先是不能正确处理实体的库
：BeautifulSoup(data ,convertEntities=BeautifulSoup.HTML_ENTITIES), lxml.html.soupparser.unescape,xml.sax.saxutils.unescape

这就是有效的（在 Python 2.x 中）：

import sys
import HTMLParser

## accept file name as argument, or read stdin if nothing passed
data = len(sys.argv) > 1 and open(sys.argv[1]).read() or sys.stdin.read()

parser = HTMLParser.HTMLParser()
print parser.unescape(data)

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，9 月前
查看次数：	1253 次
最近记录：	9 年，5 月前