小编s h*_*ley的帖子

使用python编辑html,但lxml将漂亮的html实体转换为奇怪的编码

我正在尝试使用python(使用pyquery和lxml)来改变和清理一些html.

Eg. html = "<div><!-- word style><bleep><omgz 1,000 tags><--><p>It&#146;s a spicy meatball!</div>"

Run Code Online (Sandbox Code Playgroud)

lxml.html.clean函数clean_html()运行良好,除了它取代了很好的html实体,

&#146;

Run Code Online (Sandbox Code Playgroud)

带一些unicode字符串

\xc2\x92

Run Code Online (Sandbox Code Playgroud)

unicode在不同的浏览器中看起来很奇怪(使用自动编码的firefox和opera,utf8,latin-1等),就像一个空盒子.如何阻止lxml转换实体？如何以latin-1编码完成所有操作？看起来奇怪的是专门为html构建的模块会这样做.

我不能确定那里有哪些角色,所以我不能只使用

replace("\xc2\x92","&#146;").

Run Code Online (Sandbox Code Playgroud)

我试过用了

clean_html(html).encode('latin-1')

Run Code Online (Sandbox Code Playgroud)

但是unicode仍然存在.

是的,我会告诉人们停止使用word来写html,但之后我会听到整个

"我喜欢它,因为你不能让我变成一个人".

编辑:一个美丽的解决方案:

from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup(str(desc[desc_type]))
                    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
                    [comment.extract() for comment in comments]
                    print soup

Run Code Online (Sandbox Code Playgroud)

python lxml character-encoding html-parsing

s h*_*ley

2011 02-03

10
推荐指数

2
解决办法

8982
查看次数