python lxml模块在内部使用哪种编码？

Question

python lxml模块在内部使用哪种编码？

当我得到一个网页时,我使用UnicodeDammit将其转换为utf-8编码,就像:

import chardet
from lxml import html
content = urllib2.urlopen(url).read()
encoding = chardet.detect(content)['encoding']
if encoding != 'utf-8':
    content = content.decode(encoding, 'replace').encode('utf-8')
doc = html.fromstring(content, base_url=url)

Run Code Online (Sandbox Code Playgroud)

但是当我使用时:

text = doc.text_content()
print type(text)

Run Code Online (Sandbox Code Playgroud)

输出是<type 'lxml.etree._ElementUnicodeResult'>.为什么？我以为这将是一个utf-8字符串.

Answer 1

小智 7

lxml.etree._ElementUnicodeResult是一个继承自unicode以下的类:

$ pydoc lxml.etree._ElementUnicodeResult

lxml.etree._ElementUnicodeResult = class _ElementUnicodeResult(__builtin__.unicode)
 |  Method resolution order:
 |      _ElementUnicodeResult
 |      __builtin__.unicode
 |      __builtin__.basestring
 |      __builtin__.object

Run Code Online (Sandbox Code Playgroud)

在Python中,拥有从基类扩展的类以添加一些特定于模块的功能是相当普遍的.将对象视为常规Unicode字符串应该是安全的.

归档时间：	13 年前
查看次数：	4159 次
最近记录：	12 年，10 月前