lxml.etree.XML Unicode字符串的ValueError

Pap*_*nho 12 python unicode python-3.x python-3.4

我正在使用xslt转换xml文档.使用python3时,我遇到了以下错误.但我对python2没有任何错误

-> % python3 cstm/artefact.py
Traceback (most recent call last):
  File "cstm/artefact.py", line 98, in <module>
    simplify_this_dataset('fisheries-service-des-peches.xml')
  File "cstm/artefact.py", line 85, in simplify_this_dataset
    xslt_root = etree.XML(xslt_content)
  File "lxml.etree.pyx", line 3012, in lxml.etree.XML (src/lxml/lxml.etree.c:67861)
  File "parser.pxi", line 1780, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102420)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

#!/usr/bin/env python3
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
# -*- coding: utf-8 -*-

from lxml import etree

def simplify_this_dataset(dataset):
    """Create A simplify version of an xml file
    it will remove all the attributes and assign them as Elements instead
    """
    module_path = os.path.dirname(os.path.abspath(__file__))
    data = open(module_path+'/data/ex-fire.xslt')
    xslt_content = data.read()
    xslt_root = etree.XML(xslt_content)
    dom = etree.parse(module_path+'/../CanSTM_dataset/'+dataset)
    transform = etree.XSLT(xslt_root)
    result = transform(dom)
    f = open(module_path+ '/../CanSTM_dataset/otra.xml', 'w')
    f.write(str(result))
    f.close()
Run Code Online (Sandbox Code Playgroud)

bob*_*nce 18

data = open(module_path+'/data/ex-fire.xslt')
xslt_content = data.read()
Run Code Online (Sandbox Code Playgroud)

这使用默认编码隐式地将文件中的字节解码为Unicode文本.(如果XML文件不在该编码中,这可能会产生错误的结果.)

xslt_root = etree.XML(xslt_content)
Run Code Online (Sandbox Code Playgroud)

XML有自己的编码处理和信令,<?xml encoding="..."?>序言.如果您将一个Unicode字符串传递<?xml encoding="..."?>给解析器,解析器希望使用该编码来重新表示字节串的其余部分...但不能,因为您已经将字节输入解码为Unicode字符串.

相反,您应该将未解码的字节字符串传递给解析器:

data = open(module_path+'/data/ex-fire.xslt', 'rb')

xslt_content = data.read()
xslt_root = etree.XML(xslt_content)
Run Code Online (Sandbox Code Playgroud)

或者,更好的是,直接从文件中读取解析器:

xslt_root = etree.parse(module_path+'/data/ex-fire.xslt')
Run Code Online (Sandbox Code Playgroud)


Lok*_*oki 10

我通过简单地使用默认选项重新编码来使其工作

xslt_content = data.read().encode()
Run Code Online (Sandbox Code Playgroud)


小智 7

您还可以解码UTF-8字符串并使用ascii对其进行编码,然后再将其传递给etree.XML

 xslt_content = data.read()
 xslt_content = xslt_content.decode('utf-8').encode('ascii')
 xslt_root = etree.XML(xslt_content)
Run Code Online (Sandbox Code Playgroud)

  • 当初始声明表明utf-8可能性时,为什么要将它编码为ascii? (4认同)
  • 我通过简单地使用默认选项重新编码来使其工作: xslt_content = data.read().encode() (2认同)