小编Hol*_*Sir的帖子

使用python和lxml忽略xml中的unicode?

我想忽略我的xml中的unicode。我愿意以某种方式在输出处理中进行更改。

我的python:

import urllib2, os, zipfile 
from lxml import etree

doc = etree.XML(item)
docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
target = doc.xpath('//references-cited/citation/nplcit/*/text()')
#target = '-'.join(target).replace('\n-','')
print "docID:    {0}\nCitation: {1}\n".format(docID,target) 
outFile.write(str(docID) +"|"+ str(target) +"\n")
Run Code Online (Sandbox Code Playgroud)

创建以下内容的输出:

docID:    US-D0607176-S1-20100105
Citation: [u"\u201cThe birth of Lee Min Ho's donuts.\u201d Feb. 25, 2009. Jazzholic. Apr. 22, 2009 <http://www
Run Code Online (Sandbox Code Playgroud)

但是,如果我尝试重新添加,则'-'join(target).replace('\n-','')对于print和都会出现此错误outFile.write

Traceback (most recent call last):
  File "C:\Documents and Settings\mine\Desktop\test_lxml.py", line 77, in <module>
    print "docID:    {0}\nCitation: {1}\n".format(docID,target)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' …
Run Code Online (Sandbox Code Playgroud)

python xml unicode lxml python-unicode

1
推荐指数
1
解决办法
1835
查看次数

标签 统计

lxml ×1

python ×1

python-unicode ×1

unicode ×1

xml ×1