And*_*rej 10 python xml xpath lxml
假设我们的XML文件结构如下.
<?xml version="1.0" ?>
<searchRetrieveResponse xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/zing/srw/ http://www.loc.gov/standards/sru/sru1-1archive/xml-files/srw-types.xsd" xmlns="http://www.loc.gov/zing/srw/">
<records xmlns:ns1="http://www.loc.gov/zing/srw/">
<record>
<recordData>
<record xmlns="">
<datafield tag="000">
<subfield code="a">123</subfield>
<subfield code="b">456</subfield>
</datafield>
<datafield tag="001">
<subfield code="a">789</subfield>
<subfield code="b">987</subfield>
</datafield>
</record>
</recordData>
</record>
<record>
<recordData>
<record xmlns="">
<datafield tag="000">
<subfield code="a">123</subfield>
<subfield code="b">456</subfield>
</datafield>
<datafield tag="001">
<subfield code="a">789</subfield>
<subfield code="b">987</subfield>
</datafield>
</record>
</recordData>
</record>
</records>
</searchRetrieveResponse>
Run Code Online (Sandbox Code Playgroud)
我需要解析:
我想知道如何使用lxml和XPath.粘贴在下面是我的初始代码,我恳请有人解释我,如何解析价值.
import urllib, urllib2
from lxml import etree
url = "https://dl.dropbox.com/u/540963/short_test.xml"
fp = urllib2.urlopen(url)
doc = etree.parse(fp)
fp.close()
ns = {'xsi':'http://www.loc.gov/zing/srw/'}
for record in doc.xpath('//xsi:record', namespaces=ns):
print record.xpath("xsi:recordData/record/datafield[@tag='000']", namespaces=ns)
Run Code Online (Sandbox Code Playgroud)
Zac*_*ung 17
在XPath中我会更直接:在这种情况下直接找到你想要的元素datafield.
>>> for df in doc.xpath('//datafield'):
# Iterate over attributes of datafield
for attrib_name in df.attrib:
print '@' + attrib_name + '=' + df.attrib[attrib_name]
# subfield is a child of datafield, and iterate
subfields = df.getchildren()
for subfield in subfields:
print 'subfield=' + subfield.text
Run Code Online (Sandbox Code Playgroud)
此外,lxml似乎让您忽略命名空间,可能是因为您的示例仅使用一个命名空间?
请尝试以下工作代码:
import urllib2
from lxml import etree
url = "https://dl.dropbox.com/u/540963/short_test.xml"
fp = urllib2.urlopen(url)
doc = etree.parse(fp)
fp.close()
for record in doc.xpath('//datafield'):
print record.xpath("./@tag")[0]
for x in record.xpath("./subfield/text()"):
print "\t", x
Run Code Online (Sandbox Code Playgroud)
我会跟着去
for df in doc.xpath('//datafield'):
print df.attrib
for sf in df.getchildren():
print sf.text
Run Code Online (Sandbox Code Playgroud)
此外,您不需要urllib,您可以使用HTTP直接解析XML
url = "http://dl.dropbox.com/u/540963/short_test.xml" #doesn't work with https though
doc = etree.parse(url)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
34896 次 |
| 最近记录: |