我正在使用xml sax解析器来解析xml文件,下面是我的代码
xml文件代码:
<job>
<title>Registered Nurse-Epilepsy</title>
<job-code>881723</job-code>
<detail-url>http://search.careers-hcanorthtexas.com/s/Job-Details/Registered-Nurse-Epilepsy-Job/Medical-City/xjdp-cl289619-jf120-ct2181-jid4041800?s_cid=Advance
</detail-url>
<job-category>Neuroscience Nursing</job-category>
<description>
<summary>
<div class='descriptionheader'>Description</div><P STYLE="margin-top:0px;margin-bottom:0px"><SPAN STYLE="font-family:Arial;font-size:small">Utilizing the standards set forth for Nursing Practice by the ANA and ONS, the RN will organize, modify, evaluate, document and maintain the plan of care for Epilepsy and/or Neurological patients. It will include individualized, family centered, holistic, supportive, and safe age-specific care.</SPAN></P><div class='qualificationsheader'>Qualifications</div><UL STYLE="list-style-type:disc"> <LI>Graduate of an accredited school of Professional Nursing.</LI> <LI>BSN preferred </LI> <LI>Current licensure with the Board of Nurse Examiners for the State of Texas</LI> <LI>Experience in Epilepsy Monitoring and/or Neurological background preferred.</LI> <LI>ACLS preferred, within 6 months of hire</LI> <LI>PALS required upon hire</LI> </UL>
</summary>
</description>
<posted-date>2012-07-26</posted-date>
<location>
<address>7777 Forest Lane</address>
<city>Dallas</city>
<state>TX</state>
<zip>75230</zip>
<country>US</country>
</location>
<company>
<name>Medical City (Dallas, TX)</name>
<url>http://www.hcanorthtexas.com/careers/search-jobs.dot</url>
</company>
</job>
Run Code Online (Sandbox Code Playgroud)
Python代码 :(部分代码清除我的怀疑直到启动元素功能)
from xml.sax.handler import ContentHandler
import xml.sax
import xml.parsers.expat
import ConfigParser
import xml.sax
class Exact(xml.sax.handler.ContentHandler):
def __init__(self):
self.curpath = []
def startElement(self, name, attrs):
print name,attrs
self.clearFields()
def endElement(self, name):
pass
def characters(self, data):
self.buffer += data
def clearFields():
self.fields = {}
self.fields['title'] = None
self.fields['job-code'] = None
self.fields['detail-url'] = None
self.fields['job-category'] = None
self.fields['description'] = None
self.fields['summary'] = None
self.fields['posted-date'] = None
self.fields['location'] = None
self.fields['address'] = None
self.fields['city'] = None
self.fields['state'] = None
self.fields['zip'] = None
self.fields['country'] = None
self.fields['company'] = None
self.fields['name'] = None
self.fields['url'] = None
self.buffer = ''
if __name__ == '__main__':
parser = xml.sax.make_parser()
handler = Exact()
parser.setContentHandler(handler)
parser.parse(open('/path/to/xml_file.xml'))
Run Code Online (Sandbox Code Playgroud)
结果:上面的打印声明的结果如下
job <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
title <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
job-code <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
detail-url <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
job-category <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
description <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
summary <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
posted-date <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
location <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
address <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
city <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
state <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
zip <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
country <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
company <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
name <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
url <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
Run Code Online (Sandbox Code Playgroud)
正如你可以看到上面我得到name和attrs从打印语句,但现在我的全部意图是让这个名字,怎么上面,因为我只得到节点的名称,但不是所有的值这些标签读取值的值.
编辑代码:
我真的很困惑如何将数据从节点映射到字典中的键,如上所述
要获取元素的内容,您需要覆盖该characters方法...将此添加到您的处理程序类:
def characters(self, data):
print data
Run Code Online (Sandbox Code Playgroud)
但要注意这一点:解析器不需要在一个块中提供所有数据.您应该使用内部缓冲区并在需要时读取它.在我的大多数xml/sax代码中,我做了类似这样的事情:
class MyHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self._charBuffer = []
def _flushCharBuffer(self):
s = ''.join(self._charBuffer)
self._charBuffer = []
return s
def characters(self, data):
self._charBuffer.append(data)
Run Code Online (Sandbox Code Playgroud)
...然后在我需要数据的元素末尾调用flush方法.
对于您的整个用例 - 假设您有一个包含多个作业描述的文件,并且想要一个包含每个作业的作业的列表是字段的字典,请执行以下操作:
class MyHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self._charBuffer = []
self._result = []
def _getCharacterData(self):
data = ''.join(self._charBuffer).strip()
self._charBuffer = []
return data.strip() #remove strip() if whitespace is important
def parse(self, f):
xml.sax.parse(f, self)
return self._result
def characters(self, data):
self._charBuffer.append(data)
def startElement(self, name, attrs):
if name == 'job': self._result.append({})
def endElement(self, name):
if not name == 'job': self._result[-1][name] = self._getCharacterData()
jobs = MyHandler().parse("job-file.xml") #a list of all jobs
Run Code Online (Sandbox Code Playgroud)
如果您只需要一次解析一个作业,则可以简化列表部分并丢弃该startElement方法 - 只需将_result设置为dict并直接分配给它endElement.
要获取节点的文本内容,需要实现一个characters方法。例如
class Exact(xml.sax.handler.ContentHandler):
def __init__(self):
self.curpath = []
def startElement(self, name, attrs):
print name,attrs
def endElement(self, name):
print 'end ' + name
def characters(self, content):
print content
Run Code Online (Sandbox Code Playgroud)
会输出:
job <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9baec>
title <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb0c>
Registered Nurse-Epilepsy
end title
job-code <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb2c>
881723
end job-code
detail-url <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb2c>
http://search.careers-hcanorthtexas.com/s/Job-Details/Registered-Nurse-Epilepsy-Job/Medical-City/xjdp-cl289619-jf120-ct2181-jid4041800?s_cid=Advance
end detail-url
Run Code Online (Sandbox Code Playgroud)
(狙击)
| 归档时间: |
|
| 查看次数: |
16184 次 |
| 最近记录: |