PAN*_*MAR 8 python xml-parsing
我想访问子节点中存在的信息。这是因为文件的结构吗?
尝试单独提取文件中的作者子节点信息并运行python代码。效果很好
import urllib
import xml.etree.ElementTree as ET
url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'
print 'Retrieving', url
document = urllib.urlopen (url).read()
print 'Retrieved', len(document), 'characters.'
print document[:50]
tree = ET.fromstring(document)
lst = tree.findall('title')
print lst[:100]
Run Code Online (Sandbox Code Playgroud)
由于命名空间的原因,您无法找到标题元素。
\n\n下面找到一个示例代码:
\n\n import xml.etree.ElementTree as ET\n import urllib.request\n\n url = \'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml\'\n response = urllib.request.urlopen(url).read()\n tree = ET.fromstring(response)\n\n\n for docTitle in tree.findall(\'{urn:hl7-org:v3}title\'):\n print(docTitle.text)\n\n for compTitle in tree.findall(\'.//{urn:hl7-org:v3}title\'):\n print(compTitle.text)\nRun Code Online (Sandbox Code Playgroud)\n\n更新
\n\n如果您需要搜索 XML 节点,您应该使用xPath 表达式
\n\n例子:
\n\nNS = \'{urn:hl7-org:v3}\'\nID = \'829076996\' # ID TO BE FOUND\n\n# XPATH TO FIND AUTHORS BY ID (search ID and return related author node)\nxPathAuthorById = \'\'.join([\n ".//",\n NS, "author/",\n NS, "assignedEntity/",\n NS, "representedOrganization/",\n NS, "id[@extension=\'", ID,\n "\']/../../.."\n ])\n\n# XPATH TO FIND AUTHOR NAME ELEMENT\nxPathAuthorName = \'\'.join([\n "./",\n NS, "assignedEntity/",\n NS, "representedOrganization/",\n NS, "name"\n ])\n\n# FOR EACH AUTHOR FOUND, SEARCH ATTRIBUTES (example name)\nfor author in tree.findall(xPathAuthorById):\n name = author.find(xPathAuthorName)\n print(name.text)\nRun Code Online (Sandbox Code Playgroud)\n\n此示例打印 ID 829076996 的作者姓名
\n\n更新2
\n\n您可以使用findall轻松处理所有分配的实体标签。\n对于每个标记,您可以拥有多个产品,因此需要另一个 findall 方法(请参见下面的示例)。
\n\nxPathAssignedEntities = \'\'.join([\n ".//",\n NS, "author/",\n NS, "assignedEntity/",\n NS, "representedOrganization/",\n NS, "assignedEntity/", \n NS, "assignedOrganization/", \n NS, "assignedEntity"\n ])\n\nxPathProdCode = \'\'.join([\n NS, "actDefinition/",\n NS, "product/",\n NS, "manufacturedProduct/",\n NS, "manufacturedMaterialKind/",\n NS, "code"\n ])\n\n\n# GET ALL assignedEntity TAGS\nfor assignedEntity in tree.findall(xPathAssignedEntities):\n\n #\xc2\xa0GET ID AND NAME OF assignedEntity\n id = assignedEntity.find(NS + \'assignedOrganization/\'+ NS + \'id\').get(\'extension\')\n name = assignedEntity.find(NS + \'assignedOrganization/\' + NS + \'name\').text\n\n # FOR EACH assignedEntity WE CAN HAVE MULTIPLE <performance> TAGS\n for performance in assignedEntity.findall(NS + \'performance\'):\n actCode = performance.find(NS + \'actDefinition/\'+ NS + \'code\').get(\'displayName\')\n prodCode = performance.find(xPathProdCode).get(\'code\')\n print(id, \'\\t\', name, \'\\t\', actCode, \'\\t\', prodCode)\nRun Code Online (Sandbox Code Playgroud)\n\n这是结果:
\n\n829084545 Pfizer Pharmaceuticals LLC ANALYSIS 0049-0050 \n829084545 Pfizer Pharmaceuticals LLC ANALYSIS 0049-4900 \n829084545 Pfizer Pharmaceuticals LLC ANALYSIS 0049-4910 \n829084545 Pfizer Pharmaceuticals LLC ANALYSIS 0049-4940 \n829084545 Pfizer Pharmaceuticals LLC ANALYSIS 0049-4960 \n829084545 Pfizer Pharmaceuticals LLC API MANUFACTURE 0049-0050\n829084545 Pfizer Pharmaceuticals LLC API MANUFACTURE 0049-4900\n829084545 Pfizer Pharmaceuticals LLC API MANUFACTURE 0049-4910\n829084545 Pfizer Pharmaceuticals LLC API MANUFACTURE 0049-4940\n829084545 Pfizer Pharmaceuticals LLC API MANUFACTURE 0049-4960\n829084545 Pfizer Pharmaceuticals LLC MANUFACTURE 0049-4900 \n829084545 Pfizer Pharmaceuticals LLC MANUFACTURE 0049-4910 \n829084545 Pfizer Pharmaceuticals LLC MANUFACTURE 0049-4960 \n829084545 Pfizer Pharmaceuticals LLC PACK 0049-4900 \n829084545 Pfizer Pharmaceuticals LLC PACK 0049-4910 \n829084545 Pfizer Pharmaceuticals LLC PACK 0049-4960 \n618054084 Pharmacia and Upjohn Company LLC ANALYSIS 0049-0050\n618054084 Pharmacia and Upjohn Company LLC ANALYSIS 0049-4940\n829084552 Pfizer Pharmaceuticals LLC PACK 0049-4900 \n829084552 Pfizer Pharmaceuticals LLC PACK 0049-4910 \n829084552 Pfizer Pharmaceuticals LLC PACK 0049-4960\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
27934 次 |
| 最近记录: |