如何在Python中从URL读取XML文件?

PAN*_*MAR 8 python xml-parsing

我想访问子节点中存在的信息。这是因为文件的结构吗?

尝试单独提取文件中的作者子节点信息并运行python代码。效果很好

import urllib
import xml.etree.ElementTree as ET

url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'

print 'Retrieving', url

document = urllib.urlopen (url).read()
print 'Retrieved', len(document), 'characters.'

print document[:50]

tree = ET.fromstring(document)

lst = tree.findall('title')
print lst[:100]
Run Code Online (Sandbox Code Playgroud)

man*_*l_b 5

由于命名空间的原因,您无法找到标题元素。

\n\n

下面找到一个示例代码:

\n\n
    \n
  • “文档”标签的标题
  • \n
  • 内部“组件”标签的标题
  • \n
\n\n
    import xml.etree.ElementTree as ET\n    import urllib.request\n\n    url = \'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml\'\n    response = urllib.request.urlopen(url).read()\n    tree = ET.fromstring(response)\n\n\n    for docTitle in tree.findall(\'{urn:hl7-org:v3}title\'):\n        print(docTitle.text)\n\n    for compTitle in tree.findall(\'.//{urn:hl7-org:v3}title\'):\n        print(compTitle.text)\n
Run Code Online (Sandbox Code Playgroud)\n\n

更新

\n\n

如果您需要搜索 XML 节点,您应该使用xPath 表达式

\n\n

例子:

\n\n
NS = \'{urn:hl7-org:v3}\'\nID = \'829076996\'    # ID TO BE FOUND\n\n# XPATH TO FIND AUTHORS BY ID (search ID and return related author node)\nxPathAuthorById = \'\'.join([\n    ".//",\n    NS, "author/",\n    NS, "assignedEntity/",\n    NS, "representedOrganization/",\n    NS, "id[@extension=\'", ID,\n    "\']/../../.."\n    ])\n\n# XPATH TO FIND AUTHOR NAME ELEMENT\nxPathAuthorName = \'\'.join([\n    "./",\n    NS, "assignedEntity/",\n    NS, "representedOrganization/",\n    NS, "name"\n    ])\n\n# FOR EACH AUTHOR FOUND, SEARCH ATTRIBUTES (example name)\nfor author in tree.findall(xPathAuthorById):\n    name = author.find(xPathAuthorName)\n    print(name.text)\n
Run Code Online (Sandbox Code Playgroud)\n\n

此示例打印 ID 829076996 的作者姓名

\n\n

更新2

\n\n

您可以使用findall轻松处理所有分配的实体标签。\n对于每个标记,您可以拥有多个产品,因此需要另一个 findall 方法(请参见下面的示例)。

\n\n
xPathAssignedEntities = \'\'.join([\n    ".//",\n    NS, "author/",\n    NS, "assignedEntity/",\n    NS, "representedOrganization/",\n    NS, "assignedEntity/", \n    NS, "assignedOrganization/", \n    NS, "assignedEntity"\n    ])\n\nxPathProdCode = \'\'.join([\n    NS, "actDefinition/",\n    NS, "product/",\n    NS, "manufacturedProduct/",\n    NS, "manufacturedMaterialKind/",\n    NS, "code"\n    ])\n\n\n# GET ALL assignedEntity TAGS\nfor assignedEntity in tree.findall(xPathAssignedEntities):\n\n    #\xc2\xa0GET ID AND NAME OF assignedEntity\n    id = assignedEntity.find(NS + \'assignedOrganization/\'+ NS + \'id\').get(\'extension\')\n    name = assignedEntity.find(NS + \'assignedOrganization/\' + NS + \'name\').text\n\n    # FOR EACH assignedEntity WE CAN HAVE MULTIPLE <performance> TAGS\n    for performance in assignedEntity.findall(NS + \'performance\'):\n        actCode = performance.find(NS + \'actDefinition/\'+ NS + \'code\').get(\'displayName\')\n        prodCode = performance.find(xPathProdCode).get(\'code\')\n        print(id, \'\\t\', name, \'\\t\', actCode, \'\\t\', prodCode)\n
Run Code Online (Sandbox Code Playgroud)\n\n

这是结果:

\n\n
829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-0050 \n829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-4900 \n829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-4910 \n829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-4940 \n829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-4960 \n829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-0050\n829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-4900\n829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-4910\n829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-4940\n829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-4960\n829084545    Pfizer Pharmaceuticals LLC      MANUFACTURE     0049-4900 \n829084545    Pfizer Pharmaceuticals LLC      MANUFACTURE     0049-4910 \n829084545    Pfizer Pharmaceuticals LLC      MANUFACTURE     0049-4960 \n829084545    Pfizer Pharmaceuticals LLC      PACK    0049-4900 \n829084545    Pfizer Pharmaceuticals LLC      PACK    0049-4910 \n829084545    Pfizer Pharmaceuticals LLC      PACK    0049-4960 \n618054084    Pharmacia and Upjohn Company LLC    ANALYSIS    0049-0050\n618054084    Pharmacia and Upjohn Company LLC    ANALYSIS    0049-4940\n829084552    Pfizer Pharmaceuticals LLC      PACK    0049-4900 \n829084552    Pfizer Pharmaceuticals LLC      PACK    0049-4910 \n829084552    Pfizer Pharmaceuticals LLC      PACK    0049-4960\n
Run Code Online (Sandbox Code Playgroud)\n