我想从其部分的xml 文件Component中解析数据:
<Component>\n <UnderlyingSecurityID>300001</UnderlyingSecurityID>\n <UnderlyingSecurityIDSource>102</UnderlyingSecurityIDSource>\n <UnderlyingSymbol>\xe7\x89\xb9\xe9\x94\x90\xe5\xbe\xb7</UnderlyingSymbol>\n <ComponentShare>300.00</ComponentShare>\n <SubstituteFlag>1</SubstituteFlag>\n <PremiumRatio>0.25000</PremiumRatio>\n <CreationCashSubstitute>0.0000</CreationCashSubstitute>\n <RedemptionCashSubstitute>0.0000</RedemptionCashSubstitute>\n</Component>\n<Component>\n <UnderlyingSecurityID>300003</UnderlyingSecurityID>\n <UnderlyingSecurityIDSource>102</UnderlyingSecurityIDSource>\n <UnderlyingSymbol>\xe4\xb9\x90\xe6\x99\xae\xe5\x8c\xbb\xe7\x96\x97</UnderlyingSymbol>\n <ComponentShare>600.00</ComponentShare>\n <SubstituteFlag>1</SubstituteFlag>\n <PremiumRatio>0.25000</PremiumRatio>\n <CreationCashSubstitute>0.0000</CreationCashSubstitute>\n <RedemptionCashSubstitute>0.0000</RedemptionCashSubstitute>\n</Component>\nRun Code Online (Sandbox Code Playgroud)\n我已经安装了最新版本的 lxml 和 pandas,尝试了以下代码但没有成功。
\nPython 3.9.4 (tags/v3.9.4:1f2e308, Apr 6 2021, 13:40:21) [MSC v.1928 64 bit (AMD64)]\nType \'copyright\', \'credits\' or \'license\' for more information\nIPython 7.25.0 -- An enhanced Interactive Python. Type \'?\' for help.\n\nIn [1]: import pandas as pd\n\nIn [2]: pd.__version__\nOut[2]: \'1.3.0\'\n\nIn [3]: xml = pd.read_xml(\'https://www.huaan.com.cn/etf/159949/etffiledownload.jsp?etffilename=pcf_159949_20210707.xml\', xpath=\'//component\')\n---------------------------------------------------------------------------\nValueError Traceback (most recent call last)\n<ipython-input-3-67d228028cc9> in <module>\n----> 1 xml = pd.read_xml(\'https://www.huaan.com.cn/etf/159949/etffiledownload.jsp?etffilename=pcf_159949_20210707.xml\', xpath=\'//component\')\n\n...\n 501 if elems == []:\n--> 502 raise ValueError(msg)\n 503 \n 504 if elems != [] and attrs == [] and children == []:\n\nValueError: xpath does not return any nodes. Be sure row level nodes are in xpath. If document uses namespaces denoted with xmlns, be sure to define namespaces and use them in xpath.\n\nIn [4]: xml = pd.read_xml(\'https://www.huaan.com.cn/etf/159949/etffiledownload.jsp?etffilename=pcf_159949_20210707.xml\', xpath=\'//component\', namespaces={\'com\': \'http://ts.szse.cn/Fund\'})\n---------------------------------------------------------------------------\nValueError Traceback (most recent call last)\n<ipython-input-4-52fbe542dadb> in <module>\n----> 1 xml = pd.read_xml(\'https://www.huaan.com.cn/etf/159949/etffiledownload.jsp?etffilename=pcf_159949_20210707.xml\', xpath=\'//component\', namespaces={\'com\': \'http://ts.szse.cn/Fund\'})\n\n...\n 501 if elems == []:\n--> 502 raise ValueError(msg)\n 503 \n 504 if elems != [] and attrs == [] and children == []:\n\nValueError: xpath does not return any nodes. Be sure row level nodes are in xpath. If document uses namespaces denoted with xmlns, be sure to define namespaces and use them in xpath.\nRun Code Online (Sandbox Code Playgroud)\n我也lxml直接尝试过,这似乎有效:
In [5]: from lxml import etree\nIn [6]: import requests\nIn [7]: content = requests.get(\'https://www.huaan.com.cn/etf/159949/etffiledownload.jsp?etffilename=pcf_159949_20210707.xml\').content\n\nIn [8]: html = etree.HTML(content)\nIn [9]: html.xpath(\'//component\')\nOut[9]: \n[<Element component at 0x1d493cb23c0>,\n <Element component at 0x1d493cb2340>,\n <Element component at 0x1d493cb2240>,\n <Element component at 0x1d493cb22c0>,\n <Element component at 0x1d493cb2140>,\n <Element component at 0x1d493cb2040>,\n <Element component at 0x1d493cb2c40>,\n <Element component at 0x1d493cb61c0>,\n <Element component at 0x1d493cb63c0>,\n <Element component at 0x1d493cb2200>,\n ...\nRun Code Online (Sandbox Code Playgroud)\n我不知道为什么read_xml不起作用。任何帮助,将不胜感激!
简而言之,这里的解决方案是找出您想要的节点,在本例中是Component(区分大小写),并按如下方式设置 xpath 添加//.
pd.read_xml(your_xml_file, xpath='//Component')
Run Code Online (Sandbox Code Playgroud)