尝试使用 Python 解析 XLS (XML) 文件

tre*_*ron 1 python xml excel netsuite pandas

我有一个从 Netsuite ERP 下载的“XLS”文件。文件根目录显示“.XLS”,但它实际上是一个 XML 文件。我有一个 pandas 脚本,它将组合多个 XLS 或 XLSX 文件,但 pandas 似乎无法处理这种奇怪的 XLS/XML 文件类型,因此我有另一个脚本尝试解析 XML 数据并保存到 XLS 或 XLSX。然而,下面的脚本似乎不起作用,因为它的结果是“无”。谁能用我的示例代码、新代码或解决这个奇怪的 XLS/XML 解析问题的新方法为我指明正确的方向?

\n\n

先感谢您!

\n\n

XML 示例代码:

\n\n
<?xml version="1.0" encoding="utf-16"?>\n<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40">\n  <DocumentProperties xmlns="urn:schemas-microsoft-com:office:office">\n    <Author>NetSuite Reports</Author>\n    <LastAuthor>NetSuite Reports</LastAuthor>\n    <Company>NetSuite</Company>\n  </DocumentProperties>\n  <Styles>\n    <Style ss:ID="company">\n      <Alignment ss:Horizontal="Center" />\n      <Font ss:Size="12" ss:Bold="1" />\n    </Style>\n    <Style ss:ID="subcompany">\n      <Alignment ss:Horizontal="Center" />\n      <Font ss:Size="14" ss:Bold="1" />\n    </Style>\n    <Style ss:ID="error">\n      <Alignment ss:Horizontal="Center" />\n      <Interior ss:Color="#f0d0d0" ss:Pattern="Solid" />\n      <Font ss:Bold="1" />\n    </Style>\n    <Style ss:ID="header_l">\n      <Alignment ss:Horizontal="Left" />\n      <Font ss:Size="7" ss:Bold="1" />\n      <Interior ss:Color="#d0d0d0" ss:Pattern="Solid" />\n    </Style>\n    <Style ss:ID="header_r">\n      <Alignment ss:Horizontal="Right" />\n      <Font ss:Size="7" ss:Bold="1" />\n      <Interior ss:Color="#d0d0d0" ss:Pattern="Solid" />\n    </Style>\n    <Style ss:ID="header_c">\n      <Alignment ss:Horizontal="Center" />\n      <Font ss:Size="7" ss:Bold="1" />\n      <Interior ss:Color="#d0d0d0" ss:Pattern="Solid" />\n    </Style>\n    <Style ss:ID="scheckbox">\n      <Alignment ss:Vertical="Center" ss:Horizontal="Center" />\n    </Style>\n    <Style ss:ID="Default" ss:Name="Normal">\n      <Alignment ss:Vertical="Bottom" />\n      <Borders />\n      <Font ss:FontName="Arial" ss:Size="8" />\n      <Interior />\n      <NumberFormat />\n      <Protection />\n    </Style>\n    <Style ss:ID="s53">\n      <Alignment ss:Vertical="Center" ss:Horizontal="Left" />\n      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />\n      <Borders>\n        <Border ss:Position="Top" ss:LineStyle="Dash" ss:Weight="1" ss:Color="#cccccc" />\n      </Borders>\n    </Style>\n    <Style ss:ID="s52">\n      <Alignment ss:Horizontal="Left" ss:Indent="1" />\n      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="0" ss:Italic="0" />\n      <Borders />\n    </Style>\n    <Style ss:ID="s51">\n      <Alignment ss:Vertical="Center" ss:Horizontal="Right" />\n      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="0" ss:Italic="0" />\n      <NumberFormat ss:Format="&quot;\xe2\x82\xac&quot;#,##0.00" />\n      <Borders />\n    </Style>\n    <Style ss:ID="s50">\n      <Alignment ss:Vertical="Center" ss:Horizontal="Left" />\n      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />\n      <Borders />\n    </Style>\n    <Style ss:ID="s58">\n      <Alignment ss:Horizontal="Left" ss:Indent="2" />\n      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />\n      <Borders>\n        <Border ss:Position="Top" ss:LineStyle="Dash" ss:Weight="1" ss:Color="#cccccc" />\n      </Borders>\n    </Style>\n    <Style ss:ID="s54">\n      <Alignment ss:Vertical="Center" ss:Horizontal="Right" />\n      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />\n      <NumberFormat ss:Format="&quot;\xe2\x82\xac&quot;#,##0.00" />\n      <Borders>\n        <Border ss:Position="Top" ss:LineStyle="Dash" ss:Weight="1" ss:Color="#cccccc" />\n      </Borders>\n    </Style>\n    <Style ss:ID="s59">\n      <Alignment ss:Horizontal="Left" ss:Indent="1" />\n      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />\n      <Borders>\n        <Border ss:Position="Top" ss:LineStyle="Dash" ss:Weight="1" ss:Color="#cccccc" />\n      </Borders>\n    </Style>\n    <Style ss:ID="s56">\n      <Alignment ss:Horizontal="Left" ss:Indent="2" />\n      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />\n      <Borders />\n    </Style>\n    <Style ss:ID="s57">\n      <Alignment ss:Horizontal="Left" ss:Indent="3" />\n      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="0" ss:Italic="0" />\n      <Borders />\n    </Style>\n    <Style ss:ID="s55">\n      <Alignment ss:Horizontal="Left" ss:Indent="1" />\n      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />\n      <Borders />\n    </Style>\n    <Style ss:ID="s60">\n      <Alignment ss:Vertical="Center" ss:Horizontal="Left" />\n      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />\n      <Borders>\n        <Border ss:Position="Top" ss:LineStyle="Dash" ss:Weight="1" ss:Color="#cccccc" />\n      </Borders>\n    </Style>\n  </Styles>\n  <Worksheet ss:Name="TrialBalance">\n    <Table>\n      <Row>\n        <Cell ss:StyleID="company" ss:MergeAcross="1">\n          <Data ss:Type="String">Parent Company</Data>\n        </Cell>\n      </Row>\n      <Row>\n        <Cell ss:StyleID="company" ss:MergeAcross="1">\n          <Data ss:Type="String">Company Holdings Inc. : Company A  B.V.</Data>\n        </Cell>\n      </Row>\n      <Row>\n        <Cell ss:StyleID="subcompany" ss:MergeAcross="1">\n          <Data ss:Type="String">Trial Balance</Data>\n        </Cell>\n      </Row>\n      <Row>\n        <Cell ss:StyleID="subcompany" ss:MergeAcross="1">\n          <Data ss:Type="String">End of Feb 2020</Data>\n        </Cell>\n      </Row>\n      <Row>\n        <Cell ss:StyleID="subcompany" ss:MergeAcross="1">\n          <Data ss:Type="String" />\n        </Cell>\n      </Row>\n      <Row>\n        <Cell ss:StyleID="subcompany" ss:MergeAcross="1">\n          <Data ss:Type="String" />\n        </Cell>\n      </Row>\n      <Row>\n        <Cell ss:StyleID="header_l">\n          <Data ss:Type="String">Account</Data>\n        </Cell>\n        <Cell ss:StyleID="header_r" ss:MergeDown="0" ss:Index="2">\n          <Data ss:Type="String">Total</Data>\n        </Cell>\n      </Row>\n      <Row>\n        <Cell ss:StyleID="s50">\n          <Data ss:Type="String">10000 - CASH &amp; CASH EQUIVALENTS</Data>\n        </Cell>\n        <Cell ss:StyleID="s51" />\n      </Row>\n      <Row>\n        <Cell ss:StyleID="s52">\n          <Data ss:Type="String">10101 - Bank - 9999 - Company A - EUR</Data>\n        </Cell>\n        <Cell ss:StyleID="s51">\n          <Data ss:Type="Number">1234567.01</Data>\n        </Cell>\n      </Row>\n      <Row>\n        <Cell ss:StyleID="s53">\n          <Data ss:Type="String">Total - 10000 - CASH &amp; CASH EQUIVALENTS</Data>\n        </Cell>\n        <Cell ss:Formula="SUM(R[-1]C)" ss:StyleID="s54">\n          <Data ss:Type="Number">1234567.01</Data>\n        </Cell>\n      </Row>\n    </Table>\n  </Worksheet>\n</Workbook>\n
Run Code Online (Sandbox Code Playgroud)\n\n

Python 代码将 XML 解析为 XLS:

\n\n
import pandas as pd\nimport xml.etree.cElementTree as ET\n\ntree = ET.parse(r"C:\\Users\\NAME\\Documents\\rootfolder\\examplefile.xls")\nroot = tree.getroot()\n\ndef getvalueofnode(node):\n    """ return node text or None """\n    return node.text if node is not None else None\n\n\ndef main():\n    """ main """\n    parsed_xml = tree\n    dfcols = [\'account\', \'total\']\n    df_xml = pd.DataFrame(columns=dfcols)\n\n\nfor node in parsed_xml.getroot():\n    account = node.attrib.get(\'Type="String"\')\n    total = node.find(\'Type="Number"\')\n\n    df_xml = df_xml.append(\n        pd.Series([account, getvalueofnode(total)], index=dfcols),\n        ignore_index=True)\n\nprint(df_xml)\n\n\nmain()\n
Run Code Online (Sandbox Code Playgroud)\n\n

Python 解析 XML 文件结果:

\n\n
  account total\n0    None  None\n
Run Code Online (Sandbox Code Playgroud)\n

Par*_*ait 7

避免通过附加 Series 甚至 DataFrame 等对象来构建数据框。相反,构建要绑定到的字典列表DataFrame。此外,由于您的 XML 有一个默认命名空间,因此您必须分配一个前缀来解析该命名空间下的任何元素

import pandas as pd
import xml.etree.cElementTree as ET

ns = {"doc": "urn:schemas-microsoft-com:office:spreadsheet"}

tree = ET.parse(r"C:\Path\To\Input.xml")
root = tree.getroot()

def getvalueofnode(node):
    """ return node text or None """
    return node.text if node is not None else None


def main():
    """ main """
    parsed_xml = tree

    data = []
    for i, node in enumerate(root.findall('.//doc:Row', ns)):
        if i > 6:
            data.append({'account': getvalueofnode(node.find('doc:Cell[1]/doc:Data', ns)),
                         'total': getvalueofnode(node.find('doc:Cell[2]/doc:Data', ns))})

    return(pd.DataFrame(data))

output_df = main()

print(output_df)
#                                    account       total
# 0          10000 - CASH & CASH EQUIVALENTS        None
# 1    10101 - Bank - 9999 - Company A - EUR  1234567.01
# 2  Total - 10000 - CASH & CASH EQUIVALENTS  1234567.01
Run Code Online (Sandbox Code Playgroud)

xlsx或者,使用Workbook.SaveAs方法将 Excel 样式的 XML 保存为win32com(仅适用于 Windows 用户)并跳过相应的行进行读入pandas.read_excel

import win32com.client
import pandas as pd

# SAVE EXCEL FILE
try:
    xlApp = win32com.client.Dispatch("Excel.Application")
    xlWbk = xlApp.Workbooks.Open(r"C:\Path\To\Input.xml")
    xlWbk.SaveAs(r"C:\Path\To\Output.xlsx", 51)

    xlWbk.Close(True)
    xlApp.Quit()

except Exception as e:
    print(e)

finally:
    xlWbk = None; xlApp = None
    del xlWbk; del xlApp

# READ EXCEL FILE
output_df = pd.read_excel(r"C:\Path\To\Output.xlsx", skiprows = 6)

print(output_df)    
#                                    Account       Total
# 0          10000 - CASH & CASH EQUIVALENTS         NaN
# 1    10101 - Bank - 9999 - Company A - EUR  1234567.01
# 2  Total - 10000 - CASH & CASH EQUIVALENTS  1234567.01
Run Code Online (Sandbox Code Playgroud)