sit*_*ara 5 python xml csv nested dataframe
我有一个像这样的 XML 文件:
<?xml version="1.0"?>
<PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Class Def" MessageType="Integration Object">
<ListOf_Class_Def>
<ImpExp Type="CLASS_DEF" Name="lp_pkg_cla" Object_Num="1001p">
<ListOfObject_Def>
<Object_Def Ancestor_Num="" Ancestor_Name="">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
<Object_Arrt Orig_Id="6666p" Attr_Name="LP_Portable">
</Object_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Class_Def>
</Message>
</PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Class Def" MessageType="Integration Object">
<ListOf_Class_Def>
<ImpExp Type="CLASS_DEF" Name="M_pkg_cla" Object_Num="1023i">
<ListOfObject_Def>
<Object_Def Ancestor_Num="" Ancestor_Name="">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
<Object_Arrt Orig_Id="7010p" Attr_Name="O_Portable">
</Object_Arrt>
<Object_Arrt Orig_Id="7012j" Attr_Name="O_wireless">
</Object_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Class_Def>
</Message>
</PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Prod Def" MessageType="Integration Object">
<ListOf_Prod_Def>
<ImpExp Type="PROD_DEF" Name="Laptop" Object_Num="2008a">
<ListOfObject_Def>
<Object_Def Ancestor_Num="1001p" Ancestor_Name="lp_pkg_cla">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Prod_Def>
</Message>
</PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Prod Def" MessageType="Integration Object">
<ListOf_Prod_Def>
<ImpExp Type="PROD_DEF" Name="Mouse" Object_Num="2987d">
<ListOfObject_Def>
<Object_Def Ancestor_Num="1023i" Ancestor_Name="M_pkg_cla">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Prod_Def>
</Message>
</PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Prod Def" MessageType="Integration Object">
<ListOf_Prod_Def>
<ImpExp Type="PROD_DEF" Name="Speaker" Object_Num="5463g">
<ListOfObject_Def>
<Object_Def Ancestor_Num="" Ancestor_Name="">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Prod_Def>
</Message>
</PropertySet>
</PropertySet>
Run Code Online (Sandbox Code Playgroud)
我希望使用 Python 从中提取Name、Object_Num和Orig_Id标签Attr_Name并将它们转换为 .csv 格式。
我希望看到的 .csv 格式很简单:
<?xml version="1.0"?>
<PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Class Def" MessageType="Integration Object">
<ListOf_Class_Def>
<ImpExp Type="CLASS_DEF" Name="lp_pkg_cla" Object_Num="1001p">
<ListOfObject_Def>
<Object_Def Ancestor_Num="" Ancestor_Name="">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
<Object_Arrt Orig_Id="6666p" Attr_Name="LP_Portable">
</Object_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Class_Def>
</Message>
</PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Class Def" MessageType="Integration Object">
<ListOf_Class_Def>
<ImpExp Type="CLASS_DEF" Name="M_pkg_cla" Object_Num="1023i">
<ListOfObject_Def>
<Object_Def Ancestor_Num="" Ancestor_Name="">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
<Object_Arrt Orig_Id="7010p" Attr_Name="O_Portable">
</Object_Arrt>
<Object_Arrt Orig_Id="7012j" Attr_Name="O_wireless">
</Object_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Class_Def>
</Message>
</PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Prod Def" MessageType="Integration Object">
<ListOf_Prod_Def>
<ImpExp Type="PROD_DEF" Name="Laptop" Object_Num="2008a">
<ListOfObject_Def>
<Object_Def Ancestor_Num="1001p" Ancestor_Name="lp_pkg_cla">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Prod_Def>
</Message>
</PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Prod Def" MessageType="Integration Object">
<ListOf_Prod_Def>
<ImpExp Type="PROD_DEF" Name="Mouse" Object_Num="2987d">
<ListOfObject_Def>
<Object_Def Ancestor_Num="1023i" Ancestor_Name="M_pkg_cla">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Prod_Def>
</Message>
</PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Prod Def" MessageType="Integration Object">
<ListOf_Prod_Def>
<ImpExp Type="PROD_DEF" Name="Speaker" Object_Num="5463g">
<ListOfObject_Def>
<Object_Def Ancestor_Num="" Ancestor_Name="">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Prod_Def>
</Message>
</PropertySet>
</PropertySet>
Run Code Online (Sandbox Code Playgroud)
其实xml标签中有这样的关系:
如果产品有属性,那么就有标签
<Object_Def Ancestor_Num="1023i".. >
等于标签中的
Ancestor_Num,Object_NumType="CLASS_DEF"..
我已经尝试过这个:
from lxml import etree
import pandas
import HTMLParser
inFile = "./newm.xml"
outFile = "./new.csv"
ctx1 = etree.iterparse(inFile, tag=("ImpExp", "ListOfObject_Def", "ListOfObject_Arrt",))
hp = HTMLParser.HTMLParser()
csvData = []
csvData1 = []
csvData2 = []
csvData3 = []
csvData4 = []
csvData5 = []
for event, elem in ctx1:
value1 = elem.get("Type")
value2 = elem.get("Name")
value3 = elem.get("Object_Num")
value4 = elem.get("Ancestor_Num")
value5 = elem.get("Orig_Id")
value6 = elem.get("Attr_Name")
if value1 == "PROD_DEF":
csvData.append(value2)
csvData1.append(value3)
for event, elem in ctx1:
if value4 is not None:
csvData2.append(value4)
elem.clear()
df = pandas.DataFrame({'Product':csvData, 'ProductId':csvData1, 'AncestorId':csvData2})
for event, elem in ctx1:
if value1 == "Class Def":
csvData3.append(value3)
csvData4.append(value5)
csvData5.append(value6)
elem.clear()
df1 = pandas.DataFrame({'AncestorId':csvData3, 'AttribId':csvData4, 'AttribName':csvData5})
dff = pandas.merge(df, df1, on="AncestorId")
dff.to_csv(outFile, index = False)
Run Code Online (Sandbox Code Playgroud)
考虑一下XSLT,它是一种专门用于转换 XML 文件的专用语言,可以直接将 XML 转换为 CSV(即文本文件),而无需 pandas 数据帧中介。Python 的第三方模块lxml(您已经在使用)可以运行 XSLT 1.0 脚本,并且不需要for循环或if逻辑。然而,由于产品和属性的复杂对齐,一些较长的 XPath 搜索与 XSLT 一起使用。
XSLT (另存为 .xsl 文件,一种特殊的 .xml 文件)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="no" method="text"/>
<xsl:strip-space elements="*"/>
<xsl:param name="delimiter">,</xsl:param>
<xsl:template match="/PropertySet">
<xsl:text>ProductId,Product,AttributeId,Attribute
</xsl:text>
<xsl:apply-templates select="*"/>
</xsl:template>
<xsl:template match="PropertySet|Message|ListOf_Class_Def|ListOf_Prod_Def|ImpExp">
<xsl:apply-templates select="*"/>
</xsl:template>
<xsl:template match="ListOfObject_Arrt">
<xsl:apply-templates select="Object_Arrt"/>
<xsl:if test="name(*) != 'Object_Arrt' and preceding-sibling::ListOfObject_Def/Object_Def/@Ancestor_Name = ''">
<xsl:value-of select="concat(ancestor::ImpExp/@Name, $delimiter,
ancestor::ImpExp/@Object_Num, $delimiter,
'', $delimiter,
'')"/><xsl:text>
</xsl:text>
</xsl:if>
</xsl:template>
<xsl:template match="Object_Arrt">
<xsl:variable name="attrName" select="ancestor::ImpExp/@Name"/>
<xsl:value-of select="concat(/PropertySet/PropertySet/Message[@IntObjectName='Prod Def']/ListOf_Prod_Def/
ImpExp[ListOfObject_Def/Object_Def/@Ancestor_Name = $attrName]/@Name, $delimiter,
/PropertySet/PropertySet/Message[@IntObjectName='Prod Def']/ListOf_Prod_Def/
ImpExp[ListOfObject_Def/Object_Def/@Ancestor_Name = $attrName]/@Object_Num, $delimiter,
@Orig_Id, $delimiter,
@Attr_Name)"/><xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
Run Code Online (Sandbox Code Playgroud)
Python
import lxml.etree as et
# LOAD XML AND XSL
xml = et.parse('Input.xml')
xsl = et.parse('XSLT_Script.xsl')
# RUN TRANSFORMATION
transform = et.XSLT(xsl)
result = transform(xml)
# OUTPUT TO FILE
with open('Output.csv', 'wb') as f:
f.write(result)
Run Code Online (Sandbox Code Playgroud)
输出
ProductId,Product,AttributeId,Attribute
Laptop,2008a,6666p,LP_Portable
Mouse,2987d,7010p,O_Portable
Mouse,2987d,7012j,O_wireless
Speaker,5463g,,
Run Code Online (Sandbox Code Playgroud)