使用DataFrame进行混合内容XML解析

Eri*_*mas 5 scala dataframe xml-parsing apache-spark

我有一个混合内容的XML文档,我在Dataframe中使用自定义架构来解析它.我遇到的问题是架构只会选择"测量"的文本.

XML看起来像这样

<QData>
    <Measure> some text here
        <Answer>Answer1</Answer>
        <Question>Question1</Question>
    </Measure>
    <Measure> some text here
        <Answer>Answer1</Answer>
        <Question>Question1</Question>
    </Meaure>
</QData>
Run Code Online (Sandbox Code Playgroud)

我的架构如下:

def getCustomSchema():StructType = {StructField("QData",
      StructType(Array(
        StructField("Measure",
          StructType( Array( 
            StructField("Answer",StringType,true),
            StructField("Question",StringType,true)                
        )),true)
      )),true)}
Run Code Online (Sandbox Code Playgroud)

当我尝试访问Measure中的数据时,我只得到"这里的一些文本",当我尝试从Answer获取信息时它失败了.我也只是得到一个测量.

编辑:这是我试图访问数据的方式

val result = sc.read.format("com.databricks.spark.xml").option("attributePrefix", "attr_").schema(getCustomSchema)
    .load(filename.toString)

val qDfTemp = result.mapPartitions(partition =>{val mapper = new QDMapper();partition.map(row=>{mapper(row)}).flatMap(list=>list)}).toDF()

case class QDMapper(){
    def apply(row: Row):List[QData]={
        val qDList = new ListBuffer[QData]()
        val qualData = row.getAs[Row]("QData") //When I print as list I get the first Measure text and that is it
        val measure = qualData.getAs[Row]("Measure") //This fails
}
}
Run Code Online (Sandbox Code Playgroud)

小智 0

您可以使用行标签作为根标签并访问其他元素:-

df_schema = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='<xml_tag_name>').load(schema_path)
Run Code Online (Sandbox Code Playgroud)

请访问https://github.com/harshaltaware/Pyspark/blob/main/Spark-data-parsing/xmlparsing.py获取简短代码