在pyspark数据帧中访问嵌套列

run*_*i74 5 dataframe apache-spark pyspark

我有一个xml文档,如下所示:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Position>
    <Search>
        <Location>
            <Region>OH</Region>
            <Country>us</Country>
            <Longitude>-816071</Longitude>
            <Latitude>415051</Latitude>
        </Location>
    </Search>
</Position>
Run Code Online (Sandbox Code Playgroud)

我将其读入数据帧:

df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='Position').load('1.xml')
Run Code Online (Sandbox Code Playgroud)

我可以看到1列:

df.columns
['Search']

print df.select("Search")
DataFrame[Search: struct<Location:struct<Country:string,Latitude:bigint,Longitude:bigint,Region:string>>]
Run Code Online (Sandbox Code Playgroud)

如何访问嵌套列.前位置.区域?

Pra*_*ode 9

你可以做以下的事情:

df.select("Search.Location.*").show()
Run Code Online (Sandbox Code Playgroud)

输出:

+-------+--------+---------+------+
|Country|Latitude|Longitude|Region|
+-------+--------+---------+------+
|     us|  415051|  -816071|    OH|
+-------+--------+---------+------+
Run Code Online (Sandbox Code Playgroud)