MultiIndex Pandas DataFrame到Spark DataFrame和缺少索引

Kev*_*osi 2 multi-index pandas apache-spark apache-spark-sql pyspark

拥有MultiIndex Pandas DataFrame,如何将其转换为Spark DataFrame而不丢失索引。可以使用一个玩具示例轻松地进行测试:

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df_spark = sqlContext.createDataFrame(df)
Run Code Online (Sandbox Code Playgroud)

错过所有索引。为了保留索引,我还需要注意其他事项吗?

use*_*411 6

Spark SQL没有索引的概念,因此,如果要保留索引,则必须首先将其重置或分配给列:

df_spark = sqlContext.createDataFrame(df.reset_index(drop=False))
Run Code Online (Sandbox Code Playgroud)

这将为DataFrame索引中的每个字段创建一个带有附加列的:

df_spark.printSchema()
Run Code Online (Sandbox Code Playgroud)
root
 |-- level_0: string (nullable = true)
 |-- level_1: string (nullable = true)
 |-- 0: double (nullable = true)
 |-- 1: double (nullable = true)
 |-- 2: double (nullable = true)
 |-- 3: double (nullable = true)
Run Code Online (Sandbox Code Playgroud)

您还可以使用inplace以避免额外的内存开销:

df.reset_index(drop=False, inplace=True)
df_spark = sqlContext.createDataFrame(df)
Run Code Online (Sandbox Code Playgroud)