从Spark写入镶木地板时如何处理空值

Question

从Spark写入镶木地板时如何处理空值

直到最近parquet还不支持null价值观-一个可疑的前提。实际上，最新版本确实添加了该支持：

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

但是，要spark支持该新parquet功能还需要很长时间（如果有的话）。以下是相关的（closed - will not fix）JIRA：

https://issues.apache.org/jira/browse/SPARK-10943

那么，今天人们在写出dataframeto 时如何处理空列值parquet呢？我只能想到非常丑陋的骇客，例如编写空字符串，然后..我不知道该如何使用数值来表示null-缺少一些前哨值并进行代码检查（这很不方便）且容易出错）。

Answer 1

hi-*_*zir 12

您误解了SPARK-10943。Spark确实支持将null值写入数字列。

问题在于，null仅此一项根本不包含类型信息

scala> spark.sql("SELECT null as comments").printSchema
root
 |-- comments: null (nullable = true)

Run Code Online (Sandbox Code Playgroud)

根据迈克尔·阿姆布鲁斯特（Michael Armbrust）的评论，您所要做的只是：

scala> spark.sql("""SELECT CAST(null as DOUBLE) AS comments""").printSchema
root
|-- comments: double (nullable = true)

Run Code Online (Sandbox Code Playgroud)

并将结果安全地写入Parquet。

Answer 2

Dan*_*iel 5

我为此编写了一个 pyspark 解决方案（df 是一个包含 NullType 列的数据框）：

# get dataframe schema
my_schema = list(df.schema)

null_cols = []

# iterate over schema list to filter for NullType columns
for st in my_schema:
    if str(st.dataType) == 'NullType':
        null_cols.append(st)

# cast null type columns to string (or whatever you'd like)
for ncol in null_cols:
    mycolname = str(ncol.name)
    df = df \
    .withColumn(mycolname, df[mycolname].cast('string'))

Run Code Online (Sandbox Code Playgroud)

该解决方案可以扩展为处理嵌套 NullType 列，只需将该行更改为 `if 'NullType' in str(st.dataType):` (3认同)

归档时间：	7 年，10 月前
查看次数：	7112 次
最近记录：	6 年，4 月前