小编UrV*_*Val的帖子

加入两个数据帧时出现 Parquet 错误。UnsupportedOperationException: ...PlainValuesDictionary$PlainLongDictionary

我有 2 个具有以下架构的数据框：

root
 |-- pid: long (nullable = true)
 |-- lv: timestamp (nullable = true)
root
 |-- m_pid: long (nullable = true)
 |-- vp: double (nullable = true)
 |-- created: timestamp (nullable = true)

Run Code Online (Sandbox Code Playgroud)

如果我尝试显示（）这两个数据帧中的任何一个，一切正常，则会显示前 20 行。如果我尝试加入这 2 个数据框并显示结果（连接转换时不会出现错误，仅在调用“显示”操作时）

var joined = df1.join(df2, df2("pid") === df1("m_pid")).drop("m_pid")
joined.show()

Run Code Online (Sandbox Code Playgroud)

我收到一个我不明白的错误。它与镶木地板有关。其中一个数据帧是从镶木地板中读取的（另一个是从文本中读取的），但是如果这是与读取数据相关的问题，那么为什么只有在加入数据帧时才会出现问题，而不是在单独显示时才出现问题。

错误是：

org.apache.spark.SparkException: Job aborted due to stage failure: Task 51 in stage 403.0 failed 4 times, most recent failure: Lost task 51.3 in stage 403.0 : java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

...

Caused by: …

Run Code Online (Sandbox Code Playgroud)

scala spark-dataframe

UrV*_*Val

2016 12-19

5
推荐指数

0
解决办法

2592
查看次数

Spark Scala用今天的时间戳填充NA

如何替换类型为timestamp的列中的所有空值？

我希望这会更容易，但是我似乎无法正确获取类型。我认为一个解决方案是将列转换为String，在字符串中填充今天的日期，然后重新转换为timestamp，但是，还有没有更优雅的解决方案？

val today = java.time.LocalDate.now()
var todayStamp = java.sql.Timestamp.valueOf(today.atStartOfDay());
df = df.na.fill(Map("expiration" -> todayStamp))

Run Code Online (Sandbox Code Playgroud)

结果是

java.lang.IllegalArgumentException: Unsupported value type java.sql.Timestamp

Run Code Online (Sandbox Code Playgroud)

使用今天也不起作用，并且使用unix_timestamp(string).cast("timestamp") 期望列而不是字符串。我想我可以在上面提到的“丑陋”方法中使用它。

稍后编辑：忘了提及，在timestamp列上将Int或String与df.na.fill方法一起使用也会导致错误：

org.apache.spark.sql.AnalysisException: cannot resolve 'coalesce(expiration, 0)' due to data type mismatch: input to function coalesce should all be the same type, but it's [timestamp, int];

Run Code Online (Sandbox Code Playgroud)

apache-spark apache-spark-sql

UrV*_*Val

2019 01-14

0
推荐指数

1
解决办法

2830
查看次数