过滤 Spark 分区表在 Pyspark 中不起作用

Question

过滤 Spark 分区表在 Pyspark 中不起作用

我正在使用 Spark 2.3，并使用 pyspark 中的数据帧编写器类方法编写了一个数据帧来创建 Hive 分区表。

newdf.coalesce(1).write.format('orc').partitionBy('veh_country').mode("overwrite").saveAsTable('emp.partition_Load_table')

Run Code Online (Sandbox Code Playgroud)

这是我的表结构和分区信息。

hive> desc emp.partition_Load_table;
OK
veh_code                varchar(17)
veh_flag                varchar(1)
veh_model               smallint
veh_country             varchar(3)

# Partition Information
# col_name              data_type               comment

veh_country              varchar(3)

hive> show partitions partition_Load_table;
OK
veh_country=CHN
veh_country=USA
veh_country=RUS

Run Code Online (Sandbox Code Playgroud)

现在我正在数据框中的 pyspark 中读回该表。

    df2_data = spark.sql("""
    SELECT * 
    from udb.partition_Load_table
    """);

df2_data.show() --> is working

Run Code Online (Sandbox Code Playgroud)

但我无法使用分区键列过滤它

from pyspark.sql.functions import col
newdf = df2_data.where(col("veh_country")=='CHN')

Run Code Online (Sandbox Code Playgroud)

我收到以下错误消息：

: java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. 
You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, 
however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK
Caused by: MetaException(message:Filtering is supported only on partition keys of type string)

Run Code Online (Sandbox Code Playgroud)

而当我通过指定表的 hdfs 绝对路径创建数据框时。过滤器和 where 子句按预期工作。

newdataframe = spark.read.format("orc").option("header","false").load("hdfs/path/emp.db/partition_load_table")

Run Code Online (Sandbox Code Playgroud)

下面正在工作

newdataframe.where(col("veh_country")=='CHN').show()

Run Code Online (Sandbox Code Playgroud)

我的问题是为什么它无法首先过滤数据帧。以及为什么它抛出错误消息“仅在字符串类型的分区键上支持过滤”，即使我的 veh_country 被定义为字符串或 varchar 数据类型。

Answer 1

小智 5

我也偶然发现了这个问题。对我有帮助的是执行以下操作：

spark.sql("SET spark.sql.hive.manageFilesourcePartitions=False")

Run Code Online (Sandbox Code Playgroud)

然后使用spark.sql(query)而不是使用数据框。

我不知道幕后发生了什么，但这解决了我的问题。

虽然这对你来说可能为时已晚（因为这个问题是 8 个月前提出的），但这可能对其他人有帮助。

归档时间：	7 年，1 月前
查看次数：	13254 次
最近记录：	3 年，2 月前