小编Gun*_*wal的帖子

Parquet 列无法在文件中转换,预期:bigint,发现:INT32

我有一个带有 tlc 列的胶水表,它的数据类型是 Bigint。我正在尝试使用 PySpark 执行以下操作:

  1. 读取 Glue 表并将其写入 Dataframe
  2. 加入另一个表
  3. 将结果数据帧写入 S3 路径

我的代码看起来像:

df = spark.sql('select tlc from monthly_table')
df.createOrReplaceTempView('sdc')

df_a = spark.sql('select tlc from monthly_table_2')
df_a.createOrReplaceTempView('abc')

df_moves = spark.sql('select * from abc a left join sdc s on a.tlc =s.tlc')
df_moves.write.parquet('<s3_path>', mode='overwrite')
Run Code Online (Sandbox Code Playgroud)

由于这个原因,我收到一个错误,如下所述:

Parquet column cannot be converted in file s3://<s3_path>. Column: [tlc], Expected: bigint, Found: INT32
Run Code Online (Sandbox Code Playgroud)

完整的跟踪:

py4j.protocol.Py4JJavaError: An error occurred while calling o419.parquet.
: org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) …
Run Code Online (Sandbox Code Playgroud)

amazon-emr apache-spark parquet pyspark aws-glue

6
推荐指数
1
解决办法
1517
查看次数

标签 统计

amazon-emr ×1

apache-spark ×1

aws-glue ×1

parquet ×1

pyspark ×1