小编San*_*ale的帖子

pyspark 写入失败并出现 StackOverflowError

我计划在 AWS Glue 中将固定宽度转换为 Parquet，我的数据大约有 1600 列和大约 3000 行。似乎当我尝试编写 Spark 数据框（镶木地板）时，我遇到了“StackOverflow”问题。
即使我执行 count()、show() 等操作，也会出现问题。我尝试调用 cache()、repartition() 但仍然看到此错误。

如果我将列数减少到 500，代码就可以工作。

请帮忙

下面是我的代码

    data_df = spark.read.text(input_path) 

    schema_df = pd.read_json(schema_path)
    df = data_df

    for r in schema_df.itertuples():
        df = df.withColumn(
            str(r.name), df.value.substr(int(r.start), int(r.length))
        )
    df = df.drop("value")

    df.write.mode("overwrite").option("compression", "gzip").parquet(output_path) # FAILING HERE

Run Code Online (Sandbox Code Playgroud)

下面的堆栈跟踪。

> 
2021-11-10 05:00:13,542 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Error from Python:Traceback (most recent call last):
  File "/tmp/conv_fw_2_pq.py", line 148, in <module>
    partition_ts=parsed_args.partition_timestamp,
  File "/tmp/conv_fw_2_pq.py", line 125, in process_file
    df.write.mode("overwrite").option("compression", "gzip").parquet(output_path)
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", …

Run Code Online (Sandbox Code Playgroud)

fixed-width apache-spark parquet pyspark

San*_*ale

lucky-day

2
推荐指数

1
解决办法

3898
查看次数