Spa*_*att 17 csv apache-spark pyspark
假设我有一个Spark DataFrame,我想将其保存为CSV文件.在Spark 2.0.0之后,DataFrameWriter类直接支持将其保存为CSV文件.
默认行为是将输出保存在提供的路径中的多个部分 - *.csv文件中.
如何保存DF:
处理它的一种方法是合并DF然后保存文件.
df.coalesce(1).write.option("header", "true").csv("sample_file.csv")
Run Code Online (Sandbox Code Playgroud)
然而,这在主机上收集它并且需要具有足够内存的主机时具有缺点.
是否可以在不使用合并的情况下编写单个CSV文件?如果没有,是否有比上述代码更有效的方法?
刚刚使用pyspark和dbutils来解决这个问题,以获得.csv并重命名为所需的文件名.
save_location= "s3a://landing-bucket-test/export/"+year
csv_location = save_location+"temp.folder'
file_location = save_location+'export.csv'
df.repartition(1).write.csv(path=csv_location, mode="append", header="true")
file = dbutils.fs.ls(csv_location)[-1].path
dbutils.fs.cp(file, file_location)
dbutils.fs.rm(csv_location, recurse=True)
Run Code Online (Sandbox Code Playgroud)
不使用[-1]可以改善这个答案,但.csv似乎永远是文件夹中的最后一个.如果您只处理较小的文件并且可以使用重新分区(1)或合并(1),则可以使用简单快速的解决方案.
小智 7
使用:
df.toPandas().to_csv("sample_file.csv", header=True)
有关详细信息,请参见文档:https : //spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataFrame.toPandas
小智 5
df.coalesce(1).write.option("inferSchema","true").csv("/newFolder",header =
'true',dateFormat = "yyyy-MM-dd HH:mm:ss")
Run Code Online (Sandbox Code Playgroud)
This solution is based on a Shell Script and is not parallelized, but is still very fast, especially on SSDs. It uses cat and output redirection on Unix systems. Suppose that the CSV directory containing partitions is located on /my/csv/dir and that the output file is /my/csv/output.csv:
#!/bin/bash
echo "col1,col2,col3" > /my/csv/output.csv
for i in /my/csv/dir/*.csv ; do
echo "Processing $i"
cat $i >> /my/csv/output.csv
rm $i
done
echo "Done"
Run Code Online (Sandbox Code Playgroud)
It will remove each partition after appending it to the final CSV in order to free space.
"col1,col2,col3" is the CSV header (here we have three columns of name col1, col2 and col3). You must tell Spark to don't put the header in each partition (this is accomplished with .option("header", "false") because the Shell Script will do it.
| 归档时间: |
|
| 查看次数: |
54771 次 |
| 最近记录: |