Alw*_*ing 3 python hadoop dataframe apache-spark pyspark
这是我想保存为 csv 的 Spark DataFrame。
type(MyDataFrame)
--Output: <class 'pyspark.sql.dataframe.DataFrame'>
Run Code Online (Sandbox Code Playgroud)
要将其保存为 CSV,我有以下代码:
MyDataFrame.write.csv(csv_path, mode = 'overwrite', header = 'true')
Run Code Online (Sandbox Code Playgroud)
当我保存它时,文件名是这样的:
part-0000-766dfdf-78fg-aa44-as3434rdfgfg-c000.csv
Run Code Online (Sandbox Code Playgroud)
有没有办法在保存时为其指定自定义名称?就像“MyDataFrame.csv”
我有同样的需求。你可以写入一个路径,然后更改文件路径。这是我的解决方案。
def write_to_hdfs_specify_path(df, spark, hdfs_path, file_name):
"""
:param df: dataframe which you want to save
:param spark: sparkSession
:param hdfs_path: target path(shoul be not exises)
:param file_name: csv file name
:return:
"""
sc = spark.sparkContext
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
df.coalesce(1).write.option("header", True).option("delimiter", "|").option("compression", "none").csv(hdfs_path)
fs = FileSystem.get(Configuration())
file = fs.globStatus(Path("%s/part*" % hdfs_path))[0].getPath().getName()
full_path = "%s/%s" % (hdfs_path, file_name)
result = fs.rename(Path("%s/%s" % (hdfs_path, file)), Path(full_path))
return result
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
10501 次 |
| 最近记录: |