数据块:将火花数据帧直接写入 excel

myt*_*abi 3 databricks

有什么方法可以将 spark 数据帧直接写入 xls/xlsx 格式????

网络中的大多数示例都显示了熊猫数据框的示例。

但我想使用 spark 数据框来处理我的数据。任何的想法 ?

San*_*tte 8

可以直接从 pySpark 生成 Excel 文件,而无需先转换为 Pandas

df_spark.write.format("com.crealytics.spark.excel")\
  .option("header", "true")\
  .mode("overwrite")\
  .save(path)
Run Code Online (Sandbox Code Playgroud)

为了能够运行上述代码,您需要安装com.crealytics:spark-excel_2.12:0.13.5(或者当然是更新版本)库,例如在 Azure Databricks 中,通过在集群的库列表中将其指定为新的 Maven 库(一个Databricks UI 左侧边栏上的按钮)。

有关更多信息,请参阅https://github.com/crealytics/spark-excel


小智 6

I'm assuming that because you have the "databricks" tag you are wanting to create an .xlsx file within databricks file store and that you are running code within databricks notebooks. I'm also going to assume that your notebooks are running python.

There is no direct way to save an excel document from a spark dataframe. You can, however, convert a spark dataframe to a pandas dataframe then export from there. We'll need to start by installing the xlsxwriter package. You can do this for your notebook environment using a databricks utilites command:

dbutils.library.installPyPI('xlsxwriter')
dbutils.library.restartPython()
Run Code Online (Sandbox Code Playgroud)

I was having a few permission issues saving an excel file directly to dbfs. A quick workaround was to save to the cluster's default directory then sudo move the file into dbfs. Here's some example code:

# Creating dummy spark dataframe
spark_df = spark.sql('SELECT * FROM default.test_delta LIMIT 100')

# Converting spark dataframe to pandas dataframe
pandas_df = spark_df.toPandas()

# Exporting pandas dataframe to xlsx file
pandas_df.to_excel('excel_test.xlsx', engine='xlsxwriter')
Run Code Online (Sandbox Code Playgroud)

Then in a new command, specifying the command to run in shell with %sh:

%sh
sudo mv excel_test.xlsx /dbfs/mnt/data/
Run Code Online (Sandbox Code Playgroud)

  • 请记住,您的数据帧必须适合驱动程序的内存,否则这种方法将使您的程序崩溃。 (2认同)

ASH*_*ASH 0

我相信你可以这样做。

sourcePropertySet.write
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .save("D:\\resultset.csv")
Run Code Online (Sandbox Code Playgroud)

我不确定您是否可以直接写入 Excel,但 Excel 绝对可以使用 CSV。这几乎肯定是做这种事情最简单的方法,也是最干净的方法。在 Excel 中,有各种格式,在某些系统中使用时可能会引发错误(例如合并单元格)。