从 Azure Databricks 读取 Excel 文件

Question

从 Azure Databricks 读取 Excel 文件

Sre*_*har 6 excel python-3.x azure-databricks azure-data-lake-gen2

我正在尝试从 Azure Databricks 准备 Excel 文件 ( .xlsx)，文件位于 ADLS Gen 2 中。

例子：

srcPathforParquet = "wasbs://hyxxxx@xxxxdatalakedev.blob.core.windows.net//1_Raw//abc.parquet"
srcPathforExcel = "wasbs://hyxxxx@xxxxdatalakedev.blob.core.windows.net//1_Raw//src.xlsx"

Run Code Online (Sandbox Code Playgroud)

从路径读取镶木地板文件效果很好。

srcparquetDF = spark.read.parquet(srcPathforParquet )

Run Code Online (Sandbox Code Playgroud)

从路径读取excel文件抛出错误：没有这样的文件或目录

srcexcelDF = pd.read_excel(srcPathforExcel , keep_default_na=False, na_values=[''])

Run Code Online (Sandbox Code Playgroud)

Answer 1

Jim*_* Xu 4

该方法 pandas.read_excel 不支持使用wasbs或abfss方案 URL 访问文件。欲了解更多详情，请参阅此处

因此，如果您想使用 pandas 访问该文件，我建议您创建一个 sas 令牌并使用https带有 sas 令牌的方案来访问该文件或将文件作为流下载，然后使用 pandas 读取它。同时，您还可以将存储帐户安装为文件系统，然后按照 @CHEEKATLAPRADEEP-MSFT 的说明访问文件。

例如

使用 sas 令牌访问

通过 Azure 门户创建 sas 令牌
代码

pdf=pd.read_excel('https://<account name>.dfs.core.windows.net/<file system>/<path>?<sas token>')
print(pdf)

Run Code Online (Sandbox Code Playgroud)

以流的形式下载文件并读取文件

安装包azure-storage-file-datalake并xlrd在 databricks 中使用 pip
代码

import io

import pandas as pd
from azure.storage.filedatalake import BlobServiceClient
from azure.storage.filedatalake import DataLakeServiceClient

blob_service_client = DataLakeServiceClient(account_url='https://<account name>.dfs.core.windows.net/', credential='<account key>')

file_client = blob_service_client.get_file_client(file_system='test', file_path='data/sample.xlsx')
with io.BytesIO() as f:
  downloader =file_client.download_file()
  b=downloader.readinto(f)
  print(b)
  df=pd.read_excel(f)
  print(df)

Run Code Online (Sandbox Code Playgroud)

除此之外我们还可以使用pyspark来读取excel文件。但是我们需要com.crealytics:spark-excel在我们的环境中添加jar。欲了解更多详情，请参阅这里和这里

例如

通过maven添加包com.crealytics:spark-excel_2.12:0.13.1。另外，请注意，如果您使用scala 2.11，请添加包com.crealytics:spark-excel_2.11:0.13.1
代码

spark._jsc.hadoopConfiguration().set("fs.azure.account.key.<account name>.dfs.core.windows.net",'<account key>')

print("use spark")
df=sqlContext.read.format("com.crealytics.spark.excel") \
        .option("header", "true") \
        .load('abfss://test@testadls05.dfs.core.windows.net/data/sample.xlsx')

df.show()

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年前
查看次数：	37084 次
最近记录：	4 年，5 月前