将 xlsx 从 azure blob 存储读取到 pandas dataframe,而不创建临时文件

use*_*330 8 python azure pandas

我正在尝试将 xlsx 文件从 Azure blob 存储读取到 pandas 数据帧,而不创建临时本地文件。我见过很多类似的问题,例如Issues Reading Azure Blob CSV Into Python Pandas DF,但尚未设法使建议的解决方案发挥作用。

下面的代码片段会导致UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 14: invalid start byte异常。

from io import StringIO
import pandas as pd
from azure.storage.blob import BlobClient, BlobServiceClient

blob_client = BlobClient.from_blob_url(blob_url = url + container + "/" + blobname, credential = token)   
blob = blob_client.download_blob().content_as_text()   
df = pd.read_excel(StringIO(blob))
Run Code Online (Sandbox Code Playgroud)

使用临时文件,我确实设法使其与以下代码片段一起工作:

blob_service_client = BlobServiceClient(account_url = url, credential = token)
blob_client = blob_service_client.get_blob_client(container=container, blob=blobname)

with open(tmpfile, "wb") as my_blob:
    download_stream = blob_client.download_blob()
    my_blob.write(download_stream.readall())

data = pd.read_excel(tmpfile)
Run Code Online (Sandbox Code Playgroud)

Roa*_*ner 9

与您已经完成的操作类似,我们可以使用download_blob()StorageStreamDownloader对象放入内存,然后context_as_text()将内容解码为字符串。

然后我们可以将 CSV 缓冲区中的内容读取StringIO到 pandas Dataframe 中pandas.read_csv()

from io import StringIO
import pandas as pd
from azure.storage.blob import BlobClient, BlobServiceClient
import os

connection_string = os.getenv('AZURE_STORAGE_CONNECTION_STRING')

blob_service_client = BlobServiceClient.from_connection_string(connection_string)

blob_client = blob_service_client.get_blob_client(container="blobs", blob="test.csv")

blob = blob_client.download_blob().content_as_text()

df = pd.read_csv(StringIO(blob))
Run Code Online (Sandbox Code Playgroud)

更新

如果我们使用 XLSX 文件,请使用content_as_bytes()返回字节而不是字符串,并使用以下命令转换为 pandas 数据帧pandas.read_excel()

from io import StringIO
import pandas as pd
from azure.storage.blob import BlobClient, BlobServiceClient
import os

connection_string = os.getenv('AZURE_STORAGE_CONNECTION_STRING')

blob_service_client = BlobServiceClient.from_connection_string(connection_string)

blob_client = blob_service_client.get_blob_client(container="blobs", blob="test.xlsx")

blob = blob_client.download_blob().content_as_bytes()

df = pd.read_excel(blob)
Run Code Online (Sandbox Code Playgroud)

由于content_as_text()默认使用 UTF-8 编码,这可能是UnicodeDecodeError解码字节时出现的问题。

如果我们将pandas.read_excel()编码设置为None

blob = blob_client.download_blob().content_as_text(encoding=None)

df = pd.read_excel(blob)
Run Code Online (Sandbox Code Playgroud)