从 Azure blob 存储读取 csv 并将其存储在 DataFrame 中

Rec*_*tan 6 python python-3.x pandas azure-blob-storage

我正在尝试使用 python 从 blob 存储读取多个 CSV 文件。

我正在使用的代码是:

blob_service_client = BlobServiceClient.from_connection_string(connection_str)
container_client = blob_service_client.get_container_client(container)
blobs_list = container_client.list_blobs(folder_root)
for blob in blobs_list:
    blob_client = blob_service_client.get_blob_client(container=container, blob="blob.name")
    stream = blob_client.download_blob().content_as_text()
Run Code Online (Sandbox Code Playgroud)

我不确定存储在 pandas 数据框中读取的 CSV 文件的正确方法是什么。

我尝试使用:

df = df.append(pd.read_csv(StringIO(stream)))
Run Code Online (Sandbox Code Playgroud)

但这向我显示了一个错误。

知道我该怎么做吗?

小智 9

import pandas as pd
data = pd.read_csv('blob_sas_url')
Run Code Online (Sandbox Code Playgroud)

通过右键单击要导入的 Azure 门户的 Blob 文件并选择“生成 SAS”,可以找到 Blob SAS Url。然后,单击“生成 SAS 令牌和 URL”按钮,并将 SAS url 复制到上面的代码中代替 blob_sas_url。


unk*_*own 6

您可以从 blob 存储下载该文件,然后从下载的文件中将数据读入 pandas DataFrame。

from azure.storage.blob import BlockBlobService
import pandas as pd
import tables

STORAGEACCOUNTNAME= <storage_account_name>
STORAGEACCOUNTKEY= <storage_account_key>
LOCALFILENAME= <local_file_name>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>

#download from blob
t1=time.time()
blob_service=BlockBlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)
blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
t2=time.time()
print(("It takes %s seconds to download "+blobname) % (t2 - t1))

# LOCALFILE is the file path
dataframe_blobdata = pd.read_csv(LOCALFILENAME)
Run Code Online (Sandbox Code Playgroud)

有关更多详细信息,请参阅此处


如果您想直接进行转换,代码会有所帮助。您需要从 blob 对象获取内容,并且get_blob_to_text不需要本地文件名。

from io import StringIO
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content
df = pd.read_csv(StringIO(blobstring))
Run Code Online (Sandbox Code Playgroud)


mat*_*91t 4

基于 @sahaj-raj-malla答案:2 个从 blob 加载(或保存)文件的代码片段:

  1. pandas 的负载更短[必要pip install adlfs fsspec]
import pandas as pd

account_name = "my_account_stage_name"
account_key = "loooooooooooooooooooooong_acccccooooooooount_keeeeeeeeeeeeeeeeey$$$$***$$$$$$$$$$$$$$22222222"
connection_string = f"DefaultEndpointsProtocol=https;AccountName={account_name};AccountKey={account_key};EndpointSuffix=core.windows.net"

pd.read_csv("abfs:///my_container_name/path/to/my/file/on/blob/file.csv", storage_options={"account_name": account_name, "connection_string": connection_string})
Run Code Online (Sandbox Code Playgroud)
  1. 加载 pandas 和 azure [必需pip install azure-storage-blob]
from azure.storage.blob import BlobServiceClient
import pandas as pd

account_name = "my_account_stage_name"
account_key = "loooooooooooooooooooooong_acccccooooooooount_keeeeeeeeeeeeeeeeey$$$$***$$$$$$$$$$$$$$22222222"
connection_string = f"DefaultEndpointsProtocol=https;AccountName={account_name};AccountKey={account_key};EndpointSuffix=core.windows.net"

# load file from blob
container_name = "my_container_name"
blob_name = "path/to/my/file/on/blob/file.csv"
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(blob_name)

# load to RAM, eg. jupyter notebook
pd.read_csv(blob_client.download_blob())

# save file to ROM, eg. local file
local_file_name = "path/to/my/file/on/disk/file.csv"
with open(local_file_name, "wb") as my_blob_locally:
    download_stream = blob_client.download_blob()
    my_blob_locally.write(download_stream.readall())
Run Code Online (Sandbox Code Playgroud)

如何获取连接字符串

  • 转到存储帐户 -> 访问密钥 -> 显示并复制连接字符串