从Google云端存储中读取csv到pandas数据帧

use*_*940 26 python csv pandas google-cloud-storage google-cloud-platform

我正在尝试将Google Cloud Storage存储桶中的csv文件读取到熊猫数据框中.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from io import BytesIO

from google.cloud import storage

storage_client = storage.Client()
bucket = storage_client.get_bucket('createbucket123')
blob = bucket.blob('my.csv')
path = "gs://createbucket123/my.csv"
df = pd.read_csv(path)
Run Code Online (Sandbox Code Playgroud)

它显示以下错误消息:

FileNotFoundError: File b'gs://createbucket123/my.csv' does not exist
Run Code Online (Sandbox Code Playgroud)

我做错了什么,我找不到任何不涉及谷歌datalab的解决方案?

Luk*_*ski 51

UPDATE

截至0.24版本的pandas,read_csv支持直接从Google云端存储中读取.只需提供链接到这样的桶:

df = pd.read_csv('gs://bucket/your_path.csv')
Run Code Online (Sandbox Code Playgroud)

为了完整起见,我还留下了其他三个选项.

  • 自制代码
  • gcsfs
  • DASK

我将在下面介绍它们.

艰难的方法:自己动手做代码

我已经写了一些便利功能来从Google存储中读取.为了使其更具可读性,我添加了类型注释.如果你碰巧在Python 2上,只需删除它们,代码将完全相同.

假设您获得授权,它在公共和私人数据集上同样有效.在此方法中,您无需先将数据下载到本地驱动器.

如何使用它:

fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path')
df = pd.read_csv(fileobj)
Run Code Online (Sandbox Code Playgroud)

代码:

from io import BytesIO, StringIO
from google.cloud import storage
from google.oauth2 import service_account

def get_byte_fileobj(project: str,
                     bucket: str,
                     path: str,
                     service_account_credentials_path: str = None) -> BytesIO:
    """
    Retrieve data from a given blob on Google Storage and pass it as a file object.
    :param path: path within the bucket
    :param project: name of the project
    :param bucket_name: name of the bucket
    :param service_account_credentials_path: path to credentials.
           TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
    :return: file object (BytesIO)
    """
    blob = _get_blob(bucket, path, project, service_account_credentials_path)
    byte_stream = BytesIO()
    blob.download_to_file(byte_stream)
    byte_stream.seek(0)
    return byte_stream

def get_bytestring(project: str,
                   bucket: str,
                   path: str,
                   service_account_credentials_path: str = None) -> bytes:
    """
    Retrieve data from a given blob on Google Storage and pass it as a byte-string.
    :param path: path within the bucket
    :param project: name of the project
    :param bucket_name: name of the bucket
    :param service_account_credentials_path: path to credentials.
           TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
    :return: byte-string (needs to be decoded)
    """
    blob = _get_blob(bucket, path, project, service_account_credentials_path)
    s = blob.download_as_string()
    return s


def _get_blob(bucket_name, path, project, service_account_credentials_path):
    credentials = service_account.Credentials.from_service_account_file(
        service_account_credentials_path) if service_account_credentials_path else None
    storage_client = storage.Client(project=project, credentials=credentials)
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(path)
    return blob
Run Code Online (Sandbox Code Playgroud)

gcsfs

gcsfs是"用于Google云端存储的Pythonic文件系统".

如何使用它:

import pandas as pd
import gcsfs

fs = gcsfs.GCSFileSystem(project='my-project')
with fs.open('bucket/path.csv') as f:
    df = pd.read_csv(f)
Run Code Online (Sandbox Code Playgroud)

DASK

Dask "为分析提供高级并行性,为您喜爱的工具提供大规模性能".当您需要在Python中处理大量数据时,它非常棒.Dask尝试模仿大部分pandasAPI,使其易于用于新手.

这是read_csv

如何使用它:

import dask.dataframe as dd

df = dd.read_csv('gs://bucket/data.csv')
df2 = dd.read_csv('gs://bucket/path/*.csv') # nice!

# df is now Dask dataframe, ready for distributed processing
# If you want to have the pandas version, simply:
df_pd = df.compute()
Run Code Online (Sandbox Code Playgroud)


Lak*_*Lak 16

另一个选择是使用TensorFlow,它能够从Google云端存储执行流式读取:

from tensorflow.python.lib.io import file_io
with file_io.FileIO('gs://bucket/file.csv', 'r') as f:
  df = pd.read_csv(f)
Run Code Online (Sandbox Code Playgroud)

使用tensorflow还可以方便地处理文件名中的通配符.例如:

将通配符CSV读入熊猫

以下代码将读取与特定模式匹配的所有CSV(例如:gs:// bucket/some/dir/train-*)到Pandas数据帧中:

import tensorflow as tf
from tensorflow.python.lib.io import file_io
import pandas as pd

def read_csv_file(filename):
  with file_io.FileIO(filename, 'r') as f:
    df = pd.read_csv(f, header=None, names=['col1', 'col2'])
    return df

def read_csv_files(filename_pattern):
  filenames = tf.gfile.Glob(filename_pattern)
  dataframes = [read_csv_file(filename) for filename in filenames]
  return pd.concat(dataframes)
Run Code Online (Sandbox Code Playgroud)

用法

DATADIR='gs://my-bucket/some/dir'
traindf = read_csv_files(os.path.join(DATADIR, 'train-*'))
evaldf = read_csv_files(os.path.join(DATADIR, 'eval-*'))
Run Code Online (Sandbox Code Playgroud)


Mar*_*hal 8

从 Pandas 1.2 开始,将文件从谷歌存储加载到 DataFrame 中非常容易。

如果您在本地计算机上工作,它看起来像这样:

df = pd.read_csv('gcs://your-bucket/path/data.csv.gz',
                 storage_options={"token": "credentials.json"})
Run Code Online (Sandbox Code Playgroud)

您将来自 google 的凭据.json 文件添加为令牌,并将其导入。

如果您在谷歌云上工作,请执行以下操作:

df = pd.read_csv('gcs://your-bucket/path/data.csv.gz',
                 storage_options={"token": "cloud"})
Run Code Online (Sandbox Code Playgroud)


bna*_*aul 6

由于pandas==0.24.0这一点,如果你已经是原生支持gcsfs安装:https://github.com/pandas-dev/pandas/pull/22704

在正式发布之前,您可以使用pip install pandas==0.24.0rc1.


Lle*_*e.4 5

我正在看这个问题,不想经历安装另一个库的麻烦gcsfs,这在文档中确实是这样说的,This software is beta, use at your own risk......但我找到了一个很好的解决方法,我想在这里发布以防万一这对其他人很有帮助,只需使用 google.cloud 存储库和一些本机 python 库。这是函数:

import pandas as pd
from google.cloud import storage
import os
import io
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/creds.json'


def gcp_csv_to_df(bucket_name, source_file_name):
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(source_blob_name)
    data = blob.download_as_string()
    df = pd.read_csv(io.BytesIO(data))
    print(f'Pulled down file from bucket {bucket_name}, file name: {source_file_name}')
    return df
Run Code Online (Sandbox Code Playgroud)

此外,虽然它超出了这个问题的范围,但如果您想使用类似的函数将 Pandas 数据帧上传到 GCP,这里是执行此操作的代码:

def df_to_gcp_csv(df, dest_bucket_name, dest_file_name):
    storage_client = storage.Client()
    bucket = storage_client.bucket(dest_bucket_name)
    blob = bucket.blob(dest_file_name)
    blob.upload_from_string(df.to_csv(), 'text/csv')
    print(f'DataFrame uploaded to bucket {dest_bucket_name}, file name: {dest_file_name}')
Run Code Online (Sandbox Code Playgroud)

希望这是有帮助的!我知道我肯定会使用这些功能。


Rau*_*aul 5

使用pandasgoogle-cloud-storage python 包:

首先,我们将文件上传到存储桶以获得完整的工作示例:

import pandas as pd
from sklearn.datasets import load_iris

dataset = load_iris()

data_df = pd.DataFrame(
    dataset.data,
    columns=dataset.feature_names)

data_df.head()
Run Code Online (Sandbox Code Playgroud)
Out[1]: 
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2
Run Code Online (Sandbox Code Playgroud)

将 csv 文件上传到存储桶(需要设置 GCP 凭据,请在此处阅读更多信息):

from io import StringIO
from google.cloud import storage

bucket_name = 'my-bucket-name' # Replace it with your own bucket name.
data_path = 'somepath/data.csv'

# Get Google Cloud client
client = storage.Client()

# Get bucket object
bucket = client.get_bucket(bucket_name)

# Get blob object (this is pointing to the data_path)
data_blob = bucket.blob(data_path)

# Upload a csv to google cloud storage
data_blob.upload_from_string(
    data_df.to_csv(), 'text/csv')
Run Code Online (Sandbox Code Playgroud)

现在我们在存储桶上有一个 csv,可以pd.read_csv通过传递文件内容来使用。

# Read from bucket
data_str = data_blob.download_as_text()

# Instanciate dataframe
data_dowloaded_df = pd.read_csv(StringIO(data_str))

data_dowloaded_df.head()
Run Code Online (Sandbox Code Playgroud)
Out[2]: 
   Unnamed: 0  sepal length (cm)  ...  petal length (cm)  petal width (cm)
0           0                5.1  ...                1.4               0.2
1           1                4.9  ...                1.4               0.2
2           2                4.7  ...                1.3               0.2
3           3                4.6  ...                1.5               0.2
4           4                5.0  ...                1.4               0.2

[5 rows x 5 columns]
Run Code Online (Sandbox Code Playgroud)

将此方法与pd.read_csv('gs://my-bucket/file.csv')其他方法进行比较时,我发现此处描述的方法更加明确,即client = storage.Client()负责身份验证的方法(在使用多个凭据时这可能非常方便)。此外,storage.Client如果您在 Google Cloud Platform 的资源上运行此代码,则已经完全安装,此时pd.read_csv('gs://my-bucket/file.csv')您需要安装gcsfs允许 pandas 访问 Google Storage 的软件包。


Bur*_*lid 3

read_csv不支持gs://

文档中:

该字符串可以是 URL。有效的 URL 方案包括 http、ftp、s3 和 file。对于文件 URL,需要一个主机。例如,本地文件可以是 file://localhost/path/to/table.csv

您可以下载该文件将其作为字符串获取以对其进行操作。

  • 新版本是0.24.2 (3认同)