从 Google Cloud Storage 中的 csv 读取 n 行以与 Python csv 模块一起使用

Question

从 Google Cloud Storage 中的 csv 读取 n 行以与 Python csv 模块一起使用

2 csv python-3.x google-cloud-storage google-cloud-platform

我有各种包含不同格式的非常大的（每个约 4GB）csv 文件。这些来自 10 多个不同制造商的数据记录器。我正在尝试将所有这些整合到 BigQuery 中。为了每天加载这些文件，我想首先将这些文件加载到 Cloud Storage，确定架构，然后加载到 BigQuery。由于某些文件具有额外的标题信息（从 2 - ~30 行），我生成了自己的函数来确定最可能的标题行和每个文件样本（~100 行）的模式，其中然后我可以在将文件加载到 BQ 时在 job_config 中使用。

当我处理从本地存储直接到 BQ 的文件时，这很好用，因为我可以使用上下文管理器，然后使用 Python 的 csv 模块，特别是嗅探器和读取器对象。但是，似乎没有直接从 Storage 使用上下文管理器的等效方法。如果加载到 BQ 时这些文件中的任何一个被中断，我不想绕过 Cloud Storage。

我可以开始工作：

# initialise variables
with open(csv_file, newline  = '', encoding=encoding) as datafile:
    dialect = csv.Sniffer().sniff(datafile.read(chunk_size))
    reader = csv.reader(datafile, dialect)
    sample_rows = []
    row_num  = 0
    for row in reader:
         sample_rows.append(row)
         row_num+=1
         if (row_num >100):
             break
    sample_rows
# Carry out schema  and header investigation...

Run Code Online (Sandbox Code Playgroud)

使用 Google Cloud Storage，我尝试使用 download_as_string 和 download_to_file，它们提供数据的二进制对象表示，但是我无法让 csv 模块处理任何数据。我尝试使用 .decode('utf-8') 并返回一个带有 \r\n 的 looong 字符串。然后我使用 splitlines() 来获取数据列表，但 csv 函数仍然提供方言和阅读器，将数据拆分为单个字符作为每个条目。

有没有人设法在不下载整个文件的情况下设法将 csv 模块与存储在 Cloud Storage 中的文件一起使用？

Answer 1

小智 7

在查看了 GitHub 上的 csv 源代码后，我设法使用 Python 中的 io 模块和 csv 模块来解决这个问题。io.BytesIO 和 TextIOWrapper 是要使用的两个关键函数。可能不是一个常见的用例，但我想我会在这里发布答案，为需要它的人节省一些时间。

# Set up storage client and create a blob object from csv file that you are trying to read from GCS.
content = blob.download_as_string(start = 0, end = 10240) # Read a chunk of bytes that will include all header data and the recorded data itself.
bytes_buffer = io.BytesIO(content)
wrapped_text = io.TextIOWrapper(bytes_buffer, encoding = encoding, newline =  newline)
dialect = csv.Sniffer().sniff(wrapped_text.read()) 
wrapped_text.seek(0)
reader = csv.reader(wrapped_text, dialect)
# Do what you will with the reader object

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，4 月前
查看次数：	1145 次
最近记录：	6 年，4 月前