将数据从bigquery导出到Jupyter Notebook花费的时间太长

Question

将数据从bigquery导出到Jupyter Notebook花费的时间太长

Max*_*fnv 5 python dataframe google-bigquery jupyter jupyter-notebook

在Jupyter Notebook中，我正在尝试使用BigQuery服务器上的类似sql的查询从BigQuery导入数据。然后，我将数据存储在一个数据框中：

import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="credentials.json"
from google.cloud import bigquery

sql = """
SELECT * FROM dataset.table
"""
client = bigquery.Client()
df_bq = client.query(sql).to_dataframe()

Run Code Online (Sandbox Code Playgroud)

数据的形状为（6000000，8），一旦存储在数据帧中，将使用约350MB的内存。

sql如果直接在BQ中执行查询，则大约需要2秒钟。

但是，通常需要大约30-40分钟的时间来执行上述代码，并且代码执行失败的可能性通常更高，并引发以下错误：

ConnectionError: ('Connection aborted.', OSError("(10060, 'WSAETIMEDOUT')",))

Run Code Online (Sandbox Code Playgroud)

总而言之，可能有三个错误原因：

BigQuery服务器需要很长时间才能执行查询
传输数据需要很长时间（我不明白为什么350MB文件需要30分钟才能通过网络发送。我尝试使用LAN连接来消除服务器中断并最大程度地提高吞吐量，但这没有帮助）
使用BigQuery中的数据设置数据框需要花费很长时间

希望对问题有任何见解，在此先感谢！

Answer 1

San*_*ord 16

使用 bigquery 存储将大数据查询从 bigquery 快速获取到 pandas 数据帧中。

工作代码片段：

import google.auth
from google.cloud import bigquery
from google.cloud import bigquery_storage

# uncomment this part if you are working locally
# import os
# os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="your_json_key.json"

# Explicitly create a credentials object. This allows you to use the same
# credentials for both the BigQuery and BigQuery Storage clients, avoiding
# unnecessary API calls to fetch duplicate authentication tokens.
credentials, your_project_id = google.auth.default(
    scopes=["https://www.googleapis.com/auth/cloud-platform"]
)

# Make clients.
bqclient = bigquery.Client(credentials=credentials, project=your_project_id,)
bqstorageclient = bigquery_storage.BigQueryReadClient(credentials=credentials)

# define your query
your_query = """select * from your_big_query_table"""

# set you bqstorage_client as argument in the to_dataframe() method.
# i've also added the tqdm progress bar here so you get better insight
# into how long it's still going to take
dataframe = (
    bqclient.query(your_query)
            .result()
            .to_dataframe(
                bqstorage_client=bqstorageclient,
                progress_bar_type='tqdm_notebook',)
)

Run Code Online (Sandbox Code Playgroud)

您可以在此处找到有关如何使用bigquery存储的更多信息： https:
//cloud.google.com/bigquery/docs/bigquery-storage-python-pandas

Answer 2

小智 0

WSAETIMEDOUT错误意味着连接方在一段时间后没有正确响应。您需要检查您的防火墙。

关于：

根据您的测试，查询需要 2 秒
检查你的防火墙
由于您的数据形状是 (6000000, 8)，这将需要时间，具体取决于您使用的计算资源

话虽如此，您可能会因为多维数组花费的时间太长而达到连接超时。

您可以将查询和数据帧分开并打印时间，以便更好地了解正在发生的情况。

    result = client.query(sql)
    print(datetime.datetime.now())
    df_bq = result.to_dataframe()
    print(datetime.datetime.now())

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，2 月前
查看次数：	387 次
最近记录：	6 年，4 月前