如何安排从 BigQuery 表导出到 Cloud Storage?

bra*_*les 3 google-cloud-storage google-bigquery google-cloud-platform

我已在 BigQuery 中成功安排查询,并将结果保存为数据集中的表。我看到了很多有关安排数据传输到 BigQuery 或 Cloud Storage 的信息但我还没有找到任何有关安排BigQuery 表导出到 Cloud Storage 的信息。

是否可以安排将 BigQuery 表导出到 Cloud Storage,以便我可以进一步安排通过 Google BigQuery 数据传输服务将其 SFTP 发送给我?

Max*_*xim 5

目前没有用于安排 BigQuery 表导出的托管服务,但一种可行的方法是将Cloud Functions与Cloud Scheduler结合使用。

The Cloud Function would contain the necessary code to export to Cloud Storage from the BigQuery table. There are multiple programming languages to choose from for that, such as Python, Node.JS, and Go.

Cloud Scheduler would send an HTTP call periodically in a cron format to the Cloud Function which would in turn, get triggered and run the export programmatically.

As an example and more specifically, you can follow these steps:

  1. Create a Cloud Function using Python with an HTTP trigger. To interact with BigQuery from within the code you need to use the BigQuery client library. Import it with from google.cloud import bigquery. Then, you can use the following code in main.py to create an export job from BigQuery to Cloud Storage:

        # Imports the BigQuery client library
        from google.cloud import bigquery
    
        def hello_world(request):
            # Replace these values according to your project
            project_name = "YOUR_PROJECT_ID" 
            bucket_name = "YOUR_BUCKET" 
            dataset_name = "YOUR_DATASET" 
            table_name = "YOUR_TABLE" 
            destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")
    
            bq_client = bigquery.Client(project=project_name)
    
            dataset = bq_client.dataset(dataset_name, project=project_name)
            table_to_export = dataset.table(table_name)
    
            job_config = bigquery.job.ExtractJobConfig()
            job_config.compression = bigquery.Compression.GZIP
    
            extract_job = bq_client.extract_table(
                table_to_export,
                destination_uri,
                # Location must match that of the source table.
                location="US",
                job_config=job_config,
            )  
            return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)
    
    Run Code Online (Sandbox Code Playgroud)

    Specify the client library dependency in the requirements.txt file by adding this line:

    google-cloud-bigquery
    
    Run Code Online (Sandbox Code Playgroud)
  2. Create a Cloud Scheduler job. Set the Frequency you wish for the job to be executed with. For instance, setting it to 0 1 * * 0 would run the job once a week at 1 AM every Sunday morning. The crontab tool is pretty useful when it comes to experimenting with cron scheduling.

    Choose HTTP as the Target, set the URL as the Cloud Function's URL (it can be found by selecting the Cloud Function and navigating to the Trigger tab), and as HTTP method choose GET.

    Once created, and by pressing the RUN NOW button, you can test how the export behaves. However, before doing so, make sure the default App Engine service account has at least the Cloud IAM roles/storage.objectCreator role, or otherwise the operation might fail with a permission error. The default App Engine service account has a form of YOUR_PROJECT_ID@appspot.gserviceaccount.com.

    If you wish to execute exports on different tables, datasets and buckets for each execution, but essentially employing the same Cloud Function, you can use the HTTP POST method instead, and configure a Body containing said parameters as data, which would be passed on to the Cloud Function - although, that would imply doing some small changes in its code.

Lastly, when the job is created, you can use the Cloud Function's returned job ID and the bq CLI to view the status of the export job with bq show -j <job_id>.