如何列出项目中的所有表大小

永川圭*_*川圭介 3 google-bigquery

有没有办法列出 BigQuery 中的所有表大小?

我知道这样的命令:

select 
  table_id,
  sum(size_bytes)/pow(10,9) as size
from
  certain_dataset.__TABLES__
group by 
  1
Run Code Online (Sandbox Code Playgroud)

但我想知道所有数据集中的所有表。

谢谢

Ste*_*t_R 8

随着 2020 年 BigQuery 脚本引入动态 SQL,这个问题变得容易一些。现在,我们可以动态构建查询并通过EXECUTE IMMEDIATE.

对于所有数据集都位于的大多数情况,这样的事情会做region-us

DECLARE dataset_names ARRAY<STRING>;

SET dataset_names = (
    SELECT ARRAY_AGG(SCHEMA_NAME) FROM `region-us.INFORMATION_SCHEMA.SCHEMATA`
);

EXECUTE IMMEDIATE (
    SELECT STRING_AGG(
        (SELECT """
            SELECT project_id, dataset_id, table_id, row_count, size_bytes 
            FROM `""" || s || 
            """.__TABLES__`"""), 
            " UNION ALL ")
    FROM UNNEST(dataset_names) AS s);
Run Code Online (Sandbox Code Playgroud)

如果存在大量数据集,则在尝试同时读取所有元数据时可能会返回速率限制错误。

如果发生这种情况,那么我们可以依靠“批处理”方法,这种方法读取起来有点复杂,速度较慢/效率较低,但仍然可以完成工作:

DECLARE dataset_names ARRAY<STRING>;
DECLARE batch ARRAY<STRING>;
DECLARE batch_size INT64 DEFAULT 25;

CREATE TEMP TABLE results (
    project_id STRING,
    dataset_id STRING,
    table_id STRING,
    row_count INT64,
    size_bytes INT64
);

SET dataset_names = (
        SELECT ARRAY_AGG(SCHEMA_NAME) 
        FROM `region-us.INFORMATION_SCHEMA.SCHEMATA`
    );

LOOP
    IF ARRAY_LENGTH(dataset_names) < 1 THEN 
        LEAVE;
    END IF;

    SET batch = (
        SELECT ARRAY_AGG(d) 
        FROM UNNEST(dataset_names) AS d WITH OFFSET i 
        WHERE i < batch_size);

    EXECUTE IMMEDIATE (
        SELECT """INSERT INTO results """ 
            || STRING_AGG(
                    (SELECT """
                        SELECT project_id, dataset_id, table_id, row_count, size_bytes 
                        FROM `""" || s || """.__TABLES__`"""), 
                " UNION ALL ")
        FROM UNNEST(batch) AS s);

    SET dataset_names = (
        SELECT ARRAY_AGG(d) 
        FROM UNNEST(dataset_names) AS d
        WHERE d NOT IN (SELECT * FROM UNNEST(batch)));
        
END LOOP; 

SELECT * FROM results;
Run Code Online (Sandbox Code Playgroud)


Ale*_*lme 6

目前还没有办法在单个查询中做到这一点,但您可以使用脚本来做到这一点,这是我打印出列表的 python 脚本:

from google.cloud import bigquery

client = bigquery.Client()

datasets = list(client.list_datasets())
project = client.project

if datasets:
    print('Datasets in project {}:'.format(project))
    for dataset in datasets:  # API request(s)
        print('Dataset: {}'.format(dataset.dataset_id))

        query_job = client.query("select table_id, sum(size_bytes)/pow(10,9) as size from `"+dataset.dataset_id+"`.__TABLES__ group by 1")

        results = query_job.result()
        for row in results:
            print("\tTable: {} : {}".format(row.table_id, row.size))

else:
    print('{} project does not contain any datasets.'.format(project))
Run Code Online (Sandbox Code Playgroud)

  • 哦,我知道它是如何工作的。您创建一个 .json 文件并像这样引用它:client = bigquery.Client.from_service_account_json('C:/your_path_here.123456789.json') (2认同)