如何编写查询以从 BigQuery 中的 python 字典插入数组值？

Question

如何编写查询以从 BigQuery 中的 python 字典插入数组值？

我有一个 python 字典，如下所示：

{
'id': 123, 
'categories': [
    {'category': 'fruit', 'values': ['apple', 'banana']}, 
    {'category': 'animal', 'values': ['cat']},
    {'category': 'plant', 'values': []}
  ]
}

Run Code Online (Sandbox Code Playgroud)

我正在尝试使用 python 通过 API 将这些值插入到大查询中的表中，我只需要将上述内容格式化为“INSERT table VALUES”查询。该表需要具有以下字段：id, categories.category, categories.values。

我需要类别基本上是一个包含类别和每个类别对应值的数组。该表最终应该看起来像这样 - 除了我需要它每个 id 只有一行，嵌套相应的类别字段并具有正确的字段名称：

SELECT 123 as id, (["fruit"], ["apple", "banana"]) as category
UNION ALL (SELECT 123 as id, (["animal"], ["cat"]) as category)
UNION ALL (SELECT 123 as id, (["plant"], ["tree", "bush", "rose"]) as category)

Run Code Online (Sandbox Code Playgroud)

我不太确定如何格式化“INSERT”查询以获得所需的结果，任何人都可以帮忙吗？

Answer 1

Ale*_*aes 11

如果您想使用 Python 将字典加载到 BigQuery，则必须首先准备数据。我选择将 Python 词典转换为.json文件，然后使用 Python API 将其加载到 BigQuery。但是，根据文档，BigQuery 在加载 .json 嵌套数据方面有一些限制，其中：

您的 .json 必须是新行分隔，这意味着每个对象必须位于文件中的新行中
BigQuery 不支持 Json 格式的地图或字典。因此，为了做到这一点，您必须将整个数据包装在 [] 中，正如您在此处看到的那样。

因此，应该在文件中进行一些修改，以便您可以将创建的 .json 文件加载到 BiguQuery。我创建了两个脚本，其中：第一个将 Python 字典转换为 JSON 文件，第二个将 JSON 文件格式化为换行符分隔的 json，然后加载到 BigQuery 中。

将 python 字典转换为 .json 文件。请注意，您必须将整个数据包装在 [] 之间：

import json
from google.cloud import bigquery


py_dict =[{
'id': 123, 
'categories': [
    {'category': 'fruit', 'values': ['apple', 'banana']}, 
    {'category': 'animal', 'values': ['cat']},
    {'category': 'plant', 'values': []}
  ]
}]

json_data = json.dumps(py_dict, sort_keys=True)
out_file =  open('json_data.json','w+')
json.dump(py_dict,out_file)

Run Code Online (Sandbox Code Playgroud)

其次，将 json 转换为新行分隔的 json 并加载到 BigQuery：

import json
from google.cloud import bigquery

with open("json_data.json", "r") as read_file:
    data = json.load(read_file)
result = [json.dumps(record) for record in data]
with open('nd-proceesed.json', 'w') as obj:
    for i in result:
        obj.write(i+'\n')
       

client = bigquery.Client()
filename = '/path/to/file.csv'
dataset_id = 'sample'
table_id = 'json_mytable'

dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.autodetect = True

with open("nd-proceesed.json", "rb") as source_file:
    job = client.load_table_from_file(source_file, table_ref, job_config=job_config)

job.result()  # Waits for table load to complete.

print("Loaded {} rows into {}:{}.".format(job.output_rows, dataset_id, table_id))

Run Code Online (Sandbox Code Playgroud)

然后，在 BigQuery UI 中，您可以按如下方式查询表：

SELECT id, categories
FROM `test-proj-261014.sample.json_mytable4` , unnest(categories) as categories

Run Code Online (Sandbox Code Playgroud)

和输出：

归档时间：	5 年，6 月前
查看次数：	7325 次
最近记录：	3 年，6 月前