BigQuery 将 JSON 文本展平为结构/记录数组

xia*_*ong 6 sql json user-defined-functions google-bigquery

所以我有一个包含两列的原始表:

id (INT64)  |  content (STRING)
------------|--------------------
1           | {"photos": [{"location": {"lat": 111, "lon": 222}, "ts": "2019-12-16", "uri": "aaa"}, {"location": {"lat": 333, "lon": 444}, "ts": "2019-12-17", "uri": "bbb"}]}
------------|--------------------
2           | ....
Run Code Online (Sandbox Code Playgroud)

第一列是整数类型的 id,第二列是 json 格式的字符串。示例 json 如下所示:

{
  "photos": [
    {
      "location": {
        "lat": 111, 
        "lon": 222
      }, 
      "ts": "2019-12-16", 
      "uri": "aaa"
    }, 
    {
      "location": {
        "lat": 333, 
        "lon": 444
      }, 
      "ts": "2019-12-17", 
      "uri": "bbb"
    }
  ]
}

Run Code Online (Sandbox Code Playgroud)

问题

如何将原始表中的照片格式化为结构/记录数组,即产生类似的结果?

id     |  photos.ts    | photos.uri  |  photos.location.lat  | photos.location.lon
-------|---------------|-------------|-----------------------|--------------------
1      |  2019-12-16   | aaa         |                   111 |                222
       |  2019-12-17   | bbb         |                   333 |                444
-------|---------------|-------------|-----------------------|--------------------
2      | ...           | ...         |                   ... |                ...
Run Code Online (Sandbox Code Playgroud)

想法

  1. JSON_EXTRACT(content, "$.photos")似乎是一个好的开始,因为它会给我一个 JSON 对象数组,然后我需要一些 JS UDF 将结果格式化为 BQ STRUCT/RECORD类型。但不确定具体如何做到这一点——感谢任何帮助!
  2. 我不确定对STRUCT/的这种“清理”RECORD是否真的有必要或值得。看来我可以将照片格式化为数组STRING
id (INT64)  |  photos (STRING)
------------|--------------------
1           | {"location": {"lat": 111, "lon": 222}, "ts": "2019-12-16", "uri": "aaa"}
            | {"location": {"lat": 333, "lon": 444}, "ts": "2019-12-17", "uri": "bbb"}
------------|--------------------
2           | ....
Run Code Online (Sandbox Code Playgroud)

JSON_EXTRACT/JSON_EXTRACT_SCALAR,然后在我的分析查询中使用。我预计会有多大的性能牺牲?

谢谢!

Mik*_*ant 12

以下示例适用于 BigQuery 标准 SQL

#standardSQL
CREATE TEMP FUNCTION json2array(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
  return JSON.parse(json).map(x=>JSON.stringify(x));
"""; 
WITH `project.dataset.table` AS (
  SELECT 1 id, '{"photos": [{"location": {"lat": 111, "lon": 222}, "ts": "2019-12-16", "uri": "aaa"}, {"location": {"lat": 333, "lon": 444}, "ts": "2019-12-17", "uri": "bbb"}]}' content
)
SELECT id, json2array(JSON_EXTRACT(content, "$.photos")) AS photos
FROM `project.dataset.table`
Run Code Online (Sandbox Code Playgroud)

带输出

Row id  photos   
1   1   {"location":{"lat":111,"lon":222},"ts":"2019-12-16","uri":"aaa"}     
        {"location":{"lat":333,"lon":444},"ts":"2019-12-17","uri":"bbb"}     
Run Code Online (Sandbox Code Playgroud)

或者...您可以进一步了解以下内容

#standardSQL
CREATE TEMP FUNCTION json2array(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
  return JSON.parse(json).map(x=>JSON.stringify(x));
"""; 
WITH `project.dataset.table` AS (
  SELECT 1 id, '{"photos": [{"location": {"lat": 111, "lon": 222}, "ts": "2019-12-16", "uri": "aaa"}, {"location": {"lat": 333, "lon": 444}, "ts": "2019-12-17", "uri": "bbb"}]}' content
)
SELECT id, 
  array(
    SELECT AS struct
      JSON_EXTRACT_SCALAR(photo, "$.ts") ts,
      JSON_EXTRACT_SCALAR(photo, "$.uri") uri,
      STRUCT(JSON_EXTRACT(photo, "$.location.lat") AS lat, JSON_EXTRACT(photo, "$.location.lon") AS lon) AS location
    FROM unnest(json2array(JSON_EXTRACT(content, "$.photos"))) photo
  ) AS photos

FROM `project.dataset.table`
Run Code Online (Sandbox Code Playgroud)

返回

Row id  photos.ts       photos.uri  photos.location.lat photos.location.lon  
1   1   2019-12-16      aaa         111                 222  
        2019-12-17      bbb         333                 444  
Run Code Online (Sandbox Code Playgroud)