Ell*_*ard 1 google-bigquery bigquery-standard-sql
假设我要使用JavaScript UDF对具有嵌套结构的表进行一些处理(例如示例Github commits)。在迭代实现时,我可能想更改在UDF中查看的字段,因此我决定只将表中的整个行传递给它。我的UDF最终看起来像这样:
#standardSQL
CREATE TEMP FUNCTION GetCommitStats(
input STRUCT<commit STRING, tree STRING, parent ARRAY<STRING>,
author STRUCT<name STRING, email STRING, ...>>)
RETURNS STRUCT<
parent ARRAY<STRING>,
author_name STRING,
diff_count INT64>
LANGUAGE js AS """
[UDF content here]
""";
Run Code Online (Sandbox Code Playgroud)
然后,我使用查询查询该函数,例如:
SELECT GetCommitStats(t).*
FROM `bigquery-public-data.github_repos.sample_commits` AS t;
Run Code Online (Sandbox Code Playgroud)
UDF声明中最麻烦的部分是输入结构,因为我必须包括所有嵌套字段及其类型。有一个更好的方法吗?
您可以用于TO_JSON_STRING将任意结构和数组转换为JSON,然后将其在UDF中解析为一个对象,以进行进一步处理。例如,
#standardSQL
CREATE TEMP FUNCTION GetCommitStats(json_str STRING)
RETURNS STRUCT<
parent ARRAY<STRING>,
author_name STRING,
diff_count INT64>
LANGUAGE js AS """
var row = JSON.parse(json_str);
var result = new Object();
result['parent'] = row.parent;
result['author_name'] = row.author.name;
result['diff_count'] = row.difference.length;
return result;
""";
SELECT GetCommitStats(TO_JSON_STRING(t)).*
FROM `bigquery-public-data.github_repos.sample_commits` AS t;
Run Code Online (Sandbox Code Playgroud)
如果要减少扫描的列数,可以将相关列的结构传递给TO_JSON_STRING:
#standardSQL
CREATE TEMP FUNCTION GetCommitStats(json_str STRING)
RETURNS STRUCT<
parent ARRAY<STRING>,
author_name STRING,
diff_count INT64>
LANGUAGE js AS """
var row = JSON.parse(json_str);
var result = new Object();
result['parent'] = row.parent;
result['author_name'] = row.author.name;
result['diff_count'] = row.difference.length;
return result;
""";
SELECT
GetCommitStats(TO_JSON_STRING(
STRUCT(parent, author, difference)
)).*
FROM `bigquery-public-data.github_repos.sample_commits`;
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
825 次 |
| 最近记录: |