如何在不指定完整类型的情况下将表中的行传递给UDF?

Ell*_*ard 1 google-bigquery bigquery-standard-sql

假设我要使用JavaScript UDF对具有嵌套结构的表进行一些处理(例如示例Github commits)。在迭代实现时,我可能想更改在UDF中查看的字段,因此我决定只将表中的整个行传递给它。我的UDF最终看起来像这样:

#standardSQL
CREATE TEMP FUNCTION GetCommitStats(
  input STRUCT<commit STRING, tree STRING, parent ARRAY<STRING>,
               author STRUCT<name STRING, email STRING, ...>>)
  RETURNS STRUCT<
    parent ARRAY<STRING>,
    author_name STRING,
    diff_count INT64>
  LANGUAGE js AS """
[UDF content here]
""";
Run Code Online (Sandbox Code Playgroud)

然后,我使用查询查询该函数,例如:

SELECT GetCommitStats(t).*
FROM `bigquery-public-data.github_repos.sample_commits` AS t;
Run Code Online (Sandbox Code Playgroud)

UDF声明中最麻烦的部分是输入结构,因为我必须包括所有嵌套字段及其类型。有一个更好的方法吗?

Ell*_*ard 5

您可以用于TO_JSON_STRING将任意结构和数组转换为JSON,然后将其在UDF中解析为一个对象,以进行进一步处理。例如,

#standardSQL
CREATE TEMP FUNCTION GetCommitStats(json_str STRING)
  RETURNS STRUCT<
    parent ARRAY<STRING>,
    author_name STRING,
    diff_count INT64>
  LANGUAGE js AS """
var row = JSON.parse(json_str);
var result = new Object();
result['parent'] = row.parent;
result['author_name'] = row.author.name;
result['diff_count'] = row.difference.length;
return result;
""";

SELECT GetCommitStats(TO_JSON_STRING(t)).*
FROM `bigquery-public-data.github_repos.sample_commits` AS t;
Run Code Online (Sandbox Code Playgroud)

如果要减少扫描的列数,可以将相关列的结构传递给TO_JSON_STRING

#standardSQL
CREATE TEMP FUNCTION GetCommitStats(json_str STRING)
  RETURNS STRUCT<
    parent ARRAY<STRING>,
    author_name STRING,
    diff_count INT64>
  LANGUAGE js AS """
var row = JSON.parse(json_str);
var result = new Object();
result['parent'] = row.parent;
result['author_name'] = row.author.name;
result['diff_count'] = row.difference.length;
return result;
""";

SELECT
  GetCommitStats(TO_JSON_STRING(
    STRUCT(parent, author, difference)
  )).*
FROM `bigquery-public-data.github_repos.sample_commits`;
Run Code Online (Sandbox Code Playgroud)