在 Big Query 中拆分字段

Hir*_*ess 3 google-bigquery

我四处搜索,找不到关于这个主题的太多东西(可能是不好的搜索词:)。我有一个表 Protopayload.resource,它获取 Apache 日志信息。因此,我感兴趣的字段包含我需要搜索的多个值。该字段的格式为 php URL 样式。IE

/?id=13242134123&ver=12&os_bits=64&os_type=mac&lng=EN
Run Code Online (Sandbox Code Playgroud)

这使得所有搜索最终都以非常长的正则表达式来获取数据。然后join语句来合并数据。

结合 mac/win 统计信息的示例搜索

SELECT
  t1.date, t1.wincount, COALESCE(t2.maccount, 0) AS maccount
FROM (
  SELECT
    DATE(metadata.timestamp) AS date,
    INTEGER(COUNT(protoPayload.resource)) AS wincount
  FROM (TABLE_DATE_RANGE(tablename, DATE_ADD(CURRENT_TIMESTAMP(), -30, 'DAY'), CURRENT_TIMESTAMP() ))
  WHERE
    (REGEXP_MATCH(protoPayload.resource, r'ver=[11,12'))
    AND protoPayload.resource CONTAINS 'os=win' GROUP BY date ) t1
LEFT JOIN (
  SELECT
    DATE(metadata.timestamp) AS date,
    INTEGER(COUNT(protoPayload.resource)) AS maccount
  FROM (TABLE_DATE_RANGE(tablename, DATE_ADD(CURRENT_TIMESTAMP(), -30, 'DAY'), CURRENT_TIMESTAMP() ))
  WHERE
    (REGEXP_MATCH(protoPayload.resource, r'cv=[p,m][17,16,15,14]'))
    AND protoPayload.resource CONTAINS 'os=mac' GROUP BY date ) t2
ON
  t1.date = t2.date
ORDER BY t1.date
Run Code Online (Sandbox Code Playgroud)

我在想的是使用类似的正则表达式搜索。创建一个新表。然后将数据保存到具有关系字段的新表中。然后修复未来的日志记录,使其正确记录到表中。

我的问题是这个有效的解决方案,还是在 Google BigQuery 中有更简单的方法来完成这个?有没有更好的方法来转换数据?再次感谢您的任何意见!

Ell*_*ard 5

您可以使用 SQL 函数将键值对解析为数组,这通常比使用 JavaScript 更快。例如,

#standardSQL
CREATE TEMPORARY FUNCTION ParseKeys(queryString STRING)
RETURNS ARRAY<STRUCT<key STRING, value STRING>> AS (
  (SELECT
     ARRAY_AGG(STRUCT(
       entry[OFFSET(0)] AS key,
       entry[OFFSET(1)] AS value))
   FROM (
     SELECT SPLIT(pairString, '=') AS entry
     FROM UNNEST(SPLIT(REGEXP_EXTRACT(queryString, r'/\?(.*)'), '&')) AS pairString)
   )
);
SELECT ParseKeys('/?foo=bar&baz=2');
Run Code Online (Sandbox Code Playgroud)

现在,您可以使用一个将键转换为结构字段的函数来构建它:

#standardSQL
CREATE TEMP FUNCTION GetAttributes(queryString STRING) AS (
  (SELECT AS STRUCT
     MAX(IF(key = 'id', CAST(value AS INT64), NULL)) AS id,
     MAX(IF(key = 'ver', CAST(value AS INT64), NULL)) AS ver,
     MAX(IF(key = 'os_bits', CAST(value AS INT64), NULL)) AS os_bits,
     MAX(IF(key = 'os_type', value, NULL)) AS os_type,
     MAX(IF(key = 'lng', value, NULL)) AS lng
   FROM UNNEST(ParseKeys(queryString)))
);
Run Code Online (Sandbox Code Playgroud)

将所有内容放在一起,您可以GetAttributes使用一些示例输入来试用该功能:

#standardSQL
CREATE TEMPORARY FUNCTION ParseKeys(queryString STRING)
RETURNS ARRAY<STRUCT<key STRING, value STRING>> AS (
  (SELECT
     ARRAY_AGG(STRUCT(
       entry[OFFSET(0)] AS key,
       entry[OFFSET(1)] AS value))
   FROM (
     SELECT SPLIT(pairString, '=') AS entry
     FROM UNNEST(SPLIT(REGEXP_EXTRACT(queryString, r'/\?(.*)'), '&')) AS pairString)
   )
);
CREATE TEMP FUNCTION GetAttributes(queryString STRING) AS (
  (SELECT AS STRUCT
     MAX(IF(key = 'id', CAST(value AS INT64), NULL)) AS id,
     MAX(IF(key = 'ver', CAST(value AS INT64), NULL)) AS ver,
     MAX(IF(key = 'os_bits', CAST(value AS INT64), NULL)) AS os_bits,
     MAX(IF(key = 'os_type', value, NULL)) AS os_type,
     MAX(IF(key = 'lng', value, NULL)) AS lng
   FROM UNNEST(ParseKeys(queryString)))
);
SELECT url, GetAttributes(url).*
FROM UNNEST(['/?id=13242134123&ver=12&os_bits=64&os_type=mac&lng=EN',
             '/?id=2343645745&ver=15&os_bits=32&os_type=linux&lng=FR']) AS url;
Run Code Online (Sandbox Code Playgroud)