jq 中的 SQL 样式 GROUP BY 聚合函数(COUNT、SUM 等)

Onk*_*tem 6 sql json group-by aggregate-functions jq

以前在这里问过类似的问题:

计算单个键的项目数jq 按特定键计算 json 中的项目数

计算对象值的总和: 如何对 jq 中的映射数组中的值求和?

如何模拟 COUNT 聚合函数,它的行为应该与其 SQL 原始函数类似?让我们进一步扩展这个问题以包括其他常规 SQL 函数:

  • 数数
  • 总和/最大值/最小值/平均值
  • ARRAY_AGG

最后一个不是标准的 SQL 函数——它来自 PostgreSQL 但非常有用。

输入端是一个有效的 JSON 对象流。为了演示,让我们选择一个关于主人和他们的宠物的简单故事。

模型和数据

基础关系:所有者

id name  age
 1 Adams  25
 2 Baker  55
 3 Clark  40
 4 Davis  31
Run Code Online (Sandbox Code Playgroud)

基础关系:宠物

id name  litter owner_id
10 Bella      4        1
20 Lucy       2        1
30 Daisy      3        2
40 Molly      4        3
50 Lola       2        4
60 Sadie      4        4
70 Luna       3        4
Run Code Online (Sandbox Code Playgroud)

来源

从上面我们得到一个以 JSON 格式呈现的派生关系Owner_Pet(上述关系的 SQL JOIN 的结果),用于我们的 jq 查询(源数据):

{ "owner_id": 1, "owner": "Adams", "age": 25, "pet_id": 10, "pet": "Bella", "litter": 4 }
{ "owner_id": 1, "owner": "Adams", "age": 25, "pet_id": 20, "pet": "Lucy",  "litter": 2 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pet_id": 30, "pet": "Daisy", "litter": 3 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pet_id": 40, "pet": "Molly", "litter": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 50, "pet": "Lola",  "litter": 2 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 60, "pet": "Sadie", "litter": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 70, "pet": "Luna",  "litter": 3 }
Run Code Online (Sandbox Code Playgroud)

要求

以下是示例请求及其预期输出:

  • 计算每个主人的宠物数量:
{ "owner_id": 1, "owner": "Adams", "age": 25, "pets_count": 2 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pets_count": 1 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pets_count": 1 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pets_count": 3 }
Run Code Online (Sandbox Code Playgroud)
  • 总结每个所有者的幼崽数量获得它们的最大值(最小值/平均值):
{ "owner_id": 1, "owner": "Adams", "age": 25, "litter_total": 6, "litter_max": 4 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "litter_total": 3, "litter_max": 3 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "litter_total": 4, "litter_max": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "litter_total": 9, "litter_max": 4 }
Run Code Online (Sandbox Code Playgroud)
  • 每个主人的 ARRAY_AGG 宠物:
{ "owner_id": 1, "owner": "Adams", "age": 25, "pets": [ "Bella", "Lucy" ] }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pets": [ "Daisy" ] }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pets": [ "Molly" ] }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pets": [ "Lola", "Sadie", "Luna" ] }
Run Code Online (Sandbox Code Playgroud)

Cor*_*mer 9

这是一种替代方案,不使用任何带有基本 JQ 的自定义函数。(我冒昧地删除了问题中多余的部分)

数数

In> jq -s 'group_by(.owner_id) |  map({ owner_id: .[0].owner_id, count: map(.pet) | length})'
Out>[{"owner_id": "1","pets_count": 2}, ...]
Run Code Online (Sandbox Code Playgroud)

In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, sum: map(.litter) | add})'
Out> [{"owner_id": "1","sum": 6}, ...]
Run Code Online (Sandbox Code Playgroud)

最大限度

In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, max: map(.litter) | max})'
Out> [{"owner_id": "1","max": 4}, ...]
Run Code Online (Sandbox Code Playgroud)

总计的

In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, agg: map(.pet) })'
Out> [{"owner_id": "1","agg": ["Bella","Lucy"]}, ...]
Run Code Online (Sandbox Code Playgroud)

当然,这些可能不是最有效的实现,但它们很好地展示了如何自己实现自定义函数。不同函数之间的所有变化都在最后一个函数map和管道|( length, add, max)之后的函数内部

第一个映射迭代不同的组,从第一个项目中获取名称,并再次使用映射来迭代同一组的项目。不像 SQL 那样漂亮,但也没有复杂得多。

我今天学习了 JQ,并且已经成功做到了这一点,所以这对于任何入门的人来说都是令人鼓舞的。JQ 既不像 sed 也不像 SQL,但也不是很难。


Rom*_*est 3

扩展jq解决方案:

自定义count()功能:

jq -sc 'def count($k): group_by(.[$k])[] | length as $l | .[0] 
                       | .pets_count = $l 
                       | del(.pet_id, .pet, .litter); 
        count("owner_id")' source.data
Run Code Online (Sandbox Code Playgroud)

输出:

{"owner_id":1,"owner":"Adams","age":25,"pets_count":2}
{"owner_id":2,"owner":"Baker","age":55,"pets_count":1}
{"owner_id":3,"owner":"Clark","age":40,"pets_count":1}
{"owner_id":4,"owner":"Davis","age":31,"pets_count":3}
Run Code Online (Sandbox Code Playgroud)

自定义sum()功能:

jq -sc 'def sum($k): group_by(.[$k])[] | map(.litter) as $litters | .[0] 
                     | . + {litter_total: $litters | add, litter_max: $litters | max} 
                     | del(.pet_id, .pet, .litter); 
        sum("owner_id")' source.data
Run Code Online (Sandbox Code Playgroud)

输出:

{"owner_id":1,"owner":"Adams","age":25,"litter_total":6,"litter_max":4}
{"owner_id":2,"owner":"Baker","age":55,"litter_total":3,"litter_max":3}
{"owner_id":3,"owner":"Clark","age":40,"litter_total":4,"litter_max":4}
{"owner_id":4,"owner":"Davis","age":31,"litter_total":9,"litter_max":4}
Run Code Online (Sandbox Code Playgroud)

自定义array_agg()功能:

jq -sc 'def array_agg($k): group_by(.[$k])[] | map(.pet) as $pets | .[0] 
                           | .pets = $pets | del(.pet_id, .pet, .litter); 
        array_agg("owner_id")' source.data
Run Code Online (Sandbox Code Playgroud)

输出:

{"owner_id":1,"owner":"Adams","age":25,"pets":["Bella","Lucy"]}
{"owner_id":2,"owner":"Baker","age":55,"pets":["Daisy"]}
{"owner_id":3,"owner":"Clark","age":40,"pets":["Molly"]}
{"owner_id":4,"owner":"Davis","age":31,"pets":["Lola","Sadie","Luna"]}
Run Code Online (Sandbox Code Playgroud)