汇总和减少嵌套文档和数组

Question

汇总和减少嵌套文档和数组

Kr0*_*r0e 1 mongodb mongodb-query aggregation-framework

编辑：我们的用例：我们从服务器获取有关访客的继续报告。将这些“报告”插入MongoDB之后，我们会在服务器上预聚合几秒钟的数据。

在我们的信息中心中，我们想根据时间范围查询不同的浏览器，操作系统，地理位置（国家等）。

就像这样：在过去7天内，有1000位使用Chrome的访客，来自德国的500位访客，来自英国的200位，依此类推。

我非常需要仪表板所需的MongoDB查询。

我们有以下报告条目：

{
    "_id" : ObjectId("59b9d08e402025326e1a0f30"),
    "channel_perm_id" : "c361049fb4144b0e81b71c0b6cfdc296",
    "source_id" : "insomnia",
    "start_timestamp" : ISODate("2017-09-14T00:42:54.510Z"),
    "end_timestamp" : ISODate("2017-09-14T00:42:54.510Z"),
    "timestamp" : ISODate("2017-09-14T00:42:54.510Z"),
    "resource_uri" : "b755d62a-8c0a-4e8a-945f-41782c13535b",
    "sources_info" : {
        "browsers" : [
            {
                "name" : "Chrome",
                "count" : NumberLong(2)
            }
        ],
        "operating_systems" : [
            {
                "name" : "Mac OS X",
                "count" : NumberLong(2)
            }
        ],
        "continent_ids" : [
            {
                "name" : "EU",
                "count" : NumberLong(1)
            }
        ],
        "country_ids" : [
            {
                "name" : "DE",
                "count" : NumberLong(1)
            }
        ],
        "city_ids" : [
            {
                "name" : "Solingen",
                "count" : NumberLong(1)
            }
        ]
    },
    "unique_sources" : NumberLong(1),
    "requests" : NumberLong(1),
    "cache_hits" : NumberLong(0),
    "cache_misses" : NumberLong(1),
    "cache_hit_size" : NumberLong(0),
    "cache_refill_size" : NumberLong("170000000000")
}

Run Code Online (Sandbox Code Playgroud)

现在，我们需要基于时间戳汇总这些报告。到目前为止，如此简单：

db.channel_report.aggregate([{
  $group: {
    _id: {
      $dateToString: {
        format: "%Y",
        date: "$timestamp"
      }
    },
    sources_info: {
      $push: "$sources_info"
    }
  },
}];

Run Code Online (Sandbox Code Playgroud)

但是现在对我来说变得困难了。您可能已经注意到，sources_info对象就是问题所在。

不仅仅是将所有源信息“推送”到每个组的数组中，我们还需要实际积累它。

所以，如果我们有这样的事情：

{
  sources_info: [
    {
      browsers: [
        {
          name: "Chrome, 
          count: 1
        }
      ]
    },
    {
      browsers: [
        {
          name: "Chrome, 
          count: 1
        }
      ]
    }
  ]
}

Run Code Online (Sandbox Code Playgroud)

数组应简化为：

{
  sources_info:
    {
      browsers: [
        {
          name: "Chrome, 
          count: 2
        }
      ]
    }
}

Run Code Online (Sandbox Code Playgroud)

我们从MySQL迁移到MongoDB进行分析，但是我不知道如何在Mongo中对该行为进行建模。关于文档，我几乎认为这是不可能的，至少当前的数据结构是不可能的。

有一个好的解决方案吗？甚至是另一种数据结构？

干杯，克里斯，来自StriveCDN

Answer 1

Nei*_*unn 7

您遇到的基本问题是，您正在使用“命名键”，实际上您实际上应该使用对一致的属性路径使用值。这意味着在每个条目上都"browsers"应该简单地代替诸如这样的键"type": "browser"。

在汇总数据的一般方法上，其原因应显而易见。一般而言，它也确实有助于查询。但是这些方法基本上涉及将您的初始数据格式强制为这种结构，以便首先进行聚合。

在最新版本（MongoDB 3.4.4及更高版本）中，我们可以通过$objectToArray并按以下方式使用您的命名键：

db.channel_report.aggregate([
  { "$project": {
    "timestamp": 1,
    "sources": {
      "$reduce": {
        "input": {
          "$map": {
            "input": { "$objectToArray": "$sources_info" },
            "as": "s",
            "in": {
              "$map": {
                "input": "$$s.v",
                "as": "v",
                "in": {
                  "type": "$$s.k",
                  "name": "$$v.name",
                  "count": "$$v.count"    
                }
              }
            }
          }     
        },
        "initialValue": [],
        "in": { "$concatArrays": ["$$value", "$$this"] }
      }
    }
  }},
  { "$unwind": "$sources" },
  { "$group": {
    "_id": { 
      "year": { "$year": "$timestamp" },
      "type": "$sources.type",
      "name": "$sources.name"
    },
    "count": { "$sum": "$sources.count" }
  }},
  { "$group": {
    "_id": { "year": "$_id.year", "type": "$_id.type" },
    "v": { "$push": { "name": "$_id.name", "count": "$count" } }  
  }},
  { "$group": {
    "_id": "$_id.year",
    "sources_info": {
      "$push": { "k": "$_id.type", "v": "$v" }  
    }  
  }},
  { "$addFields": {
    "sources_info": { "$arrayToObject": "$sources_info" }  
  }}
])

Run Code Online (Sandbox Code Playgroud)

让我们回想起MongoDB 3.4（目前在大多数托管服务上默认为默认），您可以手动声明每个密钥名称：

db.channel_report.aggregate([
  { "$project": {
    "timestamp": 1,
    "sources": {
      "$concatArrays": [
        { "$map": {
          "input": "$sources_info.browsers",
          "in": {
            "type": "browsers",
            "name": "$$this.name",
            "count": "$$this.count"  
          }  
        }},
        { "$map": {
          "input": "$sources_info.operating_systems",
          "in": {
            "type": "operating_systems",
            "name": "$$this.name",
            "count": "$$this.count"  
          }  
        }},
        { "$map": {
          "input": "$sources_info.continent_ids",
          "in": {
            "type": "continent_ids",
            "name": "$$this.name",
            "count": "$$this.count"  
          }  
        }},
        { "$map": {
          "input": "$sources_info.country_ids",
          "in": {
            "type": "country_ids",
            "name": "$$this.name",
            "count": "$$this.count"  
          }  
        }},
        { "$map": {
          "input": "$sources_info.city_ids",
          "in": {
            "type": "city_ids",
            "name": "$$this.name",
            "count": "$$this.count"  
          }  
        }}
      ]  
    }  
  }},
  { "$unwind": "$sources" },
  { "$group": {
    "_id": { 
      "year": { "$year": "$timestamp" },
      "type": "$sources.type",
      "name": "$sources.name"
    },
    "count": { "$sum": "$sources.count" }
  }},
  { "$group": {
    "_id": { "year": "$_id.year", "type": "$_id.type" },
    "v": { "$push": { "name": "$_id.name", "count": "$count" } }  
  }},
  { "$group": {
    "_id": "$_id.year",
    "sources": {
      "$push": { "k": "$_id.type", "v": "$v" }  
    }  
  }},
  { "$project": {
    "sources_info": {
      "browsers": {
        "$arrayElemAt": [
          "$sources.v",
          { "$indexOfArray": [ "$sources.k", "browsers" ] }
        ]    
      },
      "operating_systems": {
        "$arrayElemAt": [
          "$sources.v",
          { "$indexOfArray": [ "$sources.k", "operating_systems" ] }
        ]    
      },
      "continent_ids": {
        "$arrayElemAt": [
          "$sources.v",
          { "$indexOfArray": [ "$sources.k", "continent_ids" ] }
        ]    
      },
      "country_ids": {
        "$arrayElemAt": [
          "$sources.v",
          { "$indexOfArray": [ "$sources.k", "country_ids" ] }
        ]    
      },
      "city_ids": {
        "$arrayElemAt": [
          "$sources.v",
          { "$indexOfArray": [ "$sources.k", "city_ids" ] }
        ]    
      }
    }    
  }}
])

Run Code Online (Sandbox Code Playgroud)

我们甚至可以使用$map和$filter代替$indexOfArray，将其返回到MongoDB 3.2 ，但主要的方法是解释。

串联数组

需要发生的主要事情是使用命名键从许多不同的数组中获取数据，并创建一个“单个数组”，其"type"属性代表每个键名。可以说，这首先应该是如何存储数据的，而两种方法的第一个聚合阶段都是这样的：

/* 1 */
{
    "_id" : ObjectId("59b9d08e402025326e1a0f30"),
    "timestamp" : ISODate("2017-09-14T00:42:54.510Z"),
    "sources" : [ 
        {
            "type" : "browsers",
            "name" : "Chrome",
            "count" : NumberLong(2)
        }, 
        {
            "type" : "operating_systems",
            "name" : "Mac OS X",
            "count" : NumberLong(2)
        }, 
        {
            "type" : "continent_ids",
            "name" : "EU",
            "count" : NumberLong(1)
        }, 
        {
            "type" : "country_ids",
            "name" : "DE",
            "count" : NumberLong(1)
        }, 
        {
            "type" : "city_ids",
            "name" : "Solingen",
            "count" : NumberLong(1)
        }
    ]
}

Run Code Online (Sandbox Code Playgroud)

放松和分组

您要累积的部分数据实际上包括那些"type"和"name"数组“内部”的属性。每当您需要从“数组内”跨文档累积时，使用的过程就是$unwind为了能够将这些值作为分组键的一部分来访问。

这意味着，使用后$unwind的综合阵列上，你再要$group在这两个键和减少的"timestamp"细节，以$sum该"count"值。

由于您随后具有详细信息的“子级别”（即，浏览器中浏览器的每个名称），因此您将使用其他$group管道阶段，从而逐渐减小分组键的粒度，并$push用于将详细信息累积到数组中。

无论哪种情况，省略输出的最后阶段，累积的结构将显示为：

/* 1 */
{
    "_id" : 2017,
    "sources_info" : [ 
        {
            "k" : "continent_ids",
            "v" : [ 
                {
                    "name" : "EU",
                    "count" : NumberLong(1)
                }
            ]
        }, 
        {
            "k" : "city_ids",
            "v" : [ 
                {
                    "name" : "Solingen",
                    "count" : NumberLong(1)
                }
            ]
        }, 
        {
            "k" : "country_ids",
            "v" : [ 
                {
                    "name" : "DE",
                    "count" : NumberLong(1)
                }
            ]
        }, 
        {
            "k" : "browsers",
            "v" : [ 
                {
                    "name" : "Chrome",
                    "count" : NumberLong(2)
                }
            ]
        }, 
        {
            "k" : "operating_systems",
            "v" : [ 
                {
                    "name" : "Mac OS X",
                    "count" : NumberLong(2)
                }
            ]
        }
    ]
}

Run Code Online (Sandbox Code Playgroud)

这实际上是数据的最终状态，尽管没有以与最初找到时相同的形式表示。在这一点上可以说是完整的，因为任何进一步的处理都只是为了重新输出为命名键的装饰。

输出到命名键

如图所示，各种方法要么通过匹配的键名称查找数组条目，要么通过使用$arrayToObject将数组内容转换回具有命名键的对象的方法。

另一种方法是简单地在代码中进行最后的操作，如.map()在shell中操作游标结果的示例所示：

db.channel_report.aggregate([
  { "$project": {
    "timestamp": 1,
    "sources": {
      "$reduce": {
        "input": {
          "$map": {
            "input": { "$objectToArray": "$sources_info" },
            "as": "s",
            "in": {
              "$map": {
                "input": "$$s.v",
                "as": "v",
                "in": {
                  "type": "$$s.k",
                  "name": "$$v.name",
                  "count": "$$v.count"    
                }
              }
            }
          }     
        },
        "initialValue": [],
        "in": { "$concatArrays": ["$$value", "$$this"] }
      }
    }
  }},
  { "$unwind": "$sources" },
  { "$group": {
    "_id": { 
      "year": { "$year": "$timestamp" },
      "type": "$sources.type",
      "name": "$sources.name"
    },
    "count": { "$sum": "$sources.count" }
  }},
  { "$group": {
    "_id": { "year": "$_id.year", "type": "$_id.type" },
    "v": { "$push": { "name": "$_id.name", "count": "$count" } }  
  }},
  { "$group": {
    "_id": "$_id.year",
    "sources_info": {
      "$push": { "k": "$_id.type", "v": "$v" }  
    }  
  }},
  /*
  { "$addFields": {
    "sources_info": { "$arrayToObject": "$sources_info" }  
  }}
  */
]).map( d => Object.assign(d,{
  "sources_info": d.sources_info.reduce((acc,curr) =>
    Object.assign(acc,{ [curr.k]: curr.v }),{})
}))

Run Code Online (Sandbox Code Playgroud)

当然，哪种方法适用于任一聚合管道方法。

当然，即使所有条目都具有和的唯一标识组合，并且甚至$concatArrays可以替换$setUnion为（和它们看起来一样），这意味着通过处理光标修改最终输出的应用，甚至可以应用该技术。追溯到MongoDB 2.6。"name""type"

最终输出

最后的输出（当然实际上是汇总的，但是问题仅对一个文档进行了采样）针对所有子键进行累积，并从最后的采样输出中进行重构，如下所示：

{
    "_id" : 2017,
    "sources_info" : {
        "continent_ids" : [ 
            {
                "name" : "EU",
                "count" : NumberLong(1)
            }
        ],
        "city_ids" : [ 
            {
                "name" : "Solingen",
                "count" : NumberLong(1)
            }
        ],
        "country_ids" : [ 
            {
                "name" : "DE",
                "count" : NumberLong(1)
            }
        ],
        "browsers" : [ 
            {
                "name" : "Chrome",
                "count" : NumberLong(2)
            }
        ],
        "operating_systems" : [ 
            {
                "name" : "Mac OS X",
                "count" : NumberLong(2)
            }
        ]
    }
}

Run Code Online (Sandbox Code Playgroud)

其中每个键项下的每个数组项sources_info都减少为共享相同项的每个其他项的累加计数"name"。

归档时间：	8 年，8 月前
查看次数：	2529 次
最近记录：	8 年，8 月前