汇总和减少嵌套文档和数组

Kr0*_*r0e 1 mongodb mongodb-query aggregation-framework

编辑:我们的用例:我们从服务器获取有关访客的继续报告。将这些“报告”插入MongoDB之后,我们会在服务器上预聚合几秒钟的数据。

在我们的信息中心中,我们想根据时间范围查询不同的浏览器,操作系统,地理位置(国家等)。

就像这样:在过去7天内,有1000位使用Chrome的访客,来自德国的500位访客,来自英国的200位,依此类推。

我非常需要仪表板所需的MongoDB查询。

我们有以下报告条目:

{
    "_id" : ObjectId("59b9d08e402025326e1a0f30"),
    "channel_perm_id" : "c361049fb4144b0e81b71c0b6cfdc296",
    "source_id" : "insomnia",
    "start_timestamp" : ISODate("2017-09-14T00:42:54.510Z"),
    "end_timestamp" : ISODate("2017-09-14T00:42:54.510Z"),
    "timestamp" : ISODate("2017-09-14T00:42:54.510Z"),
    "resource_uri" : "b755d62a-8c0a-4e8a-945f-41782c13535b",
    "sources_info" : {
        "browsers" : [
            {
                "name" : "Chrome",
                "count" : NumberLong(2)
            }
        ],
        "operating_systems" : [
            {
                "name" : "Mac OS X",
                "count" : NumberLong(2)
            }
        ],
        "continent_ids" : [
            {
                "name" : "EU",
                "count" : NumberLong(1)
            }
        ],
        "country_ids" : [
            {
                "name" : "DE",
                "count" : NumberLong(1)
            }
        ],
        "city_ids" : [
            {
                "name" : "Solingen",
                "count" : NumberLong(1)
            }
        ]
    },
    "unique_sources" : NumberLong(1),
    "requests" : NumberLong(1),
    "cache_hits" : NumberLong(0),
    "cache_misses" : NumberLong(1),
    "cache_hit_size" : NumberLong(0),
    "cache_refill_size" : NumberLong("170000000000")
}
Run Code Online (Sandbox Code Playgroud)

现在,我们需要基于时间戳汇总这些报告。到目前为止,如此简单:

db.channel_report.aggregate([{
  $group: {
    _id: {
      $dateToString: {
        format: "%Y",
        date: "$timestamp"
      }
    },
    sources_info: {
      $push: "$sources_info"
    }
  },
}];
Run Code Online (Sandbox Code Playgroud)

但是现在对我来说变得困难了。您可能已经注意到,sources_info对象就是问题所在。

不仅仅是将所有源信息“推送”到每个组的数组中,我们还需要实际积累它。

所以,如果我们有这样的事情:

{
  sources_info: [
    {
      browsers: [
        {
          name: "Chrome, 
          count: 1
        }
      ]
    },
    {
      browsers: [
        {
          name: "Chrome, 
          count: 1
        }
      ]
    }
  ]
}
Run Code Online (Sandbox Code Playgroud)

数组应简化为:

{
  sources_info:
    {
      browsers: [
        {
          name: "Chrome, 
          count: 2
        }
      ]
    }
}
Run Code Online (Sandbox Code Playgroud)

我们从MySQL迁移到MongoDB进行分析,但是我不知道如何在Mongo中对该行为进行建模。关于文档,我几乎认为这是不可能的,至少当前的数据结构是不可能的。

有一个好的解决方案吗?甚至是另一种数据结构?

干杯,克里斯,来自StriveCDN

Nei*_*unn 7

您遇到的基本问题是,您正在使用“命名键”,实际上您实际上应该使用对一致的属性路径使用值。这意味着在每个条目上都"browsers"应该简单地代替诸如这样的键"type": "browser"

在汇总数据的一般方法上,其原因应显而易见。一般而言,它也确实有助于查询。但是这些方法基本上涉及将您的初始数据格式强制为这种结构,以便首先进行聚合。

在最新版本(MongoDB 3.4.4及更高版本)中,我们可以通过$objectToArray并按以下方式使用您的命名键:

db.channel_report.aggregate([
  { "$project": {
    "timestamp": 1,
    "sources": {
      "$reduce": {
        "input": {
          "$map": {
            "input": { "$objectToArray": "$sources_info" },
            "as": "s",
            "in": {
              "$map": {
                "input": "$$s.v",
                "as": "v",
                "in": {
                  "type": "$$s.k",
                  "name": "$$v.name",
                  "count": "$$v.count"    
                }
              }
            }
          }     
        },
        "initialValue": [],
        "in": { "$concatArrays": ["$$value", "$$this"] }
      }
    }
  }},
  { "$unwind": "$sources" },
  { "$group": {
    "_id": { 
      "year": { "$year": "$timestamp" },
      "type": "$sources.type",
      "name": "$sources.name"
    },
    "count": { "$sum": "$sources.count" }
  }},
  { "$group": {
    "_id": { "year": "$_id.year", "type": "$_id.type" },
    "v": { "$push": { "name": "$_id.name", "count": "$count" } }  
  }},
  { "$group": {
    "_id": "$_id.year",
    "sources_info": {
      "$push": { "k": "$_id.type", "v": "$v" }  
    }  
  }},
  { "$addFields": {
    "sources_info": { "$arrayToObject": "$sources_info" }  
  }}
])
Run Code Online (Sandbox Code Playgroud)

让我们回想起MongoDB 3.4(目前在大多数托管服务上默认为默认),您可以手动声明每个密钥名称:

db.channel_report.aggregate([
  { "$project": {
    "timestamp": 1,
    "sources": {
      "$concatArrays": [
        { "$map": {
          "input": "$sources_info.browsers",
          "in": {
            "type": "browsers",
            "name": "$$this.name",
            "count": "$$this.count"  
          }  
        }},
        { "$map": {
          "input": "$sources_info.operating_systems",
          "in": {
            "type": "operating_systems",
            "name": "$$this.name",
            "count": "$$this.count"  
          }  
        }},
        { "$map": {
          "input": "$sources_info.continent_ids",
          "in": {
            "type": "continent_ids",
            "name": "$$this.name",
            "count": "$$this.count"  
          }  
        }},
        { "$map": {
          "input": "$sources_info.country_ids",
          "in": {
            "type": "country_ids",
            "name": "$$this.name",
            "count": "$$this.count"  
          }  
        }},
        { "$map": {
          "input": "$sources_info.city_ids",
          "in": {
            "type": "city_ids",
            "name": "$$this.name",
            "count": "$$this.count"  
          }  
        }}
      ]  
    }  
  }},
  { "$unwind": "$sources" },
  { "$group": {
    "_id": { 
      "year": { "$year": "$timestamp" },
      "type": "$sources.type",
      "name": "$sources.name"
    },
    "count": { "$sum": "$sources.count" }
  }},
  { "$group": {
    "_id": { "year": "$_id.year", "type": "$_id.type" },
    "v": { "$push": { "name": "$_id.name", "count": "$count" } }  
  }},
  { "$group": {
    "_id": "$_id.year",
    "sources": {
      "$push": { "k": "$_id.type", "v": "$v" }  
    }  
  }},
  { "$project": {
    "sources_info": {
      "browsers": {
        "$arrayElemAt": [
          "$sources.v",
          { "$indexOfArray": [ "$sources.k", "browsers" ] }
        ]    
      },
      "operating_systems": {
        "$arrayElemAt": [
          "$sources.v",
          { "$indexOfArray": [ "$sources.k", "operating_systems" ] }
        ]    
      },
      "continent_ids": {
        "$arrayElemAt": [
          "$sources.v",
          { "$indexOfArray": [ "$sources.k", "continent_ids" ] }
        ]    
      },
      "country_ids": {
        "$arrayElemAt": [
          "$sources.v",
          { "$indexOfArray": [ "$sources.k", "country_ids" ] }
        ]    
      },
      "city_ids": {
        "$arrayElemAt": [
          "$sources.v",
          { "$indexOfArray": [ "$sources.k", "city_ids" ] }
        ]    
      }
    }    
  }}
])
Run Code Online (Sandbox Code Playgroud)

我们甚至可以使用$map$filter代替$indexOfArray,将其返回到MongoDB 3.2 ,但主要的方法是解释。

串联数组

需要发生的主要事情是使用命名键从许多不同的数组中获取数据,并创建一个“单个数组”,其"type"属性代表每个键名。可以说,这首先应该是如何存储数据的,而两种方法的第一个聚合阶段都是这样的:

/* 1 */
{
    "_id" : ObjectId("59b9d08e402025326e1a0f30"),
    "timestamp" : ISODate("2017-09-14T00:42:54.510Z"),
    "sources" : [ 
        {
            "type" : "browsers",
            "name" : "Chrome",
            "count" : NumberLong(2)
        }, 
        {
            "type" : "operating_systems",
            "name" : "Mac OS X",
            "count" : NumberLong(2)
        }, 
        {
            "type" : "continent_ids",
            "name" : "EU",
            "count" : NumberLong(1)
        }, 
        {
            "type" : "country_ids",
            "name" : "DE",
            "count" : NumberLong(1)
        }, 
        {
            "type" : "city_ids",
            "name" : "Solingen",
            "count" : NumberLong(1)
        }
    ]
}
Run Code Online (Sandbox Code Playgroud)

放松和分组

您要累积的部分数据实际上包括那些"type""name"数组“内部”的属性。每当您需要从“数组内”跨文档累积时,使用的过程就是$unwind为了能够将这些值作为分组键的一部分来访问。

这意味着,使用后$unwind的综合阵列上,你再要$group在这两个键和减少的"timestamp"细节,以$sum"count"值。

由于您随后具有详细信息的“子级别”(即,浏览器中浏览器的每个名称),因此您将使用其他$group管道阶段,从而逐渐减小分组键的粒度,并$push用于将详细信息累积到数组中。

无论哪种情况,省略输出的最后阶段,累积的结构将显示为:

/* 1 */
{
    "_id" : 2017,
    "sources_info" : [ 
        {
            "k" : "continent_ids",
            "v" : [ 
                {
                    "name" : "EU",
                    "count" : NumberLong(1)
                }
            ]
        }, 
        {
            "k" : "city_ids",
            "v" : [ 
                {
                    "name" : "Solingen",
                    "count" : NumberLong(1)
                }
            ]
        }, 
        {
            "k" : "country_ids",
            "v" : [ 
                {
                    "name" : "DE",
                    "count" : NumberLong(1)
                }
            ]
        }, 
        {
            "k" : "browsers",
            "v" : [ 
                {
                    "name" : "Chrome",
                    "count" : NumberLong(2)
                }
            ]
        }, 
        {
            "k" : "operating_systems",
            "v" : [ 
                {
                    "name" : "Mac OS X",
                    "count" : NumberLong(2)
                }
            ]
        }
    ]
}
Run Code Online (Sandbox Code Playgroud)

这实际上是数据的最终状态,尽管没有以与最初找到时相同的形式表示。在这一点上可以说是完整的,因为任何进一步的处理都只是为了重新输出为命名键的装饰。

输出到命名键

如图所示,各种方法要么通过匹配的键名称查找数组条目,要么通过使用$arrayToObject将数组内容转换回具有命名键的对象的方法。

另一种方法是简单地在代码中进行最后的操作,如.map()在shell中操作游标结果的示例所示:

db.channel_report.aggregate([
  { "$project": {
    "timestamp": 1,
    "sources": {
      "$reduce": {
        "input": {
          "$map": {
            "input": { "$objectToArray": "$sources_info" },
            "as": "s",
            "in": {
              "$map": {
                "input": "$$s.v",
                "as": "v",
                "in": {
                  "type": "$$s.k",
                  "name": "$$v.name",
                  "count": "$$v.count"    
                }
              }
            }
          }     
        },
        "initialValue": [],
        "in": { "$concatArrays": ["$$value", "$$this"] }
      }
    }
  }},
  { "$unwind": "$sources" },
  { "$group": {
    "_id": { 
      "year": { "$year": "$timestamp" },
      "type": "$sources.type",
      "name": "$sources.name"
    },
    "count": { "$sum": "$sources.count" }
  }},
  { "$group": {
    "_id": { "year": "$_id.year", "type": "$_id.type" },
    "v": { "$push": { "name": "$_id.name", "count": "$count" } }  
  }},
  { "$group": {
    "_id": "$_id.year",
    "sources_info": {
      "$push": { "k": "$_id.type", "v": "$v" }  
    }  
  }},
  /*
  { "$addFields": {
    "sources_info": { "$arrayToObject": "$sources_info" }  
  }}
  */
]).map( d => Object.assign(d,{
  "sources_info": d.sources_info.reduce((acc,curr) =>
    Object.assign(acc,{ [curr.k]: curr.v }),{})
}))
Run Code Online (Sandbox Code Playgroud)

当然,哪种方法适用于任一聚合管道方法。

当然,即使所有条目都具有和的唯一标识组合,并且甚至$concatArrays可以替换$setUnion为(和它们看起来一样),这意味着通过处理光标修改最终输出的应用,甚至可以应用该技术。追溯到MongoDB 2.6。"name""type"

最终输出

最后的输出(当然实际上是汇总的,但是问题仅对一个文档进行了采样)针对所有子键进行累积,并从最后的采样输出中进行重构,如下所示:

{
    "_id" : 2017,
    "sources_info" : {
        "continent_ids" : [ 
            {
                "name" : "EU",
                "count" : NumberLong(1)
            }
        ],
        "city_ids" : [ 
            {
                "name" : "Solingen",
                "count" : NumberLong(1)
            }
        ],
        "country_ids" : [ 
            {
                "name" : "DE",
                "count" : NumberLong(1)
            }
        ],
        "browsers" : [ 
            {
                "name" : "Chrome",
                "count" : NumberLong(2)
            }
        ],
        "operating_systems" : [ 
            {
                "name" : "Mac OS X",
                "count" : NumberLong(2)
            }
        ]
    }
}
Run Code Online (Sandbox Code Playgroud)

其中每个键项下的每个数组项sources_info都减少为共享相同项的每个其他项的累加计数"name"