Bla*_*lam 2 group-by mongodb aggregation-framework
我的收藏中有以下文件。每个文档都包含推文的文本和从推文中挑选出的一组实体(使用 AWS Comprehend):
{
"text" : "some tweet by John Smith in New York about Stack Overflow",
"entities" : [
{
"Type" : "ORGANIZATION",
"Text" : "stack overflow"
},
{
"Type" : "LOCATION",
"Text" : "new york"
},
{
"Type" : "PERSON",
"Text" : "john smith"
}
]
},
{
"text" : "another tweet by John Smith but this one from California and about Google",
"entities" : [
{
"Type" : "ORGANIZATION",
"Text" : "google"
},
{
"Type" : "LOCATION",
"Text" : "california"
},
{
"Type" : "PERSON",
"Text" : "john smith"
}
]
}
Run Code Online (Sandbox Code Playgroud)
我想得到一个 distinct 列表entities.Text,按entities.TypeWITH分组,每个出现的次数entities.Text如下所示:
{ "_id" : "ORGANIZATION", "values" : [ {text:"stack overflow",count:1},{text:"google",count:1} ] }
{ "_id" : "LOCATION", "values" : [ {text:"new york",count:1},{text:"california",count:1} ] }
{ "_id" : "PERSON", "values" : [ {text:"john smith",count:2} ] }
Run Code Online (Sandbox Code Playgroud)
我可以使用以下查询分组entities.Type并将 ALLentities.Text放入一个数组中:
db.collection.aggregate([
{
$unwind: '$entities'
},
{
$group: {
_id: '$entities.Type',
values: {
$push: '$entities.Text'
}
}
}])
Run Code Online (Sandbox Code Playgroud)
这导致此输出包含重复值且没有计数。
{ "_id" : "ORGANIZATION", "values" : [ "stack overflow", "google" ] }
{ "_id" : "LOCATION", "values" : [ "new york", "california" ] }
{ "_id" : "PERSON", "values" : [ "john smith", "john smith" ] }
Run Code Online (Sandbox Code Playgroud)
我开始沿着使用的路径$project作为聚合的最后一步,并valuesMap使用 javascript 函数添加计算字段。但后来我意识到你不能在聚合管道中编写 javascript。
我的下一步将是使用普通的 javascript 处理 mongoDB 输出,但我希望(为了学习)使用 mongoDB 查询完成这一切。
谢谢!
您可以尝试以下查询。你需要一个额外的$group来推动计数和文本。
db.collection.aggregate(
[
{"$unwind":"$entities"},
{"$group":{
"_id":{"type":"$entities.Type","text":"$entities.Text"},
"count":{"$sum":1}
}},
{"$group":{
"_id":"$_id.type",
"values":{"$push":{"text":"$_id.text","count":"$count"}}
}}
])
Run Code Online (Sandbox Code Playgroud)