Ale*_*lex 1 mapreduce mongodb aggregation-framework
我正在构建一个可以比作约会应用程序的应用程序.
我有一些像这样结构的文件:
$ db.profiles.find().pretty()
[
{
"_id": 1,
"firstName": "John",
"lastName": "Smith",
"fieldValues": [
"favouriteColour|red",
"food|pizza",
"food|chinese"
]
},
{
"_id": 2,
"firstName": "Sarah",
"lastName": "Jane",
"fieldValues": [
"favouriteColour|blue",
"food|pizza",
"food|mexican",
"pets|yes"
]
},
{
"_id": 3,
"firstName": "Rachel",
"lastName": "Jones",
"fieldValues": [
"food|pizza"
]
}
]
Run Code Online (Sandbox Code Playgroud)
我正在努力识别在一个或多个上相互匹配的配置文件fieldValues.
所以,在上面的例子中,我理想的结果看起来像:
<some query>
result:
[
{
"_id": "507f1f77bcf86cd799439011",
"dateCreated": "2013-12-01",
"profiles": [
{
"_id": 1,
"firstName": "John",
"lastName": "Smith",
"fieldValues": [
"favouriteColour|red",
"food|pizza",
"food|chinese"
]
},
{
"_id": 2,
"firstName": "Sarah",
"lastName": "Jane",
"fieldValues": [
"favouriteColour|blue",
"food|pizza",
"food|mexican",
"pets|yes"
]
},
]
},
{
"_id": "356g1dgk5cf86cd737858595",
"dateCreated": "2013-12-02",
"profiles": [
{
"_id": 1,
"firstName": "John",
"lastName": "Smith",
"fieldValues": [
"favouriteColour|red",
"food|pizza",
"food|chinese"
]
},
{
"_id": 3,
"firstName": "Rachel",
"lastName": "Jones",
"fieldValues": [
"food|pizza"
]
}
]
}
]
Run Code Online (Sandbox Code Playgroud)
我已经考虑过将此作为map reduce或者使用聚合框架.
无论哪种方式,'结果'将被持久化到一个集合(根据上面的'结果')
我的问题是两者中哪一个更适合?我会从哪里开始实现这个?
编辑
简而言之,模型不容易改变.
这不像传统意义上的"简介".
我基本上要做的(在伪代码中)是这样的:
foreach profile in db.profiles.find()
foreach otherProfile in db.profiles.find("_id": {$ne: profile._id})
if profile.fieldValues matches any otherProfie.fieldValues
//it's a match!
Run Code Online (Sandbox Code Playgroud)
显然那种操作非常慢!
值得一提的是,这些数据永远不会显示,它实际上只是一个用于"匹配"的字符串值
MapReduce将在单独的线程中运行JavaScript,并使用您提供的代码发出和减少文档的某些部分,以便在某些字段上进行聚合.您当然可以将练习视为聚合在每个"fieldValue"上.聚合框架也可以做到这一点但速度要快得多,因为聚合将在C++中在服务器上运行,而不是在单独的JavaScript线程中运行.但是聚合框架可能会返回比16MB更多的数据,在这种情况下,您需要对数据集进行更复杂的分区.
但似乎问题比这简单得多.您只想为每个配置文件找到其他配置文件与其共享特定属性的内容 - 在不知道数据集的大小和性能要求的情况下,我将假设您在fieldValues上有一个索引,因此查询会很有效在它上然后你可以通过这个简单的循环得到你想要的结果:
> db.profiles.find().forEach( function(p) {
print("Matching profiles for "+tojson(p));
printjson(
db.profiles.find(
{"fieldValues": {"$in" : p.fieldValues},
"_id" : {$gt:p._id}}
).toArray()
);
} );
Run Code Online (Sandbox Code Playgroud)
输出:
Matching profiles for {
"_id" : 1,
"firstName" : "John",
"lastName" : "Smith",
"fieldValues" : [
"favouriteColour|red",
"food|pizza",
"food|chinese"
]
}
[
{
"_id" : 2,
"firstName" : "Sarah",
"lastName" : "Jane",
"fieldValues" : [
"favouriteColour|blue",
"food|pizza",
"food|mexican",
"pets|yes"
]
},
{
"_id" : 3,
"firstName" : "Rachel",
"lastName" : "Jones",
"fieldValues" : [
"food|pizza"
]
}
]
Matching profiles for {
"_id" : 2,
"firstName" : "Sarah",
"lastName" : "Jane",
"fieldValues" : [
"favouriteColour|blue",
"food|pizza",
"food|mexican",
"pets|yes"
]
}
[
{
"_id" : 3,
"firstName" : "Rachel",
"lastName" : "Jones",
"fieldValues" : [
"food|pizza"
]
}
]
Matching profiles for {
"_id" : 3,
"firstName" : "Rachel",
"lastName" : "Jones",
"fieldValues" : [
"food|pizza"
]
}
[ ]
Run Code Online (Sandbox Code Playgroud)
显然,您可以调整查询以排除已经匹配的配置文件(通过更改{$gt:p._id}为{$ne:{p._id}}和其他调整.但我不确定使用聚合框架或mapreduce会获得什么额外的价值,因为这不是真正聚合一个集合它的字段(根据您显示的输出格式判断).如果您的输出格式要求是灵活的,当然您也可以使用其中一个内置聚合选项.
我确实检查了如果聚合在各个fieldValues周围会是什么样子并且它也不错,如果您的输出可以匹配,它可能对您有所帮助:
> db.profiles.aggregate({$unwind:"$fieldValues"},
{$group:{_id:"$fieldValues",
matchedProfiles : {$push:
{ id:"$_id",
name:{$concat:["$firstName"," ", "$lastName"]}}},
num:{$sum:1}
}},
{$match:{num:{$gt:1}}});
{
"result" : [
{
"_id" : "food|pizza",
"matchedProfiles" : [
{
"id" : 1,
"name" : "John Smith"
},
{
"id" : 2,
"name" : "Sarah Jane"
},
{
"id" : 3,
"name" : "Rachel Jones"
}
],
"num" : 3
}
],
"ok" : 1
}
Run Code Online (Sandbox Code Playgroud)
这基本上说"对于每个fieldValue($ unwind)group by fieldValue匹配的配置文件_ids和名称的数组,计算每个fieldValue累积的匹配数($ group),然后排除只有一个配置文件与之匹配的匹配项.
| 归档时间: |
|
| 查看次数: |
5481 次 |
| 最近记录: |