Jor*_*rdi 5 text-search mongodb aggregation-framework mongodb-indexes
我们正在 MongoDB 上构建一个简化版本的搜索引擎。
样本数据集
{ "_id" : 1, "dept" : "tech", "updDate": ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer" }
{ "_id" : 2, "dept" : "tech", "updDate": ISODate("2014-07-27T09:45:35Z"), "description" : "wireless red mouse" }
{ "_id" : 3, "dept" : "kitchen", "updDate": ISODate("2014-04-27T09:45:35Z"), "description" : "green placemat" }
{ "_id" : 4, "dept" : "kitchen", "updDate": ISODate("2014-05-27T09:45:35Z"), "description" : "red peeler" }
{ "_id" : 5, "dept" : "food", "updDate": ISODate("2014-04-27T09:45:35Z"), "description" : "green apple" }
{ "_id" : 6, "dept" : "food", "updDate": ISODate("2014-01-27T09:45:35Z"), "description" : "red potato" }
{ "_id" : 7, "dept" : "food", "updDate": ISODate("2014-08-28T09:45:35Z"), "description" : "lime green computer" }
{ "_id" : 8, "dept" : "food", "updDate": ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer" }
{ "_id" : 9, "dept" : "food", "updDate": ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer" }
Run Code Online (Sandbox Code Playgroud)
我们希望避免使用“offset-limit”对结果进行分页,为了做到这一点,我们基本上通过修改查询的“where/match”子句来使用“seek 方法”,以便能够使用索引而不是迭代集合来获取所需的结果。有关“寻求方法”的更多信息,我强烈建议阅读http://use-the-index-luke.com/blog/2013-07/pagination-done-the-postgresql-way
搜索引擎通常按得分和更新日期的降序对结果进行排序。为了实现这一目标,我们在聚合管道中使用文本搜索功能,如下所示。
db.inventory.createIndex({description:"text", dept: -1, updDate: -1, id:-1})
Run Code Online (Sandbox Code Playgroud)
第一页
db.inventory.aggregate( [ { $match: { dept : {$in : ["food","kitchen"]},"$text" : { "$language" : "en", "$search" : "green"} } },{ $project: {score: { $meta: "textScore" }, description : 1, updDate : 1, _id: 1 } }, { $sort: { "score" : -1, "updDate" : -1, _id: -1 } }, {$limit: 2 }] )
{ "_id" : 5, "updDate" : ISODate("2014-04-27T09:45:35Z"), "description" : "green apple", "score" : 0.75 }
{ "_id" : 3, "updDate" : ISODate("2014-04-27T09:45:35Z"), "description" : "green placemat", "score" : 0.75 }
Run Code Online (Sandbox Code Playgroud)
第二页
db.inventory.aggregate( [ { $match: { dept : {$in : ["food","kitchen"]},"$text" : { "$language" : "en", "$search" : "green"} } },{ $project: {score: { $meta: "textScore" }, description : 1, updDate : 1, _id: 1 } }, { $sort: { "score" : -1, "updDate" : -1, _id: -1 } }, { "$match" : { "$or" : [ { "score" : { "$lt" : 0.75}} , { "$and" : [ { "score" : { "$eq" : 0.75}} , { "$or" : [ { "updDate" : { "$lt" : ISODate("2014-04-27T09:45:35Z")}},{ "$and" : [ { "updDate": { "$eq" : ISODate("2014-04-27T09:45:35Z")}} , { "_id" : { "$lt" : 3}}]}]}]}]}},{$limit: 2 }] )
{ "_id" : 7, "updDate" : ISODate("2014-08-28T09:45:35Z"), "description" : "lime green computer", "score" : 0.6666666666666666 }
{ "_id" : 9, "updDate" : ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer", "score" : 0.6666666666666666 }
Run Code Online (Sandbox Code Playgroud)
还有最后一页
db.inventory.aggregate( [ { $match: { dept : {$in : ["food","kitchen"]} , "$text" : { "$language" : "en", "$search" : "green"} } }, { $project: {score: { $meta: "textScore" }, description : 1, updDate : 1, _id: 1 } }, { $sort: { "score" : -1, "updDate" : -1, _id: -1 } }, { "$match" : { "$or" : [ { "score" : { "$lt" : 0.6666666666666666}} , { "$and" : [ { "score" : { "$eq" : 0.6666666666666666}} , { "$or" : [ { "updDate" : { "$lt" : ISODate("2014-08-27T09:45:35Z")}} , { "$and" : [ { "updDate" : { "$eq" : ISODate("2014-08-27T09:45:35Z")}} , { "_id" : { "$lt" : 9}}]}]}]}]}}, {$limit: 2 }] )
{ "_id" : 8, "updDate" : ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer", "score" : 0.6666666666666666 }
Run Code Online (Sandbox Code Playgroud)
请注意我们如何按分数、更新日期和 id 对结果进行排序,以及在第二个匹配阶段我们如何尝试使用文档的分数值、更新日期和最后的 id 对结果进行分页。
索引创建考虑到文本查询不能覆盖文本索引前缀字段,请参阅问题https://jira.mongodb.org/browse/SERVER-13018,尽管我不确定这是否适用于我们的情况。
由于“executionStats”和“allPlansExecution”模式在聚合框架中不起作用,请参阅https://jira.mongodb.org/browse/SERVER-19758我不知道MongoDB如何尝试解析查询。
由于索引交集不适用于文本搜索,请参阅https://jira.mongodb.org/browse/SERVER-3071(在2.5.5解决)和http://blog.mongodb.org/post/87790974798/efficient -indexing-in-mongodb-26作者说
As of version 2.6.0, you cannot intersect with geo or text indices and you can intersect at most 2 separate indices with each other. These limitations are likely to change in a future release.
Run Code Online (Sandbox Code Playgroud)
阅读了多次https://docs.mongodb.org/manual/MongoDB-indexes-guide-master.pdf的 3.4(文本搜索教程)和 3.5(索引策略)部分,但没有得出任何明确的结论。
那么从文本搜索的角度来看,为该集合建立索引的最佳索引策略是什么?
第一个匹配阶段的一个索引和第二个(分页)匹配阶段的另一个索引?
db.inventory.createIndex({description:"text", dept: -1})
db.inventory.createIndex({updDate: -1, id:-})
Run Code Online (Sandbox Code Playgroud)
考虑两个匹配阶段的字段的复合索引?
db.inventory.createIndex({description:"text", dept: -1, updDate: -1, id:-1})
Run Code Online (Sandbox Code Playgroud)
以上都不是?
谢谢
| 归档时间: |
|
| 查看次数: |
768 次 |
| 最近记录: |