在聚合框架中使用文本搜索时的 MongoDB 索引优化

Jor*_*rdi 5 text-search mongodb aggregation-framework mongodb-indexes

我们正在 MongoDB 上构建一个简化版本的搜索引擎。

样本数据集

{ "_id" : 1, "dept" : "tech", "updDate":  ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer" }
{ "_id" : 2, "dept" : "tech", "updDate":  ISODate("2014-07-27T09:45:35Z"), "description" : "wireless red mouse" }
{ "_id" : 3, "dept" : "kitchen", "updDate":  ISODate("2014-04-27T09:45:35Z"), "description" : "green placemat" }
{ "_id" : 4, "dept" : "kitchen", "updDate":  ISODate("2014-05-27T09:45:35Z"), "description" : "red peeler" }
{ "_id" : 5, "dept" : "food", "updDate":  ISODate("2014-04-27T09:45:35Z"), "description" : "green apple" }
{ "_id" : 6, "dept" : "food", "updDate":  ISODate("2014-01-27T09:45:35Z"), "description" : "red potato" }
{ "_id" : 7, "dept" : "food", "updDate":  ISODate("2014-08-28T09:45:35Z"), "description" : "lime green computer" }
{ "_id" : 8, "dept" : "food", "updDate":  ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer" }
{ "_id" : 9, "dept" : "food", "updDate":  ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer" }

Run Code Online (Sandbox Code Playgroud)

我们希望避免使用“offset-limit”对结果进行分页，为了做到这一点，我们基本上通过修改查询的“where/match”子句来使用“seek 方法”，以便能够使用索引而不是迭代集合来获取所需的结果。有关“寻求方法”的更多信息，我强烈建议阅读http://use-the-index-luke.com/blog/2013-07/pagination-done-the-postgresql-way

搜索引擎通常按得分和更新日期的降序对结果进行排序。为了实现这一目标，我们在聚合管道中使用文本搜索功能，如下所示。

db.inventory.createIndex({description:"text", dept: -1, updDate: -1, id:-1})

Run Code Online (Sandbox Code Playgroud)

第一页

db.inventory.aggregate(  [ { $match: { dept : {$in : ["food","kitchen"]},"$text" : { "$language" : "en", "$search" : "green"} } },{ $project: {score: { $meta: "textScore" }, description : 1, updDate : 1, _id: 1 } }, { $sort: { "score" : -1, "updDate" : -1, _id: -1 } }, {$limit:  2 }]  )


{ "_id" : 5, "updDate" : ISODate("2014-04-27T09:45:35Z"), "description" : "green apple", "score" : 0.75 }
{ "_id" : 3, "updDate" : ISODate("2014-04-27T09:45:35Z"), "description" : "green placemat", "score" : 0.75 }

Run Code Online (Sandbox Code Playgroud)

第二页

db.inventory.aggregate(  [ { $match: { dept : {$in : ["food","kitchen"]},"$text" : { "$language" : "en", "$search" : "green"} } },{ $project: {score: { $meta: "textScore" }, description : 1, updDate : 1, _id: 1 } }, { $sort: { "score" : -1, "updDate" : -1, _id: -1 } }, { "$match" : { "$or" : [ { "score" : { "$lt" : 0.75}} , { "$and" : [ { "score" : { "$eq" : 0.75}} , { "$or" : [ { "updDate" : { "$lt" : ISODate("2014-04-27T09:45:35Z")}},{ "$and" : [ { "updDate": { "$eq" : ISODate("2014-04-27T09:45:35Z")}} , { "_id" : { "$lt" : 3}}]}]}]}]}},{$limit:  2 }]  )

{ "_id" : 7, "updDate" : ISODate("2014-08-28T09:45:35Z"), "description" : "lime green computer", "score" : 0.6666666666666666 }
{ "_id" : 9, "updDate" : ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer", "score" : 0.6666666666666666 }

Run Code Online (Sandbox Code Playgroud)

还有最后一页

db.inventory.aggregate(  [ { $match: { dept : {$in : ["food","kitchen"]} , "$text" : { "$language" : "en", "$search" : "green"} } }, { $project: {score: { $meta: "textScore" }, description : 1, updDate : 1, _id: 1 } }, { $sort: { "score" : -1, "updDate" : -1, _id: -1 } }, { "$match" : { "$or" : [ { "score" : { "$lt" : 0.6666666666666666}} , { "$and" : [ { "score" : { "$eq" : 0.6666666666666666}} , { "$or" : [ { "updDate" : { "$lt" : ISODate("2014-08-27T09:45:35Z")}} , { "$and" : [ { "updDate" : { "$eq" : ISODate("2014-08-27T09:45:35Z")}} , { "_id" : { "$lt" : 9}}]}]}]}]}}, {$limit:  2 }]  )


{ "_id" : 8, "updDate" : ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer", "score" : 0.6666666666666666 }

Run Code Online (Sandbox Code Playgroud)

请注意我们如何按分数、更新日期和 id 对结果进行排序，以及在第二个匹配阶段我们如何尝试使用文档的分数值、更新日期和最后的 id 对结果进行分页。

索引创建考虑到文本查询不能覆盖文本索引前缀字段，请参阅问题https://jira.mongodb.org/browse/SERVER-13018，尽管我不确定这是否适用于我们的情况。

由于“executionStats”和“allPlansExecution”模式在聚合框架中不起作用，请参阅https://jira.mongodb.org/browse/SERVER-19758我不知道MongoDB如何尝试解析查询。

由于索引交集不适用于文本搜索，请参阅https://jira.mongodb.org/browse/SERVER-3071（在2.5.5解决）和http://blog.mongodb.org/post/87790974798/efficient -indexing-in-mongodb-26作者说

As of version 2.6.0, you cannot intersect with geo or text indices and you can intersect at most 2 separate indices with each other. These limitations are likely to change in a future release.

Run Code Online (Sandbox Code Playgroud)

阅读了多次https://docs.mongodb.org/manual/MongoDB-indexes-guide-master.pdf的 3.4（文本搜索教程）和 3.5（索引策略）部分，但没有得出任何明确的结论。

那么从文本搜索的角度来看，为该集合建立索引的最佳索引策略是什么？

第一个匹配阶段的一个索引和第二个（分页）匹配阶段的另一个索引？

db.inventory.createIndex({description:"text", dept: -1})
db.inventory.createIndex({updDate: -1, id:-})

Run Code Online (Sandbox Code Playgroud)

考虑两个匹配阶段的字段的复合索引？

db.inventory.createIndex({description:"text", dept: -1, updDate: -1, id:-1})

Run Code Online (Sandbox Code Playgroud)

以上都不是？

谢谢

归档时间：	10 年，3 月前
查看次数：	768 次
最近记录：	9 年，9 月前