aja*_*jay 16 regex autocomplete mongodb
我有一个MongoDB
表格的文件集合
{
"id": 42,
"title": "candy can",
"description": "canada candy canteen",
"brand": "cannister candid",
"manufacturer": "candle canvas"
}
Run Code Online (Sandbox Code Playgroud)
我需要在输入搜索词中通过匹配除了以外的字段来实现自动完成功能id
.例如,如果输入术语是can
,那么我应该返回文档中的所有匹配单词
{ hints: ["candy", "can", "canada", "canteen", ...]
Run Code Online (Sandbox Code Playgroud)
我看了这个问题,但没有帮助.我还尝试搜索如何regex
在多个字段中搜索并提取匹配的标记,或在MongoDB中提取匹配的标记text search
但找不到任何帮助.
Mar*_*erg 29
没有简单的解决方案可以满足您的需求,因为普通查询无法修改它们返回的字段.有一个解决方案(使用下面的mapReduce内联而不是对集合进行输出),但除了非常小的数据库之外,不可能实时执行此操作.
如上所述,普通查询无法真正修改它返回的字段.但还有其他问题.如果你想在中途进行正则表达式搜索,你必须索引所有字段,这需要为该功能提供不成比例的RAM.如果不对所有字段建立索引,则正则表达式搜索将导致集合扫描,这意味着必须从磁盘加载每个文档,这将花费太多时间使自动完成变得方便.此外,请求自动完成的多个同时用户将在后端上产生相当大的负载.
问题与我已经回答的问题非常相似:我们需要从多个字段中提取每个单词,删除停用词并将剩余的单词与链接一起保存到相应的文档中. .现在,为了获得自动完成列表,我们只需查询索引的单词列表.
db.yourCollection.mapReduce(
// Map function
function() {
// We need to save this in a local var as per scoping problems
var document = this;
// You need to expand this according to your needs
var stopwords = ["the","this","and","or"];
for(var prop in document) {
// We are only interested in strings and explicitly not in _id
if(prop === "_id" || typeof document[prop] !== 'string') {
continue
}
(document[prop]).split(" ").forEach(
function(word){
// You might want to adjust this to your needs
var cleaned = word.replace(/[;,.]/g,"")
if(
// We neither want stopwords...
stopwords.indexOf(cleaned) > -1 ||
// ...nor string which would evaluate to numbers
!(isNaN(parseInt(cleaned))) ||
!(isNaN(parseFloat(cleaned)))
) {
return
}
emit(cleaned,document._id)
}
)
}
},
// Reduce function
function(k,v){
// Kind of ugly, but works.
// Improvements more than welcome!
var values = { 'documents': []};
v.forEach(
function(vs){
if(values.documents.indexOf(vs)>-1){
return
}
values.documents.push(vs)
}
)
return values
},
{
// We need this for two reasons...
finalize:
function(key,reducedValue){
// First, we ensure that each resulting document
// has the documents field in order to unify access
var finalValue = {documents:[]}
// Second, we ensure that each document is unique in said field
if(reducedValue.documents) {
// We filter the existing documents array
finalValue.documents = reducedValue.documents.filter(
function(item,pos,self){
// The default return value
var loc = -1;
for(var i=0;i<self.length;i++){
// We have to do it this way since indexOf only works with primitives
if(self[i].valueOf() === item.valueOf()){
// We have found the value of the current item...
loc = i;
//... so we are done for now
break
}
}
// If the location we found equals the position of item, they are equal
// If it isn't equal, we have a duplicate
return loc === pos;
}
);
} else {
finalValue.documents.push(reducedValue)
}
// We have sanitized our data, now we can return it
return finalValue
},
// Our result are written to a collection called "words"
out: "words"
}
)
Run Code Online (Sandbox Code Playgroud)
针对您的示例运行此mapReduce将导致db.words
如下所示:
{ "_id" : "can", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "canada", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "candid", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "candle", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "candy", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "cannister", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "canteen", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "canvas", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
Run Code Online (Sandbox Code Playgroud)
请注意,单个单词是_id
文档的单词.该_id
字段由MongoDB自动编制索引.由于索引试图保存在RAM中,我们可以做一些技巧来加速自动完成并减少服务器的负载.
对于自动完成,我们只需要单词,而不需要指向文档的链接.由于单词是索引的,我们使用覆盖查询 - 仅从索引中回答的查询,该索引通常驻留在RAM中.
为了坚持你的例子,我们将使用以下查询来获得自动完成的候选者:
db.words.find({_id:/^can/},{_id:1})
Run Code Online (Sandbox Code Playgroud)
这给了我们结果
{ "_id" : "can" }
{ "_id" : "canada" }
{ "_id" : "candid" }
{ "_id" : "candle" }
{ "_id" : "candy" }
{ "_id" : "cannister" }
{ "_id" : "canteen" }
{ "_id" : "canvas" }
Run Code Online (Sandbox Code Playgroud)
使用该.explain()
方法,我们可以验证此查询仅使用索引.
{
"cursor" : "BtreeCursor _id_",
"isMultiKey" : false,
"n" : 8,
"nscannedObjects" : 0,
"nscanned" : 8,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 8,
"scanAndOrder" : false,
"indexOnly" : true,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"_id" : [
[
"can",
"cao"
],
[
/^can/,
/^can/
]
]
},
"server" : "32a63f87666f:27017",
"filterSet" : false
}
Run Code Online (Sandbox Code Playgroud)
注意该indexOnly:true
字段.
虽然我们将不得不做两个查询来获取实际文档,因为我们加快了整个过程,用户体验应该足够好.
words
集合的文档当用户选择自动完成时,我们必须查询单词的完整文档,以便找到选择用于自动完成的单词源自的文档.
db.words.find({_id:"canteen"})
Run Code Online (Sandbox Code Playgroud)
这将产生这样的文件:
{ "_id" : "canteen", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
Run Code Online (Sandbox Code Playgroud)
使用该文档,我们现在可以显示包含搜索结果的页面,或者像在这种情况下,重定向到您可以获得的实际文档:
db.yourCollection.find({_id:ObjectId("553e435f20e6afc4b8aa0efb")})
Run Code Online (Sandbox Code Playgroud)
虽然这种方法初看起来复杂的(当然,MapReduce的是一个位),它是实际很简单概念.基本上,你正在交易实时结果(除非你花费大量的RAM ,否则你将无法获得)以获得速度.Imho,这是一个很好的协议.为了使相当昂贵的mapReduce阶段更有效,实现Incremental mapReduce可能是一种方法 - 改进我公认的黑客mapReduce可能是另一种方法.
最后但并非最不重要的是,这种方式完全是一个相当丑陋的黑客.你可能想深入研究elasticsearch或lucene.这些产品非常适合您想要的产品.