我有一个相对较短的文档的大型数据集,其中包括给定名称,姓氏,tenantId,地理位置和一组技能.
我们有大约700万条记录分布在三个节点上,当搜索具有相当数量匹配的术语时,事情是无法忍受的慢(大约十秒).我们通常会按名称,按时间顺序按创建日期或按术语相关性按字母顺序对结果集进行排序.我们还要求学期突出显示和结果计数.我们使用REST api与ES进行通信.
我读过排序可能是搜索性能的主要瓶颈; 什么策略在生产中有效处理这种要求?
我正在使用类似于以下的映射:
"candidate": {
"dynamic":"true",
"properties": {
"accountId": {
"type": "string",
"store": "true",
"index": "not_analyzed"
},
"tenant": {
"type": "string",
"store": "true",
"index": "not_analyzed"
},
"givenName": {
"type": "string",
"store": "true",
"index":"analyzed",
"analyzer":"sortable",
"term_vector" : "with_positions_offsets"
},
...
"locations": {
"properties": {
"name": {
"type": "string",
"store": "true",
"index": "analyzed",
"term_vector" : "with_positions_offsets"
},
"point": {
"type" : "geo_point",
"store": "true",
"lat_lon":"true"
}
}
},
"skills": {
"type": "string",
"store": "true",
"index": "analyzed",
"term_vector" : "with_positions_offsets"
},
"createdDate": {
"type": "long",
"store": "true",
"index": "not_analyzed"
},
"updatedDate": {
"type": "long",
"store": "true",
"index": "not_analyzed"
}
}
Run Code Online (Sandbox Code Playgroud)
并且查询结构如下:
{
"from" : 0,
"size" : 40,
"query" : {
"bool" : {
"must" : {
"bool" : {
"should" : [ {
"multi_match" : {
"query" : "query text",
"fields" : [ "givenName", "familyName", "email", "locations.name", "skills"],
"type" : "cross_fields"
}
}, {
"prefix" : {
"email" : {
"prefix" : "query text"
}
}
} ]
}
}
}
},
"post_filter" : {
"bool" : {
"must" : {
"geo_polygon" : {
"point" : {
"points" : [ [ -75.06681499999999, 40.536544 ],
... many more long/lat points ...
[ -75.06681499999999, 40.536544 ] ]
}
}
}
}
},
"sort" : [ {
"createdDate" : {
"order" : "asc"
}
} ],
"highlight" : {
"fields" : {
"givenName" : { },
"familyName" : { },
"email" : { },
"locations.name" : { },
"skills" : { }
}
}
}
Run Code Online (Sandbox Code Playgroud)
是否有某种基于范围的查询解决方案,其他人发现有助于处理类似的排序/搜索要求?
| 归档时间: |
|
| 查看次数: |
1879 次 |
| 最近记录: |