Pav*_*vel 39 group-by facet faceted-search elasticsearch
我发现的唯一接近的事情是:Elasticsearch中的多个分组
基本上我正在尝试使ES等效于以下mysql查询:
select gender, age_range, count(distinct profile_id) as count
FROM TABLE group by age_range, gender
Run Code Online (Sandbox Code Playgroud)
年龄和性别本身很容易获得:
{
"query": {
"match_all": {}
},
"facets": {
"ages": {
"terms": {
"field": "age_range",
"size": 20
}
},
"gender_by_age": {
"terms": {
"fields": [
"age_range",
"gender"
]
}
}
},
"size": 0
}
Run Code Online (Sandbox Code Playgroud)
这使:
{
"ages": {
"_type": "terms",
"missing": 0,
"total": 193961,
"other": 0,
"terms": [
{
"term": 0,
"count": 162643
},
{
"term": 3,
"count": 10683
},
{
"term": 4,
"count": 8931
},
{
"term": 5,
"count": 4690
},
{
"term": 6,
"count": 3647
},
{
"term": 2,
"count": 3247
},
{
"term": 1,
"count": 120
}
]
},
"total_gender": {
"_type": "terms",
"missing": 0,
"total": 193961,
"other": 0,
"terms": [
{
"term": 1,
"count": 94799
},
{
"term": 2,
"count": 62645
},
{
"term": 0,
"count": 36517
}
]
}
}
Run Code Online (Sandbox Code Playgroud)
但现在我需要一些看起来像这样的东西:
[breakdown_gender] => Array
(
[1] => Array
(
[0] => 264
[1] => 1
[2] => 6
[3] => 67
[4] => 72
[5] => 40
[6] => 23
)
[2] => Array
(
[0] => 153
[2] => 2
[3] => 21
[4] => 35
[5] => 22
[6] => 11
)
)
Run Code Online (Sandbox Code Playgroud)
请注意,这MySql是年龄范围的"映射",所以它们实际上意味着什么:)而不仅仅是数字.例如,性别[1](即"男性")分为年龄范围[0]("未满18岁"),计数为246.
Joe*_*Joe 77
从版本1.0开始ElasticSearch,新聚合 API允许使用子聚合按多个字段进行分组.假设您要按字段分组field1,field2并且field3:
{
"aggs": {
"agg1": {
"terms": {
"field": "field1"
},
"aggs": {
"agg2": {
"terms": {
"field": "field2"
},
"aggs": {
"agg3": {
"terms": {
"field": "field3"
}
}
}
}
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
当然,这可以继续你想要的许多领域.
更新:
为了完整性,以下是上述查询的输出的外观.下面是python代码,用于生成聚合查询并将结果展平为字典列表.
{
"aggregations": {
"agg1": {
"buckets": [{
"doc_count": <count>,
"key": <value of field1>,
"agg2": {
"buckets": [{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field1>,
"agg2": {
"buckets": [{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
}, ...
]
}, ...
]
}
}
}
Run Code Online (Sandbox Code Playgroud)
以下python代码在给定字段列表的情况下执行group-by.我指定include_missing=True,它也包括一些字段的缺失值的组合(你不需要它,如果你的版本是2.0 Elasticsearch感谢给这个)
def group_by(es, fields, include_missing):
current_level_terms = {'terms': {'field': fields[0]}}
agg_spec = {fields[0]: current_level_terms}
if include_missing:
current_level_missing = {'missing': {'field': fields[0]}}
agg_spec[fields[0] + '_missing'] = current_level_missing
for field in fields[1:]:
next_level_terms = {'terms': {'field': field}}
current_level_terms['aggs'] = {
field: next_level_terms,
}
if include_missing:
next_level_missing = {'missing': {'field': field}}
current_level_terms['aggs'][field + '_missing'] = next_level_missing
current_level_missing['aggs'] = {
field: next_level_terms,
field + '_missing': next_level_missing,
}
current_level_missing = next_level_missing
current_level_terms = next_level_terms
agg_result = es.search(body={'aggs': agg_spec})['aggregations']
return get_docs_from_agg_result(agg_result, fields, include_missing)
def get_docs_from_agg_result(agg_result, fields, include_missing):
current_field = fields[0]
buckets = agg_result[current_field]['buckets']
if include_missing:
buckets.append(agg_result[(current_field + '_missing')])
if len(fields) == 1:
return [
{
current_field: bucket.get('key'),
'doc_count': bucket['doc_count'],
}
for bucket in buckets if bucket['doc_count'] > 0
]
result = []
for bucket in buckets:
records = get_docs_from_agg_result(bucket, fields[1:], include_missing)
value = bucket.get('key')
for record in records:
record[current_field] = value
result.extend(records)
return result
Run Code Online (Sandbox Code Playgroud)
mol*_*are 19
由于您只有2个字段,因此使用单个方面进行两个查询的简单方法就是这样.男性:
{
"query" : {
"term" : { "gender" : "Male" }
},
"facets" : {
"age_range" : {
"terms" : {
"field" : "age_range"
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
对于女性:
{
"query" : {
"term" : { "gender" : "Female" }
},
"facets" : {
"age_range" : {
"terms" : {
"field" : "age_range"
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
或者您可以使用构面过滤器在单个查询中执行此操作(有关详细信息,请参阅此链接)
{
"query" : {
"match_all": {}
},
"facets" : {
"age_range_male" : {
"terms" : {
"field" : "age_range"
},
"facet_filter":{
"term": {
"gender": "Male"
}
}
},
"age_range_female" : {
"terms" : {
"field" : "age_range"
},
"facet_filter":{
"term": {
"gender": "Female"
}
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
更新:
随着方面即将被删除.这是具有聚合的解决方案:
{
"query": {
"match_all": {}
},
"aggs": {
"male": {
"filter": {
"term": {
"gender": "Male"
}
},
"aggs": {
"age_range": {
"terms": {
"field": "age_range"
}
}
}
},
"female": {
"filter": {
"term": {
"gender": "Female"
}
},
"aggs": {
"age_range": {
"terms": {
"field": "age_range"
}
}
}
}
}
}
Run Code Online (Sandbox Code Playgroud)