LYu*_*LYu 26 email lucene tokenize analyzer elasticsearch
在以下情况下,我无法在Google或ES中找到完美的解决方案,希望有人可以在这里提供帮助.
假设在"email"字段下存储了五个电子邮件地址:
1. {"email": "john.doe@gmail.com"}
2. {"email": "john.doe@gmail.com, john.doe@outlook.com"}
3. {"email": "hello-john.doe@outlook.com"}
4. {"email": "john.doe@outlook.com}
5. {"email": "john@yahoo.com"}
Run Code Online (Sandbox Code Playgroud)
我想完成以下搜索方案:
[搜索 - >接收]
"john.doe@gmail.com" - > 1,2
"john.doe@outlook.com" - > 2,4
"john@yahoo.com" - > 5
"john.doe" - > 1,2,3,4
"约翰" - > 1,2,3,4,5
"gmail.com" - > 1,2
"outlook.com" - > 2,3,4
前三个匹配是必须的,对于其余的匹配,越精确越好.已经尝试过不同的索引/搜索分析器,标记器和过滤器组合.还试图处理匹配查询的条件,但没有找到理想的解决方案,任何想法都是受欢迎的,并且对映射,分析器或使用哪种查询没有限制,谢谢.
And*_*fan 37
映射:
PUT /test
{
"settings": {
"analysis": {
"filter": {
"email": {
"type": "pattern_capture",
"preserve_original": 1,
"patterns": [
"([^@]+)",
"(\\p{L}+)",
"(\\d+)",
"@(.+)",
"([^-@]+)"
]
}
},
"analyzer": {
"email": {
"tokenizer": "uax_url_email",
"filter": [
"email",
"lowercase",
"unique"
]
}
}
}
},
"mappings": {
"emails": {
"properties": {
"email": {
"type": "string",
"analyzer": "email"
}
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
测试数据:
POST /test/emails/_bulk
{"index":{"_id":"1"}}
{"email": "john.doe@gmail.com"}
{"index":{"_id":"2"}}
{"email": "john.doe@gmail.com, john.doe@outlook.com"}
{"index":{"_id":"3"}}
{"email": "hello-john.doe@outlook.com"}
{"index":{"_id":"4"}}
{"email": "john.doe@outlook.com"}
{"index":{"_id":"5"}}
{"email": "john@yahoo.com"}
Run Code Online (Sandbox Code Playgroud)
要使用的查询:
GET /test/emails/_search
{
"query": {
"term": {
"email": "john.doe@gmail.com"
}
}
}
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
12943 次 |
最近记录: |