Koz*_*try 4 regex lucene elasticsearch
我在理解 ElasticSearch 中的正则表达式机制时遇到了麻烦。我有代表财产单位的文件:
{
"Unit" :
{
"DailyAvailablity" :
"UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}
}
Run Code Online (Sandbox Code Playgroud)
DailyAvailability 字段代码从今天开始的未来两年内按天数计算的财产可用性。'A' 表示可用,'U' 不可用,'I' 可以签到,'O' 可以签出。如何编写正则表达式过滤器以获取特定日期可用的所有单位?
我试图在 DailyAvailability 字段中找到具有特定长度和偏移量的“A”子字符串。例如,要查找从今天起 7 天内可使用 7 天的单位:
{
"query": {
"bool": {
"filter": [
{
"regexp": { "Unit.DailyAvailability": {"value": ".{7}a{7}.*" } }
}
]
}
}
}
Run Code Online (Sandbox Code Playgroud)
此查询返回例如具有 DateAvailability 的单位,该单位从“UUUUUUUUUUUUUUUUUUUIAA”开始,但在字段内的某处包含合适的序列。如何为整个源字符串锚定正则表达式?ES 文档说默认情况下应该锚定 lucene regex。
PS我试过了'^.{7}a{7}.*$'
。返回空集。
看起来您正在使用text
数据类型来存储Unit.DailyAvailability
(如果您使用动态映射,这也是字符串的默认类型)。您应该考虑改用keyword
数据类型。
让我更详细地解释一下。
text
字段中间的某些内容?text
数据类型发生的事情是数据被分析以进行全文搜索。它进行一些转换,例如小写和拆分为标记。
让我们尝试对您的输入使用分析 API:
POST _analyze
{
"text": "UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}
Run Code Online (Sandbox Code Playgroud)
回应是:
{
"tokens": [
{
"token": "uiaouuuuuuuiaaaaaaaaaaaaaaaaaouuuuiaaaaouuuiaouuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuiaaaaaouuuuuuuuuuuuuiaaaaouuuuuuuuuuuuuiaaaaaaaaouuuuuuiaaaaaaaaaouuuuuuuuuuuuuuuuuuiuuuuuuuuiuuuuuuuuuuuuuuiaaaouuuuuuuuuuuuuiuuuuiaouuuuuuuuuuuuuuu",
"start_offset": 0,
"end_offset": 255,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "uuuuuuuuuuuuuuiaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"start_offset": 255,
"end_offset": 510,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"start_offset": 510,
"end_offset": 732,
"type": "<ALPHANUM>",
"position": 2
}
]
}
Run Code Online (Sandbox Code Playgroud)
如您所见,Elasticsearch 已将您的输入拆分为三个标记并将它们小写。这看起来出乎意料,但如果您认为它实际上试图促进在人类语言中搜索单词,这是有道理的 - 没有这么长的单词。
这就是为什么现在regexp
查询".{7}a{7}.*"
将匹配:有一个标记,实际上有很多开始a
的,这是一种预期行为的regexp
查询。
...Elasticsearch 会将 regexp 应用于标记器为该字段生成的术语,而不是该字段的原始文本。
regexp
查询考虑整个字符串?很简单:不要应用分析器。该类型keyword
按原样存储您提供的字符串。
使用这样的映射:
PUT my_regexes
{
"mappings": {
"doc": {
"properties": {
"Unit": {
"properties": {
"DailyAvailablity": {
"type": "keyword"
}
}
}
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
您将能够执行这样的查询以匹配帖子中的文档:
POST my_regexes/doc/_search
{
"query": {
"bool": {
"filter": [
{
"regexp": { "Unit.DailyAvailablity": "UIAOUUUUUUUIA.*" }
}
]
}
}
}
Run Code Online (Sandbox Code Playgroud)
请注意,查询变得区分大小写,因为未分析该字段。
这regexp
将不再返回任何结果:".{12}a{7}.*"
这会: ".{12}A{7}.*"
正则表达式被锚定:
Lucene 的模式总是固定的。提供的模式必须匹配整个字符串。
看起来锚定错误的原因很可能是因为令牌在分析的text
字段中被拆分。
归档时间: |
|
查看次数: |
4718 次 |
最近记录: |