Yer*_*yan 17 fuzzy-search fuzzy-logic fuzzy-comparison elasticsearch
我在我的项目中使用模糊匹配主要是为了找到相同名称的拼写错误和拼写错误.我需要准确理解弹性搜索的模糊匹配如何工作以及它如何使用标题中提到的2个参数.
据我所知,min_similarity是查询字符串与数据库中字符串匹配的百分比.我找不到如何计算此值的确切描述.
据我所知,max_expansions是应该执行搜索的Levenshtein距离.如果这实际上是Levenshtein距离,它将是我理想的解决方案.无论如何,它不起作用,例如我有"Samvel"这个词
queryStr max_expansions matches?
samvel 0 Should not be 0. error (but levenshtein distance can be 0!)
samvel 1 Yes
samvvel 1 Yes
samvvell 1 Yes (but it shouldn't have)
samvelll 1 Yes (but it shouldn't have)
saamvelll 1 No (but for some weird reason it matches with Samvelian)
saamvelll anything bigger than 1 No
Run Code Online (Sandbox Code Playgroud)
文档说的是我实际上不理解的东西:
Add max_expansions to the fuzzy query allowing to control the maximum number
of terms to match. Default to unbounded (or bounded by the max clause count in
boolean query).
Run Code Online (Sandbox Code Playgroud)
所以请任何人向我解释这些参数究竟是如何影响搜索结果的.
DrT*_*ech 23
这min_similarity
是一个介于0和1之间的值.来自Lucene的文档:
For example, for a minimumSimilarity of 0.5 a term of the same length
as the query term is considered similar to the query term if the edit
distance between both terms is less than length(term)*0.5
Run Code Online (Sandbox Code Playgroud)
所引用的"编辑距离"是Levenshtein距离.
此查询在内部的工作方式是:
min_similarity
考虑你可以想象这个查询有多重!
要解决此问题,您可以设置max_expansions
参数以指定应考虑的最大匹配项数.
归档时间: |
|
查看次数: |
9048 次 |
最近记录: |