Lucene得分问题

Question

Lucene得分问题

Mar*_*ech 4 lucene information-retrieval scoring

我对Lucene的得分功能有一个问题,我无法弄清楚.到目前为止,我已经能够编写此代码来重现它.

package lucenebug;

import java.util.Arrays;
import java.util.List;

import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;

public class Test {
    private static final String TMP_LUCENEBUG_INDEX = "/tmp/lucenebug_index";

    public static void main(String[] args) throws Throwable {
        SimpleAnalyzer analyzer = new SimpleAnalyzer();
        IndexWriter w = new IndexWriter(TMP_LUCENEBUG_INDEX, analyzer, true);
        List<String> names = Arrays
                .asList(new String[] { "the rolling stones",
                        "rolling stones (karaoke)",
                        "the rolling stones tribute",
                        "rolling stones tribute band",
                        "karaoke - the rolling stones" });
        try {
            for (String name : names) {
                System.out.println("#name: " + name);
                Document doc = new Document();
                doc.add(new Field("name", name, Field.Store.YES,
                        Field.Index.TOKENIZED));
                w.addDocument(doc);
            }
            System.out.println("finished adding docs, total size: "
                    + w.docCount());

        } finally {
            w.close();
        }

        IndexSearcher s = new IndexSearcher(TMP_LUCENEBUG_INDEX);
        QueryParser p = new QueryParser("name", analyzer);
        Query q = p.parse("name:(rolling stones)");
        System.out.println("--------\nquery: " + q);

        TopDocs topdocs = s.search(q, null, 10);
        for (ScoreDoc sd : topdocs.scoreDocs) {
            System.out.println("" + sd.score + "\t"
                    + s.doc(sd.doc).getField("name").stringValue());
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

我从运行中获得的输出是:

finished adding docs, total size: 5
--------
query: name:rolling name:stones
0.578186    the rolling stones
0.578186    rolling stones (karaoke)
0.578186    the rolling stones tribute
0.578186    rolling stones tribute band
0.578186    karaoke - the rolling stones

Run Code Online (Sandbox Code Playgroud)

我只是无法理解为什么the rolling stones具有相同的相关性the rolling stones tribute.根据lucene的文档,字段中的标记越多,标准化因子应该越小,因此the rolling stones tribute应该得分低于the rolling stones.

有任何想法吗？

Answer 1

Sha*_*ore 5

长度归一化因子计算为1 / sqrt(numTerms)(您可以在DefaultSimilarity中看到这一点

此结果不直接存储在索引中.此值乘以指定字段的提升值.然后最终结果以8位编码,如Similarity.encodeNorm()中所述.这是一种有损编码,这意味着精细细节会丢失.

如果要查看操作中的长度规范化,请尝试使用以下句子创建文档.

the rolling stones tribute a b c d e f g h i j k

Run Code Online (Sandbox Code Playgroud)

这将在您可以看到的长度标准化值中产生足够的差异.

现在,如果您的字段根据您使用的示例只有很少的令牌,您可以根据您自己的公式设置文档/字段的提升值,这对于短字段来说实质上是更高的提升.或者,您可以创建自定义相似性并覆盖legthNorm()方法.

归档时间：	16 年前
查看次数：	1033 次
最近记录：	11 年，8 月前