Lucene SpanNearQuery部分匹配

Fra*_*See 1 html lucene proximity match partial

给定一个文档{'foo','bar','baz'},我希望使用SpanNearQuery与令牌匹配{'baz','extra'}

但这失败了.

我该如何解决这个问题?

样品测试(使用lucene 2.9.1),结果如下:

  • givenSingleMatch - PASS
  • givenTwoMatches - PASS
  • 给出了三个匹配 - 通过
  • givenSingleMatch_andExtraTerm - 失败

...

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.After;
import org.junit.Assert;
import org.junit.Before;
import org.junit.Test;

import java.io.IOException;

public class SpanNearQueryTest {

    private RAMDirectory directory = null;

    private static final String BAZ = "baz";
    private static final String BAR = "bar";
    private static final String FOO = "foo";
    private static final String TERM_FIELD = "text";

    @Before
    public void given() throws IOException {
        directory = new RAMDirectory();
        IndexWriter writer = new IndexWriter(
                directory,
                new StandardAnalyzer(Version.LUCENE_29),
                IndexWriter.MaxFieldLength.UNLIMITED);

        Document doc = new Document();
        doc.add(new Field(TERM_FIELD, FOO, Field.Store.NO, Field.Index.ANALYZED));
        doc.add(new Field(TERM_FIELD, BAR, Field.Store.NO, Field.Index.ANALYZED));
        doc.add(new Field(TERM_FIELD, BAZ, Field.Store.NO, Field.Index.ANALYZED));

        writer.addDocument(doc);
        writer.commit();
        writer.optimize();
        writer.close();
    }

    @After
    public void cleanup() {
        directory.close();
    }

    @Test
    public void givenSingleMatch() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenTwoMatches() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAR))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenThreeMatches() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAR)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAZ))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenSingleMatch_andExtraTerm() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, BAZ)),
                        new SpanTermQuery(new Term(TERM_FIELD, "EXTRA"))
                },
                Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }
}
Run Code Online (Sandbox Code Playgroud)

dan*_*ben 6

SpanNearQuery允许您查找彼此相距一定距离的术语.

示例(来自http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/):

假设我们想要在doug的5个位置找到lucene,使用doug跟随lucene(顺序问题) - 您可以使用以下SpanQuery:

new SpanNearQuery(new SpanQuery[] {
  new SpanTermQuery(new Term(FIELD, "lucene")),
  new SpanTermQuery(new Term(FIELD, "doug"))},
  5,
  true);
Run Code Online (Sandbox Code Playgroud)

替代文字http://www.lucidimagination.com/blog/wp-content/uploads/2009/07/spanquery-dia1.png

在这个示例文本中,Lucene在Doug的3个范围内

但是对于你的例子,我能看到的唯一匹配是你的查询和目标文档都有"cd"(我假设所有这些术语都在一个字段中).在这种情况下,您不需要使用任何特殊的查询类型.使用标准机制,您将获得一些非零加权,因为它们在同一字段中包含相同的术语.

编辑3 - 作为对最新评论的回应,答案是您不能使用SpanNearQuery除了预期之外的任何其他内容,即查明文档中的多个术语是否出现在彼此的特定位置内.我不知道你的具体用例/预期结果是什么(随意发布),但在最后一种情况下,你只想知道一个或多个("BAZ","EXTRA")是否在文件,一个BooleanQuery将工作得很好.

编辑4 - 既然您已经发布了用例,我就明白您想要做什么.以下是如何做到这一点:使用BooleanQuery上面提到的一个来组合你想要的个别术语SpanNearQuery,以及设置一个提升SpanNearQuery.

因此,文本形式的查询看起来像:

BAZ OR EXTRA OR "BAZ EXTRA"~100^5
Run Code Online (Sandbox Code Playgroud)

(作为一个例子 - 这将匹配包含"BAZ"或"EXTRA"的所有文件,但是对于"BAZ"和"EXTRA"在彼此的100个地方之间发生的文件分配更高的分数;调整位置并提升为你喜欢.这个例子来自Solr食谱,所以它可能不会在Lucene中解析,或者可能会产生不良结果.这没关系,因为在下一节我将向您展示如何使用API​​构建它.

以编程方式,您将构建如下:

Query top = new BooleanQuery();

// Construct the terms since they will be used more than once
Term bazTerm = new Term("Field", "BAZ");
Term extraTerm = new Term("Field", "EXTRA");

// Add each term as "should" since we want a partial match
top.add(new TermQuery(bazTerm), BooleanClause.Occur.SHOULD);
top.add(new TermQuery(extraTerm), BooleanClause.Occur.SHOULD);

// Construct the SpanNearQuery, with slop 100 - a document will get a boost only
// if BAZ and EXTRA occur within 100 places of each other.  The final parameter means
// that BAZ must occur before EXTRA.
SpanNearQuery spanQuery = new SpanNearQuery(
                              new SpanQuery[] { new SpanTermQuery(bazTerm), 
                                                new SpanTermQuery(extraTerm) }, 
                              100, true);

// Give it a boost of 5 since it is more important that the words are together
spanQuery.setBoost(5f);

// Add it as "should" since we want a match even when we don't have proximity
top.add(spanQuery, BooleanClause.Occur.SHOULD);
Run Code Online (Sandbox Code Playgroud)

希望有所帮助!在将来,尝试通过准确发布您期望的结果来开始 - 即使对您来说显而易见,也可能不是读者,并且明确可以避免不得不来回走动很多次.