Lucene Custom Analyzer用于索引和查询

Coo*_*hie 2 lucene solr

我正在研究lucene 4.7并尝试迁移我们在solr配置中使用的一个分析器.

 <analyzer> 
  <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>  
    <filter class="solr.WordDelimiterFilterFactory" 
            generateWordParts="1" 
            generateNumberParts="1" 
            catenateWords="1"
            catenateNumbers="1"
            catenateAll="0"
            splitOnCaseChange="0"
            splitOnNumerics="0"
            preserveOriginal="1"
    />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
Run Code Online (Sandbox Code Playgroud)

但是,我只是无法弄清楚如何使用HTMLStripCharFilterFactory和WordDelimiterFilterFactory与上面的配置.另外,对于我在solr中的查询我的分析器如下,我怎样才能在lucene中实现相同的功能.

 <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
Run Code Online (Sandbox Code Playgroud)

fem*_*gon 5

分析包文档解释了如何使用CharFilter.您可以在覆盖的initReader方法中将读取器包装起来.

我假设您的问题WordDelimiterFilter是您不知道如何设置您正在使用的配置选项?通过将适当的常量与二进制和(&)组合,构造一个int以传递给构造函数.如:

int config = WordDelimiterFilter.GENERATE_NUMBER_PARTS & WordDelimiterFilter.GENERATE_WORD_PARTS; //etc.
Run Code Online (Sandbox Code Playgroud)

所以,最终你可能会得到类似的东西:

//StopwordAnalyzerBase grants you some convenient ways to handle stop word sets.
public class MyAnalyzer extends StopwordAnalyzerBase {

    private final Version version = Version.LUCENE_47;
    private int wordDelimiterConfig;

    public MyAnalyzer() throws IOException {
        super(version, loadStopwordSet(new FileReader("stopwords.txt"), matchVersion));
        //Might as well load this config up front, along with the stop words
        wordDelimiterConfig = 
            WordDelimiterFilter.GENERATE_WORD_PARTS &
            WordDelimiterFilter.GENERATE_NUMBER_PARTS &
            WordDelimiterFilter.CATENATE_WORDS &
            WordDelimiterFilter.CATENATE_NUMBERS &
            WordDelimiterFilter.PRESERVE_ORIGINAL;
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
        Tokenizer source = new WhitespaceTokenizer(version, reader);
        TokenStream filter = new WordDelimiterFilter(source, wordDelimiterConfig, null);
        filter = new LowercaseFilterFactory(version, filter);
        filter = new StopFilter(version, filter, stopwords);
        filter = new PorterStemFilter(filter);
        return new TokenStreamComponents(source, filter);
    }

    @Override
    protected Reader initReader(String fieldName, Reader reader) {
        return new HTMLStripCharFilter(reader);
    }
}
Run Code Online (Sandbox Code Playgroud)

注意:我已经搬到了StopFilter之后LowercaseFilter.这使得它不区分大小写,只要您的停用词定义全部为小写.不知道这是否有问题WordDelimiterFilter.如果是这样,有一种loadStopwordSet方法可以支持不区分大小写,但坦率地说,我不知道如何使用它.