Solr俄语拼写检查

Kir*_*aLT 7 solr spell-checking lang cyrillic

我使用solr拼写检查俄语.当您使用西里尔字母键入时,一切都可以,但是当您使用拉丁字符键入时它不起作用.

我希望拼写检查正确,当你用西里尔字母打字时,你什么时候打字拉丁字符.并用西里尔字母文本进行纠正.

For example, when you type:

????????????? or televidenieee

It should correct to:

???????????
Run Code Online (Sandbox Code Playgroud)

schema.xml中:

<fieldType name="spell_text" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[,.;:]" replacement=" "/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement=""/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
        <filter class="solr.LengthFilterFactory" min="3" max="256" />
    </analyzer>
</fieldType>
Run Code Online (Sandbox Code Playgroud)

solrconfig.xml中

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
    <lst name="spellchecker">
        <str name="name">default</str>
        <str name="field">spellcheck</str>
        <str name="classname">solr.IndexBasedSpellChecker</str>
        <str name="buildOnCommit">true</str>
        <str name="buildOnOptimize">true</str>
        <str name="spellcheckIndexDir">./spellchecker</str>
        <str name="accuracy">0.75</str>
    </lst>
    <lst name="spellchecker">
        <str name="name">wordbreak</str>
        <str name="field">spellcheck</str>
        <str name="classname">solr.WordBreakSolrSpellChecker</str>
        <str name="combineWords">false</str>
        <str name="breakWords">true</str>
        <int name="maxChanges">1</int>
    </lst>
</searchComponent>
Run Code Online (Sandbox Code Playgroud)

感谢帮助

rch*_*ukh 5

它可以通过ICUTransformFilterFactory实现,每次都会(un)音译输入查询.

以下是一个如何启用此功能的示例:

  1. 启用icu4j amalyzers(lucene-analyzers-icu - *.jar,icu4j - *.jar):

    这些库可以在contrib/analysis-extras官方站点的solr分发文件夹中找到(它们也可以通过maven获得).

    在solrconfig.xml中添加类似这样的东西来启用它们(可以有一个包含所有所需jar的lib目录,在本例中它只使用相对于example/solr/collection1/conf官方发行版文件夹的默认位置):

    <lib dir="../../../contrib/analysis-extras/lib" regex=".*\.jar" />
    <lib dir="../../../contrib/analysis-extras/lucene-libs" regex=".*\.jar" />
    
    Run Code Online (Sandbox Code Playgroud)
  2. spell_text字段分析器拆分为两个单独的索引和查询列表.

  3. 使用以下id 将solr.ICUTransformFilterFactory添加为查询分析器Any-Cyrillic; NFD; [^\p{Alnum}] Remove:

    <fieldType name="spell_text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[,.;:]" replacement=" "/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement=""/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
        <filter class="solr.LengthFilterFactory" min="3" max="256" />
      </analyzer>
      <analyzer type="query">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[,.;:]" replacement=" "/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement=""/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
        <filter class="solr.LengthFilterFactory" min="3" max="256" />
    
        <filter class="solr.ICUTransformFilterFactory" id="Any-Cyrillic; NFD; [^\p{Alnum}] Remove" />
      </analyzer>
    </fieldType>
    
    Run Code Online (Sandbox Code Playgroud)

关于ICUTransformFilterFactory id - Any-Cyrillic; NFD; [^\p{Alnum}] Remove:

上面描述的配置正在我的本地机器上以俄语音译和俄语单词的相同方式工作