Apache Lucene TokenStream合同违规

Mul*_*ard 5 java lucene

使用Appache Lucene TokenStream删除停用词会导致错误:

TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.
Run Code Online (Sandbox Code Playgroud)

我用这个代码:

public static String removeStopWords(String string) throws IOException {
    TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_47, new StringReader(string));
    TokenFilter tokenFilter = new StandardFilter(Version.LUCENE_47, tokenStream);
    TokenStream stopFilter = new StopFilter(Version.LUCENE_47, tokenFilter, StandardAnalyzer.STOP_WORDS_SET);
    StringBuilder stringBuilder = new StringBuilder();

    CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);

    while(stopFilter.incrementToken()) {
        if(stringBuilder.length() > 0 ) {
            stringBuilder.append(" ");
        }

        stringBuilder.append(token.toString());
    }

    stopFilter.end();
    stopFilter.close();

    return stringBuilder.toString();
}
Run Code Online (Sandbox Code Playgroud)

但是你可以看到我从不调用reset()或close().

那么为什么我会收到这个错误?

min*_*das 8

我从不调用reset()或close().

嗯,这你的问题.如果您想阅读TokenStreamjavadoc,您会发现以下内容:

TokenStreamAPI 的工作流程如下:

  1. TokenStream/ TokenFilters的实例化,用于向/从中添加/获取属性AttributeSource.
  2. 消费者打电话 TokenStream#reset()
  3. ...

我只需要在reset()你的代码中添加一行就行了.

...    
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
tokenStream.reset();   // I added this 
while(stopFilter.incrementToken()) {
...
Run Code Online (Sandbox Code Playgroud)

  • 嗯......我有那条单行,我仍然得到这个错误:) (2认同)