在LUCENE-5472中,如果术语太长,Lucene会更改为抛出错误,而不是仅记录消息.此错误表明SOLR不接受大于32766的令牌
Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="text" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[10, 10, 70, 111, 117, 110, 100, 32, 116, 104, 105, 115, 32, 111, 110, 32, 116, 104, 101, 32, 119, 101, 98, 32, 104, 111, 112, 101, 32, 116]...', original message: bytes can be at most 32766 in length; got 43225
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:671)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1475)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
... 54 more
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 43225
at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
Run Code Online (Sandbox Code Playgroud)
为了解决这个问题,我在架构中添加了两个过滤器(粗体):
<field name="text" type="text_en_splitting" termPositions="true" termOffsets="true" termVectors="true" indexed="true" required="false" stored="true"/>
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
**<filter class="solr.TruncateTokenFilterFactory" prefixLength="32700"/>
<filter class="solr.LengthFilterFactory" min="2" max="32700" />**
</analyzer>
</fieldType>
Run Code Online (Sandbox Code Playgroud)
由于错误仍然相同(这让我觉得过滤器没有正确设置,可能?) 重启服务器是关键谢谢Bashetti先生
问题是哪一个更好:LengthFilterFactory
或者TruncateTokenFilterFactory
?并且假设一个字节是一个字符是正确的(因为过滤器应该删除'异常'字符?)谢谢!
错误说"SOLR doesn't accept token larger than 32766"
问题是因为您之前为字段文本使用了 String fieldType,并且在更改字段类型后您当前遇到的问题与您当前遇到的问题相同,因为更改后您尚未重新启动 solr 服务器。
我认为没有必要添加TruncateTokenFilterFactory
或LengthFilterFactory
。
但这取决于您和您的要求。