Eri*_*son 72 java lucene attributes token tokenize
我正在尝试使用Apache Lucene进行标记,我对从一个获取令牌的过程感到困惑TokenStream
.
最糟糕的是,我正在查看解决我的问题的JavaDocs中的注释.
不知何故,AttributeSource
应该使用an 而不是Token
s.我完全不知所措.
任何人都可以解释如何从TokenStream获取类似令牌的信息吗?
Ada*_*ter 112
是的,它有点令人费解(与好的方式相比),但这应该做到:
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = termAttribute.term();
}
Run Code Online (Sandbox Code Playgroud)
据Donotello称,TermAttribute
已被弃用赞成CharTermAttribute
.根据jpountz(以及Lucene的文档),addAttribute
比起来更可取getAttribute
.
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = charTermAttribute.toString();
}
Run Code Online (Sandbox Code Playgroud)
yeg*_*256 37
这是应该的样子(亚当答案的简洁版本):
TokenStream stream = analyzer.tokenStream(null, new StringReader(text));
CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
System.out.println(cattr.toString());
}
stream.end();
stream.close();
Run Code Online (Sandbox Code Playgroud)