Eri*_*son 72 java lucene attributes token tokenize
我正在尝试使用Apache Lucene进行标记,我对从一个获取令牌的过程感到困惑TokenStream.
最糟糕的是,我正在查看解决我的问题的JavaDocs中的注释.
不知何故,AttributeSource应该使用an 而不是Tokens.我完全不知所措.
任何人都可以解释如何从TokenStream获取类似令牌的信息吗?
Ada*_*ter 112
是的,它有点令人费解(与好的方式相比),但这应该做到:
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = termAttribute.term();
}
Run Code Online (Sandbox Code Playgroud)
据Donotello称,TermAttribute已被弃用赞成CharTermAttribute.根据jpountz(以及Lucene的文档),addAttribute比起来更可取getAttribute.
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = charTermAttribute.toString();
}
Run Code Online (Sandbox Code Playgroud)
yeg*_*256 37
这是应该的样子(亚当答案的简洁版本):
TokenStream stream = analyzer.tokenStream(null, new StringReader(text));
CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
System.out.println(cattr.toString());
}
stream.end();
stream.close();
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
41486 次 |
| 最近记录: |