Cod*_*lus 13 java lucene tokenize n-gram
我正在尝试将字符串标记为ngrams.奇怪的是,在NGramTokenizer的文档中,我没有看到一个方法会返回被标记化的单个ngrams.实际上我只在NGramTokenizer类中看到两个返回String Objects的方法.
这是我的代码:
Reader reader = new StringReader("This is a test string");
NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3);
Run Code Online (Sandbox Code Playgroud)
我希望我的输出如下:这,是,a,测试,字符串,这是,是一个测试,测试字符串,这是一个测试,一个测试字符串.
fem*_*gon 18
我不认为你会找到你想找到返回String的方法.你需要处理属性.
应该是这样的:
Reader reader = new StringReader("This is a test string");
NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3);
CharTermAttribute charTermAttribute = gramTokenizer.addAttribute(CharTermAttribute.class);
gramTokenizer.reset();
while (gramTokenizer.incrementToken()) {
String token = charTermAttribute.toString();
//Do something
}
gramTokenizer.end();
gramTokenizer.close();
Run Code Online (Sandbox Code Playgroud)
但是,如果需要在那之后重新使用,请务必重置()Tokenizer.
每个评论标记字组的分组,而不是字符:
Reader reader = new StringReader("This is a test string");
TokenStream tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
tokenizer = new ShingleFilter(tokenizer, 1, 3);
CharTermAttribute charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);
while (tokenizer.incrementToken()) {
String token = charTermAttribute.toString();
//Do something
}
Run Code Online (Sandbox Code Playgroud)