我正在从Lucene 3.6升级到Lucene 4.0-beta.在Lucene 3.x中,IndexReader包含一个方法IndexReader.getTermFreqVectors(),我可以用它来提取给定文档和字段中每个术语的频率.
此方法现在替换为IndexReader.getTermVectors(),返回Terms.我如何利用这个(或可能是其他方法)来提取文档和字段中的术语频率?
小智 13
也许这会对你有所帮助:
// get terms vectors for one document and one field
Terms terms = reader.getTermVector(docID, "fieldName");
if (terms != null && terms.size() > 0) {
// access the terms for this field
TermsEnum termsEnum = terms.iterator(null);
BytesRef term = null;
// explore the terms for this field
while ((term = termsEnum.next()) != null) {
// enumerate through documents, in this case only one
DocsEnum docsEnum = termsEnum.docs(null, null);
int docIdEnum;
while ((docIdEnum = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
// get the term frequency in the document
System.out.println(term.utf8ToString()+ " " + docIdEnum + " " + docsEnum.freq());
}
}
}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
13199 次 |
| 最近记录: |