Lucene 4.0中的术语矢量频率

mos*_*aab 9 lucene

我正在从Lucene 3.6升级到Lucene 4.0-beta.在Lucene 3.x中,IndexReader包含一个方法IndexReader.getTermFreqVectors(),我可以用它来提取给定文档和字段中每个术语的频率.

此方法现在替换为IndexReader.getTermVectors(),返回Terms.我如何利用这个(或可能是其他方法)来提取文档和字段中的术语频率?

小智 13

也许这会对你有所帮助:

// get terms vectors for one document and one field
Terms terms = reader.getTermVector(docID, "fieldName"); 

if (terms != null && terms.size() > 0) {
    // access the terms for this field
    TermsEnum termsEnum = terms.iterator(null); 
    BytesRef term = null;

    // explore the terms for this field
    while ((term = termsEnum.next()) != null) {
        // enumerate through documents, in this case only one
        DocsEnum docsEnum = termsEnum.docs(null, null); 
        int docIdEnum;
        while ((docIdEnum = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
            // get the term frequency in the document 
            System.out.println(term.utf8ToString()+ " " + docIdEnum + " " + docsEnum.freq()); 
        }
    }
}
Run Code Online (Sandbox Code Playgroud)