传统的pdf索引解决方案与基于图形的版本相比

Question

传统的pdf索引解决方案与基于图形的版本相比

我的目的是使用存储在列表中的关键字索引包含pdf文件(以及其他文件类型)的任意目录.我有一个传统的解决方案,我听说使用例如SimpleGraph的基于图形的解决方案可以更优雅/高效并且独立于目录结构.

基于图形的解决方案(例如SimpleGraph)会是什么样的？

传统解决方案

// https://stackoverflow.com/a/14051951/1497139
List<File> pdfFiles = this.explorePath(TestPDFFiles.RFC_DIRECTORY, "pdf");
List<PDFFile> pdfs = this.getPdfsFromFileList(pdfFiles);
…
for (PDFFile pdf:pdfs) {
     // https://stackoverflow.com/a/9560307/1497139
     if (org.apache.commons.lang3.StringUtils.containsIgnoreCase(pdf.getText(), keyWord)) {
          foundList.add(pdf.file.getName()); // here we access by structure (early binding)
          // - in the graph solution by name (late binding)
     }
}

Run Code Online (Sandbox Code Playgroud)

Answer 1

Wol*_*ahl 5

基本上使用SimpleGraph,您可以使用模块的组合

文件系统
PDFSystem

使用FileSystem模块,您可以收集目录中的文件图并过滤它以仅包含扩展名为pdf的文件 - 然后使用PDFSystem分析PDF以获取页面/文本结构 - 已经有一个测试用例simplegraph-bundle模块显示它如何与一些RFC pdf一起作为输入.

TestPDFFiles.java

我现在已经添加了索引测试,见下文.

核心功能取自旧测试,搜索单个关键字并将其作为参数:

List<Object> founds = pdfSystem.g().V().hasLabel("page")
      .has("text", RegexPredicate.regex(".*" + keyWord + ".*")).in("pages")
      .dedup().values("name").toList();

Run Code Online (Sandbox Code Playgroud)

这是一个gremlin查询,只需一次调用就可以在整个PDF文件树中搜索,从而完成大部分工作.我认为这更优雅,因为你不必关心输入的结构(树/图形/文件系统/数据库等......)

JUnit Testcase

 @Test
  /**
   * test for https://github.com/BITPlan/com.bitplan.simplegraph/issues/12
   */
  public void testPDFIndexing() throws Exception {
    FileSystem fs = getFileSystem(RFC_DIRECTORY);
    int limit = Integer.MAX_VALUE;
    PdfSystem pdfSystem = getPdfSystemForFileSystem(fs, limit);
    Map<String, List<String>> index = this.getIndex(pdfSystem, "ARPA",
        "proposal", "plan");
    // debug=true;
    if (debug) {
      for (Entry<String, List<String>> indexEntry : index.entrySet()) {
        List<String> fileNameList = indexEntry.getValue();
        System.out.println(String.format("%15s=%3d %s", indexEntry.getKey(),
            fileNameList.size(), fileNameList));
      }
    }
    assertEquals(14,index.get("ARPA").size());
    assertEquals(9,index.get("plan").size());
    assertEquals(8,index.get("proposal").size());
  }

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，4 月前
查看次数：	42 次
最近记录：	7 年，4 月前