如何使用 Elasticsearch 摄取附件插件索引 pdf 文件?

Ash*_*ley 5 full-text-search elasticsearch elasticsearch-plugin

我必须使用Elasticsearch摄取插件在 pdf 文档中实现基于全文的搜索。当我尝试someword在 pdf 文档中搜索单词时,我得到了一个空的命中数组。

//Code for creating pipeline

PUT _ingest/pipeline/attachment
{
    "description" : "Extract attachment information",
    "processors" : [
      {
        "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
        }
      }
    ]
}

//Code for creating the index

PUT my_index/my_type/my_id?pipeline=attachment
{
   "filename" : "C:\\Users\\myname\\Desktop\\bh1.pdf",
   "title" : "Quick",
   "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="

}

//Code for searching the word in pdf 

GET /my_index/my_type/_search
{
    "query": {
    "match": {
      "data" : {
        "query" : "someword"
    }
 }
}
Run Code Online (Sandbox Code Playgroud)

Val*_*Val 4

When you index your document with the second command by passing the Base64 encoded content, the document then looks like this:

        {
           "filename": "C:\\Users\\myname\\Desktop\\bh1.pdf",
           "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
           "attachment": {
              "content_type": "application/rtf",
              "language": "ro",
              "content": "Lorem ipsum dolor sit amet",
              "content_length": 28
           },
           "title": "Quick"
        }
Run Code Online (Sandbox Code Playgroud)

So your query needs to look into the attachment.content field and not the data one (which only serves the purpose of sending the raw content during indexing)

Modify your query to this and it will work:

POST /my_index/my_type/_search
{
   "query": {
      "match": {
         "attachment.content": {         <---- change this
            "query": "lorem"
         }
      }
   }
}
Run Code Online (Sandbox Code Playgroud)

PS: Use POST instead of GET when sending a payload