Ash*_*ley 5 full-text-search elasticsearch elasticsearch-plugin
我必须使用Elasticsearch
摄取插件在 pdf 文档中实现基于全文的搜索。当我尝试someword
在 pdf 文档中搜索单词时,我得到了一个空的命中数组。
//Code for creating pipeline
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"indexed_chars" : -1
}
}
]
}
//Code for creating the index
PUT my_index/my_type/my_id?pipeline=attachment
{
"filename" : "C:\\Users\\myname\\Desktop\\bh1.pdf",
"title" : "Quick",
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
//Code for searching the word in pdf
GET /my_index/my_type/_search
{
"query": {
"match": {
"data" : {
"query" : "someword"
}
}
}
Run Code Online (Sandbox Code Playgroud)
When you index your document with the second command by passing the Base64 encoded content, the document then looks like this:
{
"filename": "C:\\Users\\myname\\Desktop\\bh1.pdf",
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "ro",
"content": "Lorem ipsum dolor sit amet",
"content_length": 28
},
"title": "Quick"
}
Run Code Online (Sandbox Code Playgroud)
So your query needs to look into the attachment.content
field and not the data
one (which only serves the purpose of sending the raw content during indexing)
Modify your query to this and it will work:
POST /my_index/my_type/_search
{
"query": {
"match": {
"attachment.content": { <---- change this
"query": "lorem"
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
PS: Use POST
instead of GET
when sending a payload
归档时间: |
|
查看次数: |
5478 次 |
最近记录: |