Elasticsearch:在使用html_strip过滤器无法正常工作的索引文档之前剥离HTML标记

鉴于我在自定义分析器中指定了我的html strip char过滤器

当我使用html内容索引文档时

然后我希望html脱离索引内容

并且在检索时,索引的返回文档不包含hmtl

ACTUAL:索引文档包含html检索到的doc包含html

我已经尝试将分析器指定为index_analyzer,就像人们期望的那样,以及其他一些绝望的search_analyzer和analyzer.Non似乎对正在索引或检索的doc有任何影响.

根据HTML_Strip测试文档索引分析字段:

请求:带有html内容的示例POST文档

POST /html_poc_v2/html_poc_type/02
{
  "description": "Description <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
  "title": "Title <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
  "body": "Body <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>"
}

Run Code Online (Sandbox Code Playgroud)

预期:索引数据已通过html分析器解析. 实际:数据使用html编制索引

响应

{
   "_index": "html_poc_v2",   "_type": "html_poc_type",   "_id": "02", ...
   "_source": {
      "description": "Description <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
      "title": "Title <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
      "body": "Body <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>"
   }
}

Run Code Online (Sandbox Code Playgroud)

设置和文档映射

PUT …

Run Code Online (Sandbox Code Playgroud)

html mapping full-text-search filter elasticsearch

Dad*_*Moe

2017 05-23

17
推荐指数

1
解决办法

5968
查看次数

Elasticsearch父/子.如何查询和筛选Parent并在Child字段上排序

到目前为止,我可以过滤父母与孩子,并对儿童领域的反应进行排序,就像一个魅力.见下文.

GET /parent_child_meta_poc/paren_doc/_search
{
  "query": {
    "has_child": {
      "inner_hits": {},
      "type": "most_read_doc",
      "query": {
        "function_score": {
          "functions": [
            {
              "field_value_factor": {
                "factor": 1,
                "field": "read_count"
              }
            }
          ]
        }
      },
      "score_mode": "avg"
    }
  }
}

Run Code Online (Sandbox Code Playgroud)

资料来源:Matts要点

子映射示例:

{
  "mappings": {
    "most_read_doc": {
      "_parent": {
        "type": "paren_doc"
      },
      "properties": {
        "parent_uri": {
          "type": "string"
        },
        "read_count": {
          "type": "long"
        }
      }
    }
  }
}

Run Code Online (Sandbox Code Playgroud)

但是,这种限制来自于以下要求:

向我提供符合此搜索条件的所有文档,仅返回具有子项"most_read_doc"的文档,并在"most_read_doc.read_count"上对返回的父文档进行排序

基本上我想: