在elasticsearch中搜索字幕数据

Question

在elasticsearch中搜索字幕数据

Mik*_*ite 7 elasticsearch elasticsearch-model elasticsearch-mapping elasticsearch-query

有以下数据（简单的srt）

1
00:02:17,440 --> 00:02:20,375
Senator, we're making our final

2
00:02:20,476 --> 00:02:22,501
approach into Coruscant.

...

Run Code Online (Sandbox Code Playgroud)

在 Elasticsearch 中索引它的最佳方法是什么？现在有一个问题：我希望搜索结果突出显示链接到时间戳指示的确切时间。此外，还有多个 srt 行重叠的短语（例如final approach在上面的示例中）。

我的想法是

将 srt 文件索引为列表类型，时间戳是索引。我相信这不会匹配重叠多个键的短语
创建仅索引文本部分的自定义标记器。我不确定elasticsearch 能在多大程度上突出显示原始内容。
仅索引文本部分并将其映射回弹性搜索之外的时间戳

或者还有另一种选择可以优雅地解决这个问题吗？

Answer 1

Joe*_*ook 9

有趣的问题。这是我的看法。

\n

本质上，字幕彼此“不了解”\xe2\x80\x94 意味着最好在每个文档中包含前一个和后一个字幕文本（n - 1、n、n + 1。

\n

因此，您会寻找类似于以下内容的文档结构：

\n

{\n  "sub_id" : 0,\n  "start" : "00:02:17,440",\n  "end" : "00:02:20,375",\n  "text" : "Senator, we\'re making our final",\n  "overlapping_text" : "Senator, we\'re making our final approach into Coruscant."\n}\n

Run Code Online (Sandbox Code Playgroud)\n

为了达到这样的文档结构，我使用了以下内容（受到这个优秀答案的启发启发）：

\n

from itertools import groupby\nfrom collections import namedtuple\n\n\ndef parse_subs(fpath):\n    # "chunk" our input file, delimited by blank lines\n    with open(fpath) as f:\n        res = [list(g) for b, g in groupby(f, lambda x: bool(x.strip())) if b]\n\n    Subtitle = namedtuple(\'Subtitle\', \'sub_id start end text\')\n\n    subs = []\n\n    # grouping\n    for sub in res:\n        if len(sub) >= 3:  # not strictly necessary, but better safe than sorry\n            sub = [x.strip() for x in sub]\n            sub_id, start_end, *content = sub  # py3 syntax\n            start, end = start_end.split(\' --> \')\n\n            # ints only\n            sub_id = int(sub_id)\n\n            # join multi-line text\n            text = \', \'.join(content)\n\n            subs.append(Subtitle(\n                sub_id,\n                start,\n                end,\n                text\n            ))\n\n    es_ready_subs = []\n\n    for index, sub_object in enumerate(subs):\n        prev_sub_text = \'\'\n        next_sub_text = \'\'\n\n        if index > 0:\n            prev_sub_text = subs[index - 1].text + \' \'\n\n        if index < len(subs) - 1:\n            next_sub_text = \' \' + subs[index + 1].text\n\n        es_ready_subs.append(dict(\n            **sub_object._asdict(),\n            overlapping_text=prev_sub_text + sub_object.text + next_sub_text\n        ))\n\n    return es_ready_subs\n

Run Code Online (Sandbox Code Playgroud)\n

一旦字幕被解析，它们就可以被摄取到 ES 中。在此之前，请设置以下映射，以便您的时间戳可以正确搜索和排序：

\n

PUT my_subtitles_index\n{\n  "mappings": {\n    "properties": {\n      "start": {\n        "type": "text",\n        "fields": {\n          "as_timestamp": {\n            "type": "date",\n            "format": "HH:mm:ss,SSS"\n          }\n        }\n      },\n      "end": {\n        "type": "text",\n        "fields": {\n          "as_timestamp": {\n            "type": "date",\n            "format": "HH:mm:ss,SSS"\n          }\n        }\n      }\n    }\n  }\n}\n

Run Code Online (Sandbox Code Playgroud)\n

完成后，继续摄取：

\n

from elasticsearch import Elasticsearch\nfrom elasticsearch.helpers import bulk\n\nfrom utils.parse import parse_subs\n\nes = Elasticsearch()\n\nes_ready_subs = parse_subs(\'subs.txt\')\n\nactions = [\n    {\n        "_index": "my_subtitles_index",\n        "_id": sub_group[\'sub_id\'],\n        "_source": sub_group\n    } for sub_group in es_ready_subs\n]\n\nbulk(es, actions)\n

Run Code Online (Sandbox Code Playgroud)\n

摄取后，您可以定位原始字幕text并增强它（如果它与您的短语直接匹配）。否则，在文本上添加后备overlapping，以确保返回两个“重叠”的字幕。

\n

返回之前，您可以确保点击次数按start, 升序排序。这违背了提升的目的，但如果您进行排序，您可以track_scores:true在 URI 中指定以确保也返回最初计算的分数。

\n

把它们放在一起：

\n

POST my_subtitles_index/_search?track_scores&filter_path=hits.hits\n{\n  "query": {\n    "bool": {\n      "should": [\n        {\n          "match_phrase": {\n            "text": {\n              "query": "final approach",\n              "boost": 2\n            }\n          }\n        },\n        {\n          "match_phrase": {\n            "overlapping_text": {\n              "query": "final approach"\n            }\n          }\n        }\n      ]\n    }\n  },\n  "sort": [\n    {\n      "start.as_timestamp": {\n        "order": "asc"\n      }\n    }\n  ]\n}\n

Run Code Online (Sandbox Code Playgroud)\n

产量：

\n

{\n  "hits" : {\n    "hits" : [\n      {\n        "_index" : "my_subtitles_index",\n        "_type" : "_doc",\n        "_id" : "0",\n        "_score" : 6.0236287,\n        "_source" : {\n          "sub_id" : 0,\n          "start" : "00:02:17,440",\n          "end" : "00:02:20,375",\n          "text" : "Senator, we\'re making our final",\n          "overlapping_text" : "Senator, we\'re making our final approach into Coruscant."\n        },\n        "sort" : [\n          137440\n        ]\n      },\n      {\n        "_index" : "my_subtitles_index",\n        "_type" : "_doc",\n        "_id" : "1",\n        "_score" : 5.502407,\n        "_source" : {\n          "sub_id" : 1,\n          "start" : "00:02:20,476",\n          "end" : "00:02:22,501",\n          "text" : "approach into Coruscant.",\n          "overlapping_text" : "Senator, we\'re making our final approach into Coruscant. Very good, Lieutenant."\n        },\n        "sort" : [\n          140476\n        ]\n      }\n    ]\n  }\n}\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	10 年，10 月前
查看次数：	411 次
最近记录：	4 年，8 月前