在elasticsearch中搜索字幕数据

Mik*_*ite 7 elasticsearch elasticsearch-model elasticsearch-mapping elasticsearch-query

有以下数据(简单的srt)

1
00:02:17,440 --> 00:02:20,375
Senator, we're making our final

2
00:02:20,476 --> 00:02:22,501
approach into Coruscant.

...
Run Code Online (Sandbox Code Playgroud)

在 Elasticsearch 中索引它的最佳方法是什么?现在有一个问题:我希望搜索结果突出显示链接到时间戳指示的确切时间。此外,还有多个 srt 行重叠的短语(例如final approach在上面的示例中)。

我的想法是

  • 将 srt 文件索引为列表类型,时间戳是索引。我相信这不会匹配重叠多个键的短语
  • 创建仅索引文本部分的自定义标记器。我不确定elasticsearch 能在多大程度上突出显示原始内容。
  • 仅索引文本部分并将其映射回弹性搜索之外的时间戳

或者还有另一种选择可以优雅地解决这个问题吗?

Joe*_*ook 9

有趣的问题。这是我的看法。

\n

本质上,字幕彼此“不了解”\xe2\x80\x94 意味着最好在每个文档中包含前一个和后一个字幕文本(n - 1nn + 1

\n

因此,您会寻找类似于以下内容的文档结构:

\n
{\n  "sub_id" : 0,\n  "start" : "00:02:17,440",\n  "end" : "00:02:20,375",\n  "text" : "Senator, we\'re making our final",\n  "overlapping_text" : "Senator, we\'re making our final approach into Coruscant."\n}\n
Run Code Online (Sandbox Code Playgroud)\n

为了达到这样的文档结构,我使用了以下内容(受到这个优秀答案的启发启发):

\n
from itertools import groupby\nfrom collections import namedtuple\n\n\ndef parse_subs(fpath):\n    # "chunk" our input file, delimited by blank lines\n    with open(fpath) as f:\n        res = [list(g) for b, g in groupby(f, lambda x: bool(x.strip())) if b]\n\n    Subtitle = namedtuple(\'Subtitle\', \'sub_id start end text\')\n\n    subs = []\n\n    # grouping\n    for sub in res:\n        if len(sub) >= 3:  # not strictly necessary, but better safe than sorry\n            sub = [x.strip() for x in sub]\n            sub_id, start_end, *content = sub  # py3 syntax\n            start, end = start_end.split(\' --> \')\n\n            # ints only\n            sub_id = int(sub_id)\n\n            # join multi-line text\n            text = \', \'.join(content)\n\n            subs.append(Subtitle(\n                sub_id,\n                start,\n                end,\n                text\n            ))\n\n    es_ready_subs = []\n\n    for index, sub_object in enumerate(subs):\n        prev_sub_text = \'\'\n        next_sub_text = \'\'\n\n        if index > 0:\n            prev_sub_text = subs[index - 1].text + \' \'\n\n        if index < len(subs) - 1:\n            next_sub_text = \' \' + subs[index + 1].text\n\n        es_ready_subs.append(dict(\n            **sub_object._asdict(),\n            overlapping_text=prev_sub_text + sub_object.text + next_sub_text\n        ))\n\n    return es_ready_subs\n
Run Code Online (Sandbox Code Playgroud)\n

一旦字幕被解析,它们就可以被摄取到 ES 中。在此之前,请设置以下映射,以便您的时间戳可以正确搜索和排序:

\n
PUT my_subtitles_index\n{\n  "mappings": {\n    "properties": {\n      "start": {\n        "type": "text",\n        "fields": {\n          "as_timestamp": {\n            "type": "date",\n            "format": "HH:mm:ss,SSS"\n          }\n        }\n      },\n      "end": {\n        "type": "text",\n        "fields": {\n          "as_timestamp": {\n            "type": "date",\n            "format": "HH:mm:ss,SSS"\n          }\n        }\n      }\n    }\n  }\n}\n
Run Code Online (Sandbox Code Playgroud)\n

完成后,继续摄取:

\n
from elasticsearch import Elasticsearch\nfrom elasticsearch.helpers import bulk\n\nfrom utils.parse import parse_subs\n\nes = Elasticsearch()\n\nes_ready_subs = parse_subs(\'subs.txt\')\n\nactions = [\n    {\n        "_index": "my_subtitles_index",\n        "_id": sub_group[\'sub_id\'],\n        "_source": sub_group\n    } for sub_group in es_ready_subs\n]\n\nbulk(es, actions)\n
Run Code Online (Sandbox Code Playgroud)\n

摄取后,您可以定位原始字幕text并增强它(如果它与您的短语直接匹配)。否则,在文本上添加后备overlapping,以确保返回两个“重叠”的字幕。

\n

返回之前,您可以确保点击次数按start, 升序排序。这违背了提升的目的,但如果您进行排序,您可以track_scores:true在 URI 中指定以确保也返回最初计算的分数。

\n

把它们放在一起:

\n
POST my_subtitles_index/_search?track_scores&filter_path=hits.hits\n{\n  "query": {\n    "bool": {\n      "should": [\n        {\n          "match_phrase": {\n            "text": {\n              "query": "final approach",\n              "boost": 2\n            }\n          }\n        },\n        {\n          "match_phrase": {\n            "overlapping_text": {\n              "query": "final approach"\n            }\n          }\n        }\n      ]\n    }\n  },\n  "sort": [\n    {\n      "start.as_timestamp": {\n        "order": "asc"\n      }\n    }\n  ]\n}\n
Run Code Online (Sandbox Code Playgroud)\n

产量:

\n
{\n  "hits" : {\n    "hits" : [\n      {\n        "_index" : "my_subtitles_index",\n        "_type" : "_doc",\n        "_id" : "0",\n        "_score" : 6.0236287,\n        "_source" : {\n          "sub_id" : 0,\n          "start" : "00:02:17,440",\n          "end" : "00:02:20,375",\n          "text" : "Senator, we\'re making our final",\n          "overlapping_text" : "Senator, we\'re making our final approach into Coruscant."\n        },\n        "sort" : [\n          137440\n        ]\n      },\n      {\n        "_index" : "my_subtitles_index",\n        "_type" : "_doc",\n        "_id" : "1",\n        "_score" : 5.502407,\n        "_source" : {\n          "sub_id" : 1,\n          "start" : "00:02:20,476",\n          "end" : "00:02:22,501",\n          "text" : "approach into Coruscant.",\n          "overlapping_text" : "Senator, we\'re making our final approach into Coruscant. Very good, Lieutenant."\n        },\n        "sort" : [\n          140476\n        ]\n      }\n    ]\n  }\n}\n
Run Code Online (Sandbox Code Playgroud)\n