Mik*_*ite 7 elasticsearch elasticsearch-model elasticsearch-mapping elasticsearch-query
有以下数据(简单的srt)
1
00:02:17,440 --> 00:02:20,375
Senator, we're making our final
2
00:02:20,476 --> 00:02:22,501
approach into Coruscant.
...
Run Code Online (Sandbox Code Playgroud)
在 Elasticsearch 中索引它的最佳方法是什么?现在有一个问题:我希望搜索结果突出显示链接到时间戳指示的确切时间。此外,还有多个 srt 行重叠的短语(例如final approach在上面的示例中)。
我的想法是
或者还有另一种选择可以优雅地解决这个问题吗?
有趣的问题。这是我的看法。
\n本质上,字幕彼此“不了解”\xe2\x80\x94 意味着最好在每个文档中包含前一个和后一个字幕文本(n - 1、n、n + 1。
因此,您会寻找类似于以下内容的文档结构:
\n{\n "sub_id" : 0,\n "start" : "00:02:17,440",\n "end" : "00:02:20,375",\n "text" : "Senator, we\'re making our final",\n "overlapping_text" : "Senator, we\'re making our final approach into Coruscant."\n}\nRun Code Online (Sandbox Code Playgroud)\n为了达到这样的文档结构,我使用了以下内容(受到这个优秀答案的启发启发):
\nfrom itertools import groupby\nfrom collections import namedtuple\n\n\ndef parse_subs(fpath):\n # "chunk" our input file, delimited by blank lines\n with open(fpath) as f:\n res = [list(g) for b, g in groupby(f, lambda x: bool(x.strip())) if b]\n\n Subtitle = namedtuple(\'Subtitle\', \'sub_id start end text\')\n\n subs = []\n\n # grouping\n for sub in res:\n if len(sub) >= 3: # not strictly necessary, but better safe than sorry\n sub = [x.strip() for x in sub]\n sub_id, start_end, *content = sub # py3 syntax\n start, end = start_end.split(\' --> \')\n\n # ints only\n sub_id = int(sub_id)\n\n # join multi-line text\n text = \', \'.join(content)\n\n subs.append(Subtitle(\n sub_id,\n start,\n end,\n text\n ))\n\n es_ready_subs = []\n\n for index, sub_object in enumerate(subs):\n prev_sub_text = \'\'\n next_sub_text = \'\'\n\n if index > 0:\n prev_sub_text = subs[index - 1].text + \' \'\n\n if index < len(subs) - 1:\n next_sub_text = \' \' + subs[index + 1].text\n\n es_ready_subs.append(dict(\n **sub_object._asdict(),\n overlapping_text=prev_sub_text + sub_object.text + next_sub_text\n ))\n\n return es_ready_subs\nRun Code Online (Sandbox Code Playgroud)\n一旦字幕被解析,它们就可以被摄取到 ES 中。在此之前,请设置以下映射,以便您的时间戳可以正确搜索和排序:
\nPUT my_subtitles_index\n{\n "mappings": {\n "properties": {\n "start": {\n "type": "text",\n "fields": {\n "as_timestamp": {\n "type": "date",\n "format": "HH:mm:ss,SSS"\n }\n }\n },\n "end": {\n "type": "text",\n "fields": {\n "as_timestamp": {\n "type": "date",\n "format": "HH:mm:ss,SSS"\n }\n }\n }\n }\n }\n}\nRun Code Online (Sandbox Code Playgroud)\n完成后,继续摄取:
\nfrom elasticsearch import Elasticsearch\nfrom elasticsearch.helpers import bulk\n\nfrom utils.parse import parse_subs\n\nes = Elasticsearch()\n\nes_ready_subs = parse_subs(\'subs.txt\')\n\nactions = [\n {\n "_index": "my_subtitles_index",\n "_id": sub_group[\'sub_id\'],\n "_source": sub_group\n } for sub_group in es_ready_subs\n]\n\nbulk(es, actions)\nRun Code Online (Sandbox Code Playgroud)\n摄取后,您可以定位原始字幕text并增强它(如果它与您的短语直接匹配)。否则,在文本上添加后备overlapping,以确保返回两个“重叠”的字幕。
返回之前,您可以确保点击次数按start, 升序排序。这违背了提升的目的,但如果您进行排序,您可以track_scores:true在 URI 中指定以确保也返回最初计算的分数。
把它们放在一起:
\nPOST my_subtitles_index/_search?track_scores&filter_path=hits.hits\n{\n "query": {\n "bool": {\n "should": [\n {\n "match_phrase": {\n "text": {\n "query": "final approach",\n "boost": 2\n }\n }\n },\n {\n "match_phrase": {\n "overlapping_text": {\n "query": "final approach"\n }\n }\n }\n ]\n }\n },\n "sort": [\n {\n "start.as_timestamp": {\n "order": "asc"\n }\n }\n ]\n}\nRun Code Online (Sandbox Code Playgroud)\n产量:
\n{\n "hits" : {\n "hits" : [\n {\n "_index" : "my_subtitles_index",\n "_type" : "_doc",\n "_id" : "0",\n "_score" : 6.0236287,\n "_source" : {\n "sub_id" : 0,\n "start" : "00:02:17,440",\n "end" : "00:02:20,375",\n "text" : "Senator, we\'re making our final",\n "overlapping_text" : "Senator, we\'re making our final approach into Coruscant."\n },\n "sort" : [\n 137440\n ]\n },\n {\n "_index" : "my_subtitles_index",\n "_type" : "_doc",\n "_id" : "1",\n "_score" : 5.502407,\n "_source" : {\n "sub_id" : 1,\n "start" : "00:02:20,476",\n "end" : "00:02:22,501",\n "text" : "approach into Coruscant.",\n "overlapping_text" : "Senator, we\'re making our final approach into Coruscant. Very good, Lieutenant."\n },\n "sort" : [\n 140476\n ]\n }\n ]\n }\n}\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
411 次 |
| 最近记录: |