我想解析srt字幕:
1
00:00:12,815 --> 00:00:14,509
Chlapi, jak to jde s
t?ma pracovníma sv?tlama?.
2
00:00:14,815 --> 00:00:16,498
Trochu je zesilujeme.
3
00:00:16,934 --> 00:00:17,814
Jo, sleduj.
Run Code Online (Sandbox Code Playgroud)
每个项目都进入结构.有了这个正则表达式:
A:
RE_ITEM = re.compile(r'(?P<index>\d+).'
r'(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> '
r'(?P<end>\d{2}:\d{2}:\d{2},\d{3}).'
r'(?P<text>.*?)', re.DOTALL)
Run Code Online (Sandbox Code Playgroud)
B:
RE_ITEM = re.compile(r'(?P<index>\d+).'
r'(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> '
r'(?P<end>\d{2}:\d{2}:\d{2},\d{3}).'
r'(?P<text>.*)', re.DOTALL)
Run Code Online (Sandbox Code Playgroud)
这段代码:
for i in Subtitles.RE_ITEM.finditer(text):
result.append((i.group('index'), i.group('start'),
i.group('end'), i.group('text')))
Run Code Online (Sandbox Code Playgroud)
使用代码BI只有一个项目在数组中(因为贪婪.*)和代码AI有空的'文本',因为没有贪心.*?
怎么治这个?
谢谢
我对Python可用的srt库感到非常沮丧(通常因为它们是重量级的并且避开了语言标准类型而支持自定义类),所以我花了大约一年时间在我自己的srt库上工作.您可以访问https://github.com/cdown/srt获取它.
我试图保持简单和轻松的类(除了核心Subtitle类,它或多或少只存储SRT块数据).它可以读写SRT文件,并将不合规的SRT文件转换为合规文件.
以下是您的示例输入的用法示例:
>>> import srt, pprint
>>> gen = srt.parse('''\
... 1
... 00:00:12,815 --> 00:00:14,509
... Chlapi, jak to jde s
... t?ma pracovníma sv?tlama?.
...
... 2
... 00:00:14,815 --> 00:00:16,498
... Trochu je zesilujeme.
...
... 3
... 00:00:16,934 --> 00:00:17,814
... Jo, sleduj.
...
... ''')
>>> pprint.pprint(list(gen))
[Subtitle(start=datetime.timedelta(0, 12, 815000), end=datetime.timedelta(0, 14, 509000), index=1, proprietary='', content='Chlapi, jak to jde s\nt?ma pracovníma sv?tlama?.'),
Subtitle(start=datetime.timedelta(0, 14, 815000), end=datetime.timedelta(0, 16, 498000), index=2, proprietary='', content='Trochu je zesilujeme.'),
Subtitle(start=datetime.timedelta(0, 16, 934000), end=datetime.timedelta(0, 17, 814000), index=3, proprietary='', content='Jo, sleduj.')]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
12902 次 |
| 最近记录: |