如何在 OpenAI 的 Whisper ASR 中获取字级时间戳？

Question

如何在 OpenAI 的 Whisper ASR 中获取字级时间戳？

Fra*_*urt 15 python speech-recognition timestamp openai-api openai-whisper

我使用 OpenAI 的Whisper python 库进行语音识别。如何获取字级时间戳？

使用 OpenAI 的Whisper进行转录（在 Ubuntu 20.04 x64 LTS 上使用 Nvidia GeForce RTX 3090 进行测试）：

conda create -y --name whisperpy39 python==3.9
conda activate whisperpy39
pip install git+https://github.com/openai/whisper.git 
sudo apt update && sudo apt install ffmpeg
whisper recording.wav
whisper recording.wav --model large

Run Code Online (Sandbox Code Playgroud)

如果使用 Nvidia GeForce RTX 3090，请在后面添加以下内容conda activate whisperpy39：

pip install -f https://download.pytorch.org/whl/torch_stable.html
conda install pytorch==1.10.1 torchvision torchaudio cudatoolkit=11.0 -c pytorch

Run Code Online (Sandbox Code Playgroud)

Answer 1

小智 13

我创建了一个存储库来恢复字级时间戳（和置信度），以及更准确的段时间戳： https ://github.com/Jeronymous/whisper-timestamped

它是基于 Whisper 的交叉注意力权重构建的，如Whisper 存储库中的笔记本所示。我稍微调整了方法以获得更好的位置，并添加了动态获得交叉注意力的可能性，因此无需运行 Whisper 模型两次。处理长音频时不存在内存问题。

注意：首先，我尝试了使用 wav2vec 模型来重新对齐 Whisper 转录的单词以输入音频的方法。它工作得相当好，但有很多缺点：它需要处理一个单独的 (wav2vec) 模型，对完整信号执行另一个推理，每种语言有一个 wav2vec 模型，规范化转录文本，以便字符集适合 wav2vec 模型之一（例如将数字转换为字符、“%”等符号、货币...）。此外，对齐可能会遇到通常由 Whisper 消除的不流畅问题（因此缺少识别 wav2vec 模型的部分内容，例如重新表述的句子开头）。