OpenAI - Whisper to generate missing subtitles for videos

October 20, 2022

Hi, Emby's subtitle feature has come a long way since I started using MB3 and I think the next step is now possible.

The OpenAI group has released 'Whisper' a python (there is also a cpp implementation available) module that can be run locally to generate subtitles in 99 different languages (although english is the most efficient from a resource perspective currently) for any file containing audio. This is a OSS model that can be implemented (maybe as a plug in) across many OSes and would allow the generation of subtitles for things that currently lack them. Without hardware support it can't run in real time but as a scheduled task could be run during slow periods for a server.

https://arstechnica.com/information-technology/2022/09/new-ai-model-from-openai-automatically-recognizes-speech-and-translates-to-english/ for a overview

https://github.com/openai/whisper for the Github of the project.

October 20, 2022

@softworkz may have some thoughts on that.

October 20, 2022

3 hours ago, Baenwort said:

Hi, Emby's subtitle feature has come a long way since I started using MB3 and I think the next step is now possible.

The OpenAI group has released 'Whisper' a python (there is also a cpp implementation available) module that can be run locally to generate subtitles in 99 different languages (although english is the most efficient from a resource perspective currently) for any file containing audio. This is a OSS model that can be implemented (maybe as a plug in) across many OSes and would allow the generation of subtitles for things that currently lack them. Without hardware support it can't run in real time but as a scheduled task could be run during slow periods for a server.

https://arstechnica.com/information-technology/2022/09/new-ai-model-from-openai-automatically-recognizes-speech-and-translates-to-english/ for a overview

https://github.com/openai/whisper for the Github of the project.

I have followed OpenAI's projects, many of which are mind blowing. I'm not sure about this one. Meanwhile it is well-known what can be achieved by training transformer models with gigantic sets of data. In that case, the Google speech recognition and audio transcription abilities are more impressive as these are doing it in real-time with minimal resources.

Someone has extracted that functionality to make it work standalone, but it's a kind of hack of course. Still interesting.

What would be kind of a holy grail in that area would be to:

Recognize speech
Recognize speakers
Create subtitles from recognized speech
Translate subtitles to another language
Filter out the original voices from the audio
Use text-to-speech to let the actors speak in a different language

A while ago, I had made an experiment regarding the latter point:

https://user-images.githubusercontent.com/4985349/136570224-88a65ced-bb98-49fa-bcd9-1e766f90af26.mp4
https://gist.github.com/softworkz/3425fc196f5c7eac9e842a655c7e1e5c

It's the same language and this version mixes the original and the TTS voices for comparison.

Though, all of these things are experimental - only the Google ASR would be production ready, but it's not free to use (it works in the browser, but you are not even able to copy the transcribed text in any way).

And generally: features that do not work on all platforms, require excessive hardware, require installation of frameworks like Pytorch, require massive download of data, require manual installation and intervention and eventually cannot even work in realtime as part of Emby's media delivery pipelines - are not a great match for integration into the Emby core server (or Emby's ffmpeg) as it wouldn't reach the masses with all those preconditions.

It might be a nice idea for an Emby plugin, though.

Also I'm sure that at some time there will be approaches for audio transcription that are more handy and universal to integrate.

November 27, 2022

So there is a C++ version that is working on enhancing subtitle generation to be real time: https://github.com/ggerganov/whisper.cpp However, even before realtime usage is feasible for everywhere (which since transcoding isn't and won't ever be I don't think this should be a limit) it would be a nice to have for offline scanning and building of .srt subs for media that no online database match is available. Emby allows best fit for downloaded subs, even though that results in issues when non-knowledgeable people use it as it can improve things and this could help in a similar way when there isn't even a close fit result online.

November 27, 2022

7 hours ago, Baenwort said:

So there is a C++ version that is working on enhancing subtitle generation to be real time: https://github.com/ggerganov/whisper.cpp

Thanks for the link. This really sounds promising!

December 9, 2022

They have gotten whisper.cpp to produce a transcription that also highlights the current word being spoken for a Karaoke mode that would be REALLY nice for Emby's music function with lyrics.

https://github.com/ggerganov/whisper.cpp#karaoke-style-movie-generation-experimental

This would allow something similar to the sing along mode on discs or even what Amazon Music does with lyrics!

Edited December 9, 2022 by Baenwort

October 12, 2023

@softworkzyou wanted a project for a large dataset? How is this? ;-)

Sign In

OpenAI - Whisper to generate missing subtitles for videos

Recommended Posts

Baenwort 97

Link to comment

Share on other sites

Luke 37262

Link to comment

Share on other sites

softworkz 3349

Link to comment

Share on other sites

Baenwort 97

Link to comment

Share on other sites

softworkz 3349

Link to comment

Share on other sites

Baenwort 97

Link to comment

Share on other sites

Dibbes 431

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Activity