Jump to content

OpenAI - Whisper to generate missing subtitles for videos


Baenwort

Recommended Posts

Baenwort

Hi, Emby's subtitle feature has come a long way since I started using MB3 and I think the next step is now possible. 

The OpenAI group has released 'Whisper' a python (there is also a cpp implementation available) module that can be run locally to generate subtitles in 99 different languages (although english is the most efficient from a resource perspective currently) for any file containing audio. This is a OSS model that can be implemented (maybe as a plug in) across many OSes and would allow the generation of subtitles for things that currently lack them. Without hardware support it can't run in real time but as a scheduled task could be run during slow periods for a server.

https://arstechnica.com/information-technology/2022/09/new-ai-model-from-openai-automatically-recognizes-speech-and-translates-to-english/ for a overview

https://github.com/openai/whisper for the Github of the project.

  • Like 4
  • Agree 1
Link to comment
Share on other sites

3 hours ago, Baenwort said:

Hi, Emby's subtitle feature has come a long way since I started using MB3 and I think the next step is now possible. 

The OpenAI group has released 'Whisper' a python (there is also a cpp implementation available) module that can be run locally to generate subtitles in 99 different languages (although english is the most efficient from a resource perspective currently) for any file containing audio. This is a OSS model that can be implemented (maybe as a plug in) across many OSes and would allow the generation of subtitles for things that currently lack them. Without hardware support it can't run in real time but as a scheduled task could be run during slow periods for a server.

https://arstechnica.com/information-technology/2022/09/new-ai-model-from-openai-automatically-recognizes-speech-and-translates-to-english/ for a overview

https://github.com/openai/whisper for the Github of the project.

I have followed OpenAI's projects, many of which are mind blowing. I'm not sure about this one. Meanwhile it is well-known what can be achieved by training transformer models with gigantic sets of data. In that case, the Google speech recognition and audio transcription abilities are more impressive as these are doing it in real-time with minimal resources.

Someone has extracted that functionality to make it work standalone, but it's a kind of hack of course. Still interesting.

What would be kind of a holy grail in that area would be to:

  • Recognize speech
  • Recognize speakers
  • Create subtitles from recognized speech
  • Translate subtitles to another language
  • Filter out the original voices from the audio
  • Use text-to-speech to let the actors speak in a different language

A while ago, I had made an experiment regarding the latter point:

https://user-images.githubusercontent.com/4985349/136570224-88a65ced-bb98-49fa-bcd9-1e766f90af26.mp4
https://gist.github.com/softworkz/3425fc196f5c7eac9e842a655c7e1e5c

It's the same language and this version mixes the original and the TTS voices for comparison.

 

Though, all of these things are experimental - only the Google ASR would be production ready, but it's not free to use (it works in the browser, but you are not even able to copy the transcribed text in any way).

And generally: features that do not work on all platforms, require excessive hardware, require installation of frameworks like Pytorch, require massive download of data, require manual installation and intervention and eventually cannot even work in realtime as part of Emby's media delivery pipelines - are not a great match for integration into the Emby core server (or Emby's ffmpeg) as it wouldn't reach the masses with all those preconditions.

It might be a nice idea for an Emby plugin, though. 

Also I'm sure that at some time there will be approaches for audio transcription that are more handy and universal to integrate.

  • Like 2
Link to comment
Share on other sites

  • 1 month later...
Baenwort

So there is a C++ version that is working on enhancing subtitle generation to be real time: https://github.com/ggerganov/whisper.cpp However, even before realtime usage is feasible for everywhere (which since transcoding isn't and won't ever be I don't think this should be a limit) it would be a nice to have for offline scanning and building of .srt subs for media that no online database match is available. Emby allows best fit for downloaded subs, even though that results in issues when non-knowledgeable people use it as it can improve things and this could help in a similar way when there isn't even a close fit result online.

  • Like 1
Link to comment
Share on other sites

  • 2 weeks later...
Baenwort

They have gotten whisper.cpp to produce a transcription that also highlights the current word being spoken for a Karaoke mode that would be REALLY nice for Emby's music function with lyrics. 

https://github.com/ggerganov/whisper.cpp#karaoke-style-movie-generation-experimental

This would allow something similar to the sing along mode on discs or even what Amazon Music does with lyrics! 

Edited by Baenwort
Link to comment
Share on other sites

  • 10 months later...

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...