Jump to content

Performance difference between embedded ffmpeg in docker images for 3.5.3.0 and 4.0.1.0


ken-ji

Recommended Posts

In the slower case there was a scaling filter applied which wasn't applied with hwa enabled and wasn't even necessary.

This is probably a little bug in the old version or maybe the file being played wasn't analyzed correctly (or not yet) which made Emby add the scaling to limit the image size.

Unfortunately you hadn't use the same file for both logs.

 

You may try this comparison with 4.0 and see what happens, but please keep using the same file for testing if you want to post more logs for comparison.

Edited by softworkz
Link to comment
Share on other sites

Ok. Tried it with the same File as the previous 3.5.3 log and there seems to be a 5fps disparity, but I guess there's no real solution for me but to change processors at this point. 

Thanks for being patient with me.

Link to comment
Share on other sites

  • 2 weeks later...

I had a bizarre idea and took a look at the ffmpg command line options and I noticed this:

-filter_complex "[0:1]format=nv12|vaapi,hwupload,scale_vaapi,hwmap=mode=read+write+direct,format=nv12...

 

and I was wondering if we allow copies to be made rather forcing direct mode

-filter_complex "[0:1]format=nv12|vaapi,hwupload,scale_vaapi,hwmap,format=nv12

 

and it seems to work fairly fast in this case:

(read+write) frame=  550 fps= 62 q=-0.0 Lsize=N/A time=00:00:23.38 bitrate=N/A speed=2.

vs

(read+write+direct) frame=  546 fps= 22 q=-0.0 size=N/A time=00:00:23.33 bitrate=N/A speed=0.941x
 
Is there anyway we can force/tweak these options? I'd like to see if there any compatibility issues...
Edited by ken-ji
Link to comment
Share on other sites

 

I had a bizarre idea and took a look at the ffmpg command line options and I noticed this:

-filter_complex "[0:1]format=nv12|vaapi,hwupload,scale_vaapi,hwmap=mode=read+write+direct,format=nv12...

 

and I was wondering if we allow copies to be made rather forcing direct mode

-filter_complex "[0:1]format=nv12|vaapi,hwupload,scale_vaapi,hwmap,format=nv12

 

and it seems to work fairly fast in this case:

(read+write) frame=  550 fps= 62 q=-0.0 Lsize=N/A time=00:00:23.38 bitrate=N/A speed=2.

vs

(read+write+direct) frame=  546 fps= 22 q=-0.0 size=N/A time=00:00:23.33 bitrate=N/A speed=0.941x
 
Is there anyway we can force/tweak these options? I'd like to see if there any compatibility issues...

 

 

Could you please post the full command line, I'll need to see context.

Link to comment
Share on other sites

Generally speaking, there is room for improvement, specifically regarding filter chain creation.

 

But there's an incredible amount of variations that we need to cover and certain things may work in one case but not in another case.

 

The good news is that this is an area where improvements are planned for the near future.

Link to comment
Share on other sites

ken-ji

/bin/ffmpeg -init_hw_device vaapi=vad0:/dev/dri/renderD128 -filter_hw_device vad0 -f matroska -i file:"/mnt/user/Media/Anime Series/Mahouka Koukou no Rettousei/[Doki] Mahouka Koukou no Rettousei - 16 (1920x1080 Hi10P BD FLAC) [F97267AB].mkv" -threads 0 -map 0:1 -map 0:2 -c:v:0 h264_vaapi -copyts -filter_complex "[0:1]format=nv12|vaapi,hwupload,scale_vaapi,hwmap=mode=read+write+direct,format=nv12,subtitles='/mnt/user/Media/Anime Series/Mahouka Koukou no Rettousei/[Doki] Mahouka Koukou no Rettousei - 16 (1920x1080 Hi10P BD FLAC) [F97267AB].mkv:si=0':force_style='FontName=Droid Sans Fallback':fontsdir='/config/fonts',hwmap" -b:v:0 4451487 -maxrate 4451487 -bufsize 8902974 -profile high -level 4.1 -look_ahead 0 -force_key_frames "expr:if(isnan(prev_forced_t),eq(t,t),gte(t,prev_forced_t+3))" -vsync -1 -codec:a:0 aac -strict experimental -metadata:s:a:0 language=jpn -disposition:a:0 default -ac:a:0 2 -ab:a:0 192000 -f segment -max_delay 5000000 -avoid_negative_ts disabled -map_metadata -1 -map_chapters -1 -start_at_zero -segment_time 3 -individual_header_trailer 0 -segment_format mpegts -segment_list_type m3u8 -segment_start_number 0 -segment_list "/transcoding/transcoding-temp/c73fab6ff193645686d40416929f3d27.m3u8" -y "/transcoding/transcoding-temp/c73fab6ff193645686d40416929f3d27%d.ts"
 

Link to comment
Share on other sites

Thanks for the line. So the hwmap is for allowing subtitle burn-in.

 

Actually the direct mode is meant to avoid copying frames between system and gpu memory.

 

Before we look into this any further: In your second example without the direct mode option, have you watched the output file and checked whether the subtitles are actually burnt into the video?

Link to comment
Share on other sites

  • 9 months later...

@@Luke @@softworkz So just wondering where we are at with tweaking the this ffmpeg hwmap which seems to be the cause of the poor performance of subtitle burn-in. I will admit my testing is limited to intel gpu - via quicksync, but wouldn't it make sense to disable the direct hwmap option on the intel GPUs? or allow it to be disabled to further see the effects?

Link to comment
Share on other sites

@@Luke @@softworkz So just wondering where we are at with tweaking the this ffmpeg hwmap which seems to be the cause of the poor performance of subtitle burn-in. I will admit my testing is limited to intel gpu - via quicksync, but wouldn't it make sense to disable the direct hwmap option on the intel GPUs? or allow it to be disabled to further see the effects?

 

What do you mean by "disable the direct hwmap"? What alternative would you suggest?

 

 

These are the possible variants I can think of:

 

  • Download/overlay/upload: We could download the video frames from the GPU, perform the overlay, then upload back for encoding

    Last time I tested (and also obvious): This is slower than hwmap because all frames will be copied from GPU to CPU memory and back to GPU memory after processing (overlay)

    .

  • HWMAP: This avoids copying the video frames by mapping GPU memory to CPU memory. That variant is only possible when the GPU uses shared system memory (e.g. onboard graphics) where memory is physically the same (and when the GPU supports that special mode)

    The subtitles are burnt-in by the subtitles filter as-if it would overlay local video frames.

    .

  • Have a subtitles-filter that creates the text as video of transparent images and upload that to the GPU, then do the overlay in hardware.

    As nice as that sounds, afaik, the subtiles filter does not support this because it relies on having the original video frames for synchronizing the timing, so it wouldn't work overlaying over an empty video.

    (that approach might work for graphical subtitles, though)

    .

  • Create a modified subtitles-filter that can work with and synchronize with hardware-frames.

    Sounds easy, but will require significant work. 

 

If you have any better idea...

Link to comment
Share on other sites

@@softworkz I meant this actual command

/bin/ffmpeg -loglevel +timing -y -copyts -start_at_zero -f matroska,webm -hwaccel:0 vaapi -hwaccel_device:0 /dev/dri/renderD128 -hwaccel_output_format:0 vaapi -i "/mnt/user/Media/Anime/R/R-15/Show/[Kira-Fansub] R-15 - 01 (BD 1080p h264 FLAC) [2A134FA2].mkv" -filter_complex "[0:0]scale_vaapi,hwmap=mode=read+write+direct,format=nv12,subtitles='/mnt/user/Media/Anime/R/R-15/Show/[Kira-Fansub] R-15 - 01 (BD 1080p h264 FLAC) [2A134FA2].mkv':si=0:force_style='FontName=Droid Sans Fallback':fontsdir='/config/fonts',hwmap" -map 0:0 -map 0:1 -sn -c:v:0 h264_vaapi -b:v:0 4476127 -g:v:0 72 -maxrate:v:0 4476127 -bufsize:v:0 8952254 -sc_threshold:v:0 0 -keyint_min:v:0 72 -profile:v:0 high -level:v:0 4.1 -c:a:0 aac -ab:a:0 192000 -ac:a:0 2 -metadata:s:a:0 language=jpn -disposition:a:0 default -max_delay 5000000 -avoid_negative_ts disabled -f segment -map_metadata -1 -map_chapters -1 -segment_format mpegts -segment_list /transcode/transcoding-temp/9ac694d4d41f324c4c78dd2383f0bb2c.m3u8 -segment_list_type m3u8 -segment_time 3 -segment_start_number 0 -individual_header_trailer 0 -segment_write_temp 1 "/transcode/transcoding-temp/9ac694d4d41f324c4c78dd2383f0bb2c%d.ts"

 

would like to be able to run it like this:

/bin/ffmpeg -loglevel +timing -y -copyts -start_at_zero -f matroska,webm -hwaccel:0 vaapi -hwaccel_device:0 /dev/dri/renderD128 -hwaccel_output_format:0 vaapi -i "/mnt/user/Media/Anime/R/R-15/Show/[Kira-Fansub] R-15 - 01 (BD 1080p h264 FLAC) [2A134FA2].mkv" -filter_complex "[0:0]scale_vaapi,hwmap=mode=read+write,format=nv12,subtitles='/mnt/user/Media/Anime/R/R-15/Show/[Kira-Fansub] R-15 - 01 (BD 1080p h264 FLAC) [2A134FA2].mkv':si=0:fontsdir='/config/fonts',hwmap" -map 0:0 -map 0:1 -sn -c:v:0 h264_vaapi -b:v:0 4476127 -g:v:0 72 -maxrate:v:0 4476127 -bufsize:v:0 8952254 -sc_threshold:v:0 0 -keyint_min:v:0 72 -profile:v:0 high -level:v:0 4.1 -c:a:0 aac -ab:a:0 192000 -ac:a:0 2 -metadata:s:a:0 language=jpn -disposition:a:0 default -max_delay 5000000 -avoid_negative_ts disabled -f segment -map_metadata -1 -map_chapters -1 -segment_format mpegts -segment_list /transcode/transcoding-temp/9ac694d4d41f324c4c78dd2383f0bb2c.m3u8 -segment_list_type m3u8 -segment_time 3 -segment_start_number 0 -individual_header_trailer 0 -segment_write_temp 1 "/transcode/transcoding-temp/9ac694d4d41f324c4c78dd2383f0bb2c%d.ts"

 

I removed the direct option of the hwmap and left it in read+write and the whole encode runs from about 18fps to about 80+fps. I also disabled the annoying forcing of just using Droid Sans Fallback as the only font. Seems to work well for my specific use case. I actually have no idea why omitting direct allows the whole transcoding run faster than with direct enabled. Maybe because I'm running a docker container?

I guess this is the same as the HWMAP option you mentioned?

Edited by ken-ji
Link to comment
Share on other sites

Thanks for your reply. This is quite interesting. Actually, the 'direct' option is meant to avoid copying and fail if that is not possible. But it's quite unexpected that it is causing a slowdown.

 

We will investigate and test this further on various systems and situations.

 

But it's a very good hint! We are about to re-work the whole hardware filter-chaining anyway, so you'll see some progress in the beta channel during the next few weeks.

 

Thanks a lot for pointing this out!

 

softworkz

Link to comment
Share on other sites

What you're experiencing could very well be caused by the fact that you're running inside Docker and that it's not possible to really get direct access to the system memory.

Link to comment
Share on other sites

  • 5 weeks later...

@@softworkz

I saw the Diagnostics plugin for 4.4.0.8 and tried it out :D

The option that interested me the most is the parameter adjustment as it allowed me to disable the direct hwmapping - which  mitigates the slow performance we were talking about with the docker version.

 Do we have a timeline for

* disabling the direct hwmapping for docker versions

* disabling the forced font styling of Droid Sans Fallback

 

and I noticed only in this version, that if transcoding is done, the client (Emby for Fire TV 1.5.73a) the subtitles would still soft display along with the burned in subs.

Link to comment
Share on other sites

The option that interested me the most is the parameter adjustment as it allowed me to disable the direct hwmapping - which  mitigates the slow performance we were talking about with the docker version.

 Do we have a timeline for

* disabling the direct hwmapping for docker versions

* disabling the forced font styling of Droid Sans Fallback

 

Those are in fact good candidates for testing the alternatives via diagnostic options.

 

 

and I noticed only in this version, that if transcoding is done, the client (Emby for Fire TV 1.5.73a) the subtitles would still soft display along with the burned in subs.

 

Do you mean after you activated "force subtitle burn-in" in the diagnostic options?

Link to comment
Share on other sites

Do you mean after you activated "force subtitle burn-in" in the diagnostic options?

Actually I didn't enable force subtitle burn-in. I only disabled "Allow subtitle extraction on the fly"

 

Link to comment
Share on other sites

@@Luke

Yes, I know it does cause a lot of subtitle burn-in which i kinda prefer as a lot of the stuff I watch has ASS subtitles, and I prefer the advanced formatting.

Treating ASS as simple subtitles causes awkward scenarios like two or more people talking and only seeing one person's subs; or having a lot of small text on screen that is translated, and seeing the whole screened filled with subtitles (no positioning or fonting) - Speaking of fonting, can we make the forcing of font to Droid Sans Falllback a configurable option? (ie allow us to add a font file and use that, or turn the setting off all together?

Link to comment
Share on other sites

We can't yet allow turning if off as on some platforms it will fail. But hopefully down the line we can work those things out.

Link to comment
Share on other sites

  • 3 weeks later...

I've given it a try on the browser client, remotely.

though seems a lot of my stuff is now transcoded using software (probably because of the subtitles)

I was able to try watching a 4k HEVC file with image subtitles and it transcoded using hardware and seemed just as fast as hwmap:read+write

Seems like we are going the right direction.

Link to comment
Share on other sites

I've given it a try on the browser client, remotely.

though seems a lot of my stuff is now transcoded using software (probably because of the subtitles)

 

You mean because of the browser client?

 

 

I was able to try watching a 4k HEVC file with image subtitles and it transcoded using hardware and seemed just as fast as hwmap:read+write

Seems like we are going the right direction.

 

What do you mean by "just as fast". Earlier you said that hwmap would be slow on Docker and that we should use hwdownload instead, which is what we're doing right now (at least temporarily for testing how that compares).

Link to comment
Share on other sites

@@softworkz

You mean because of the browser client?

 

I'm currently busy and unable to test with a client device like Roku or FireTV

 

 

What do you mean by "just as fast". Earlier you said that hwmap would be slow on Docker and that we should use hwdownload instead, which is what we're doing right now (at least temporarily for testing how that compares).

 

Sorry if I wasn't clear.

I meant its running pretty fast on the few videos I tried with (HW transcoding) - about 60fps which is more than the ~20fps when we do HW transcoding before with the hwmap-direct

I mentioned before that hwmap was faster on docker containers if you omitted the direct option hence my answer that it was "as fast as hwmap:read+write".
 

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...