Performance difference between embedded ffmpeg in docker images for 3.5.3.0 and 4.0.1.0

February 15, 2019

In the slower case there was a scaling filter applied which wasn't applied with hwa enabled and wasn't even necessary.

This is probably a little bug in the old version or maybe the file being played wasn't analyzed correctly (or not yet) which made Emby add the scaling to limit the image size.

Unfortunately you hadn't use the same file for both logs.

You may try this comparison with 4.0 and see what happens, but please keep using the same file for testing if you want to post more logs for comparison.

Edited February 15, 2019 by softworkz

February 19, 2019

Ok. Tried it with the same File as the previous 3.5.3 log and there seems to be a 5fps disparity, but I guess there's no real solution for me but to change processors at this point.

Thanks for being patient with me.

February 28, 2019

I had a bizarre idea and took a look at the ffmpg command line options and I noticed this:

-filter_complex "[0:1]format=nv12|vaapi,hwupload,scale_vaapi,hwmap=mode=read+write+direct,format=nv12...

and I was wondering if we allow copies to be made rather forcing direct mode

-filter_complex "[0:1]format=nv12|vaapi,hwupload,scale_vaapi,hwmap,format=nv12

and it seems to work fairly fast in this case:

(read+write) frame= 550 fps= 62 q=-0.0 Lsize=N/A time=00:00:23.38 bitrate=N/A speed=2.

vs

(read+write+direct) frame= 546 fps= 22 q=-0.0 size=N/A time=00:00:23.33 bitrate=N/A speed=0.941x

Is there anyway we can force/tweak these options? I'd like to see if there any compatibility issues...

Edited February 28, 2019 by ken-ji

February 28, 2019

@@softworkz

February 28, 2019

I had a bizarre idea and took a look at the ffmpg command line options and I noticed this:

-filter_complex "[0:1]format=nv12|vaapi,hwupload,scale_vaapi,hwmap=mode=read+write+direct,format=nv12...

and I was wondering if we allow copies to be made rather forcing direct mode

-filter_complex "[0:1]format=nv12|vaapi,hwupload,scale_vaapi,hwmap,format=nv12

and it seems to work fairly fast in this case:

(read+write) frame= 550 fps= 62 q=-0.0 Lsize=N/A time=00:00:23.38 bitrate=N/A speed=2.

vs

(read+write+direct) frame= 546 fps= 22 q=-0.0 size=N/A time=00:00:23.33 bitrate=N/A speed=0.941x

Is there anyway we can force/tweak these options? I'd like to see if there any compatibility issues...

Could you please post the full command line, I'll need to see context.

February 28, 2019

Generally speaking, there is room for improvement, specifically regarding filter chain creation.

But there's an incredible amount of variations that we need to cover and certain things may work in one case but not in another case.

The good news is that this is an area where improvements are planned for the near future.

March 1, 2019

/bin/ffmpeg -init_hw_device vaapi=vad0:/dev/dri/renderD128 -filter_hw_device vad0 -f matroska -i file:"/mnt/user/Media/Anime Series/Mahouka Koukou no Rettousei/[Doki] Mahouka Koukou no Rettousei - 16 (1920x1080 Hi10P BD FLAC) [F97267AB].mkv" -threads 0 -map 0:1 -map 0:2 -c:v:0 h264_vaapi -copyts -filter_complex "[0:1]format=nv12|vaapi,hwupload,scale_vaapi,hwmap=mode=read+write+direct,format=nv12,subtitles='/mnt/user/Media/Anime Series/Mahouka Koukou no Rettousei/[Doki] Mahouka Koukou no Rettousei - 16 (1920x1080 Hi10P BD FLAC) [F97267AB].mkv:si=0':force_style='FontName=Droid Sans Fallback':fontsdir='/config/fonts',hwmap" -b:v:0 4451487 -maxrate 4451487 -bufsize 8902974 -profile high -level 4.1 -look_ahead 0 -force_key_frames "expr:if(isnan(prev_forced_t),eq(t,t),gte(t,prev_forced_t+3))" -vsync -1 -codec:a:0 aac -strict experimental -metadata:s:a:0 language=jpn -disposition:a:0 default -ac:a:0 2 -ab:a:0 192000 -f segment -max_delay 5000000 -avoid_negative_ts disabled -map_metadata -1 -map_chapters -1 -start_at_zero -segment_time 3 -individual_header_trailer 0 -segment_format mpegts -segment_list_type m3u8 -segment_start_number 0 -segment_list "/transcoding/transcoding-temp/c73fab6ff193645686d40416929f3d27.m3u8" -y "/transcoding/transcoding-temp/c73fab6ff193645686d40416929f3d27%d.ts"

March 1, 2019

Thanks for the line. So the hwmap is for allowing subtitle burn-in.

Actually the direct mode is meant to avoid copying frames between system and gpu memory.

Before we look into this any further: In your second example without the direct mode option, have you watched the output file and checked whether the subtitles are actually burnt into the video?

March 1, 2019

yes. otherwise I wasn't going to report it.

December 21, 2019

@@Luke @@softworkz So just wondering where we are at with tweaking the this ffmpeg hwmap which seems to be the cause of the poor performance of subtitle burn-in. I will admit my testing is limited to intel gpu - via quicksync, but wouldn't it make sense to disable the direct hwmap option on the intel GPUs? or allow it to be disabled to further see the effects?

December 21, 2019

Hi, yes there is room for improvement that we are working on. Thanks for the feedback.

December 22, 2019

@@Luke @@softworkz So just wondering where we are at with tweaking the this ffmpeg hwmap which seems to be the cause of the poor performance of subtitle burn-in. I will admit my testing is limited to intel gpu - via quicksync, but wouldn't it make sense to disable the direct hwmap option on the intel GPUs? or allow it to be disabled to further see the effects?

What do you mean by "disable the direct hwmap"? What alternative would you suggest?

These are the possible variants I can think of:

Download/overlay/upload: We could download the video frames from the GPU, perform the overlay, then upload back for encoding
Last time I tested (and also obvious): This is slower than hwmap because all frames will be copied from GPU to CPU memory and back to GPU memory after processing (overlay)

.
HWMAP: This avoids copying the video frames by mapping GPU memory to CPU memory. That variant is only possible when the GPU uses shared system memory (e.g. onboard graphics) where memory is physically the same (and when the GPU supports that special mode)
The subtitles are burnt-in by the subtitles filter as-if it would overlay local video frames.

.
Have a subtitles-filter that creates the text as video of transparent images and upload that to the GPU, then do the overlay in hardware.
As nice as that sounds, afaik, the subtiles filter does not support this because it relies on having the original video frames for synchronizing the timing, so it wouldn't work overlaying over an empty video.

(that approach might work for graphical subtitles, though)

.
Create a modified subtitles-filter that can work with and synchronize with hardware-frames.
Sounds easy, but will require significant work.

If you have any better idea...

December 23, 2019

@@softworkz I meant this actual command

/bin/ffmpeg -loglevel +timing -y -copyts -start_at_zero -f matroska,webm -hwaccel:0 vaapi -hwaccel_device:0 /dev/dri/renderD128 -hwaccel_output_format:0 vaapi -i "/mnt/user/Media/Anime/R/R-15/Show/[Kira-Fansub] R-15 - 01 (BD 1080p h264 FLAC) [2A134FA2].mkv" -filter_complex "[0:0]scale_vaapi,hwmap=mode=read+write+direct,format=nv12,subtitles='/mnt/user/Media/Anime/R/R-15/Show/[Kira-Fansub] R-15 - 01 (BD 1080p h264 FLAC) [2A134FA2].mkv':si=0:force_style='FontName=Droid Sans Fallback':fontsdir='/config/fonts',hwmap" -map 0:0 -map 0:1 -sn -c:v:0 h264_vaapi -b:v:0 4476127 -g:v:0 72 -maxrate:v:0 4476127 -bufsize:v:0 8952254 -sc_threshold:v:0 0 -keyint_min:v:0 72 -profile:v:0 high -level:v:0 4.1 -c:a:0 aac -ab:a:0 192000 -ac:a:0 2 -metadata:s:a:0 language=jpn -disposition:a:0 default -max_delay 5000000 -avoid_negative_ts disabled -f segment -map_metadata -1 -map_chapters -1 -segment_format mpegts -segment_list /transcode/transcoding-temp/9ac694d4d41f324c4c78dd2383f0bb2c.m3u8 -segment_list_type m3u8 -segment_time 3 -segment_start_number 0 -individual_header_trailer 0 -segment_write_temp 1 "/transcode/transcoding-temp/9ac694d4d41f324c4c78dd2383f0bb2c%d.ts"

would like to be able to run it like this:

/bin/ffmpeg -loglevel +timing -y -copyts -start_at_zero -f matroska,webm -hwaccel:0 vaapi -hwaccel_device:0 /dev/dri/renderD128 -hwaccel_output_format:0 vaapi -i "/mnt/user/Media/Anime/R/R-15/Show/[Kira-Fansub] R-15 - 01 (BD 1080p h264 FLAC) [2A134FA2].mkv" -filter_complex "[0:0]scale_vaapi,hwmap=mode=read+write,format=nv12,subtitles='/mnt/user/Media/Anime/R/R-15/Show/[Kira-Fansub] R-15 - 01 (BD 1080p h264 FLAC) [2A134FA2].mkv':si=0:fontsdir='/config/fonts',hwmap" -map 0:0 -map 0:1 -sn -c:v:0 h264_vaapi -b:v:0 4476127 -g:v:0 72 -maxrate:v:0 4476127 -bufsize:v:0 8952254 -sc_threshold:v:0 0 -keyint_min:v:0 72 -profile:v:0 high -level:v:0 4.1 -c:a:0 aac -ab:a:0 192000 -ac:a:0 2 -metadata:s:a:0 language=jpn -disposition:a:0 default -max_delay 5000000 -avoid_negative_ts disabled -f segment -map_metadata -1 -map_chapters -1 -segment_format mpegts -segment_list /transcode/transcoding-temp/9ac694d4d41f324c4c78dd2383f0bb2c.m3u8 -segment_list_type m3u8 -segment_time 3 -segment_start_number 0 -individual_header_trailer 0 -segment_write_temp 1 "/transcode/transcoding-temp/9ac694d4d41f324c4c78dd2383f0bb2c%d.ts"

I removed the direct option of the hwmap and left it in read+write and the whole encode runs from about 18fps to about 80+fps. I also disabled the annoying forcing of just using Droid Sans Fallback as the only font. Seems to work well for my specific use case. I actually have no idea why omitting direct allows the whole transcoding run faster than with direct enabled. Maybe because I'm running a docker container?

I guess this is the same as the HWMAP option you mentioned?

Edited December 23, 2019 by ken-ji

December 24, 2019

Thanks for your reply. This is quite interesting. Actually, the 'direct' option is meant to avoid copying and fail if that is not possible. But it's quite unexpected that it is causing a slowdown.

We will investigate and test this further on various systems and situations.

But it's a very good hint! We are about to re-work the whole hardware filter-chaining anyway, so you'll see some progress in the beta channel during the next few weeks.

Thanks a lot for pointing this out!

softworkz

December 24, 2019

What you're experiencing could very well be caused by the fact that you're running inside Docker and that it's not possible to really get direct access to the system memory.

January 22, 2020

@@softworkz

I saw the Diagnostics plugin for 4.4.0.8 and tried it out

The option that interested me the most is the parameter adjustment as it allowed me to disable the direct hwmapping - which mitigates the slow performance we were talking about with the docker version.

Do we have a timeline for

* disabling the direct hwmapping for docker versions

* disabling the forced font styling of Droid Sans Fallback

and I noticed only in this version, that if transcoding is done, the client (Emby for Fire TV 1.5.73a) the subtitles would still soft display along with the burned in subs.

January 22, 2020

The option that interested me the most is the parameter adjustment as it allowed me to disable the direct hwmapping - which mitigates the slow performance we were talking about with the docker version.

Do we have a timeline for

* disabling the direct hwmapping for docker versions

* disabling the forced font styling of Droid Sans Fallback

Those are in fact good candidates for testing the alternatives via diagnostic options.

and I noticed only in this version, that if transcoding is done, the client (Emby for Fire TV 1.5.73a) the subtitles would still soft display along with the burned in subs.

Do you mean after you activated "force subtitle burn-in" in the diagnostic options?

January 23, 2020

Do you mean after you activated "force subtitle burn-in" in the diagnostic options?

Actually I didn't enable force subtitle burn-in. I only disabled "Allow subtitle extraction on the fly"

January 23, 2020

That will cause a lot of subtitle burn in so make sure you actually need that.

January 24, 2020

@@Luke

Yes, I know it does cause a lot of subtitle burn-in which i kinda prefer as a lot of the stuff I watch has ASS subtitles, and I prefer the advanced formatting.

Treating ASS as simple subtitles causes awkward scenarios like two or more people talking and only seeing one person's subs; or having a lot of small text on screen that is translated, and seeing the whole screened filled with subtitles (no positioning or fonting) - Speaking of fonting, can we make the forcing of font to Droid Sans Falllback a configurable option? (ie allow us to add a font file and use that, or turn the setting off all together?

January 24, 2020

We can't yet allow turning if off as on some platforms it will fail. But hopefully down the line we can work those things out.

February 13, 2020

@@ken-ji - Please try the latest beta (.13) - it used hwupload and hwdownload instead of hwmap.

February 13, 2020

I've given it a try on the browser client, remotely.

though seems a lot of my stuff is now transcoded using software (probably because of the subtitles)

I was able to try watching a 4k HEVC file with image subtitles and it transcoded using hardware and seemed just as fast as hwmap:read+write

Seems like we are going the right direction.

February 13, 2020

I've given it a try on the browser client, remotely.

though seems a lot of my stuff is now transcoded using software (probably because of the subtitles)

You mean because of the browser client?

I was able to try watching a 4k HEVC file with image subtitles and it transcoded using hardware and seemed just as fast as hwmap:read+write

Seems like we are going the right direction.

What do you mean by "just as fast". Earlier you said that hwmap would be slow on Docker and that we should use hwdownload instead, which is what we're doing right now (at least temporarily for testing how that compares).

February 13, 2020

@@softworkz

You mean because of the browser client?

I'm currently busy and unable to test with a client device like Roku or FireTV

What do you mean by "just as fast". Earlier you said that hwmap would be slow on Docker and that we should use hwdownload instead, which is what we're doing right now (at least temporarily for testing how that compares).

Sorry if I wasn't clear.

I meant its running pretty fast on the few videos I tried with (HW transcoding) - about 60fps which is more than the ~20fps when we do HW transcoding before with the hwmap-direct

I mentioned before that hwmap was faster on docker containers if you omitted the direct option hence my answer that it was "as fast as hwmap:read+write".

Performance difference between embedded ffmpeg in docker images for 3.5.3.0 and 4.0.1.0

Recommended Posts

softworkz 3326

Link to comment

Share on other sites

ken-ji 0

Link to comment

Share on other sites

ken-ji 0

Link to comment

Share on other sites

Luke 37025

Link to comment

Share on other sites

softworkz 3326

Link to comment

Share on other sites

softworkz 3326

Link to comment

Share on other sites

ken-ji 0

Link to comment

Share on other sites

softworkz 3326

Link to comment

Share on other sites

ken-ji 0

Link to comment

Share on other sites

ken-ji 0

Link to comment

Share on other sites

Luke 37025

Link to comment

Share on other sites

softworkz 3326

Link to comment

Share on other sites

ken-ji 0

Link to comment

Share on other sites

softworkz 3326

Link to comment

Share on other sites

softworkz 3326

Link to comment

Share on other sites

ken-ji 0

Link to comment

Share on other sites

softworkz 3326

Link to comment

Share on other sites

ken-ji 0

Link to comment

Share on other sites

Luke 37025

Link to comment

Share on other sites

ken-ji 0

Link to comment

Share on other sites

Luke 37025

Link to comment

Share on other sites

softworkz 3326

Link to comment

Share on other sites

ken-ji 0

Link to comment

Share on other sites

softworkz 3326

Link to comment

Share on other sites

ken-ji 0

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in