Slow hardware decode with HEVC and nVidia P4000

February 16, 2018

Hello, I am seeing an issue where if I have HEVC decoding enabled in the transcode section and I attempt to play a 4K video that needs to be transcoded due to bandwidth limits I am getting around 19FPS transcode via the P4000 with 12-18% utilization - If I disable HEVC from the hardware decode and allow the CPU to transcode I get around 45-50FPS.

My issue is I want to offload all transcoding tasks to the P4000 as I had a K2200 and it worked great for non-4K content however it did not support HEVC so hence the upgrade to the P4000.

C:\...\MediaBrowser == Junction link to my RAID SSD array

S:\...\Media Storage\ == 45 drive RAID 10 array (platter based + RAM cache backed)

T:\...\Transcode == RAID SSD array

Here is the FFMPEG command that is being executed according to the logs:

C:\Users\BLAH\AppData\Roaming\MediaBrowser-Server\System\ffmpeg.exe -c:v hevc_cuvid -i file:"S:\Media Storage\Media\MEDIA\Meida (4K UHD).mkv" -threads 0 -map 0:0 -map 0:1 -map -0:s -codec:v:0 h264_nvenc -vf "scale=trunc(min(max(iw\,ih*dar)\,720)/2)*2:trunc(ow/dar/2)*2" -pix_fmt yuv420p -preset default -b:v 1116000 -maxrate 1116000 -bufsize 2232000 -profile:v high -force_key_frames "expr:if(isnan(prev_forced_t),eq(t,t),gte(t,prev_forced_t+3))" -copyts -vsync -1 -codec:a:0 libmp3lame -ac 2 -ab 384000 -af "volume=2" -f segment -max_delay 5000000 -avoid_negative_ts disabled -map_metadata -1 -map_chapters -1 -start_at_zero -segment_time 3 -individual_header_trailer 0 -segment_format mpegts -segment_list_type m3u8 -segment_start_number 0 -segment_list "T:\transcoding-temp\4095b6eda9031249089b2bba97c7cbfb.m3u8" -y "T:\transcoding-temp\4095b6eda9031249089b2bba97c7cbfb%d.ts"

Stream mapping:

Stream #0:0 -> #0:0 (hevc (hevc_cuvid) -> h264 (h264_nvenc))
Stream #0:1 -> #0:1 (truehd (native) -> mp3 (libmp3lame))
Press [q] to stop, [?] for help
[segment @ 000002baa812c900] Opening 'T:\transcoding-temp\4095b6eda9031249089b2bba97c7cbfb0.ts' for writing
Output #0, segment, to 'T:\transcoding-temp\4095b6eda9031249089b2bba97c7cbfb%d.ts':
Metadata:
encoder : Lavf58.3.100
Stream #0:0: Video: h264 (h264_nvenc) (High), yuv420p, 720x404 [sAR 404:405 DAR 16:9], q=-1--1, 1116 kb/s, 23.98 fps, 90k tbn, 23.98 tbc
Metadata:
encoder : Lavc58.9.100 h264_nvenc
Side data:
cpb: bitrate max/min/avg: 1116000/0/1116000 buffer size: 2232000 vbv_delay: -1
Stream #0:1: Audio: mp3 (libmp3lame), 48000 Hz, stereo, fltp (24 bit), 384 kb/s (default)
Metadata:
encoder : Lavc58.9.100 libmp3lame
frame= 3 fps=0.0 q=18.0 size=N/A time=00:00:00.21 bitrate=N/A speed=0.43x
frame= 14 fps= 13 q=10.0 size=N/A time=00:00:00.72 bitrate=N/A speed=0.689x
frame= 25 fps= 16 q=10.0 size=N/A time=00:00:01.12 bitrate=N/A speed=0.713x
frame= 36 fps= 17 q=10.0 size=N/A time=00:00:01.63 bitrate=N/A speed=0.775x
frame= 47 fps= 18 q=10.0 size=N/A time=00:00:02.04 bitrate=N/A speed=0.77x
frame= 56 fps= 18 q=10.0 size=N/A time=00:00:02.44 bitrate=N/A speed=0.771x
frame= 66 fps= 18 q=10.0 size=N/A time=00:00:02.85 bitrate=N/A speed=0.776x
frame= 77 fps= 18 q=10.0 size=N/A time=00:00:03.36 bitrate=N/A speed=0.796x
frame= 87 fps= 18 q=22.0 size=N/A time=00:00:03.74 bitrate=N/A speed=0.789x
frame= 97 fps= 18 q=21.0 size=N/A time=00:00:04.15 bitrate=N/A speed=0.785x
frame= 107 fps= 18 q=26.0 size=N/A time=00:00:04.56 bitrate=N/A speed=0.785x
frame= 116 fps= 18 q=27.0 size=N/A time=00:00:04.94 bitrate=N/A speed=0.779x
frame= 126 fps= 18 q=28.0 size=N/A time=00:00:05.35 bitrate=N/A speed=0.78x
frame= 136 fps= 18 q=24.0 size=N/A time=00:00:05.76 bitrate=N/A speed=0.779x
frame= 147 fps= 19 q=23.0 size=N/A time=00:00:06.26 bitrate=N/A speed=0.789x
frame= 157 fps= 19 q=27.0 size=N/A time=00:00:06.64 bitrate=N/A speed=0.787x
frame= 167 fps= 19 q=28.0 size=N/A time=00:00:07.05 bitrate=N/A speed=0.786x
frame= 176 fps= 19 q=28.0 size=N/A time=00:00:07.44 bitrate=N/A speed=0.784x
frame= 187 fps= 19 q=29.0 size=N/A time=00:00:07.94 bitrate=N/A speed=0.792x
frame= 197 fps= 19 q=30.0 size=N/A time=00:00:08.32 bitrate=N/A speed=0.789x
frame= 207 fps= 19 q=28.0 size=N/A time=00:00:08.73 bitrate=N/A speed=0.788x
frame= 217 fps= 19 q=26.0 size=N/A time=00:00:09.14 bitrate=N/A speed=0.789x
frame= 226 fps= 19 q=24.0 size=N/A time=00:00:09.55 bitrate=N/A speed=0.787x
frame= 236 fps= 19 q=24.0 size=N/A time=00:00:09.93 bitrate=N/A speed=0.786x
frame= 246 fps= 19 q=24.0 size=N/A time=00:00:10.36 bitrate=N/A speed=0.788x
[segment @ 000002baa812c900] Opening 'T:\transcoding-temp\4095b6eda9031249089b2bba97c7cbfb.m3u8.tmp' for writing
[segment @ 000002baa812c900] Opening 'T:\transcoding-temp\4095b6eda9031249089b2bba97c7cbfb1.ts' for writing
frame= 256 fps= 19 q=24.0 size=N/A time=00:00:10.75 bitrate=N/A speed=0.787x
frame= 267 fps= 19 q=24.0 size=N/A time=00:00:11.25 bitrate=N/A speed=0.793x
frame= 277 fps= 19 q=24.0 size=N/A time=00:00:11.64 bitrate=N/A speed=0.792x
frame= 286 fps= 19 q=25.0 size=N/A time=00:00:12.04 bitrate=N/A speed=0.791x
frame= 296 fps= 19 q=25.0 size=N/A time=00:00:12.45 bitrate=N/A speed=0.792x
frame= 305 fps= 19 q=24.0 size=N/A time=00:00:12.88 bitrate=N/A speed=0.794x
frame= 315 fps= 19 q=23.0 size=N/A time=00:00:13.24 bitrate=N/A speed=0.792x
frame= 325 fps= 19 q=24.0 size=N/A time=00:00:13.65 bitrate=N/A speed=0.792x
frame= 336 fps= 19 q=23.0 size=N/A time=00:00:14.16 bitrate=N/A speed=0.798x
frame= 346 fps= 19 q=24.0 size=N/A time=00:00:14.54 bitrate=N/A speed=0.796x
frame= 356 fps= 19 q=24.0 size=N/A time=00:00:14.95 bitrate=N/A speed=0.795x
frame= 365 fps= 19 q=23.0 size=N/A time=00:00:15.33 bitrate=N/A speed=0.793x
frame= 375 fps= 19 q=22.0 size=N/A time=00:00:15.74 bitrate=N/A speed=0.793x
frame= 385 fps= 19 q=15.0 size=N/A time=00:00:16.15 bitrate=N/A speed=0.792x
frame= 396 fps= 19 q=21.0 size=N/A time=00:00:16.65 bitrate=N/A speed=0.795x
frame= 406 fps= 19 q=23.0 size=N/A time=00:00:17.04 bitrate=N/A speed=0.794x
frame= 416 fps= 19 q=21.0 size=N/A time=00:00:17.44 bitrate=N/A speed=0.794x
frame= 425 fps= 19 q=23.0 size=N/A time=00:00:17.85 bitrate=N/A speed=0.795x
frame= 435 fps= 19 q=21.0 size=N/A time=00:00:18.24 bitrate=N/A speed=0.794x
frame= 445 fps= 19 q=22.0 size=N/A time=00:00:18.67 bitrate=N/A speed=0.795x
frame= 455 fps= 19 q=22.0 size=N/A time=00:00:19.05 bitrate=N/A speed=0.794x
frame= 466 fps= 19 q=22.0 size=N/A time=00:00:19.56 bitrate=N/A speed=0.798x
frame= 476 fps= 19 q=24.0 size=N/A time=00:00:19.94 bitrate=N/A speed=0.797x
frame= 485 fps= 19 q=24.0 size=N/A time=00:00:20.35 bitrate=N/A speed=0.797x
frame= 495 fps= 19 q=23.0 size=N/A time=00:00:20.76 bitrate=N/A speed=0.797x

Edited February 16, 2018 by mbnwa

February 16, 2018

Assuming drivers are up to date it's hard to say. You could try turning off hevc encoding and just use decoding, and vice versa.

February 16, 2018

yeah if I disable HEVC decode things work good but then we are back to CPU based decoding and that's costly on my Dell R710

Drivers are currently the latest drivers.

February 16, 2018

But it is still encoding with the gpu. Perhaps it just has some inefficiencies on the gpu side.

February 21, 2018

@@Luke Here is some testing I have done, please let me know your thoughts... This simple change also increased my GPU decoder utilization by about 1/2 19% Emby's command vs around 35-40% with the updated command.

FFMpeg Command 0.95x speed (19-22fps) - Emby Default

C:\Users\$username\AppData\Roaming\MediaBrowser-Server\System\ffmpeg.exe -c:v hevc_cuvid -i file:"S:\Media Storage\$videos\$media_file\$media_file (4K UHD).mkv" -threads 0 -map 0:0 -map 0:1 -map -0:s -codec:v:0 h264_nvenc -vf "scale=trunc(min(max(iw\,ih*dar)\,640)/2)*2:trunc(ow/dar/2)*2" -pix_fmt yuv420p -preset default -b:v 416000 -maxrate 416000 -bufsize 832000 -profile:v high -force_key_frames "expr:if(isnan(prev_forced_t),eq(t,t),gte(t,prev_forced_t+3))" -copyts -vsync -1 -codec:a:0 libmp3lame -ac 2 -ab 384000 -af "volume=2" -f segment -max_delay 5000000 -avoid_negative_ts disabled -map_metadata -1 -map_chapters -1 -start_at_zero -segment_time 3 -individual_header_trailer 0 -segment_format mpegts -segment_list_type m3u8 -segment_start_number 0 -segment_list "T:\transcoding-temp\1355279e13b40fa96934820627bdeecf.m3u8" -y "T:\transcoding-temp\1355279e13b40fa96934820627bdeecf%d.ts"

frame= 128 fps= 23 q=23.0 size=N/A time=00:00:05.42 bitrate=N/A speed=0.958x

Proposed FFMpeg Command avg 2.9x speed (63-74fps)

FFMpeg seems to have an issue when using the following "-vf "scale=trunc(min(max(iw\,ih*dar)\,640)/2)*2:trunc(ow/dar/2)*2"" if you replace that string with -resize (widthXheight) as in the below example the speed of the transcode process (decoding / encoding) greatly increases.

C:\Users\$username\AppData\Roaming\MediaBrowser-Server\System\ffmpeg.exe -c:v hevc_cuvid -resize 1280x720 -i file:"S:\Media Storage\$videos\$media_file\$media_file (4K UHD).mkv" -threads 0 -map 0:0 -map 0:1 -map -0:s -codec:v:0 h264_nvenc -pix_fmt yuv420p -preset default -b:v 416000 -maxrate 416000 -bufsize 832000 -profile:v high -force_key_frames "expr:if(isnan(prev_forced_t),eq(t,t),gte(t,prev_forced_t+3))" -copyts -vsync -1 -codec:a:0 libmp3lame -ac 2 -ab 384000 -af "volume=2" -f segment -max_delay 5000000 -avoid_negative_ts disabled -map_metadata -1 -map_chapters -1 -start_at_zero -segment_time 3 -individual_header_trailer 0 -segment_format mpegts -segment_list_type m3u8 -segment_start_number 0 -segment_list "T:\transcoding-temp\1355279e13b40fa96934820627bde999.m3u8" -y "T:\transcoding-temp\1355279e13b40fa96934820627bdeecf9999%d.ts"

640x360 (same as Emby) = frame= 521 fps= 84 q=24.0 Lsize=N/A time=00:00:21.93 bitrate=N/A speed=3.53x

1280x720 = frame= 1438 fps= 73 q=28.0 size=N/A time=00:01:00.07 bitrate=N/A speed=3.06x

EDIT: This also greatly helps h264 as well

Emby Current Command: frame= 2352 fps=186 q=16.0 Lsize=N/A time=00:01:39.32 bitrate=N/A speed=7.86x

Modified Command (same as Emby): frame= 4003 fps=573 q=39.0 Lsize=N/A time=00:02:48.19 bitrate=N/A speed=24.1x

Modified Command 1280x720: frame= 1795 fps=327 q=49.0 Lsize=N/A time=00:01:16.09 bitrate=N/A speed=13.9x

EDIT2: It looks like you can not use the -resize option if you are doing NON-HW decoding as it tosses back some errors so this would be for nVidia and maybe QS however I do not have a QS chip to check.

Codec AVOption resize (Resize (width)x(height)) specified for input file #0 (file:S:\Media Storage\$videos\$media_file\$media_file (4K UHD).mkv) has not been used for any stream. The most likely reason is either wrong type (e.g. a video option with no video streams) or that it is a private option of some decoder which was not actually used for any stream.

EDIT3: My test is a bit skewed as I think I used a higher resolution vs what Emby did due to the major difference in the q=XX line. I will look at the logs and post some updated below (Emby default res added above)

Edited February 21, 2018 by mbnwa

February 21, 2018

where did you learn about this? i don't see that in ffmpeg documentation. thanks.

February 21, 2018

I have been digging a lot after our last discussion, found the following https://lists.ffmpeg.org/pipermail/ffmpeg-user/2017-July/036839.html when searching due to the inability to use -hwaccel cuvid and due to an error I got "Impossible to convert between the formats supported by the filter 'graph 0 input from stream 0:0' and the filter 'auto_scaler_0' Error reinitializing filters! Failed to inject frame into filter network: Function not implemented Error while processing the decoded data for stream #0:0" so that started to drive me down the path to look and see if -vf "scale=trunc(min(max(iw\,ih*dar)\,640)/2)*2:trunc(ow/dar/2)*2" can be replaced with something that allows the hw decoder to work better and landed on that post about -resize

> Try something like:
>
> ffmpeg -hwaccel cuvid -c:v h264_cuvid -deint bob -resize 1280x800 -i
> foo -c:v h264_nvenc -c:a aac ./bar

February 21, 2018

are they fixed sizes or max sizes? because the advantage of vf scale is that we can feed in max sizes and it handles it for us so that we don't have a dependency on knowing the input resolution ahead of time.

February 21, 2018

I am not sure as I also could not really find any doc's on the function, if staying with vf scale maybe substitute to vf scale_npp however it does not look like Emby's build has --enable-libnpp, I'll look around and see if I can find FFMpeg already compiled with libnpp and see if replacing scale with scale_npp is a direct replacement.

February 21, 2018

It looks like NPP is part of non-free so I am building FFMpeg with NPP locally, IF this works maybe you can add an option on transcode hardware page Use libnpp (custom FFMpeg only) or something of the like. - It does not look like you can distribute ffmpeg with NPP built in.

Edit: looks like I am having a heck of a time getting FFMpeg to compile with libnpp .. I'll look into that later however I don't think it's a good solution due to the limitation of not being able to distribute it already compiled.

Edited February 21, 2018 by mbnwa

February 21, 2018

on -resize the way I see this is that you are telling FFMpeg to resize the input file to HxW so as long as you know your output resolution when the command is executed it should work, you are doing the resize on the decoder side vs on the encoder side at least that's what it looks like when looking at the output files.

If -resize works as I think it does, it would be best to use that vs trying to build libnpp into the builds that have to be done locally vs distributed.

Edited February 21, 2018 by mbnwa

February 21, 2018

@@Luke this is what I found on -resize from FFMpeg, clearly it looks like -resize is part of the cuvid options as I do not see a resize option with qsv

M:\Emby\system>ffmpeg -h decoder=hevc_cuvid

built with gcc 7.2.0 (GCC)

configuration: --enable-gpl --enable-version3 --enable-sdl2 --enable-bzlib --enable-fontconfig --enable-gnutls --enable-iconv --enable-libass --enable-libbluray --enable-libfreetype --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libopus --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libtheora --enable-libtwolame --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libzimg --enable-lzma --enable-zlib --enable-gmp --enable-libvidstab --enable-libvorbis --enable-libvo-amrwbenc --enable-libmysofa --enable-libspeex --enable-amf --enable-cuda --enable-cuvid --enable-d3d11va --enable-nvenc --enable-dxva2 --enable-avisynth --enable-libmfx

libavutil 56. 7.100 / 56. 7.100

libavcodec 58. 9.100 / 58. 9.100

libavformat 58. 3.100 / 58. 3.100

libavdevice 58. 0.100 / 58. 0.100

libavfilter 7. 8.100 / 7. 8.100

libswscale 5. 0.101 / 5. 0.101

libswresample 3. 0.101 / 3. 0.101

libpostproc 55. 0.100 / 55. 0.100

Decoder hevc_cuvid [Nvidia CUVID HEVC decoder]:

General capabilities: delay

Threading capabilities: none

Supported pixel formats: cuda nv12 p010le p016le

hevc_cuvid AVOptions:

-deint <int> .D.V.... Set deinterlacing mode (from 0 to 2) (default weave)

weave .D.V.... Weave deinterlacing (do nothing)

bob .D.V.... Bob deinterlacing

adaptive .D.V.... Adaptive deinterlacing

-gpu <string> .D.V.... GPU to be used for decoding

-surfaces <int> .D.V.... Maximum surfaces to be used for decoding (from 0 to INT_MAX) (default 25)

-drop_second_field <boolean> .D.V.... Drop second field when deinterlacing (default false)

-crop <string> .D.V.... Crop (top)x(bottom)x(left)x(right)

-resize <string> .D.V.... Resize (width)x(height)

February 21, 2018

ok, it's better than nothing. it's just unfortunate it requires knowledge of a fixed size.

February 21, 2018

fixed output size not input, I assume the transcoding profiles have fixed sizes?

For example 4K video that needs to be outputted to 720p 1mbps would be WxH for that transcoding profile

I assume you do not want people rolling ffmpeg to add npp? I am currently building ffmpeg with libnpp built in to see if I can use that in place of scale, if I can maybe you can add an advanced option to use scale_npp that people select a check box if they roll ffmpeg outside of Emby.

I will post back after this super long compile of FFMpeg is done with the results of libnpp

Edited February 21, 2018 by mbnwa

February 21, 2018

the client profiles have max sizes, those maxes are all we currently feed into ffmpeg.

February 21, 2018

ok so you feed max size into ffmpeg and what it uses -vr to calculate the correct output size based on bitrate ect? What would happen if you feed max size into the -resize option?

February 21, 2018

Don't know, sounds like it might force a video resize to those max sizes, which could be larger or smaller than the input.

February 21, 2018

hmm ok let's see what happens after my libnpp build is finished.

February 24, 2018

@Luke

I finished compiling FFMpeg (side note: pain in the @$$ for sure, to include cuda-sdk in the build you have to have MS Visual C 2015, yes it has to be 2015 or it will not compile + nVidia toolkit)

So If you want to support full hardware and keep the auto scale feature as you have above a custom compiled version of FFMpeg would be required and I assume would have to be created by the user as it includes items from the "non free" category. Please let me know if you intend to support this as my time is running out on returning the Quadro P4000 - I have no need for a 800$ paperweight if I can not use it for HEVC transcoding.

In order to support this with the current version of FFMpeg that is shipped with Emby you would need to use the resize function.

Ok now to the fun part that I have spent the last 3 days compiling by trial and error... To keep auto scale in place you have to A) Compile FFMpeg with cuda-sdk (libnpp is NOT required and it does not support 10bit anyway...)

The following two commands would require custom compiled versions of FFMpeg due to the inclusion of --enable-cuda-sdk --enable-nonfree

The updated command in order to use autoscaler for HEVC would be the following:

C:\Users\$username\AppData\Roaming\MediaBrowser-Server\System\ffmpeg.exe -hwaccel cuvid -c:v hevc_cuvid -i file:"S:\Media Storage\$videos\$media_file\$media_file (4K UHD).mkv" -threads 0 -map 0:0 -map 0:1 -map -0:s -codec:v:0 h264_nvenc -vf "scale_cuda=trunc(min(max(iw\,ih*dar)\,640)/2)*2:trunc(ow/dar/2)*2,hwdownload,format=p010le" -pix_fmt yuv420p -preset default -b:v 416000 -maxrate 416000 -bufsize 832000 -profile:v high -force_key_frames "expr:if(isnan(prev_forced_t),eq(t,t),gte(t,prev_forced_t+3))" -copyts -vsync -1 -codec:a:0 libmp3lame -ac 2 -ab 384000 -af "volume=2" -f segment -max_delay 5000000 -avoid_negative_ts disabled -map_metadata -1 -map_chapters -1 -start_at_zero -segment_time 3 -individual_header_trailer 0 -segment_format mpegts -segment_list_type m3u8 -segment_start_number 0 -segment_list "T:\transcoding-temp\1355279e13b40fa96934820627bdeecf.m3u8" -y "T:\transcoding-temp\1355279e13b40fa96934820627bdeecf%d.ts"

The updated command in order to use autoscaler for H264 (based on one of my tests) would be the following:

C:\Users\$username\AppData\Roaming\MediaBrowser-Server\System\ffmpeg.exe -hwaccel cuvid -c:v h264_cuvid -i file:"S:\Media Storage\$videos\$media_file\$media_file.m4v" -threads 0 -map 0:1 -map 0:0 -map -0:s -codec:v:0 h264_nvenc -vf "scale_cuda=trunc(min(max(iw\,ih*dar)\,640)/2)*2:trunc(ow/dar/2)*2,hwdownload,format=nv12" -pix_fmt yuv420p -preset default -b:v 674565 -maxrate 674565 -bufsize 1349130 -profile:v high -force_key_frames "expr:if(isnan(prev_forced_t),eq(t,t),gte(t,prev_forced_t+3))" -copyts -vsync -1 -codec:a:0 copy -f segment -max_delay 5000000 -avoid_negative_ts disabled -map_metadata -1 -map_chapters -1 -start_at_zero -segment_time 3 -individual_header_trailer 0 -segment_format mpegts -segment_list_type m3u8 -segment_start_number 0 -segment_list "T:\transcoding-temp\527943752781c5890fd839740c8b28db.m3u8" -y "T:\transcoding-temp\527943752781c5890fd839740c8b28db%d.ts"

In short you have to tell the decoder to upload the frames to the GPU for processing, then during your autoscaler -vf call download the the frames back into system ram and you MUST provide the input format that was uploaded to the GPU so in this case due to it being 4K it was p010le however for my m4v's that I use I had to use format=nv12

Current Emby FFMpeg can support the following --resize HxW (with the removal of -vf autoscale)

C:\Users\$username\AppData\Roaming\MediaBrowser-Server\System\ffmpeg.exe -c:v hevc_cuvid -resize 1280x720 -i file:"S:\Media Storage\$videos\$media_file\$media_file (4K UHD).mkv" -threads 0 -map 0:0 -map 0:1 -map -0:s -codec:v:0 h264_nvenc ~~-vf "scale=trunc(min(max(iw\,ih*dar)\,640)/2)*2:trunc(ow/dar/2)*2"~~ -pix_fmt yuv420p -preset default -b:v 416000 -maxrate 416000 -bufsize 832000 -profile:v high -force_key_frames "expr:if(isnan(prev_forced_t),eq(t,t),gte(t,prev_forced_t+3))" -copyts -vsync -1 -codec:a:0 libmp3lame -ac 2 -ab 384000 -af "volume=2" -f segment -max_delay 5000000 -avoid_negative_ts disabled -map_metadata -1 -map_chapters -1 -start_at_zero -segment_time 3 -individual_header_trailer 0 -segment_format mpegts -segment_list_type m3u8 -segment_start_number 0 -segment_list "T:\transcoding-temp\1355279e13b40fa96934820627bde999.m3u8" -y "T:\transcoding-temp\1355279e13b40fa96934820627bdeecf9999%d.ts"

I do have to say that performance is outstanding using this method (HEVC) even if it's a PITA to build, this method yielded the following:

HEVC scale_cuda: frame= 8428 fps=138 q=35.0 Lsize=N/A time=00:05:51.72 bitrate=N/A speed=5.74x

HEVC resize: frame= 521 fps= 84 q=24.0 Lsize=N/A time=00:00:21.93 bitrate=N/A speed=3.53x

HEVC no optimization: frame= 128 fps= 23 q=23.0 size=N/A time=00:00:05.42 bitrate=N/A speed=0.958x

H264 scale_cuda: frame= 1857 fps=598 q=27.0 Lsize=N/A time=00:01:18.69 bitrate=N/A speed=25.3x

H264 resize: frame= frame= 2478 fps=570 q=35.0 Lsize=N/A time=00:01:44.59 bitrate=N/A speed=24.1x

H264 no optimization: frame= 1216 fps=173 q=20.0 size=N/A time=00:00:51.98 bitrate=N/A speed=7.39x

Edited February 25, 2018 by mbnwa

February 24, 2018

@@Luke

Some time ago I remember that you used to be able to point Emby to a custom FFMpeg directory, I can not seem to find that option any longer was it removed?

If you are interested in testing out the compiled version of FFMpeg I have (assuming you have nVidia hardware, let me know and I will PM you a link via DropBox.

February 24, 2018

That option too much troubleshooting for us. You can just replace the emby executables if you want to. I'll try to incorporate the resize option.

February 24, 2018

Thanks, I will be on the lookout for the resize option, I guess I will keep the P4000 at this point seeing your willing to look at the resize option as a solution for Nvidia hardware.

February 24, 2018

Well actually it looks like based on your research it should be scale_cuda instead?

February 24, 2018

@@Luke

scale_cuda does work and works well, however it would require the end user to compile FFmpeg with Cuda SDK - resize works with the included version of FFmpeg that is distributed with Emby today.

I think for the highest level of compatibility using resize will be advantageous to all users as doing your own compile is not for everyone.

If you want to take on both [emoji4] you could update the current Nvidia profile to support resize and add a new one that uses scale_cuda for advanced users that can replace the bundled version of FFmpeg.

I would be happy with either one or both just depends on what you want to take on.

Sent from my iPhone using Tapatalk

Edited February 24, 2018 by mbnwa

February 25, 2018

@@Luke I updated the post on the last page to reflect everything in a single post however to sum up my post I will post the relevant parts here.

Current Emby FFMpeg can support the following --resize HxW (with the removal of -vf autoscale)

C:\Users\$username\AppData\Roaming\MediaBrowser-Server\System\ffmpeg.exe -c:v hevc_cuvid -resize 1280x720 -i file:"S:\Media Storage\$videos\$media_file\$media_file (4K UHD).mkv" -threads 0 -map 0:0 -map 0:1 -map -0:s -codec:v:0 h264_nvenc ~~-vf "scale=trunc(min(max(iw\,ih*dar)\,640)/2)*2:trunc(ow/dar/2)*2"~~ -pix_fmt yuv420p -preset default -b:v 416000 -maxrate 416000 -bufsize 832000 -profile:v high -force_key_frames "expr:if(isnan(prev_forced_t),eq(t,t),gte(t,prev_forced_t+3))" -copyts -vsync -1 -codec:a:0 libmp3lame -ac 2 -ab 384000 -af "volume=2" -f segment -max_delay 5000000 -avoid_negative_ts disabled -map_metadata -1 -map_chapters -1 -start_at_zero -segment_time 3 -individual_header_trailer 0 -segment_format mpegts -segment_list_type m3u8 -segment_start_number 0 -segment_list "T:\transcoding-temp\1355279e13b40fa96934820627bde999.m3u8" -y "T:\transcoding-temp\1355279e13b40fa96934820627bdeecf9999%d.ts"

The following two commands would require custom compiled versions of FFMpeg due to the inclusion of --enable-cuda-sdk --enable-nonfree

The updated command in order to use autoscaler for HEVC would be the following:

C:\Users\$username\AppData\Roaming\MediaBrowser-Server\System\ffmpeg.exe -hwaccel cuvid -c:v hevc_cuvid -i file:"S:\Media Storage\$videos\$media_file\$media_file (4K UHD).mkv" -threads 0 -map 0:0 -map 0:1 -map -0:s -codec:v:0 h264_nvenc -vf "scale_cuda=trunc(min(max(iw\,ih*dar)\,640)/2)*2:trunc(ow/dar/2)*2,hwdownload,format=p010le" -pix_fmt yuv420p -preset default -b:v 416000 -maxrate 416000 -bufsize 832000 -profile:v high -force_key_frames "expr:if(isnan(prev_forced_t),eq(t,t),gte(t,prev_forced_t+3))" -copyts -vsync -1 -codec:a:0 libmp3lame -ac 2 -ab 384000 -af "volume=2" -f segment -max_delay 5000000 -avoid_negative_ts disabled -map_metadata -1 -map_chapters -1 -start_at_zero -segment_time 3 -individual_header_trailer 0 -segment_format mpegts -segment_list_type m3u8 -segment_start_number 0 -segment_list "T:\transcoding-temp\1355279e13b40fa96934820627bdeecf.m3u8" -y "T:\transcoding-temp\1355279e13b40fa96934820627bdeecf%d.ts"

The updated command in order to use autoscaler for H264 (based on one of my tests) would be the following:

C:\Users\$username\AppData\Roaming\MediaBrowser-Server\System\ffmpeg.exe -hwaccel cuvid -c:v h264_cuvid -i file:"S:\Media Storage\$videos\$media_file\$media_file.m4v" -threads 0 -map 0:1 -map 0:0 -map -0:s -codec:v:0 h264_nvenc -vf "scale_cuda=trunc(min(max(iw\,ih*dar)\,640)/2)*2:trunc(ow/dar/2)*2,hwdownload,format=nv12" -pix_fmt yuv420p -preset default -b:v 674565 -maxrate 674565 -bufsize 1349130 -profile:v high -force_key_frames "expr:if(isnan(prev_forced_t),eq(t,t),gte(t,prev_forced_t+3))" -copyts -vsync -1 -codec:a:0 copy -f segment -max_delay 5000000 -avoid_negative_ts disabled -map_metadata -1 -map_chapters -1 -start_at_zero -segment_time 3 -individual_header_trailer 0 -segment_format mpegts -segment_list_type m3u8 -segment_start_number 0 -segment_list "T:\transcoding-temp\527943752781c5890fd839740c8b28db.m3u8" -y "T:\transcoding-temp\527943752781c5890fd839740c8b28db%d.ts"

In short you have to tell the decoder to upload the frames to the GPU for processing, then during your autoscaler -vf call download the the frames back into system ram and you MUST provide the input format that was uploaded to the GPU so in this case due to it being 4K it was p010le however for my m4v's that I use I had to use format=nv12 (format I assume will be dynamic based on the source video.)

I do have to say that performance is outstanding using this method (HEVC) even if it's a PITA to build, this method yielded the following:

HEVC scale_cuda: frame= 8428 fps=138 q=35.0 Lsize=N/A time=00:05:51.72 bitrate=N/A speed=5.74x

HEVC resize: frame= 521 fps= 84 q=24.0 Lsize=N/A time=00:00:21.93 bitrate=N/A speed=3.53x

HEVC no optimization: frame= 128 fps= 23 q=23.0 size=N/A time=00:00:05.42 bitrate=N/A speed=0.958x

H264 scale_cuda: frame= 1857 fps=598 q=27.0 Lsize=N/A time=00:01:18.69 bitrate=N/A speed=25.3x

H264 resize: frame= frame= 2478 fps=570 q=35.0 Lsize=N/A time=00:01:44.59 bitrate=N/A speed=24.1x

H264 no optimization: frame= 1216 fps=173 q=20.0 size=N/A time=00:00:51.98 bitrate=N/A speed=7.39x

Edited February 25, 2018 by mbnwa

Sign In

Slow hardware decode with HEVC and nVidia P4000

Recommended Posts

mbnwa 49

Luke 40079

mbnwa 49

Luke 40079

mbnwa 49

Luke 40079

mbnwa 49

Luke 40079

mbnwa 49

mbnwa 49

mbnwa 49

mbnwa 49

Luke 40079

mbnwa 49

Luke 40079

mbnwa 49

Luke 40079

mbnwa 49

mbnwa 49

mbnwa 49

Luke 40079

mbnwa 49

Luke 40079

mbnwa 49

mbnwa 49

Create an account or sign in to comment

Create an account

Sign in

Activity