GPU Transcoding (Intel QuickSync and nVidia NVENC)

August 2, 2017

I can confirm that for nvenc in linux, one has to install the proprietary nvidia drivers as well as cuda, which comes out to be a 2.4GB install. Plus, ffmpeg needs to be compiled with nvenc support.

Then, you have the 2 simultaneous transcode limitation (some higher end and pro cards have higher limits)

Nvidia for hw transcode is more hassle than it's worth. Amd is also a pain. Intel is the easiest to work with, but requires a cpu with a built in gpu, which may also require a new mobo and new (ddr4) ram for recent gen if upgrading.

For that purpose, I just got a pentium G4600 (hd 630) with an asrock 250m and some ddr4 ram.

August 2, 2017

I can confirm that for nvenc in linux, one has to install the proprietary nvidia drivers as well as cuda, which comes out to be a 2.4GB install. Plus, ffmpeg needs to be compiled with nvenc support.

Then, you have the 2 simultaneous transcode limitation (some higher end and pro cards have higher limits)

Nvidia for hw transcode is more hassle than it's worth. Amd is also a pain. Intel is the easiest to work with, but requires a cpu with a built in gpu, which may also require a new mobo and new (ddr4) ram for recent gen if upgrading.

For that purpose, I just got a pentium G4600 (hd 630) with an asrock 250m and some ddr4 ram.

And does quicksync work for you? If so, which OS du you use?

August 2, 2017

All the gpu fans here will like this: https://emby.media/community/index.php?/topic/49605-322612-image-extraction/

August 2, 2017

I can confirm that for nvenc in linux, one has to install the proprietary nvidia drivers as well as cuda, which comes out to be a 2.4GB install. Plus, ffmpeg needs to be compiled with nvenc support.

Then, you have the 2 simultaneous transcode limitation (some higher end and pro cards have higher limits)

Nvidia for hw transcode is more hassle than it's worth. Amd is also a pain. Intel is the easiest to work with, but requires a cpu with a built in gpu, which may also require a new mobo and new (ddr4) ram for recent gen if upgrading.

For that purpose, I just got a pentium G4600 (hd 630) with an asrock 250m and some ddr4 ram.

https://developer.nvidia.com/nvidia-video-codec-sdk Nvidia SDK state:

If you are looking to make use of the dedicated decoding/encoding hardware on your GPU in an existing application you can leverage the integration already available in the FFmpeg/libav. FFmpeg/libav should be used for evaluation or quick integration, but it may not provide control over every encoder parameter. NVDECODE and NVENCODE APIs should be used for low-level granular control over various encode/decode parameters and if you want to directly tap into the hardware decoder/encoder. This access is available through the Video Codec SDK.

So ffmpeg should already have Nvidia support builtin?

https://developer.nvidia.com/ffmpeg

Edited August 2, 2017 by mbze430

August 2, 2017

Well, Pascal cards are reportedly capable, but I'm not sure about the ffmpeg side of things (at least for hardware en/decoding). Can you post just the ffmpeg command line from a transcode log? I'm curious to see how Emby is telling ffmpeg to handle the colourspace conversion. I have a BT.2020 sample file here, so I'll poke at it when I get a chance (within the limits of my 970). A quick-and-very-basic test transcoded the BT.2020 source to a BT.709/h.264 file just fine when specifically told to do so using the appropriate _nvenc encoder (doing the same for a BT.709/HEVC file worked fine as well). It's not using hardware decoding during the process, though...again, due to me having a 970 *sigh*. Easy enough to test adding hardware decoding manually if you want to, though. You just have to specify the pix_fmt and colorspace for the output for a basic colorspace conversion test (yuv420p and bt709 are well supported by players/monitors).

Not sure why you're having playback issues, though, as BT.2020 isn't really that new. MPC-HC should decode it just fine, since it uses LAV...I can't speak to whether or not it'll use hardware to decode, though, when doing so since my card can't do hardware decoding of Main10 files. You can try installing a more recent LAV filters version and make sure that MPC-HC or -BE is using them (uncheck the internal ones if they're not getting it done). My sample plays fine locally with MPC-HC latest and madVR/latest LAV version being used in place of MPC-HC's internal LAV (software decoded, obviously).

I've downloaded the two HDR test files from here:

http://www.4ktv.de/testvideos/

They play fine in MPC and Windows Player but transcoding in emby is very slow. I think the issue is on the encoding side. One of the test files doesn't transcode at all. I attach the logs. Thank you

hdr1.txt

hdr2.txt

August 2, 2017

And does quicksync work for you? If so, which OS du you use?

I just installed it the other day and so far I only tried it with the plex docker (ubuntu xenial) on unraid (Slackware) and it worked just fine for both encode and decode.

Emby docker doesn't support hw transcode yet so I didn't try it.

I'll try to build ffmpeg with hw support inside the emby container, but I'm not very competent in opensuse. Maybe I'll build a custom emby docker based on ubuntu

August 2, 2017

https://developer.nvidia.com/nvidia-video-codec-sdk Nvidia SDK state:

If you are looking to make use of the dedicated decoding/encoding hardware on your GPU in an existing application you can leverage the integration already available in the FFmpeg/libav. FFmpeg/libav should be used for evaluation or quick integration, but it may not provide control over every encoder parameter. NVDECODE and NVENCODE APIs should be used for low-level granular control over various encode/decode parameters and if you want to directly tap into the hardware decoder/encoder. This access is available through the Video Codec SDK.

So ffmpeg should already have Nvidia support builtin?

https://developer.nvidia.com/ffmpeg

My understanding is that yes ffmpeg supports nvidia hw acceleration for both encode and decode, but ffmpeg must be built with that functionality enabled. I heard that you can't build statically with that support, so you can't download ffmpeg with that built in, you have to build it yourself

August 2, 2017

I just installed it the other day and so far I only tried it with the plex docker (ubuntu xenial) on unraid (Slackware) and it worked just fine for both encode and decode.

Emby docker doesn't support hw transcode yet so I didn't try it.

I'll try to build ffmpeg with hw support inside the emby container, but I'm not very competent in opensuse. Maybe I'll build a custom emby docker based on ubuntu

Thank you, that's great! One more question: Which ffmpeg build did you use? The one that comes with emby? I'm still wondering why QS won't decode on Windows...

I've just triesd it on a Ubuntu 16.04 server and it did hardware encode using libx264 but not decode. To use *_qsv codecs the Intel Media SDK is needed on Linux and the hevc is only in the paid version. Which codec did it actually use at your machine? I have also tried vaapi, no success with it as well.

Edited August 2, 2017 by Gerrit507

August 3, 2017

My understanding is that yes ffmpeg supports nvidia hw acceleration for both encode and decode, but ffmpeg must be built with that functionality enabled. I heard that you can't build statically with that support, so you can't download ffmpeg with that built in, you have to build it yourself

You can download a static build of ffmpeg that includes support for nvenc/qsv, but trying to use that functionality without the appropriate video drivers installed will fail. IIRC, the Zeranoe static builds all include nvenc/qsv support (Windows-only), for example, and I've used my own ffmpeg builds on other machines without issue.

In the past, NVENC support required a header file that wasn't free to distribute, so building locally was the only real answer as you couldn't redistribute a compiled binary that was built using it. That header was relicensed by nVidia a year or so ago, so that restriction's gone.

Edited August 3, 2017 by Waldonnis

August 4, 2017

John Vansickle, the guy who does the linux static builds for ffmpeg wrote in git that nvenc and vaapi can't be done in static builds but I'm no expert in that so I can't comment on the reasons

August 4, 2017

Thank you, that's great! One more question: Which ffmpeg build did you use? The one that comes with emby? I'm still wondering why QS won't decode on Windows...

I've just triesd it on a Ubuntu 16.04 server and it did hardware encode using libx264 but not decode. To use *_qsv codecs the Intel Media SDK is needed on Linux and the hevc is only in the paid version. Which codec did it actually use at your machine? I have also tried vaapi, no success with it as well.

At the time of my previous post, I had only tried it in plex and it had worked for both decode and encode. They have the intel drivers in their docker, as well their own custom build of ffmpeg.

I just built a test docker of emby based on ubuntu 16 and gave that a try. I installed the intel drivers in there and compiled ffmpeg with vaapi enabled. Like you, I got the encode accelerated with h264_vaapi, but the decode used good old h264 (native).

Here's the transcode log I have where you can also see the ffmpeg build flags: https://pastebin.com/K0FbUEYg

PS. For anyone else wondering, this is a docker container running on unraid (slackware based) and the cpu/gpu is Pentium G4600 (hd 630).

Edited August 4, 2017 by aptalca

August 4, 2017

At the time of my previous post, I had only tried it in plex and it had worked for both decode and encode. They have the intel drivers in their docker, as well their own custom build of ffmpeg.

I just built a test docker of emby based on ubuntu 16 and gave that a try. I installed the intel drivers in there and compiled ffmpeg with vaapi enabled. Like you, I got the encode accelerated with h264_vaapi, but the decode used good old h264 (native).

Here's the transcode log I have where you can also see the ffmpeg build flags: https://pastebin.com/K0FbUEYg

PS. For anyone else wondering, this is a docker container running on unraid (slackware based) and the cpu/gpu is Pentium G4600 (hd 630).

Why do you think the decode used the cpu?

August 4, 2017

John Vansickle, the guy who does the linux static builds for ffmpeg wrote in git that nvenc and vaapi can't be done in static builds but I'm no expert in that so I can't comment on the reasons

It could be done in static builds but they were not allowed to be distributed because of license reasons. Nvidia changed the license and it's now available in almost every ffmpeg build I've found. When I built ffmpeg yesterday it was enabled by default.

At the time of my previous post, I had only tried it in plex and it had worked for both decode and encode. They have the intel drivers in their docker, as well their own custom build of ffmpeg.

I just built a test docker of emby based on ubuntu 16 and gave that a try. I installed the intel drivers in there and compiled ffmpeg with vaapi enabled. Like you, I got the encode accelerated with h264_vaapi, but the decode used good old h264 (native).

Here's the transcode log I have where you can also see the ffmpeg build flags: https://pastebin.com/K0FbUEYg

PS. For anyone else wondering, this is a docker container running on unraid (slackware based) and the cpu/gpu is Pentium G4600 (hd 630).

Welcome to the club

I've also built ffmpeg with vaapi enabled on Ubuntu 16.04. In emby it also shows me h264(native)->h264_vaapi, couldn't test h265 yet, I'll have to switch the mainboard first.

When I run

ffmpeg -decoders | grep i 'vaapi'

I get no codecs at all.

Only

ffmpeg -encoders | grep i 'vaapi'

shows me the right codecs.

Nevertheless, when emby transcodes, the cpu usage is exceptionally low. I just can't say for sure if it's really using full hw transcoding. I guess trying h265 on my Apollo Lake Mainboard will clear this up...

I also want to note that it's really hard to find any useful information on the web about this topic. I've found inconsistent information about it: some say it can't be used for tanscoding, other persons state the opposite...

Why do you think the decode used the cpu?

Do you know if it's normal that vaapi is only listed as encoder?

John Vansickle, the guy who does the linux static builds for ffmpeg wrote in git that nvenc and vaapi can't be done in static builds but I'm no expert in that so I can't comment on the reasons

It could be done in static builds but they were not allowed to be distributed because of license reasons. Nvidia changed the license and it's now available in almost every ffmpeg build I've found. When I built ffmpeg yesterday it was enabled by default.

August 4, 2017

I'm not sure about that, but according to ffmpeg docs it looks like we're using it correctly to achieve decoding:

https://trac.ffmpeg.org/wiki/Hardware/VAAPI

August 4, 2017

Does emby support 4k hevc 12 bit transcode with a nvidia 10xx card ?

Thx & Greetings

Edited August 4, 2017 by Kekskruemel

August 4, 2017

Does emby support 4k hevc 12 bit transcode with a nvidia 10xx card ?

Thx & Greetings

If the card supports it and if ffmpeg supports it.

August 4, 2017

The card support 4k hevc 12bit decode, but the ffmpeg changelog is only talking about general hevc support. Hard to find any infos about nvidia 12 bit support :/

Edited August 4, 2017 by Kekskruemel

August 4, 2017

You can compile ffmpeg for both, but it's a serious pain to do so and probably why none of the mainstream builds have bothered to do so. Mostly, that applies to the software encoders, though, as ffmpeg itself doesn't care. "libx264" and "libx265" encoders are actually library versions of the x264 and x265 stand-alone encoders, and both definitely require special compilation for 10 and 12bit support. What I'm not sure is just how NVENC's interface to ffmpeg is implemented, but it's likely that it shouldn't need special compilation - just that the code is in ffmpeg to talk to the driver properly when it comes to new options.

yuv420p10le may indeed be an issue, as I don't think that part of the patch was accepted (it worked, but there were some technical implementation reasons that caused it to be rejected). It may have since been worked on, but I would have to dredge the mailing list and commit logs again to find out. If it's still not suported, it sucks since yuv420p10le is pretty widely used.

If you want to double-check that the GPU was used in your successful test, look for the ffmpeg command line in the transcode log and see if you can find any reference to NVENC in there (either a -vcodec or -codec option...or maybe something in the stats output that mentions nvenc). "Speed" isn't a bad indicator as well, since CPU/software transcoding would be significantly slower and pretty obvious (especially if the source was HEVC 2160p).

I do compile my own ffmpeg with 10/12bit support locally, but can't redistribute it due to license restrictions on one of the audio encoders that I include in my build. If the problem isn't only with the yuv420p10le pixel format, I may be able to prune out the audio encoder and build a binary to test with, but it may take a few days for me to get that set up and done (stupid holidays).

Oh, one other thing - can you run the command below and PM the output to me? I'm mostly curious to see what I would have to work with and what's been implemented so far if I snag a new Pascal card this month, but am also trying to think of a way to work around the apparent limitation (more info is always a good thing):
ffmpeg -h encoder="nvenc_hevc"

Do you have a working 4k hevc 12 transcode setup or did I misread something ?

Thx & Greetings

August 4, 2017

I'm not sure about that, but according to ffmpeg docs it looks like we're using it correctly to achieve decoding:

https://trac.ffmpeg.org/wiki/Hardware/VAAPI

You're right, the ffmpeg options used by emby seem to be correct. My assumption was based on the line @@Gerrit507 also mentioned:

Stream mapping:
  Stream #0:0 -> #0:0 (h264 (native) -> h264 (h264_vaapi))

I assumed that the left side was decode and the right side was encode, but I guess my assumption was incorrect. I did some manual tests with the ffmpeg I compiled and here are my results:

First test: Hardware decode only, stream is then discarded

./ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i input.mkv -f null -

Here's the log: https://pastebin.com/ULVvYEHe

Notice that line 160 lists vaapi_vld which I assume is the vaapi decoder

Speed was around 100x and the cpu utilization was an average of 20% on all 4 threads

Second test: Software decode only, stream is then discarded

./ffmpeg -i input.mkv -f null -

Here's the log: https://pastebin.com/K9b9n9RT

Line 155 does not list vaapi_vld, but yuv420p instead

Speed was around 24x but all cpu threads were maxed out

So, I guess my ffmpeg build does decode properly, and emby seems to use it properly as well. The only thing that still confuses me is the cpu utilization. When emby is transcoding, the cpu util is around 20% average on all cores, which seems to be high. Transcoding the same file in plex, cpu util is around 5% and even when doing 4 simultaneous transcodes, it is still around 5-10%. Is it perhaps that emby transcodes ahead?

I'll do some more testing on that.

EDIT: Now I'm getting the same high cpu utilization in plex. I'm not sure what's going on. Perhaps a server restart is on order. This stuff is so confusing.

Edited August 4, 2017 by aptalca

August 4, 2017

Do you have a working 4k hevc 12 transcode setup or did I misread something ?

Thx & Greetings

I don't, no, at least not with hardware encoding/decoding, which is why I was asking a lot of questions of Pascal owners around that time. I have a GTX-970 in my system so hardware encoding HEVC Main10 is impossible, and it has no HEVC hardware decoder at all either.

I've done a fair bit of 10/12-bit HEVC software encodes with x265 otherwise, though, which is why that quote talks about building the software encoders with 10/12-bit support since it's not the default configuration for them.

August 4, 2017

You're right, the ffmpeg options used by emby seem to be correct. My assumption was based on the line @@Gerrit507 also mentioned:
Stream mapping:
  Stream #0:0 -> #0:0 (h264 (native) -> h264 (h264_vaapi))
I assumed that the left side was decode and the right side was encode, but I guess my assumption was incorrect.

The meaning of "(native)" is a bit vague when dealing with hwaccels, which makes it really hard to tell what's going on at normal log levels.

Looking at your decoding tests, ffmpeg was transcoding the audio and discarding it as well (simple PCM conversion, but the ac3 stream still had to be decoded/mixed). Not sure if it was intentional, as it would've introduced more CPU load into what appeared to be a video decoder test, so I figured I'd point it out.

As for processor use and Plex...

Plex seems to handle transcoding decisions a little differently in my experience, and even uses a custom version ffmpeg to do so. Without knowing what they've done to it or what it's transcoding from/to, I couldn't even begin to explain the reason for the difference. I doubt they've modified it too heavily, but hardcoded profiles, presets, or output settings may explain some of that.

August 4, 2017

Thanks @Waldonnis

Yeah, ffmpeg terminology is very confusing and their documentation is pretty lacking.

I do realize the transcodes are using software for the audio which eat up cpu cycles. I didn't bother taking audio out of the equation when I did the decode comparison since the audio decode would have the same effect on both the test and control.

I did some more side by side tests with plex and emby and the cpu utilization are comparable. I honestly don't know what I did when I saw 5-10% cpu utilization with 4 transcodes in plex, so it must have been a fluke (or maybe I was looking at the other server?!? not sure).

One thing I noticed is that, the more transcode sessions I have, the less cpu utilization each session has. Perhaps due to processor (cpu/gpu) frequency going up?!? Not sure. But looking at htop results, a single transcode session has about 70-80% cpu utilization (single thread out of 4). But when I go up to 5 simultaneous transcodes, each session has about 35-50% util, so they scale nicely to add up to about 60-70% average across all threads.

It is very nice to see a cheap pentium handling multiple simultaneous transcodes, rivaling my trusted but aging Xeon E5-2670v1

Edited August 4, 2017 by aptalca

August 4, 2017

I know I created a lot of noise in this thread today, but this is hopefully the last entry :-)

As a final test, I compared cpu utilization between a hw video transcode (sw audio) and a direct stream where only the audio is transcoded (ac3 to stereo aac) and the cpu utilization between the two scenarios is pretty much the same. And I did this in both plex and emby with the same results.

I can finally conclude that emby is fully utilizing hardware transcode (vaapi) for both encode and decode and the cpu utilization from that is negligible. Audio transcode and remuxing on the other hand have significant impact on the cpu.

For reference, this is all done on a Pentium G4600 (hd 630) kaby lake cpu, in a docker container based on ubuntu 16 running on unraid. Inside the docker, I compiled ffmpeg according to these instructions: https://trac.ffmpeg.org/wiki/CompilationGuide/Ubuntu except I also added "--enable-vaapi" to ffmpeg configuration flags on the last step.

Thanks Emby team. This is working great. I think the next logical step would be to include an ffmpeg build with hw transcode support in the official docker (I tried to build it but opensuse wasn't too friendly to me).

EDIT: I promised no more messages, so I'm appending this. I realized that with emby's implementation, ffmpeg is definitely transcoding ahead all the way. I started a transcode session, this time the video is transcoded, but the audio is direct. At first I was surprised that the cpu utilization was the same, 80% single core. But a couple of minutes later, I noticed that the ffmpeg process disappeared from htop. I kept skipping ahead, and still no ffmpeg process. I skipped all the way to the end before I realized that ffmpeg has already transcoded the rest of the file so the cpu utilization was down to zero with no more ffmpeg process. And sure enough, once I skipped back to an earlier time in playback, ffmpeg fired right up. My guess is that when I saw 5-10% cpu utilization the other day with 4 transcodes, I must have had the streams going long enough that they had all finished transcoding the rest of the media.

Conclusions:

1) Don't pay attention to cpu utilization

2) Trust emby and ffmpeg to do their job

3) Hw transcode is awesome

Thanks again, Emby team

Edited August 4, 2017 by aptalca

August 12, 2017

This new setting in the beta server will allow hevc decoding to be enabled for quick sync:

https://emby.media/community/index.php?/topic/49877-322623-hardware-decoding-settings/

You'll just want to make sure your hardware supports it. Enjoy.

August 13, 2017

Long awaited automatic switch from gpu to cpu:

https://emby.media/community/index.php?/topic/49886-322625-auto-fallback-from-gpu-to-cpu/

Enjoy.

Sign In

GPU Transcoding (Intel QuickSync and nVidia NVENC)

Recommended Posts

aptalca 70

Gerrit507 24

Luke 40112

mbze430 1

Gerrit507 24

aptalca 70

aptalca 70

Gerrit507 24

Waldonnis 148

aptalca 70

aptalca 70

Luke 40112

Gerrit507 24

Luke 40112

Kekskruemel 0

Luke 40112

Kekskruemel 0

Kekskruemel 0

aptalca 70

Waldonnis 148

Waldonnis 148

aptalca 70

aptalca 70

Luke 40112

Luke 40112

Create an account or sign in to comment

Create an account

Sign in

Activity