GPU Transcoding (Intel QuickSync and nVidia NVENC)

April 6, 2017

@@bcm00re, no, no known problems.

April 6, 2017

Not that I can help a lot but post some transcoding logs to start. That will help others help you! I'm surprised LUKE hasn't given you his standard 'how to report a problem' link. Lol

Sent from my STV100-3 using Tapatalk

April 6, 2017

I did post some logs (in late March) in a thread I created in the Live TV area, but it's not getting any traction. I came here trying to educated myself on getting hardware encoding setup properly, but I am still having problems. Sometimes I can get live 1280x720 channels to work, but I almost never get live 1920x1080 channels to work. I have gotten a previously recorded 1080 program to hardware transcode, but all the artifacts and hiccups make it unwatchable. My i7-6700 processor with integrated graphics supports QSV and my video card supports NVENC. I'll attempt to hardware encode using the same ffmpeg using something besides Emby and see what happens.

April 6, 2017

I've never used LIVE TV so I don't know much about that. What about using previously recorded / downloaded shows. Even w / o hardware decoding you shouldn't have any issues with that CPU I'd think unless you had multiple streams perhaps.

Sent from my STV100-3 using Tapatalk

April 6, 2017

As I said, I have gotten a previously recorded 1920x1080 program to hardware transcode, but all the artifacts and hiccups made it unwatchable. Maybe that's because it was sports (NCAA basketball final)?

I'm doing a ffmpeg transcode (using quick sync hardware) right now, and it seems to be working.

If anyone is using hardware encoding with Emby and getting better results please let me know; I would like to determine what I am doing different/wrong. Thanks...

April 6, 2017

How does one know if hardware encoding is successfully utilized, is it just a change in CPU load?

April 6, 2017

The change in CPU utilization is usually a good clue, but one can also search for "qsv" (or "nvenc") in the Emby logs.

April 7, 2017

I tried NVENC both with the 2016-04-10 version of FFMPEG (installed with Emby?) and the 3.2.4 Zeranoe version using the latest nVidia drivers with a i7-6800K running Win 10 64-bit and 770 GTX video. Didn't seem to work, I got a black screen, audio and the timer was incrementing. This was with a1080p h264 MKV transcoding to 720p 3Mbps.

Logs are attached below. I guess this feature is in development, but some have had success. Despite the powerful CPU transcoding does use significant resources, it would be cool to unload on the graphics card if it could be made to work. This CPU has no graphics so Intel Quicksync isn't an option.

Any suggestions on what I might need to check?

server-63627120000.txt

ffmpeg-transcode-b2e7527b-b710-4807-8c8a-5c46d160ad8b.txt

April 7, 2017

I tried NVENC both with the 2016-04-10 version of FFMPEG (installed with Emby?) and the 3.2.4 Zeranoe version using the latest nVidia drivers with a i7-6800K running Win 10 64-bit and 770 GTX video. Didn't seem to work, I got a black screen, audio and the timer was incrementing. This was with a1080p h264 MKV transcoding to 720p 3Mbps.

Logs are attached below. I guess this feature is in development, but some have had success. Despite the powerful CPU transcoding does use significant resources, it would be cool to unload on the graphics card if it could be made to work. This CPU has no graphics so Intel Quicksync isn't an option.

Any suggestions on what I might need to check?

It was used here successfully for the encoding portion of the transcoding process, but not the decoding. We are not currently doing decoding with nvenc.

April 7, 2017

It was used here successfully for the encoding portion of the transcoding process, but not the decoding. We are not currently doing decoding with nvenc.

I don't understand why FFMPEG would try to use NVENC to decode if you are not commanding it to do so. Are you saying I have the configuration wrong?

April 7, 2017

Sorry, I misread. Disregard what I said. That transcode log looks fine. I think you probably have some kind of driver or other environmental issue preventing it from working.

April 19, 2017

Oh no, it's not an Emby issue: ffmpeg and/or cuda can't handle more than 2 nvenc threads. Like @@Gerrit507 proposed in post #982, Emby should perform a CPU fallback when this error appears. I thought you implemented it in beta.

Actually if you are using a Quadro there is a different limit to the number of NVEC threads you can do. The 2 thread limit per machine is only on consumer cards. Quadros have a limit of 6 threads, but it is also based on available memory etc.

https://devtalk.nvidia.com/default/topic/800942/session-count-limitation-for-nvenc-no-maxwell-gpus-with-2-nevenc-sessions-/

April 19, 2017

I didn't read this entire thread so I apologize if it has already been mentioned but I am looking for a real world example of what VA API transcoding can do with a mid to high end GPU. My understanding is that VA API isn't limited to a certain number of simultaneous encodes like NVENC. So what is the limiting factor? Number of stream processors? Overall memory transfer rate? Amount of memory?

I am looking for something that can process approximately 10 simulations 1080p streams, is this even possible?

Honestly your best bet is to make sure your clients all have boxes that can direct play anything. I would recommend the Beelink Mini M8S II with Xannytechs firmware from Freaktab. If you want to buy direct from Amazon instead of greatbest https://www.amazon.ca/Beelink-Mini-MXIII-Cortex-A53-Penta-core/dp/B01KH6GE88/ref=sr_1_2?ie=UTF8&qid=1492577257&sr=8-2&keywords=beelink grab this box. It will also work with Xannytechs firmware and with several other firmwares on freaktab, even several Android TV firmwares. Set the box up, install SPMC or FTMC or Kodi, and use it as an external player through Emby. Everything will direct play, and you won't have to worry about transcodes anymore. Well unless you don't have the bandwidth for 10 simultaneous streams. The most I've had at one time is 7. Trust me, I have several of these and everything, even truHD and ATMOS play perfectly fine through FTMC/SPMC, when they won't play through the Emby apps.

Your other options, is to buy a new system based on Ryzen, or pick up a super high end Quadro card. You'd probably be better off to wait and see what AMD brings out to the server market this summer. Ryzen already destroyed Intel in productivity and encoding, so hopefully their server chips will do the same, with similar discounts.

April 20, 2017

Thanks for the feedback.

Hello my dear @@Luke. I'm a little bit lost in how you want us to proceed and how we can help you to have better HW support in Emby. Your answer was referring to some observations I did earlier : https://r.tapatalk.com/shareLink?share_fid=77624&share_tid=10723&share_pid=434743&url=https://emby.media/community/index.php?/topic/10723-GPU-Transcoding-%28Intel-QuickSync-and-nVidia-NVENC%29/page__view__findpost__p__434743&share_type=t on how NVENC was working with FFMPEG build from Zeranoe. Since this date, have you done some changes? Do you want me to test again?

Sent from my iPhone using Tapatalk

Edited April 20, 2017 by jscoys

April 20, 2017

Yes, please test those again. Thanks.

April 20, 2017

Reminds me, I wrote a quick "testsuite" for my friend's new computer (Kaby Lake, GTX 1060, Windows) to compare encoder and decoder performance/capabilities, as well as explore the "in practice" difference between -hwaccel and specifying a decoder with -c:v. While testing it locally for hardware decoding, I noticed something interesting: specifying a codec for decoding yielded the same results as specifying a hwaccel. In some cases, hwaccel actually worked better (using dxva2) due to it selecting the higher-performing decoder - in my case, h264_nvenc is significantly faster on my GTX 970 than QSV decoding is on my IB i5 (tested by decoding, then "transcoding" to rawvideo with null output). Functionally, the following two commands were equivalent in decoding performance:

ffmpeg -hwaccel dxva2 -i ...
ffmpeg -c:v h264_nvenc -i ...

What I liked about the DXVA2/hwaccel solution was that it's not reliant on specifying a particular vendor's decoder/codec. The downside, however, is that it's Windows-only and doesn't let you "prefer" one decoder over another easily if multiple devices support the same codec (there are ways, but it's a pain). It also predictably spat out a bunch of errors when no supporting hardware decoder could be found for a particular codec. Using a -hwaccel option with qsv or cuvid yielded identical results/performance to commands where I specified the codec as well.

I also ran tests with -hwaccel auto, which was the best of both worlds, so to speak. It used the CPU when no supporting decoder could be found (unlike dxva2 or either vendor-specific option) and used a hardware decoder when supported/available. In my case, it used dxva2 for h.264, and cpu for HEVC Main/Main10 (I have no hardware decoder for HEVC). Since there's a hwaccel for vaapi, I'd assume auto would recognise that as well, making it more of a platform-agnostic option (I have not tested this on my Linux system personally, so I can't verify).

I'm waiting for a result set from my friend and will double-check all of my observations, but I'm wondering if using -hwaccel isn't just a better/more flexible option than -c:v, especially given situations like varying profile support for codecs like HEVC in different generations of hardware. Since we've seen some wrinkles in this thread with specifying decoders with -c:v, it may be worth exploring using something like "-hwaccel auto" since it seems to be more generic and forgiving. Also, it could actually be used in the absence of any hardware decoding since it just falls back to CPU anyway in my tests - meaning it could be its own option that's independent of hardware encoding. I can supply ffmpeg reports for each if needed or desired, but the slightly pared-down command I was using to test with is below (uses the Windows null output descriptor - change if using Linux; only things I added to it were report/logging-related):

ffmpeg -hwaccel auto -i source.mkv -map 0:0 -c:v rawvideo -f null NUL

April 21, 2017

@@Waldonnis, i'm all for adding an auto option, I think that would be fine. Do you know how ffmpeg handles the situation of no hardware decoders being available? Can auto sometimes mean none or does it always try to use something?

April 21, 2017

@@Waldonnis, i'm all for adding an auto option, I think that would be fine. Do you know how ffmpeg handles the situation of no hardware decoders being available? Can auto sometimes mean none or does it always try to use something?

It always tries, but if it can't create a context (in my case, DXVA2, which is only capable of being created if a decoder is available), it will use the CPU. Here's the relevant bits from local test reports, both run with auto (they're at a very high log level so I could see exactly what it was doing, so I trimmed it a bit)...

HEVC Main10 (I have no decoder for this at all, so it should fall back to CPU and does):

Applying option hwaccel (use HW accelerated decoding) with argument auto.
<deleted stuff>
[AVHWDeviceContext @ 0000000003235280] Using D3D9Ex device.
No decoder device for codec found
Error creating the DXVA2 decoder
<continues on and decodes with the CPU>

Here's one for h.264 (which I have two decoders for: GTX 970 and the IB i5):

Applying option hwaccel (use HW accelerated decoding) with argument auto.
<stuff deleted>
[AVHWDeviceContext @ 000000000383c8a0] Using D3D9Ex device.
[h264 @ 0000000000b4a660] Reinit context to 1920x1088, pix_fmt: dxva2_vld
<goes on to use the dxva2 device, which seems to be the 970>

For reference, here's one with -hwaccel dxva2 on the same HEVC Main10 file as above:

<stuff deleted; this log isn't at the same loglevel>
No decoder device for codec found
Error creating the DXVA2 decoder
dxva2 hwaccel requested for input stream #0:0, but cannot be initialized.
[hevc @ 00000000032543c0] Error parsing NAL unit #1.
Error while decoding stream #0:0: Operation not permitted
<continues spitting out errors and never decodes>

Attaching the complete reports for every scenario I tested (all jellyfish samples, one at 140Mbps HEVC Main10, one 30Mbit HEVC Main, and one 30Mbps h.264 - each attempted with auto, cuvid, dxva2, and qsv hwaccels).

I can easily add more tests and run them quickly. I didn't try vpx or mpegvideo since the person I'm having run the tests has awful internet and it was a bit much to ask her to download even more...but I can add and run them locally if you want (the whole test run takes about 20mins, so it's not a big imposition).

hwaccel_tests.zip

Edited April 21, 2017 by Waldonnis

April 21, 2017

Does this also exist for encoding?

April 21, 2017

Sadly, no, no such luck on the encoding side. Really, the encoder and decoder should be "paired" hardware-wise for efficiency, but I haven't checked resource use or performance differences if you mix encoder/decoders (e.g. using nvenc to decode and qsv to encode). Since DXVA2 was really more meant for playback, that scenario probably isn't considered.

One other thing I just now noticed in the ffmpeg documentation:

qsv

Use the Intel QuickSync Video acceleration for video transcoding.

Unlike most other values, this option does not enable accelerated decoding (that is used automatically whenever a qsv decoder is selected), but accelerated transcoding, without copying the frames into the system memory.

For it to work, both the decoder and the encoder must support QSV acceleration and no filters must be used.

I'll have to look at the results more closely with qsv-specific decoding using the hwaccel. I've already spotted something else in the docs that is wrong, so it's possible that this isn't correct either, but I'd like to verify it anyway. I have cpu-only decode reports on every file as well, so it should be easy to compare.

April 21, 2017

That's a shame because it sounds like this could have been a great way to safeguard against people enabling it when their system doesn't actually support it. It sounds like that will do the trick for decoding but not encoding though.

April 21, 2017

That's a shame because it sounds like this could have been a great way to safeguard against people enabling it when their system doesn't actually support it. It sounds like that will do the trick for decoding but not encoding though.

Agreed. I want to do a clean test run as I think the docs are correct on the qsv point and my last test run was done while the system was pretty busy (gaming) and could've skewed the results. I also would love for someone to check to see if auto detects/uses vaapi on the Linux side. Ultimately, if the qsv hwaccel is ignored if not using a qsv encoder, I don't think it's much of a loss for h.264 but could really impact HEVC on Skylake and beyond. I'm hopeful that DXVA2 would still work (which auto seems to prefer anyway), though, and I'd need someone with a newer gen processor to verify that (hopefully someone who doesn't have an nVidia card as well to eliminate any chances it would be used by DXVA2).

April 21, 2017

Okay, just re-ran the tests and, although the system isn't totally unencumbered, no memory/GPU/CPU-heavy tasks were going on this time so it should be close enough. I also increased the loglevels of all of my decoding tests to see more of what was going on.

It looks like the docs were correct. When using -hwaccel qsv, it didn't appear to differ in performance from just using the cpu and there was no mention of using hardware at all. Using -c:v h264_qsv rather than -hwaccel qsv, however, did work and noted hardware use ("[h264_qsv @ 0000000000c77ac0] Initialized an internal MFX session using hardware accelerated implementation"). This is very odd, but I haven't looked at the code to figure out why the two differ in implementation.

I also ran a couple of tests by hand trying -hwaccel auto and -hwaccel qsv...and pairing it with a qsv encoder (in my case h264_qsv). Interestingly, both options preferred to keep the decoding on the Intel side and initialised MFX sessions (so it's using Intel's hardware decoder in both cases, although auto seemed to take the dxva2 path to get there by creating a dxva2 context, while using -hwaccel qsv seemed to use just libmfx routines directly). Using auto with h264_nvenc encoding seemingly kept the decoder and encoder matched as well (no libmfx use; dxva2 context created and other info that pointed to a pure nv "path"). It's worth noting that it's really hard to tell when hwaccel is actually working in such a scenario, so it may not be obvious unless you look hard at how it's dealing with pixel formats at high loglevels, making it tougher to tell when it's actually working with codecs like h.264 that aren't computationally taxing.

So....it looks like auto will just use the properly-matched accelerator/decoder for a given encoder (which is fantastic for many reasons, and makes perfect sense) and you wouldn't have to worry about mismatched en/decoder scenarios. The only outstanding question is will auto (and/or qsv) actually work for HEVC in an Intel-only scenario and how will it do so (hopefully dxva2). If someone has even a Skylake system without a dGPU (or at least without one that supports HEVC decoding) would be willing to try some tests and post the resulting reports, I'd be very grateful (I can PM or post the commands).

April 21, 2017

I have a Skylake i7 which has an NVIDIA card but can take the GPU out. I can test tomorrow. I also have excellent bandwidth so downloading more isn't an issue if needed.

Sent from my STV100-3 using Tapatalk

April 21, 2017

I have a Skylake i7 which has an NVIDIA card but can take the GPU out. I can test tomorrow. I also have excellent bandwidth so downloading more isn't an issue if needed.

Sent from my STV100-3 using Tapatalk

Excellent and thank you! Sorry for the delay in replying...despite my friends' beliefs, I do sleep every so often

I've been using three jellyfish video samples for my tests (http://jell.yfish.us/), so we may as well continue that trend so the results can be reproduced easily by anyone interested. They're only 30secs long and are relaxing to look at too . We'll be testing three files in particular, so go ahead and download these:

jellyfish-30-mbps-hd-h264.mkv (h.264 30Mbit)
jellyfish-30-mbps-hd-hevc.mkv (HEVC Main 30Mbit)
jellyfish-140-mbps-4k-uhd-hevc-10bit.mkv (HEVC Main10 140Mbit - a bit over the average UHD BD bitrate; overkill for this particular test, but it's nice to keep around for encoding tests/experiments)

Since Skylake supports HEVC Main and h.264, we know that the first two videos can be decoded by your hardware. The third is HEVC Main10 which is not supported by hardware on Skylake, so I expect it to decode using the CPU. At any rate, once the videos are downloaded, run the following command for each of them (substituting the filename for FILE....and the one L in NUL is not a typo):

ffmpeg -hwaccel auto -i FILE -map 0:0 -c:v rawvideo -hide_banner -stats -report -loglevel 56 -f null NUL

This command tries to automatically detect the hardware acceleration method to use for decoding, then "transcodes" the file to raw video with the output going to...well, nowhere - it won't leave a file on the drive since we're not interested in viewing it (using null output is great for benchmarking as well). I've purposely mapped only the video stream since audio isn't important for this case. I'm also increasing the loglevel to trace to see everything possible so we can verify how it's making its hardware decoding decisions and what ends up being used. Actual performance isn't being measured here, so ignore the framerates you see in the report files that each run generates (loglevels that high combined with reporting overhead can skew the results anyway).

To give you an idea of what I'm looking for in the output: I'm expecting to see auto using dxva2 for decoding for the files it can decode with hardware (you'll find things like "Using D3D9Ex device" and a reference to a pixel format called "dxva2_vld" in the output). It may mention "MFX" in there instead, but that's what we're trying to find out (MFX refers to Intel's QuickSync en/decoding library, libmfx). On the Main10 file, it should try to do the same thing, but can't so you'll probably find a line like "Error creating the DXVA2 decoder" in the output. It should still decode, though, but will be using the CPU instead so you'll see it complete successfully.

Also, please run the command below to just do a CPU decode of the HEVC Main file. I know roughly what the report will look like, but it will help answer an outstanding question I had about that generation of processor but could never test...and it's always nice to have a "control" testcase/dataset available for comparison:

ffmpeg -i jellyfish-30-mbps-hd-hevc.mkv -map 0:0 -c:v rawvideo -hide_banner -stats -report -loglevel 56 -f null NUL

You should now have four files named something like ffmpeg-20170421-001322.log in the directory. Just zip those up together and attach that to a reply and we should be all set. Thanks so much for testing this! And apologies for any typos or crazy wording...I just woke up :blink:

Sign In

GPU Transcoding (Intel QuickSync and nVidia NVENC)

Recommended Posts

Luke 42077

lorac 118

bcm00re 18

lorac 118

bcm00re 18

lifespeed 42

bcm00re 18

lifespeed 42

Luke 42077

lifespeed 42

Luke 42077

RanmaCanada 494

RanmaCanada 494

jscoys 147

Luke 42077

Waldonnis 148

Luke 42077

Waldonnis 148

Luke 42077

Waldonnis 148

Luke 42077

Waldonnis 148

Waldonnis 148

lorac 118

Waldonnis 148

Create an account or sign in to comment

Create an account

Sign in

Activity