Jump to content

Recommended Posts

Posted

I think we may switch decoding to use that auto param.

Waldonnis
Posted (edited)

Here you go! I hope you find the info relevant :)

 

I do!  The results were very interesting, actually.  I'll need to go through them more thoroughly to verify what I'm seeing, but a cursory glance shows dxva2 use for "auto" h.264 (expected) and CPU use for HEVC Main10 (with a puzzling line about it using a d3d device, which I'll need to look more closely at).  HEVC Main's report is something I'll need to look more closely at since I have no local result to compare it to.

 

Performance-wise, there's another minor puzzle - the cpu and "auto" decoding of HEVC Main seem to be rather close, with the CPU actually edging the hardware decoder.  It may be because the CPU decoding is multi-threaded while hardware decoding is likely forced internally to be single-threaded.  I'm also not sure if avx2 instructions were implemented on the CPU decoder side of things, but that could make a difference as well.  I'll have to double-check for hardware use with this case to be sure, and look at the hwaccel code to see if it forces single-threading.

 

If you'd like, you can run the tests again and observe CPU load for each just to get an idea if the CPU is being taxed in the various scenarios (probably the most telling will be the HEVC Main CPU vs "auto" decoding scenarios).  You should see comparatively light CPU load using auto compared to just omitting the "-hwaccel auto".  In my local tests with h.264, it was VERY obvious when hardware decoding was being used (~32% overall vs ~99% overall).  I've run a few tests in the past to do simultaneous hardware decodes vs CPU-bound and the results led to me using hardware for "lightweight" decoding-focused tasks (Roku BIF file creation, image extraction, etc; I still use software decoding when quality is more important than time and for transcoding).

 

I'll look over these logs more closely throughout the day.  I definitely thank you for your time and help with this!

 

Small edit:

If you want to just benchmark performance of decoding without all of the logging stuff and without the performance impact of all of that, you can run this command:

ffmpeg -hwaccel auto -i FILE -hide_banner -f null NUL -benchmark

I specifically went with the rawvideo "transcode" for a reason for my tests, but the above works great for just checking speed and is better when eyeballing CPU use during the tests since it has less overhead than my tests do.  You can omit the "-hwaccel auto" if you just want to use CPU decoding rather than trying hardware decoders.  At the end of each run, it'll print two "bench:" lines with the "utime" being time spent on the CPU, and maxrss is the max memory footprint.

Edited by Waldonnis
Posted

Sure I can have a look at CPU usage.

 

Sent from my STV100-3 using Tapatalk

Posted

The new beta server switches to hwaccel auto, and now also applies that for the recording conversion process, since that's relatively safe.

Posted

So would that implement the CPU fall back request that people have been asking for?

 

Sent from my STV100-3 using Tapatalk

Posted

So would that implement the CPU fall back request that people have been asking for?

 

Sent from my STV100-3 using Tapatalk

 

It has nothing to do with what you're talking about. It's what waldonnis and I were discussing on the previous page.

Posted

OK. Well get on that fallback CPU FR request! What are we paying you for Luke? [emoji14]

 

Sent from my STV100-3 using Tapatalk

Waldonnis
Posted (edited)

OK. Well get on that fallback CPU FR request! What are we paying you for Luke? [emoji14]

 

Sent from my STV100-3 using Tapatalk

 

Hehehe, I have to giggle about that since I've been trying to think of ways to overcome the nvenc concurrent stream limitations and plan on writing some test scripts for it.  The decoding thoughts stemmed from some earlier posts and my latest experiments with stereoscopic rendering and decoding, so I thought I'd revisit the decoding side in the thread since it was on my mind anyway.

 

This talk and testing is all on the decoding side, though.  It should lighten the load a bit more when hardware transcoding and could have some small benefits even for software transcodes.  Plugins like the Roku thumbnail (bif) generator or server tasks like chapter image extraction could even use it to reduce system load for their operations if desired.

Edited by Waldonnis
Posted

The NVENC is annoying and seems rather arbitrary. I'm sure these Modern GPU are capable of more.

 

Sent from my STV100-3 using Tapatalk

Posted

I ran the tests again and the first two didn't seem to impact CPU usage very much. The last two tests I noticed a much larger jump in CPU usage though.

Waldonnis
Posted (edited)

The NVENC is annoying and seems rather arbitrary. I'm sure these Modern GPU are capable of more.

 

Sent from my STV100-3 using Tapatalk

 

It is arbitrary..and it isn't.  It's really complicated.  Practically, the driver enforces what they seem to like calling a "license limit" of 2 streams, but current gen cards are capable of handling more simultaneous threads.  Where things start to get complex is when you start looking at memory bandwidth and other hardware factors, and how the encoder uses the card's memory.  In my earlier searches, there were some really good explanations of how a 980Ti compares to a similar-generation Quadro memory bandwidth-wise and how it affects hardware video encoding, but I can't find it right now.  I recall that it was hard to sift through some of nV's own replies/docs about this because they were always so vague about it...not sure why, but they're pathologically vague about everything (look up even developer info on 3D Vision one day...ugh; the only thing they document well is CUDA thankfully).

 

While you could disable the limit by hacking the driver (at least in the past, but I'm sure this is still true), in practice, you could essentially crush the performance quite quickly if you start piling a bunch of streams onto the card.  Given the increased capabilities of Pascal and even later-gen Maxwell (GTX 960), they really should've increased the concurrent h.264 stream limit above 2 (the encoder on those is limited to 8k HEVC, which is many times as much data to move per GOP). I don't think they wanted to make a pseudo-scheduler just to dynamically assign/reject encoding tasks based on available resources or encoding parameters, though, and just stuck with 2 since it was a safe estimate for the worst-case scenario of 8k HEVC HDR stream encoding.

 

Their goal in setting the limits seems to be to maintain good performance without burdening the memory bandwidth so much that it affects the primary function of the card or causing bad things to happen on the video side (frame drops, mostly) in live streaming scenarios.  For good or ill, the two stream limit does make some sense from a hardware perspective, but it's also frustrating to deal with when you are doing relatively lightweight 720/1080p h.264 encodes and knowing you could be doing ~16 streams or more comfortably since you're not lobbing giant 8k HEVC hand grenades at your card that would stress it above 2 streams....

 

...stupid double reply...edited out

Edited by Waldonnis
Posted

I did the tests with some larger 10bit hevc files and while CPU occasionally spiked a little generally it was under 20%.

Waldonnis
Posted

I did the tests with some larger 10bit hevc files and while CPU occasionally spiked a little generally it was under 20%.

 

Nice!  I just remembered that Skylake can decode HEVC Main10 with a hybrid solution (I must've been thinking encoding earlier), which explains that d3d oddity in the decoding test report that I mentioned earlier.

Posted (edited)

First, I did not change anything in my config setup. But since one of the last beta server updates I got this line in all of my transcode logs which was not there before:

[AVHWDeviceContext @ 03e511c0] Failed to create Direct3D device
.
.
.
Stream mapping:
  Stream #0:0 -> #0:0 (h264 (native) -> h264 (h264_nvenc))
  Stream #0:1 -> #0:1 (dts (dca) -> mp3 (libmp3lame))
Press [q] to stop, [?] for help
[AVHWDeviceContext @ 03e511c0] Failed to create Direct3D device
frame=   55 fps=0.0 q=14.0 size=       1kB time=00:00:02.47 bitrate=   3.1kbits/s speed=4.93x    
frame=  116 fps=116 q=14.0 size=       1kB time=00:00:05.01 bitrate=   1.5kbits/s speed=   5x
.
.
.

Has this anything to do with the NVENC improvements @@Luke mentioned in some of the past changelogs and is this something I need to worry about?

 

 

Edit: Reviewed my transcode logs, and on the logs of April 19th, the Error line was not present. Between this date and today, unfortunately no titles were transcoded. So any server version in between created the error. Currently on latest beta.

 

Edit 2: @@Luke Commit https://github.com/MediaBrowser/Emby/pull/2593 in 3.2.13.2 beta release is causing the line.

Edited by shorty1483
Waldonnis
Posted

First, I did not change anything in my config setup. But since one of the last beta server updates I got this line in all of my transcode logs which was not there before:

[AVHWDeviceContext @ 03e511c0] Failed to create Direct3D device
.
.
.
Stream mapping:
  Stream #0:0 -> #0:0 (h264 (native) -> h264 (h264_nvenc))
  Stream #0:1 -> #0:1 (dts (dca) -> mp3 (libmp3lame))
Press [q] to stop, [?] for help
[AVHWDeviceContext @ 03e511c0] Failed to create Direct3D device
frame=   55 fps=0.0 q=14.0 size=       1kB time=00:00:02.47 bitrate=   3.1kbits/s speed=4.93x    
frame=  116 fps=116 q=14.0 size=       1kB time=00:00:05.01 bitrate=   1.5kbits/s speed=   5x
.
.
.

Has this anything to do with the NVENC improvements @@Luke mentioned in some of the past changelogs and is this something I need to worry about?

 

 

Edit: Reviewed my transcode logs, and on the logs of April 19th, the Error line was not present. Between this date and today, unfortunately no titles were transcoded. So any server version in between created the error. Currently on latest beta.

 

Edit 2: @@Luke Commit https://github.com/MediaBrowser/Emby/pull/2593 in 3.2.13.2 beta release is causing the line.

 

Can I see a complete transcoding log, and is the transcode completing or failing?

 

ffmpeg's trying and failing to create the device context that the decoder will use and I'm looking at the code to see what may be going wrong there (it actually tries two methods, so this message is showing up after both have failed).  I can see a few reasons in the code why it could, but not sure if those situations would apply here.  Is the server headless by any chance?

Posted (edited)

Can I see a complete transcoding log, and is the transcode completing or failing?

 

ffmpeg's trying and failing to create the device context that the decoder will use and I'm looking at the code to see what may be going wrong there (it actually tries two methods, so this message is showing up after both have failed).  I can see a few reasons in the code why it could, but not sure if those situations would apply here.  Is the server headless by any chance?

 

Thanks for your answer. Server is not headless. Windows 7 Enterprise x86. Specs in my signature. Log attached...

 

Edit: It's transcoding, no fail and GPU is used, checked with GPU-Z. But I did not watch a complete movie since the update. 

ffmpeg-transcode-8c469f7f-a959-4afe-a284-c9a1a42c20d0.txt

Edited by shorty1483
Waldonnis
Posted

Thanks for your answer. Server is not headless. Windows 7 Enterprise x86. Specs in my signature. Log attached...

 

Edit: It's transcoding, no fail and GPU is used, checked with GPU-Z. But I did not watch a complete movie since the update. 

 

Hmm.  Looking over the code and as I mentioned before, the dxva2 acceleration tries twice to create D3D devices: once with D3D9ex functions and if that fails, it tries again with D3D9 functions (which is where the error message is coming from).  There aren't many things that could cause this.  I'm inclined to think that it's a DirectX or driver installation problem since the same device creation functions are used for games and I've seen similar messages before in that context, but I'm still looking into it.  Practically, the message is somewhat irrelevent to the overall transcode, but it does indicate that it's not using dxva2 for decoding (leading me to wonder just what is being used; I'd guess CPU off-hand but the use of the nvenc encoder may force the attempted use of cuvid).

 

Any chance you can PM a report from dxdiag?  DXVA Checker may reveal interesting info as well, but that would take a little explaining for it to make much sense if you aren't familiar with it (very handy tool for some types of DXVA testing/troubleshooting).

mediacowboy
Posted (edited)

Just want to make sure I am understanding what I am seeing.

I built a new server with a i5-6500. I set emby up on it and enabled qsv. Now when I stream live TV it is direct playing most of my live TV where on the old one it would transcode. So is emby passing the stream to qsv and qsv is giving emby a signal it understands and passing it on?

Please understand this is exactly what I but this server for. Just want to make sure there isn't something else I'm missing.

Edited by mediacowboy
Guest asrequested
Posted

Just want to make sure I am understanding what I am seeing.

 

I built a new server with a i5-6500. I set emby up on it and enabled qsv. Now when I stream live TV it is direct playing most of my live TV where on the old one it would transcode. So is emby passing the stream to qsv and qsv is giving emby a signal it understands and passing it on?

 

Please understand this is exactly what I but this server for. Just want to make sure there isn't something else I'm missing.

 

Direct playing has no server involvement. It's straight to your playing device. It will only transcode if needed or asked to.

mediacowboy
Posted (edited)

Direct playing has no server involvement. It's straight to your playing device. It will only transcode if needed or asked to.

That's what I thought to but the using the blue neon app on a Roku 3 streaming the same channel both times one will direct play while the other transcodes. The only other thing I can think of if it is not qsv would be OS. Windows 10 (direct stream) vs Windows 7 (transcode). Either way I don't care just a nice surprising find. Edited by mediacowboy
Guest asrequested
Posted

If the roku doesn't support something in the stream, the server will transcode it, then QS (if enbaled) will be used.

mediacowboy
Posted (edited)

If the roku doesn't support something in the stream, the server will transcode it, then QS (if enbaled) will be used.

Yeah I don't know. I will have to run more test.

 

The only different between the test I ran last night on multiple channels was the server. I only have a single HD homerun prime hooked up for TV and the dashboard would show direct play on the new server (Windows 10 and qs capable) and transcoding on old server (Windows 7 and no qs)

 

@@Luke, could the way the server handles tuner sharing be a factor?

Edited by mediacowboy
Guest asrequested
Posted

Do you have max streaming bitrate set to auto? It may have set it low at that time.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...