Jump to content

HW Acceleration maxes out nVidia RTX 4000 when playing back 4k movie on 1080p display


MBSki

Recommended Posts

rbjtech
18 minutes ago, mbarylski said:

Initial tests look really good. I'm getting about 300 fps transcoding to 1080p 40 Mb. After about 4 minutes, I had already transcoded a 30 minute buffer! Still testing out the settings, but it seems I can even get past the errors in my files with HW transcoding only. I'll keep testing though as what I saw so far might have been a fluke.

Colors look good too. On par with actual 1080p versions. Looks just about ready for stable! 😁 

I believe you will still be limited by the actual memory on the RTX (max 8-9 simultaneous transcodes) - but @ 300 fps that gives you massive buffer of nearly 40fps each.  Nice !  👍

Link to comment
Share on other sites

MBSki
Just now, rbjtech said:

I believe you will still be limited by the actual memory on the RTX (max 8-9 simultaneous transcodes) - but @ 300 fps that gives you massive buffer of nearly 40fps each.  Nice !  👍

Yep, agreed! After about 15 minutes I could have close to a full 2 hour movie in buffer!

It's amazing how well this works. I've got it set to HW transcoding only and even my error prone files don't skip a beat. Incredible!

  • Like 1
Link to comment
Share on other sites

 

1 hour ago, mbarylski said:

It's amazing how well this works. I've got it set to HW transcoding only and even my error prone files don't skip a beat. Incredible!

😎

1 hour ago, mbarylski said:

Yep, agreed! After about 15 minutes I could have close to a full 2 hour movie in buffer!

The movie is not buffered in the GPU. The GPU has never more than like 20-100 frames in memory.

Also the limit is more likely not the GPU's memory being "full", it's rather the memory bandwidth for getting the data in and out of the GPU (not to speak of storage IO).

The the processing itself is ridiculous for Nvidia GPUs - that's just about nothing.

Link to comment
Share on other sites

MBSki
4 minutes ago, softworkz said:

The movie is not buffered in the GPU. The GPU has never more than like 20-100 frames in memory.

Really? Interesting. So what's responsible for the buffering?

Link to comment
Share on other sites

5 minutes ago, mbarylski said:

Really? Interesting. So what's responsible for the buffering?

It's buffered as HLS segments in the transcoding-temp folder.

  • Thanks 1
Link to comment
Share on other sites

arrbee99

Just as a matter of interest (and to demonstrate my complete lack of understanding of such matters), would it be expected to get a noticeable increase in performance by say having the transcoding-temp folder on a NVMe drive as opposed to a spinning hdd (which I do), or storing actual movies on NVMe drives (no doubt an expensive proposition).

Link to comment
Share on other sites

8 minutes ago, arrbee99 said:

Just as a matter of interest (and to demonstrate my complete lack of understanding of such matters), would it be expected to get a noticeable increase in performance by say having the transcoding-temp folder on a NVMe drive as opposed to a spinning hdd (which I do), or storing actual movies on NVMe drives (no doubt an expensive proposition).

That's difficult to answer without having a clearly defined usage profile. But for getting the maximum performance I would do the following (in that order):

  1. Add a dedicated SSD for the transcoding-temp folder (and nothing else)
    => This is where most of the IO happens
  2. Add a dedicated SSD for the Emby Cache folder (and nothing else)
    => is for images and other metadata 
  3. Add a dedicated SSD to which Emby is installed (instead of installing it onto the OS HD)
    => The Emby Database and logs are stored here

For the library content itself, I would never use SSDs. When you have a huge library you'll have many HDs and the IO will naturally distribute between these (simultaneous users would unlikely all watch something  that is on the same HD)

  • Like 1
Link to comment
Share on other sites

For 1,2 and 3, you don't need large capacity. You can use really small ones, it's much more important that they are separate.

PS: I'm not exactly sure for what you're asking, so just to make this clear: I'm not talking about a family + 2 friends setup...

Link to comment
Share on other sites

arrbee99

Well, yes, I'm asking basically for general performance hints, especially re transcoding and tone mapping, and of course your answer might reveal something that's easy(ish) and cheap(ish) to do. Am while it does sound easy and cheap I am indeed in the family + 2 friends setup (well just slightly scattered family actually) so I think I'm good to go.

Besides, I'm saving my two remaining drive letters for big HDDs...

Link to comment
Share on other sites

MBSki
40 minutes ago, softworkz said:

PS: I'm not exactly sure for what you're asking, so just to make this clear: I'm not talking about a family + 2 friends setup...

I'm in a family + 2 friends setup as well, in which case I think what you're saying is there isn't going to be much benefit in splitting up the drives. I can't imagine I'm coming anywhere close to saturating the available bandwidth on my OS nvme that also has Emby and Plex. I would think the bottleneck is going to come from the HDD's that store the media wouldn't it?

1 hour ago, softworkz said:

It's buffered as HLS segments in the transcoding-temp folder.

 Clarification question...doesn't the GPU create the segments and drop them into the transcoding-temp folder? So while the GPU might not STORE the temp files, it's responsible for generating them. Am I understanding that correctly?

Link to comment
Share on other sites

28 minutes ago, mbarylski said:

Clarification question...doesn't the GPU create the segments and drop them into the transcoding-temp folder? So while the GPU might not STORE the temp files, it's responsible for generating them. Am I understanding that correctly?

No, the GPU doesn't do this. It doesn't actually do anything on its own. ffmpeg does all that. It manages data flow from system to GPU memory, tells the GPU what to do on a frame-by-frame basis. Finally, ffmpeg retrieves the encoded video frames back into system memory and multiplexes the hw-encoded video stream together with audio and subtitle streams, cuts it into segments and writes those segments to disk.

Then, the client requests those segments via http, and Emby Server transmits them to the client.

28 minutes ago, mbarylski said:

I'm in a family + 2 friends setup as well, in which case I think what you're saying is there isn't going to be much benefit in splitting up the drives. I can't imagine I'm coming anywhere close to saturating the available bandwidth on my OS nvme that also has Emby and Plex. I would think the bottleneck is going to come from the HDD's that store the media wouldn't it?

Well - if all content is on a single HD and more than like 6 users are watching videos with high bandwidths simultaneously, then (and only then) it might be a bottleneck when it has a really low performance.

But consider the following: All you need (per user) is to load the data from HD in Playback Speed. That means for example:

  • When you have a 90 minute movie of 30 GB, then it would be sufficient when the HD would take 90 min to copy that movie. 
  • Now let's assume those 6 simultaneous users
  • Makes 6 * 30 = 180 GB
    => When your HD is able to copy(read!) 180 GB within 90 min, then it's sufficient for those 6 simultaneous users

(a HD that can't do this would be so old that you'd have to wonder that it's still running 🙂 )

The bottlenecks are where I said above. Usually not at the storage side.

Edited by softworkz
  • Thanks 1
Link to comment
Share on other sites

Unless of course your transcoding then you adding a write/read step if that too is on the same drive. Then things start to slow down much faster as the access times and head positions matter.

This is especially bad for DVR or Live TV all using the same disc as it does become IO bound quickly.

Link to comment
Share on other sites

For live TV - in case of viewing - the timeshift buffer will be under transcoding-temp which means that segments are written by the receiver and read by the server to send them to the clients. In case of transcoding, both - the original stream is written as segments, plus the transcoded segments, which are written and read when sending to clients.

In sum, that is 2 x writing + 1 x reading on the transcoding-temp drive. For a TV recording, it's just 1 x writing to the recording-folder-drive.

So, again - the hotspot is 'transcoding-temp'.

Link to comment
Share on other sites

MBSki
10 minutes ago, softworkz said:

In sum, that is 2 x writing + 1 x reading on the transcoding-temp drive. For a TV recording, it's just 1 x writing to the recording-folder-drive.

So, again - the hotspot is 'transcoding-temp'.

Hmmm, I may have to add an nvme drive for transcoding-temp. Right now I've got OS, Emby, and transcoding-temp on nvme mirrored drives. The mirror would double the writing. I wonder if that contributed to the issue I had with seeking during a live TV recording (DVR Recording in progress freezes while watching and using live seeking - Android TV / Fire TV - Emby Community). Still kind of surprising though given how much bandwidth nvme's are supposed to be able to handle. Does PCIe 4.0/5.0 alleviate this bottleneck at all?

Link to comment
Share on other sites

4 minutes ago, mbarylski said:

Hmmm, I may have to add an nvme drive for transcoding-temp. Right now I've got OS, Emby, and transcoding-temp on nvme mirrored drives. The mirror would double the writing. I wonder if that contributed to the issue I had with seeking during a live TV recording (DVR Recording in progress freezes while watching and using live seeking - Android TV / Fire TV - Emby Community). 

I'm not that much familiar with the current TV feature in Emby, but my initial impression would be : No. That sounds like a different issue.

 

4 minutes ago, mbarylski said:

Still kind of surprising though given how much bandwidth nvme's are supposed to be able to handle. Does PCIe 4.0/5.0 alleviate this bottleneck at all?

I didn't follow this in detail, but the last time I looked into it, most nvme's were kind of SATA SSDs with a different interface. The interface  speeds are often used in marketing, but they don't reflect the actual "drive" read/write performance. To my knowledge, nvme  makes sense only for "drives" that can provide higher transfer rates than SATA.

Besides that, when it comes to expectations and estimation of hardware performance, I have noted that very often, the way people think about it is somewhat inaccurate.

Simple example:

  • You have that super-fast SSD, having transfer rates that are way above your requirements
  • You do a video hw transcoding
  • But the processing is trivial and the GPU hardware is much faster than your SSD can provide and receive the video data
  • What happens is - the SSD is getting maxed out
  • At the same time you have an Emby server running
  • But the UI hangs and all actions and navigation are getting slow
  • This happens, because the Emby database and image cache are on the same SSD (which is maxed out).
  • "Maxed Out" doesn't mean that it can't keep up. It might process the video at 10x or 50x speed
    => you would never need it running so fast. But it does so. And you'll have a bottleneck even though your resources would be sufficient by  far from calculation..
     

 

Link to comment
Share on other sites

MBSki
5 minutes ago, softworkz said:
  • "Maxed Out" doesn't mean that it can't keep up. It might process the video at 10x or 50x speed
    => you would never need it running so fast. But it does so. And you'll have a bottleneck even though your resources would be sufficient by  far from calculation.

That's so messed up. Don't understand why there would be a bottleneck if resources are sufficient. So strange. 🥴

Really appreciate all your explanations though. I'll probably try another SSD and put transcoding-temp on it just out of curiosity. I think I have an extra SSD laying around somewhere.

Link to comment
Share on other sites

59 minutes ago, mbarylski said:

That's so messed up. Don't understand why there would be a bottleneck if resources are sufficient. So strange. 🥴

Yes it is. Sometimes it's mind-bending...

I have told the following a number of times already: At some point, I had a mid-range GeForce which already had served its duty but I had still gotten acceptable transcoding results.

Finally I got a new GPU which was really powerful. I started Emby, ran a few transcodings and saw my CPU usage going up to 100% sometimes. This has never happened before with the previous GPU.

The reason was: the new GPU could do the video transcoding so fast, that my CPU couldn't even keep up with just doing the audio conversion.  That's impressive for the GPU, but on the other side - my system was running at limit and became slow and less responsive - after installing a better GPU...

Link to comment
Share on other sites

MBSki
23 minutes ago, softworkz said:

The reason was: the new GPU could do the video transcoding so fast, that my CPU couldn't even keep up with just doing the audio conversion.  That's impressive for the GPU, but on the other side - my system was running at limit and became slow and less responsive - after installing a better GPU...

LOL, yea, I'm hoping the same thing doesn't happen to me. I'm not ready for a new CPU! 😅

  • Like 1
Link to comment
Share on other sites

  • 1 year later...
iiiJoe
On 2/24/2021 at 3:56 PM, softworkz said:

That's difficult to answer without having a clearly defined usage profile. But for getting the maximum performance I would do the following (in that order):

  1. Add a dedicated SSD for the transcoding-temp folder (and nothing else)
    => This is where most of the IO happens
  2. Add a dedicated SSD for the Emby Cache folder (and nothing else)
    => is for images and other metadata 
  3. Add a dedicated SSD to which Emby is installed (instead of installing it onto the OS HD)
    => The Emby Database and logs are stored here

For the library content itself, I would never use SSDs. When you have a huge library you'll have many HDs and the IO will naturally distribute between these (simultaneous users would unlikely all watch something  that is on the same HD)

Hi, guys. Is this still the best procedure for enhanced performance? 

  • Agree 1
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...