Jump to content

M3U IPTV CPU usage: best path for upgrading?


voodoomurphy

Recommended Posts

voodoomurphy

@cayars

So I am running a repurposed HP Z840 workstation with 128 GB of RAM and 2 M2000 GPUs (hardware acceleration is not enabled at this time). Both the OS drive and the drive where we are putting the cache are SSDs. The entire purpose of this system is to send out internally sourced IPTV streams via an M3u file. The streams are MPEG2 and so they have to be transcoded to sent out. I can hit about 14 streams before the CPU hits 100% and everything starts to stutter and freeze. Ideally I'd like to hit about 20-25 streams for what I have in mind. 

Here are my questions:

1) What optimizations can I make to improve performance on this system?

2) If this system is "maxed" then what are my best options for a system that can support more output streams? What CPU stats are more important: Cores or Ghz?

3) What does the Apple Silicon path look like moving forward? Has anyone tested out the M1 via Rosetta with itv transcoding?

Link to comment
Share on other sites

Hello voodoomurphy,

** This is an auto reply **

Please wait for someone from staff support or our members to reply to you.

It's recommended to provide more info, as it explain in this thread:


Thank you.

Emby Team

Link to comment
Share on other sites

On 8/11/2022 at 8:07 AM, voodoomurphy said:

@cayars

So I am running a repurposed HP Z840 workstation with 128 GB of RAM and 2 M2000 GPUs (hardware acceleration is not enabled at this time). Both the OS drive and the drive where we are putting the cache are SSDs. The entire purpose of this system is to send out internally sourced IPTV streams via an M3u file. The streams are MPEG2 and so they have to be transcoded to sent out. I can hit about 14 streams before the CPU hits 100% and everything starts to stutter and freeze. Ideally I'd like to hit about 20-25 streams for what I have in mind. 

Here are my questions:

1) What optimizations can I make to improve performance on this system?

2) If this system is "maxed" then what are my best options for a system that can support more output streams? What CPU stats are more important: Cores or Ghz?

3) What does the Apple Silicon path look like moving forward? Has anyone tested out the M1 via Rosetta with itv transcoding?

Your HP Z840 likely has 2 Intel Xeon E5-2630v3 at 2.4GHz or close to that. It's probably in the 17K passmark range with 16 cores. I've got a couple T5500's from Dell with similar specs.

If you don't have the SSDs yet, make sure to purchase NVMe with a speed of 3K read/write which you can get from a PCI 3 bus. If you think you'll be upgrading the PC to a PCI 4 bus version spend a couple bucks more per NVMe and get 6K+ read/write sticks.

I'm assuming the GPUs are Nvidia M2000 with 4GB memory. In theory an M2000 should be able to transcode about 13 streams from 1080 (10Mb) mpeg2 to 1080 (8Mb) H.264.  The M2000 for transcoding is just about the same as using a GTX 1050 (4GB) GPU. At present Emby will only use one GPU.  I would turn on hardware encoding and test this out yourself on your own equipment to see how close you get to 13 streams. You could have other bottle-necks keeping you from reaching full GPU potential. Emby should then be able to fall back to CPU for remaining streams which should get you in the 20-25 streams with the hardware you presently have already at no cost.

An M2000 with 4GB memory will transcode the same number of streams (13) going to 720 H.264 @4Mb. A Quattro RTX 5000 (16GB) or Titan RTX (16GB) would both increase the number of streams handled in hardware.  These would both be 2 generations newer hardware as well which will help a great deal with quality and stream size. Both of these should be able to handle 26 streams 1080 to 1080 (as above) and 46 streams 1080 to 720 (as above). There are other options to explore GPU wise as well including used cards of eBay.

Let's take a step back and get a fuller picture of how this will work and if there is another way to go about this.

Are these source streams created in-house or do you get these on-demand from a 3rd party?
If in house, is it possible to have them created using H.264 video codec vs Mpeg2?
Are these time sensitive, as in produced at 1pm and need to be viewed by 2pm?

What I would ask is how many unique input sources do you have? If we're talking a dozen or less streams you may be better off setting up the streams to first transcode on the fly to h.264 video (outside of Emby)  feeding this to Emby.  Done correctly Emby will be able to send these out and they will be direct played by most clients.  You could then stream to many more clients.

Is this server bare metal or running a hypervisor environment? What OS are you running?
What else are you running on this server? If not running other jobs we can put that memory to better use as well.

Carlo

What you have to keep in mind is that if you fix one bottle-neck in the system, it will become faster until it hits the next bottle-neck, and so on.  What I would check is disk bandwidth. When your CPU gets to roughly 80% use what's the disk throughput and what's it's utilization? Same with networking. 

You'll likely want 10Gb networking if you don't have this already.  If disk bandwidth becomes a problem you'll likely want to stripe multiple NVMe disks together to "bond" their bandwidth together.

Link to comment
Share on other sites

voodoomurphy
3 minutes ago, cayars said:

Your HP Z840 likely has 2 Intel Xeon E5-2630v3 at 2.4GHz or close to that. It's probably in the 17K passmark range with 16 cores. I've got a couple T5500's from Dell with similar specs.

If you don't have the SSDs yet, make sure to purchase NVMe with a speed of 3K read/write which you can get from a PCI 3 bus. If you think you'll be upgrading the PC to a PCI 4 bus version spend a couple bucks more per NVMe and get 6K+ read/write sticks.

I'm assuming the GPUs are Nvidia M2000 with 4GB memory. In theory an M2000 should be able to transcode about 13 streams from 1080 (10Mb) mpeg2 to 1080 (8Mb) H.264.  The M2000 for transcoding is just about the same as using a GTX 1050 (4GB) GPU. At present Emby will only use one GPU.  I would turn on hardware encoding and test this out yourself on your own equipment to see how close you get to 13 streams. You could have other bottle-necks keeping you from reaching full GPU potential. Emby should then be able to fall back to CPU for remaining streams which should get you in the 20-25 streams with the hardware you presently have already at no cost.

An M2000 with 4GB memory will transcode the same number of streams (13) going to 720 H.264 @4Mb. A Quattro RTX 5000 (16GB) or Titan RTX (16GB) would both increase the number of streams handled in hardware.  These would both be 2 generations newer hardware as well which will help a great deal with quality and stream size. Both of these should be able to handle 26 streams 1080 to 1080 (as above) and 46 streams 1080 to 720 (as above). There are other options to explore GPU wise as well including used cards of eBay.

Let's take a step back and get a fuller picture of how this will work and if there is another way to go about this.

Are these source streams created in-house or do you get these on-demand from a 3rd party?
If in house, is it possible to have them created using H.264 video codec vs Mpeg2?
Are these time sensitive, as in produced at 1pm and need to be viewed by 2pm?

What I would ask is how many unique input sources do you have? If we're talking a dozen or less streams you may be better off setting up the streams to first transcode on the fly to h.264 video (outside of Emby)  feeding this to Emby.  Done correctly Emby will be able to send these out and they will be direct played by most clients.  You could then stream to many more clients.

Is this server bare metal or running a hypervisor environment? What OS are you running?
What else are you running on this server? If not running other jobs we can put that memory to better use as well.

Carlo

What you have to keep in mind is that if you fix one bottle-neck in the system, it will become faster until it hits the next bottle-neck, and so on.  What I would check is disk bandwidth. When your CPU gets to roughly 80% use what's the disk throughput and what's it's utilization? Same with networking. 

You'll likely want 10Gb networking if you don't have this already.  If disk bandwidth becomes a problem you'll likely want to stripe multiple NVMe disks together to "bond" their bandwidth together.

Are these source streams created in-house or do you get these on-demand from a 3rd party? Created in house. They come from slightly older hardware we can’t upgrade at this time. 
If in house, is it possible to have them created using H.264 video codec vs Mpeg2? No. These devices are limited to MPEG2 unfortunately 
Are these time sensitive, as in produced at 1pm and need to be viewed by 2pm? Everything needs to viewed as live as possible.  

What I would ask is how many unique input sources do you have? About 40, 20 of which are critical.

Is this server bare metal or running a hypervisor environment? What OS are you running? Baremetal. They are running Windows 11 (not my call on that one) 
What else are you running on this server? If not running other jobs we can put that memory to better use as well. Happauge WinTV software to bring in some content via RF. These are single purpose systems. 

Video cards are M4000 8GB models. Hardware encoding on we got about 8 streams a before the system started to stutter. 

The System does have SSDs for the main and caching drive, but not 3K or 6K. 

Link to comment
Share on other sites

rbjtech

As per the hardware stream - if you can provide a sample of the TS - then it's probably possible to create a proof of concept using a modern CPU and iGPU's which will put your current hardware to shame at a fraction of the cost and power ... ;)

iGPU's are not VRAM limited - on a 64Gb system, 32Gb is allocated to the GPU ... lol 

image.thumb.png.e300d38c7e8c7ba995469ea6a39bd5dd.png

 

Edited by rbjtech
Link to comment
Share on other sites

image.png

This is a 64GB system right? And you only have an integrated video card (QuickSync) correct?
Windows will assign 1/2 your memory by default to GPU Shared memory.

Windows shared memory isn't the same thing as what you would find on a dedicated GPU such as Nvidia.  Nvidia uses VRAM which is much faster than your PC memory.  On the other hand integrated GPUs normally have no memory of their own so they require shared memory in order to function. 

You take a performance hit when you run out of VRAM and have to use virtual GPU memory (shared memory) as the latency is greatly increased.

CPUs need low latency, but can’t use a lot of bandwidth, because CPUs are serial (yes, I know there are multi-core processor machines - they are multiple serial cores).

GPUs need a lot of bandwidth, but latency isn’t as important.

Dual-channel DDR4 is 128-bit wide. Modern GDDR5 GPUs are 256 to 512-bit wide and getting wider.

The other obstacle is GPU interface bandwidth. Graphics cards are traditionally add-on cards, so they are constrained by the motherboard bus bandwidth (from PCI, AGP, to now PCI-E).

An i9–9700K has main memory bandwidth of roughly 80 Gigabytes per second. An older Nvidia GeForce GTX-1050Ti has a memory bandwidth of about 112GB second.

Now to show why "shared memory" is a bad idea. If you plug the GTX-1050Ti into a Gen2 PCI express 16x bus you have a maximum of 8 GB/second as that's all that bus can handle.  A typical modern Gen4 32X slot which will give you 32GB/second throughput.  Now image trying to use shared memory with a GTX 3080 that has 768 GB/sec bandwidth! It's a major bottleneck for PCI cards and why they don't use shared memory other than a "swap file" just as Windows uses disk by switching out memory. NOTE: An integrated GPU doesn't go through the bus to access memory so it will be able to use shared memory much easier/faster.

So if purchasing an Nvidia card for transcoding the amount of memory on the card is very important.

If you compare a GTX 3080 with 8GB, 10GB & 16GB and do a transcode from 1080p 10Mb to 1080p 8Mb  these would all do about 23 transcodes as the additional memory isn't needed.  If you did a 4K 64Mb (SDR) to 1080p 8Mb you would get 6, 8 & 11 transcodes respectfully as the additional memory is now needed.

 

 

Link to comment
Share on other sites

rbjtech

Hi Carlo,

I hope that explanation wasn't specifically for me - as respectfully, I'm well aware of all of that.

You may not be aware that the ONLY parts of a GPU that are used during Media Decoding and Encoding is the DEDICATED Enc/Dec modules within the silicon - these have NOTHING to do with the 3D aspect of the card.  It is the 3D rending that needs the huge bandwidth demands but we are not taking about 3D, so I'm not sure where you are going with all the above.

If you look in the link below - you will see that the number of enc/dec chips remain consistent across the Nvidia card technologies - so a 'low end 3D card' has exactly the same enc/dec performance as the high end card (within that card 'series') - the only difference is the amount of VRAM.

https://developer.nvidia.com/video-encode-and-decode-gpu-support-matrix-new

This is also why my 7 year old GTX 1070 (8Gb) can transcode just as many 4K streams as an RTX 3090 (8Gb) !

For an iGPU - not only is this enc/dec silicon on the same physical chip as the cpu - it has dedicated buses (ie doesn't have to use ANY PCIe lanes..) to all the other aspects of the CPU and channels to the shared memory.  Moving data from memory to VRAM is a huge part of transcoding - but with an iGPU it can be a lot more efficient.  Yes the memory is slower than dedicated VRAM, but it doesn't matter - because the memory is not the bottleneck.

The current integrated UHD770 is on par with the best Nvidia have to offer for mainstream cards (the RTX 3090) in terms of transcoding performance - if the transcodes themselves needed greater than the physical limit of the RTX 3090 - then given enough shared memory (obviously expandable) then the UHD770 will outperform the RTX 3090.    The extremely expensive Quadro cards with large memory have multiple enc/dec - so those will no doubt outperform the iGPU.

I expect the Nvidia 4000 series will up the game and no doubt include AV1 encoders (as will the 13th Gen iGPU's) so it will be interesting what's around the corner. 

But in summary, I think people need to understand that the most expensive GPU is not going to be any more capable at transcoding than a cheaper GPU with the same amount of memory, and a modern Intel iGPU (10th series onwards) will probable wipe the floor with it for a fraction of the cost and power consumption.... 

Edited by rbjtech
Link to comment
Share on other sites

20 hours ago, voodoomurphy said:

Video cards are M4000 8GB models. Hardware encoding on we got about 8 streams a before the system started to stutter. 

The System does have SSDs for the main and caching drive, but not 3K or 6K. 

Is this the Quadro M4000 with 8GB memory?  In an ideal world that GPU could handle roughly 13 transcodes but the older hardware may be holding it back a bit.

What OS are you running?

Is the 128GB memory available to use at will?

Are your SSDs being used a 2.5" SSDs that connect like HDDs? If so they're 500MB or so in bandwidth with much faster access (latency).  Having fast NVME around 3Kish like a 

What would help a lot with transcoding is having a fast NVMe drive or two striped that you move Emby's transcode folder to for use.

Besides using Optane which is expensive, the  very best "Pro-Consumer" NVMe's that I've come across (reading and testing) are these:
$127 SK hynix Platinum P41 PCIe NVMe Gen4
READ up to 7,000 MB/s, WRITE up to 6,500 MB/s
READ up to 1,400K IOPS, WRITE up to 1,300K IOPS
4K random read/write IOPS (Max.): 750K/750K
1TB: 750TBW, 2TB: 1200TBW
Aries SSD controller with 176-Layer TLC flash and over 200MB DRAM delivering up to 1.4 million IOPS
Single side board, making it great for laptops.

Uses its own specialized DDR DRAM designed for mobile applications with lower power usage. DRAM has far lower latency than NAND, so it is particularly useful for many small and random I/O operations. This is especially true for writes as they require an update of the look-up table. The in-house Aries controller steps up the IO interface speed running 1600 MT/s.

$135 Kingston KC3000 1TB
READ up to 7,000 MB/s, WRITE up to 7,000 MB/s
READ up to 900K IOPS, WRITE up to 1,000K IOPS
1TB: 800TBW
Uses the popular Phison PS5018-E18 controller paired with Micron’s 176L TLC NAND flash operating at 1,600 MT/s. Onlly SK hynix and Kingston run at 1,600 MT/s

Those right now are the best two NVMe for pro-consumer use. They are both faster in many ways then Optane Drives

For PCI Gen 3 these are ideal
$109 ADATA PREMIUM SSD FOR PS5 Internal SSD 1TB PCIe Gen4x4
Sabrent Rocket 4 Plus +
7,400/6,800MB/s read/write
4K random read/write IOPS (Max.): 750K/750K
2TB: 1480TBW
While this drive is fast it won't do extended writes at a steady rate which makes it's use hard in any type of RAID or STRIPE

$85 SK hynix Gold P31 Internal SSD
Read 3500MBps; Write 3200MBps

$90 XPG GAMMIX S7 Series Internal SSD
Read 3500MBps; Write 3000MBps

Link to comment
Share on other sites

rbjtech

I'm not sure why you keep listing these numbers Carlo - they are all meaningless to the OP as they are using an Intel C612 chipset on their 8 year old workstation - it has PCIe Gen 2 !

There is also nothing particularly special about the numbers above - it's the PCIe revision that makes them fast.

This is from my own NVME drive - on a standard 980 Pro - these are actual benchmark figures, not some marketing 'MAX' BS.

image.png.f8ae1a0f2a279fa3aef85aa17753daf6.png

In my original hardware post - I suggested a RAM disk as that is likely the only real way to get very fast I/O on this old hardware - but in all honesty, until the OP does some testing and identifies the bottleneck, throwing 'the fastest' hardware at an issue and hoping is just the wrong way to go about this.

I suspect running parallel operations/processes for horizonal scalability is the only real solution to this.

As this is a specialist transcoding operation, @softworkz may have some pointers on the best way to process MPEG2/TS to h264  

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...