Server Plugin: Transcoding Tests

October 6, 2022

6 minutes ago, SeekingWisdom said:

I am running DSM 7.1 the most recent version.

I mean server version number but shouldn't be able to install on non beta 4.8 systems.

@FrostByte @cayarswhat ffmpeg version do you see on Synology? Is @SeekingWisdompath showing old data from change to DSM 7? As I haven't seen that ffmpeg version since version 4.6.x.x days.

cross posting

Edited October 6, 2022 by Happy2Play
cross posting

October 6, 2022

47 minutes ago, Happy2Play said:

I mean server version number but shouldn't be able to install on non beta 4.8 systems.

@FrostByte @cayarswhat ffmpeg version do you see on Synology? Is @SeekingWisdompath showing old data from change to DSM 7? As I haven't seen that ffmpeg version since version 4.6.x.x days.

cross posting

SW is correct. Synology users are on 4.1.8

ffmpeg version 4.1.8 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 8.5.0 (GCC)

October 6, 2022

41 minutes ago, FrostByte said:
SW is correct. Synology users are on 4.1.8
ffmpeg version 4.1.8 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 8.5.0 (GCC)

Which doesn't matter in any way because Emby is not using it..

October 6, 2022

9 hours ago, softworkz said:

This is surely inspired by some talks we had earlier about performance testing, but the initial focus is about functional testing. Performance plays a role as well, but more in the sense of relative comparison of different configurations and feature usage.

For performance testing and comparison between various user setups, there will be some extra work needed, like:

Adding a new "Test Area" for "Performance Testing"

As we want comparable results, this should not allow that many selections like we do have now for functional testing

Maybe just two test files: One H264 1080 with text and graphic subs and one HEVC 4k HDR

For HWA, no combinations, just SW-SW, HW1-HW1, HW2-HW2, etc.

For subtitle processing mode, always "Subtitle Filtering" (the future default)

For subtitle overlay: Software for SW-SW and Hardware for HWx-HWx

For processing: Unscaled and Half-Size (or maybe just unscaled)

For tone mapping: None, SW for SW-SW and HW for HWx-HWx

An additional selection will be needed: Parallel Executions, maybe with 1, 2 and 4
(estimations are too likely to be wrong, that's not really useful IMO)

The individual ffmpeg logs already contain all the information about hardware, software and driver versions, but that information would need to be included in the *.etrd files

There's no personal information in the *.etrd files, so no special anonymization would be required. But I don't think that the results should be published anonymously - but rather associated with the forum user account. I think, this makes the whole thing more interesting and useful.
We can list by CPU and GPU models, but there can still be huge differences between two systems having the same values for these, which means that a result set is always specific to a certain user's system. Having such differing results in a list would be pointless and confusing unless you can contact a user to find out why your own results are so different even though the rough parameters are the same.
Also, this brings a little bit of a competitive aspect to the game

The forum allows OAuth authentication which would allow to make this work without giving your credentials to the server (or the plugin), but I need to think about this.
It doesn't make much sense to have the data files in posts of a forum topic. It rather needs to be processed and aggregated in some way.

So - there's still a way to go, but at least there's a basis now..

Thanks @softworkz - I believe we are on the same page on all aspects here.

A recent issue on 'performance' of the UHD770 (where my system fps was considerable faster on a technically identical system) highlighted there is indeed a need to have a 'metric' associated with a said CPU or GPU but agree, it needs to be for a strict set of fixed scenario's.

An interesting comment about the max transcode estimations - I have used simple extrapolation based on either the main limiting factor (GPU memory) or CPU utilisation if that is not a factor (in the case of 'unlimited' memory of the iGPU's) and found it's a useful metric. I guess break testing to get the real figure is perfectly possible - A lot easier than trying to fire off 10 or so transcodes on various clients ..

Yes I wasn't meaning to store the data in forum posts, I was thinking of storing the data on the backend somehow for proper analysis as you say.

Great stuff - I'm looking forward to where this is heading.

October 6, 2022

2 hours ago, rbjtech said:

An interesting comment about the max transcode estimations - I have used simple extrapolation based on either the main limiting factor (GPU memory) or CPU utilisation if that is not a factor (in the case of 'unlimited' memory of the iGPU's) and found it's a useful metric. I guess break testing to get the real figure is perfectly possible - A lot easier than trying to fire off 10 or so transcodes on various clients ..

Once it's working, we'll be able to see the how those extrapolations will stand. What I would expect is that:

Extrapolating from a run with 2 two or 4 parallel executions will lead to better estimations than from a solo run
Those (linear) extrapolations are valid for a certain range only. At some point, there will be a saturation and the individual run performance will decrease stronger than the linear prediction would be, because in that range, the competition over resources will lead to highly frequent context switching (to equally satisfy each process) and that will in turn eliminate many caching effects (memory, disk, gpu mem)

I'm sure that those effects exist and will be visible, but I'm curios to see to which extent and beyond which ranges these will start to hit in.

Another point to consider is that we must not be cheating (ourselves) when running parallel executions and/or are trying to extrapolate: Of course you want to run exactly identical tests in parallel - but when we use a single source file for the parallel runs, then that would not accurately resemble reality - like "I can serve X 4k streams to X users in parallel" - because those X users would not watch the same source video.
In turn, to make this realistic, for example, when we would want to run 8 transcodings in parallel, we'll need to make 7 copies of the source file before starting, so each instance will have its own source file.

2 hours ago, rbjtech said:

is indeed a need to have a 'metric' associated with a said CPU or GPU

Calculating such index is probably the hardest part of all. I have thought through quite a number of ways for doing this, but for each way, I'm ending up with reasons why it's not fair or not accurate. Might need a bit of looking at how other benchmarks or handling it.

2 hours ago, rbjtech said:

Great stuff - I'm looking forward to where this is heading.

With all enthusiasm I also need to set expectations straight: What we're talking about here is actually a NON-GOAL ().
The actual focus is on quite different things like Improving and preserving software quality by regular testing (internally, automated and by users). And a second goal is streamlining user support by reducing time and effort on both sides for diagnosing and identifying issues.

The next addition will be the ability to specify a custom file to run the tests on - alongside some of the plugin's known files as a reference. Maybe with automatic posting to the forums (that's why I was talking about it above).

Those are the ideas. As you all know, those things often take much more time than in one's (and my own) initial thoughts as other and more important things get in the way. And that's not even including the benchmarking. The benchmarking part is more a kind of "surfer" which can ride on the waves of other work for this plugin - slowly but almost for free.

Edited October 6, 2022 by softworkz

October 6, 2022

21 hours ago, rbjtech said:

As per the first post, this gives us a constant 'source' and 'parameters' to work with - which is good, but I would LOVE to see the results anonymized and with permission via an option - upload the raw performance stats, along with the hardware used. And then present that in a table from within the Plugin vs other users. An estimated max number of simultaneous transcodes (of the same type) could also be calculated.

As you can probably imagine, this kind of info is right up my alley. You can calculate the number of theoretical transcodes each Nvidia GPU can support based on specific factors such as scaling, bitrate of input media and bitrate of output media. With that info plus the info about the specific GPU such as # of NVDEC & NVENC counts, bandwidth, VRAM, etc you can calculate the number of DE and ENC streams that can be done for both H.264 and H.265 to see where the bottleneck of each GPU is which determines the max transcodes. A particular GPU might be able to support 20 specific transcodes but that assumes there are no other system bottlenecks in play. Similar calculations can be done for QuickSync or other vendor GPUs as well.

I've found over the years these numbers to be really accurate up to the point of other system bottlenecks coming into play. That of course is the wild card of system performance. It's one thing to read a single input file from a RAM disk processing it 20 times in parallel vs reading 20 different input files from disk while processing them. Of course there are major difference from a single 5400 RPM disk to a RAID or ZFS array to using a set of mirrored NVMe disks!

@softworkz I know this is outside the original scope of work you were interested in but I mention this because a utility like this in "simulated GPU mode" could be really useful as well as current functionality. Having the ability to test maximum stream handling ability would be the precursor to upgrading GPUs. Taking the GPU out of the picture, how many streams could a Synology 920+ handle using 4 disks spinning at 7200 RPM vs 5400 RPM? What about if using 1 or 2 NVMes setup as typical cache? What about a non-standard setup where you have one or two NVMe disks formatted for use as transcodes storage or even main storage for your latest 25 to 50 popular movie or tv show episodes?

3 hours ago, softworkz said:

Once it's working, we'll be able to see the how those extrapolations will stand. What I would expect is that:

Extrapolating from a run with 2 two or 4 parallel executions will lead to better estimations than from a solo run

Those (linear) extrapolations are valid for a certain range only. At some point, there will be a saturation and the individual run performance will decrease stronger than the linear prediction would be, because in that range, the competition over resources will lead to highly frequent context switching (to equally satisfy each process) and that will in turn eliminate many caching effects (memory, disk, gpu mem)

I'm sure that those effects exist and will be visible, but I'm curios to see to which extent and beyond which ranges these will start to hit in.

Another point to consider is that we must not be cheating (ourselves) when running parallel executions and/or are trying to extrapolate: Of course you want to run exactly identical tests in parallel - but when we use a single source file for the parallel runs, then that would not accurately resemble reality - like "I can serve X 4k streams to X users in parallel" - because those X users would not watch the same source video.
In turn, to make this realistic, for example, when we would want to run 8 transcodings in parallel, we'll need to make 7 copies of the source file before starting, so each instance will have its own source file.

CPU Caching, pinned core use, RAM speed, PCI Lane speed, system interrupts, QUMA, chipset, all come into play at the hardware level and short of ffmpeg gaining support for direct memory access these will always be real bottlenecks limiting upper performance of what's truly possible.

PS except for testing you would not want to use separate copies of the same file for transcoding as that would always require more resources across the board. Instead, consolidation and shared memory use especially using compression is where many DPU libraries are making significant performance gains. ZFS file systems for example gain significant performance using compression as more bits can be held in memory pages reducing IO. More compute time (CPU or GPU) easily outpaces IO time as it's usually a factor faster to decompress data already loaded than transfer data unencrypted across the system buss. This is exactly how technologies like graid are doing this to increase NVMe RAID performance 5x otherwise hitting system bandwidth limits mentioned above. Have a look https://www.graidtech.com/

October 6, 2022

10 minutes ago, cayars said:

ZFS file systems for example gain significant performance using compression as more bits can be held in memory pages reducing IO. More compute time (CPU or GPU) easily outpaces IO time as it's usually a factor faster to decompress data already loaded than transfer data unencrypted across the system buss. This is exactly how technologies like graid are doing this to increase NVMe RAID performance 5x otherwise hitting system bandwidth limits mentioned above. Have a look https://www.graidtech.com/

There's just that little detail problem that there's nothing more to compress when you have video files with state-of-the-art codecs. The attempt to do so is just a waste of resources. Your point is valid for other data, of course.

October 6, 2022

I think we also need to keep our feet firmly on the ground here and not get carried away ..

This is not an enterprise streaming service such as Netflix - it is a home media streaming solution, and its typical user base is probably less than 5 or so simultaneous user access cases.

Now I know from forum 'chatter' that it's has other extremes where maximum system performance really does matter - but my personal view is to start 'somewhere' to give users just a 'ballpark' figure of what is technically possible with their unoptimized typical day-day hardware.

Something similar to the P**x transcoding charts - but with a little more depth to include tonemapping would, I think, be a great start.

We can always add functionality.

October 6, 2022

This isn't talking about compression in the sense of H.264 or H.265 as would be seen in a file but as would be stored and transferred across a system bus or PCI lane, through a chipset etc which has far more overhead that can be reduced. Take a look at the link for GRAID that is used for increasing NVMe RAID performance (compressed or non compressed files) or other DPU alternatives that specialize in moving data across PCI lanes faster than the PCI lanes would allow through conventional means.

As rbjetch mentioned this is a "home server", so only so much is warranted. I mentioned the "GPU simulation" in the other post as that could be real handy in general for system optimization of IO, if it could be added with minimal work to what you already have in place. It could/might be as simple as copying the data over to the GPU and then copying it back ASIS with no actual processing done. While not perfect could be semi-easy to do showing an upper limit of what's possible using the current system. Having to invest more dev time then something like this probably isn't worth it.

October 6, 2022

1 minute ago, cayars said:

I mentioned the "GPU simulation"

I don't understand this - what do you want to simulate?
A discrete GPU on a system which doesn't have one?
An iGPU on a system with a CPU which doesn't have an iGPU?

October 6, 2022

1 hour ago, cayars said:

short of ffmpeg gaining support for direct memory access

How do you come to the idea that ffmpeg would not have direct access to memory?
DMA means that PCI components can directly write into memory without requiring the CPU to copy the data.

October 6, 2022

2 hours ago, softworkz said:

How do you come to the idea that ffmpeg would not have direct access to memory?
DMA means that PCI components can directly write into memory without requiring the CPU to copy the data.

Correct. DMA allows access without CPU or chipset involvement which it currently doesn't support.

October 6, 2022

2 hours ago, softworkz said:

I don't understand this - what do you want to simulate?
A discrete GPU on a system which doesn't have one?
An iGPU on a system with a CPU which doesn't have an iGPU?

Think of it as a way of benchmarking everything else except the actual GPU transcode itself. It would show bottlenecks in the system that aren't directly transcode related. For example if you could presently transcode 7 4K streams on a 2070 8GB GPU would you be able to support 13 4K streams by updating to a Titan RTX 16GB or would other bottlenecks come into play first? You could for example already be transcoding the optimal amount of streams the system can other wise support.

If your able to process 20 or 21 stream with only the simulation (data copied to GPU and back with no actual processing) you know the system itself is able to scale higher with a more powerful GPU but if if you can only "simulate" 8 or 9 streams before hitting a bottleneck the upgrade to the titan GPU would be rather foolish and expensive to gain an addition stream or two.

Again it's really outside the original scope but would be quite useful information to have available if it could be dropped in without much addition work.

October 6, 2022

1 hour ago, cayars said:

Correct. DMA allows access without CPU or chipset involvement which it currently doesn't support.

This is something which needs to be managed by a kernel-mode driver. Applications cannot do such things.

October 6, 2022

1 hour ago, cayars said:

(data copied to GPU and back with no actual processing)

A newer GPU can have more RAM and faster RAM and may make use of newer PCIe features etc..
I'm having a hard time seeing some value in this - not to speak of the gigantic work this would take.

For now, I'd say let's rather focus on testing existing systems and configurations rather than hypothetical...

October 6, 2022

1 hour ago, softworkz said:

This is something which needs to be managed by a kernel-mode driver. Applications cannot do such things.

Not exactly true. There needs to be a driver that sets up memory access to the device in question (ie GPU, NIC, DPU) such as Nvidia drivers which do support DMA access. Once this is setup by the driver(s), software can then access this through the DMA API which allows direct mapping of the virtual address space to the actual bus address space bypassing CPU & chipset. This is pretty much a fundamental requirement for InfiniBand and high speed Ethernet 40 & 100 GB. Sometimes this is done with the aid of IOMMU for the mapping but other times IOMMU can be avoided as well. This even allows DMA access from containers and Virtual Machines as long as the hypervisor sets things up correctly. Some manufactures have drop in replacement libraries that can be used to achieve acceleration like this. Nvidia has for example drop in libs that can be used this way that will automatically take advantage of DMA access, GPU and DPUs to accelerate and use their hardware to it's fullest.

Nvidia has a whole suite of drop in replacement libs (Math Libraries, Parallel Algorithms, Deep Learning, Image and Video Libraries, Communication Libraries) as well as many frameworks that allow development from the ground up with complete support for many new technologies. Two of the most interesting (to me) are GPUDirect for Video (part of Magnum IO, that enhances data movement and access for NVIDIA GPUs and GPUDirect Storage which does much of what was mentioned here. GPUDirect Storage creates a direct data path between local or remote storage, such as NVMe or NVMe over Fabrics (NVMe-oF), and GPU memory by enabling (multinode) direct-memory access (DMA) with the ability to move data across NICs, storage and GPU memory—without burdening the CPU.

October 6, 2022

1 hour ago, softworkz said:

A newer GPU can have more RAM and faster RAM and may make use of newer PCIe features etc..
I'm having a hard time seeing some value in this - not to speak of the gigantic work this would take.

For now, I'd say let's rather focus on testing existing systems and configurations rather than hypothetical...

Absolutely agree a new GPU can have faster and/or more RAM, wider bus, etc which could theoretically allow for more transcodes to take place. However, if the system bottleneck isn't the GPU then replacing a currently working GPU with a faster one may be nothing but a waste of money. Hence the "simulation" or simplest operation possible using the current GPU that would do no processing except copy data to the GPU and back.

You still copy to and from the current GPU but do nothing else with the data. That covers the complete IO path sans actual GPU processing. Nvidia GPUs can almost always transfer much more info then they can process through NVENC or NVDEC (and VRAM by virtue of using NVENC/NVDEC) which is where almost all the current bottlenecks in GPU transcoding come from.

In a way it's a similar test to coping data to a RAID caching controller and reading it back without actually writing the data to disk. It simply shows the maximum or potential transfer rate that is possible. It would show if a bottleneck is happening before the RAID controller or after such as slow drives. In a similar fashion if the controller is already maxed out then replacing spinning disks with SSDs would gain very little in overall throughput even if access time is reduced.

It's not important to the plugin right now, so let's not worry about it.

October 7, 2022

55 minutes ago, cayars said:

Not exactly true. There needs to be a driver that sets up memory access t

What I said is true. Applications cannot do those things themselves. Only kernel mode drivers can do that, and then there are user-mode APIs that communicate with the driver and which applications can use. ffmpeg does that actually for transferring data from and to GPU memory.

But we have arrived on a totally different set now. This is about accessing non-system memory - not quite where we started above.

BTW, I know those APIs (a few) from inside - not from a product sheet. There are good things but a significant share is rarely useful for multiple reasons. Iterating through the latest buzz-topics is interesting from time-to-time, but after all, you need to ask the question which or whether any of these would realistically provide any benefit for what you want or need to do - and then - usually all that remains is some shades of diminishing vapor

October 7, 2022

1 hour ago, cayars said:

Absolutely agree a new GPU can have faster and/or more RAM, wider bus, etc which could theoretically allow for more transcodes to take place. However, if the system bottleneck isn't the GPU then replacing a currently working GPU with a faster one may be nothing but a waste of money. Hence the "simulation" or simplest operation possible using the current GPU that would do no processing except copy data to the GPU and back.

OK, now I understand what you're up to after all.

I think that when it's about finding bottlenecks, there's no need to implement any simulations. Anyway, you can never get this right without gigantic effort and then it will always remain to be a simulation which might not match reality due to a zillion of reasons.

So, I think it makes much more sense to work and orchestrate this just by doing it "for real". We actually do have ways to craft "real" tests in ways that these can be maxing out very specific parts of the overall process and configuring other parts in a way that they don't play any role.

Like for example: we could have an Nvidia GPU decode a black video which means that the transfer of the source video is negligible, same for the decoding processing. Then we can hwdownload frames and hwupload them again, and maybe overlaying just a single pixel (or zero) then. Finally the encoder can be set to 1fps and all features switched off (or maybe there's even some no-up mode.

Something like that would allow to get a fairly focused indication of mem transfer performance, for example.

1 hour ago, cayars said:

In a way it's a similar test to coping data to a RAID caching controller and reading it back without actually writing the data to disk. It simply shows the maximum or potential transfer rate that is possible. It would show if a bottleneck is happening before the RAID controller or after such as slow drives. In a similar fashion if the controller is already maxed out then replacing spinning disks with SSDs would gain very little in overall throughput even if access time is reduced.

Yeah, that's all correct. Just another problem we have in case of transcoding is that things are not predictable as easily as in case of disk performance.

It depends on the exact specifics of of the kind of transcoding you perform. As we've been talking about memory transfer, let's look at the following example:

We have a high quality source video @4k
We want to decode this in hw
Then we want to overlay subtitles and we have to do that in software
Then, we want to re-encode the video @4k again

That whole procedure means that we need to

upload the source video to the GPU, let it decode and
then we download the decoded video to sys mem and
let the CPU do the subtitle burn-in.
Afterwards, we need to re-upload the "burnt" video frames
and let the GPU encode it.
Finally we download the encoded result from the GPU to sys mem.

It's quite obvious that this is not ideal - but it still doesn't sound dramatic. One might think: ideal would be one upload to GPU and one download at the end, but hey, "that's just the double effort - I can live with that". At that point, you couldn't be more wrong. What gets in-between is math.

When we have a real high-quality 4k source video, it might have a bitrate of 80Mbps and let's assume that we're encoding to similar quality.
80 Mbps means 10 MB per second that we need to upload and we get another 10 MB/s for the final downloading - makes 20 MB/s in total - which is not a serious task for any GPU.

But then we need to download the frames for subtitle burn-in, and that's a different story - because those frames are uncompressed.. A 4k frame (24bit) is about 25 MB. With a framerate of 30 fps, we will need to transfer 30 * 25 MB = 750 MB per second, same again for re-uploading, makes 1.5 GB/s memory transfer.

This is 75 times more than up-/downloading of the encoded video.

After all, I think: you can calculate, extrapolate, simulate, estimate and whatever - but as long as you don't know your users, your content and your existing hardware, chances are much higher that you end up wrong than right. But if you do know all this and have a record of experience, then you will likely be able give a better estimate from experience than any simulation I could develop

2 hours ago, cayars said:

It's not important to the plugin right now, so let's not worry about it.

Yea, let's look at what we got before talking about what we'd like to have

October 25, 2022

Any ideas why the below error appears, using server 4.8.0.13

October 25, 2022

Does it load when you access the url directly?

https://mediabrowser.github.io/genericedit_dx/genericedit_dx.js

What's your environment? Browser or app?
Are you accessing your server directly or in a different way (like app.emby.media)?

October 25, 2022

It loads from my local browser

It loads from the docker container

# wget https://mediabrowser.github.io/genericedit_dx/genericedit_dx.js
Connecting to mediabrowser.github.io (185.199.108.153:443)
wget: note: TLS certificate validation not implemented
saving to 'genericedit_dx.js'
genericedit_dx.js    100% |**************************************************************************| 68209  0:00:00 ETA
'genericedit_dx.js' saved

Edited October 25, 2022 by GWTPqZp6b

October 25, 2022

Oh... its probably my reverse proxy header configuration... let me check on it after work today for you.

Edited October 25, 2022 by GWTPqZp6b

October 25, 2022

Sure. Let us know how you get along!

October 25, 2022

Fixed @softworkz with following adjustments, thanks.

font-src 'self' https://mediabrowser.github.io;
script-src 'self' https://www.gstatic.com https://mediabrowser.github.io;
style-src 'unsafe-inline' 'self' https://mediabrowser.github.io;

Edited October 25, 2022 by GWTPqZp6b

Server Plugin: Transcoding Tests

Recommended Posts

Happy2Play 8282

Link to comment

Share on other sites

FrostByte 5049

Link to comment

Share on other sites

softworkz 3335

Link to comment

Share on other sites

rbjtech 4266

Link to comment

Share on other sites

softworkz 3335

Link to comment

Share on other sites

Carlo 4330

Link to comment

Share on other sites

softworkz 3335

Link to comment

Share on other sites

rbjtech 4266

Link to comment

Share on other sites

Carlo 4330

Link to comment

Share on other sites

softworkz 3335

Link to comment

Share on other sites

softworkz 3335

Link to comment

Share on other sites

Carlo 4330

Link to comment

Share on other sites

Carlo 4330

Link to comment

Share on other sites

softworkz 3335

Link to comment

Share on other sites

softworkz 3335

Link to comment

Share on other sites

Carlo 4330

Link to comment

Share on other sites

Carlo 4330

Link to comment

Share on other sites

softworkz 3335

Link to comment

Share on other sites

softworkz 3335

Link to comment

Share on other sites

GWTPqZp6b 41

Link to comment

Share on other sites

softworkz 3335

Link to comment

Share on other sites

GWTPqZp6b 41

Link to comment

Share on other sites

GWTPqZp6b 41

Link to comment

Share on other sites

softworkz 3335

Link to comment

Share on other sites

GWTPqZp6b 41

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in