Jump to content

Multiple Thumbnail Extract Processes


runtimesandbox

Recommended Posts

ryzen5000

"Faster IO" How do I accomplish this?

"Multiple codecs" Yes I see that, and I left that at default on just NVDEC, I was looking at it yesterday and I read online somewhere that the CUVID was depreciated,  should I have them both selected? Can I add more than one GPU to Unraid docker template or just say all and more will show up in EMBY, I have not tried this yet I will test it.

NVDEC NVIDIA GeForce RTX 3050 - MPEG-2
CUVID NVIDIA GeForce RTX 3050 - MPEG-2
Link to comment
Share on other sites

ryzen5000
8 minutes ago, softworkz said:

Very good! 🙂 

For scanning please see my reply to your other post: https://emby.media/community/index.php?/topic/113885-chapter-images-not-displayed-in-tv-series/&do=findComment&comment=1201660

For extraction: This is a single-thread operation. Decoding a key-frame is not a task that benefits from multiple core. Neither does it benefit from GPU decoding. I think I have explained that already.
Faster IO is the only thing you could to to accelerate thumbnail extraction.

@Luke will need to respond to that.

 

Yes, this has always been planned for and the architecture is there. You also see that you can already select multiple GPUs for each  codec in a priority order.
The only bit that is missing is some logic to determine under which conditions, the other GPU should be chosen.
We're not far away from making that possible - technically.

I can assure you this it wouldn't accelerate this. Not even a tiny bit. 

It would accelerate the old way of extraction. But the old way with GPU is still much slower than the new way (where CPU vs. GPU doesn't differ).

Got it, I have these two GPU I spent so much money on I had all sorts of ideas on ways to use them and they are mostly just doing transcoding for tdarr, One on the controller and one is on a node. I want to put them to work in EMBY

Link to comment
Share on other sites

3 minutes ago, ryzen5000 said:

Got it, I have these two GPU I spent so much money on I had all sorts of ideas on ways to use them and they are mostly just doing transcoding for tdarr, One on the controller and one is on a node. I want to put them to work in EMBY

At the moment, it doesn't make sense to have two GPU boards for Emby. Hopefully soon..

Link to comment
Share on other sites

Happy2Play

With right script/command line you could run as many processes outside of Emby as you like.  From a one and done standpoint in Emby there is no need to go 100 miles an hour all the time.  With the process complete against existing media Emby can generate thumbs/bif as fast as you add media.  From a backlog you just have to wait or create outside of Emby.

Link to comment
Share on other sites

17 hours ago, softworkz said:

key-frame is not a task that benefits from multiple core. Neither does it benefit from GPU decoding. I think I have explained that already.
Faster IO is the only thing you could to to accelerate thumbnail extraction.

16 hours ago, Happy2Play said:

With right script/command line you could run as many processes outside of Emby as you like.  From a one and done standpoint in Emby there is no need to go 100 miles an hour all the time.  With the process complete against existing media Emby can generate thumbs/bif as fast as you add media.  From a backlog you just have to wait or create outside of Emby.

While both of these are true, Emby Server could process multiple media files at one time up until X load (CPU or storage) or maximum threads are used (configurable). With PC becoming available on the consumer side with more and more threads common functions like this could be made to scale-out more in order to use the hardware available to it. $500 or so, these days can get you a used Dual XEON server with 22 cores per CPU giving you 88 threads or vCPUs. Emby at present doesn't scale out well to make use of these resources. Doing multiple things in parallel even if single threaded Ops would make better use of resources and allow processes to complete faster and have little impact on other ongoing operations assuming you don't cause an IO bottleneck.

17 hours ago, ryzen5000 said:

Got it, I have these two GPU I spent so much money on I had all sorts of ideas on ways to use them and they are mostly just doing transcoding for tdarr, One on the controller and one is on a node. I want to put them to work in EMBY

At present using an external program like Tdarr is probably the best use of multiple GPUs to make use of any un-used GPU resources. You should be able to set the affinity higher for Emby then for Tdarr and have both using GPUs at the same time (assuming Emby adds multi-GPU support) as well for maximum use of PC resources.

I use Tdarr this way over my LAN with Tdarr having access to more than a dozen GPUs. Same with CPU cores/threads running other background tasks.

 

Link to comment
Share on other sites

rbjtech
18 hours ago, Happy2Play said:

@rbjtechdidn't you have a script/writeup somewhere?

Not specifically for multiple extract processes I'm afraid.

The ffmpeg command to do the extraction is actually listed in the 'quick-image-series' logs - but these days it's dynamic based on content - HDR for example now does tonemapping - so the best (and probably quickest) way forward is to just spin up multiple copies of emby, point to different libraries and let it run.

For a script, If you start adding any sort of intelligence, it will get complex quickly but a 'dumb' script should be fairly easy - just iterate though the folders (libraries) with each one using a separate ffmpeg process. 

@Cheesegeezer and I added the BIF creator for HDR sources in the mediainfo plugin before it was added to the core - so technically this could probably be easily adapted to create an SDR BIF as required (if it does not already exist) - and then spool multiple ffmpeg processes based on a user selectable number in a config file or in the GUI.  Could be library based or ItemId based - so it would have that advantage.  

 

  • Like 1
Link to comment
Share on other sites

 

3 hours ago, cayars said:

Dual XEON server with 22 cores per CPU giving you 88 threads or vCPUs. Emby at present doesn't scale out well to make use of these resources.

Once again: Image extraction is NOT a matter of CPU or GPU resources.

3 hours ago, cayars said:

assuming you don't cause an IO bottleneck.

This is an assumption that Emby cannot make.

Playback issues caused by such situations (e.g. stuttering playback) are hard to identify and typically, users won't even take the effort to identify such cases and will rather conclude that Emby is not working properly and has playback problems. But the same users would have activated all kind of "parallel image extraction" options before.That's why we don't offer any. Better a very small number of users being dissatisfied about non-parallel image extraction than a large amount of users who think that Emby doesn't work well. 

Link to comment
Share on other sites

4 minutes ago, softworkz said:

Once again: Image extraction is NOT a matter of CPU or GPU resources.

It is when the overall process takes days to complete on a large system but running in parallel could chop the time by a magnitude. What we're talking about is processing multiple images at the same time regardless of how efficient the process runs.

Some processes don't lend themselves or don't need multiple threads or have a limit to how many threads are useful for that process. ffmpeg doing AVC encoding is an example where 8 to 10 threads is usually the max for performance reasons.  But if you have a 32, 64, 88 or 128 hyperthreaded machine with threads available you could have up to a dozen of these going at the same time each running about as fast as if it was the only process. In this case you could reprocess a library roughly 12 times faster.  The efficiency of the individual process might already be optimal so no gain can be had from it, but you can run multiple versions of it processing different media. Ideally, since it's a background task it gets ran with a lower affinity than other processes so it's only using resources not needed ny other processes that run at normal affinity.

Link to comment
Share on other sites

2 hours ago, rbjtech said:

but these days it's dynamic based on content - HDR for example now does tonemapping 

Correct. There are also certain cases where the commands are different or even the classic extraction is being used.

2 hours ago, rbjtech said:

 so the best (and probably quickest) way forward is to just spin up multiple copies of emby, point to different libraries and let it run.

Yup - that's definitely the best option. Also, Emby might move away from bif in the future (Roku - the "inventors" of the format have officially deprecated it), so I wouldn't spend much effort on this. Even more, considering the fact that it's probably not just about image extraction but the whole scanning process.

Emby is a "Personal Media Server". A professional media server would never combine content preparation, data management and content serving in a single product.

One or more separate Emby instances for content preparation is the best you can do in case of Emby, when you want to go into such directions.

  • Like 1
Link to comment
Share on other sites

3 minutes ago, cayars said:

ffmpeg doing AVC encoding is an example where 8 to 10 threads is usually the max for performance reasons.

This conversation is not about AVC encoding. It's about image extraction.

Link to comment
Share on other sites

Happy2Play

In the end this just creates a slow INITIAL import for Emby but day to day operates are unaffected as the processes are as fast as you add the media.

But to do all of these processes and expecting normal operations is quite a reach in my opinion and users will blame Emby instead of their parallel import choices.

  • Agree 1
Link to comment
Share on other sites

I think the problem is that many do not understand what we are exactly talking about here and how on earth we would come to tell such nonsense that more CPU or GPU power wouldn't be able to make image extraction faster.

So, let's make a test that is reproducible and everybody can do on one's own systems.

Step 1 - Download

Download the test file: http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_2160p_30fps_normal.mp4
(to make sure we all get comparable results)

Step 2 - Prepare

  • Open a command line
  • Create or go to an empty temporary folder
  • Copy the downloaded video into that folder
  • Determine the ffmpeg path
    • Windows default:  C:\Users\admin\AppData\Roaming\Emby-Server\system\ffmpeg.exe
    • Debian/Ubuntu: /opt/emby-server/bin/ffmpeg-emby
    • Or copy from an ffmpeg log
  • Create a sub-folder named out

Step 3 - Run Tests

3.1 - Classic Extraction with SW Decoding - 1 Thread

ffmpeg -loglevel +timing -threads 1 -i bbb_sunflower_2160p_30fps_normal.mp4 -an -sn -threads 0 -vf "fps=fps=1/10,scale=min(iw\,320):trunc(ow/dar/2)*2" -f image2 out\img_%05d.jpg

3.2 - Classic Extraction with SW Decoding - Use all Threads

ffmpeg -loglevel +timing -threads 0 -i bbb_sunflower_2160p_30fps_normal.mp4 -an -sn -threads 0 -vf "fps=fps=1/10,scale=min(iw\,320):trunc(ow/dar/2)*2" -f image2 out\img_%05d.jpg

3.3 - Classic Extraction with HW Decoding - CUVID

ffmpeg -loglevel +timing -c:v h264_cuvid -i bbb_sunflower_2160p_30fps_normal.mp4 -an -sn -threads 0 -vf "fps=fps=1/10,scale=min(iw\,320):trunc(ow/dar/2)*2" -f image2 out\img_%05d.jpg

(NVDEC is the same, so I'll leave that out)

3.4 - Classic Extraction with HW Decoding - QSV

ffmpeg -loglevel +timing -init_hw_device qsv="dev1:hw2,child_device=1,qsv_use_dx11=1" -hwaccel qsv -hwaccel_device dev1 -c:v h264_qsv -i bbb_sunflower_2160p_30fps_normal.mp4 -an -sn -threads 0 -vf "hwdownload,format=nv12,fps=fps=1/10,scale=min(iw\,320):trunc(ow/dar/2)*2" -f image2 out\img_%05d.jpg

(this command is for Window; you might need to change hw2 to hw and child_device to 0)

3.5 - Quick Extraction - 1 Thread

ffmpeg -loglevel +timing -threads 1 -skip_interval 10 -copyts -i bbb_sunflower_2160p_30fps_normal.mp4 out\img_%05d.jpg

3.6 - Quick Extraction - All Threads

ffmpeg -loglevel +timing -threads 0 -skip_interval 10 -copyts -i bbb_sunflower_2160p_30fps_normal.mp4 out\img_%05d.jpg

 

Step 4 - Compare and Discuss Results

Let's see which results you will get..

Edited by softworkz
Updated QSV command
  • Like 1
Link to comment
Share on other sites

Here are my results:

# Test   Duration
3.1 Classic with SW Decoding 1 Thread 257s
3.2 Classic with SW Decoding All Threads 43s
3.3 Classic with HW Decoding CUVID 91s
3.4 Classic with HW Decoding QSV 285s
3.5 Quick Extraction 1 Thread 6.5s
3.6 Quick Extraction All Threads 6.7s

 

  • Like 1
Link to comment
Share on other sites

Happy2Play

Here are my results:

# Test   Duration
3.1 Classic with SW Decoding 1 Thread 306s
3.2 Classic with SW Decoding All Threads 67s
       
3.4 Classic with HW Decoding QSV 297s
3.5 Quick Extraction 1 Thread 6.9s
3.6 Quick Extraction All Threads 6.87s

 

  • Like 1
Link to comment
Share on other sites

Step 4 - Compare and Discuss Results

Now - as I'm sure you will all get similar results, let's take a look what these can tell.

 

Classic Extraction with SW Decoding

# Test   Duration
3.1 Classic with SW Decoding 1 Thread 257s
3.2 Classic with SW Decoding All Threads 43s

Classic extraction means that the whole video is being decoded. 
Of course this can be accelerated by using more threads.
This is what you all know and have in mind when thinking about way to accelerate image extraction.

 

Classic Extraction with HW Accelerated Decoding

The HW decoding results are a bit unfair to compare, because they would be faster when scaling down in hardware rather than in software, so I'll skip over them. 
But what's for sure is that these could never be as fast as quick extraction.

 

Quick Extraction

Note for occasional readers:
Quick Extraction is an Exclusive Emby Feature.
Nobody else has this - at the time of writing.

# Test   Duration
3.5 Quick Extraction 1 Thread 6.5s
3.6 Quick Extraction All Threads 6.7s

Let's look at 3.5 first:

6.5s means that it is 6.6 times faster than 3.2.

But not just that - in case of 3.2 your system is under 80-100% load for 43s!
In case of 3.5, a single core is under 100% load for 6.5s - so we need to actually compare that to 3.1 - and that means 40 times faster with the same CPU load.

How about 3.6?

@Happy2Play might be able to confirm this: In case of 3.6 - you'll see a higher CPU load of the ffmpeg process.

But it is not able to end up faster. Why so?

The reason for that is that enabling more threads will make ffmpeg decode frames in parallel. But what we actually need is a single frame every 10 seconds.
And before getting the next frame, it always needs to stop decoding and seek forward in the file, then decode the next frame.
By enabling multi-threaded decoding, ffmpeg will decode more frames in parallel - but we don't need any of them. They will just be thrown away at the cost of a multiple of the CPU load.

That's why I always said that the quick extraction cannot be accelerated. Not with GPUs and not with CPUs. 

But the most important bottom line is:

Emby's Quick Extraction is multiple times faster than anybody else can do it, no matter how much more hardware you throw at it.

So please stop complaining and think you could have image extraction get any faster. This is already the best what's possible.
And we do not allow parallel extraction for the reason explained above.

  • Like 1
Link to comment
Share on other sites

45 minutes ago, Happy2Play said:
# Test   Duration
3.1 Classic with SW Decoding 1 Thread 306s
3.2 Classic with SW Decoding All Threads 67s
       
3.4 Classic with HW Decoding QSV 297s
3.5 Quick Extraction 1 Thread 6.9s
3.6 Quick Extraction All Threads 6.87s

Thanks a lot for doing those tests!

One more interesting point when comparing to my results is that my CPU is 120/150% faster than yours (comparing 3.1/3.2).

But in case of Quick Extraction (3.5) we got almost the same result.

That means when @Happy2Play would upgrade the CPU to mine, it wouldn't improve the image extraction.

q.e.d.

Link to comment
Share on other sites

rbjtech

I'll add my results later - but the other factor which you have mentioned @softworkz is the other 'bottlenecks' - where is your source coming from - from one end of the performance scale it could be a local nvme drive, next is local directly connected storage (sata/sas), it could be network connected storage (nas), it may even be in the cloud.  What about the destination for the temp files  - again, could be a fast nvme drive, could be a 'slow' badly fragmented HDD.  If you look at the process in detail, you will note a huge amount of I/O (720 'temp' files per 2hrs of video if a 10second interval, so on 1000 movies - that's 720K files written.., and that doesn't include the initial read overhead either)  - I don't know what this is in terms of cpu %, but I would guess it's high enough to vary the results - especially if using cpu intensive 'storage' overhead such as nas/cloud.

Edited by rbjtech
Link to comment
Share on other sites

rbjtech

Interestingly - on my system - single thread is actually faster lol.   repeated the test a few times and got the same results (within a few 100th of a second).

Source was one nvme drive and destination was another for reference.  cpu an i7-12700K 

# Test   Duration
3.5 Quick Extraction 1 Thread 5.73s
3.6 Quick Extraction All Threads 7.05s

So yea, in summary - it's of magnitudes quicker than the classic extraction ragardless of what cpu/I/O you have .. ;)

Thanks @softworkz - an interesting exercise.

edit - same nvme source/destination - fractionally faster but it may be because it was in cache ? ..

# Test   Duration
3.5 Quick Extraction 1 Thread 5.68s
3.6 Quick Extraction All Threads 6.97s
Edited by rbjtech
  • Like 1
Link to comment
Share on other sites

7 minutes ago, rbjtech said:

Source was one nvme drive and destination was another for reference.  cpu an i7-12700K 

I have i7-11700. First run had the source on a spinning hd and was 2s slower, but subsequent runs from this hd gave the same results as later from SSD, so to really assess the IO vs. CPU relation, you'll need to run on some 50GB files in order to be sure that there are no caching effects involved.

BTW, the destination shouldn't really matter for this. I mean this is creating 63 jpg file with a size of only 1 MB in total.

15 minutes ago, rbjtech said:

Interestingly - on my system - single thread is actually faster lol. 

I have an explanation for this:

Your Alder Lake CPU has different kinds of cores (P and E). When ffmpeg parallelizes decoding using "per-frame threading", it might employ each of the cores for one frame. IDR frames (key frames) always need to be decoded first, so these will probably get a P core. Then there are 7 P-cores left for 7 frames and another 4 frames might get E cores. The E cores are slower, so when the 8 frames are done, the other 4 frames are still being processed, and even when those would be switched to P cores to finish, then it's already too late.

I would predict that when you specify -threads 8, you will get equal times for 3.5 and 3.6..

Link to comment
Share on other sites

It should use P cores unless you're out of them and reserve E cores for Windows and other low use or low priority background tasks. This will work much better on Windows 11 then 10 or Linux. Linux kernel at present decides when to use P or E cores using the ITMT/Turbo Boost Max 3.0 driver that relies on the information exposed by the firmware but ends up favoring P cores for most things. Most people would probably find binding the E cores to the Linux core is all the CPU they need while giving all the P cores to apps and user services. Running CPU based RAID or ZFS will likely need a manual adjustment.  If you're not maxing out the E cores and don't have any bottlenecks trying to read/write files or feeding/retrieving data from the GPU, this isn't a bad way to make use of P & E cores on systems nit tuned for them yet.

Carlo

Link to comment
Share on other sites

24 minutes ago, cayars said:

It should use P cores unless you're out of them and reserve E cores for Windows and other low use or low priority background tasks.

FFmpeg determines the number of cores to decide how many threads to use. It doesn't care about what kind of cores there might be.
So, all cores will be used. The E cores are slower, and due to the quantization of the workload per frame this slows down the whole process.

Link to comment
Share on other sites

rbjtech
14 hours ago, softworkz said:

FFmpeg determines the number of cores to decide how many threads to use. It doesn't care about what kind of cores there might be.
So, all cores will be used. The E cores are slower, and due to the quantization of the workload per frame this slows down the whole process.

I ran it again (system was 3-4% loaded at the time )

with 1 thread - time was 5.92

with 4 threads - time was 5.10

with 8 threads - time was 5.63

with 16 threads - time was 7.72

with 20 threads - time was 8.62

 

  • Like 1
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...