Jump to content

Multiple Thumbnail Extract Processes


runtimesandbox

Recommended Posts

14 hours ago, softworkz said:

FFmpeg determines the number of cores to decide how many threads to use. It doesn't care about what kind of cores there might be.
So, all cores will be used. The E cores are slower, and due to the quantization of the workload per frame this slows down the whole process.

Keep in mind the OS determines what kind and number of cores is shown or available to a process. You can right now choose to not use E cores for any process just as always or bind a process to a specific core. OSes will get smarter and smarter making correct choices of P vs E core.  A well tuned OS could for example use 2 E cores for a set of threads it notices doesn't need P cores while other threads in the same app get P cores as the OS sees it can use as much performance as possible.  We are getting closer and closer to OSes that auto optimize a lot of things such as running something on a GPU even if not compiled specifically to do so.  Using DPUs, additional instruction sets on CPUs, etc A lot of functionality compiled to run on general CPUs is already lifted by most OS when the presence of dedicated hardware or instruction sets are available. This includes memory access, caching, cryptography, CRC checksums (every packet sent or received), RAID calculations, etc

Link to comment
Share on other sites

31 minutes ago, rbjtech said:

I ran it again (system was 3-4% loaded at the time )

with 1 thread - time was 5.92

with 4 threads - time was 5.10

with 8 threads - time was 5.63

with 16 threads - time was 7.72

with 20 threads - time was 8.62

Hey, that perfectly matches my prediction 🙂 

One needs to count physical cores here, not Intel's HT blabla, that's why I had written "8". As soon as E-cores come into the game, it gets slower.

Thanks for trying.

  • Like 1
Link to comment
Share on other sites

On 11/17/2022 at 5:41 PM, softworkz said:

 

Once again: Image extraction is NOT a matter of CPU or GPU resources.

This is an assumption that Emby cannot make.

Playback issues caused by such situations (e.g. stuttering playback) are hard to identify and typically, users won't even take the effort to identify such cases and will rather conclude that Emby is not working properly and has playback problems. But the same users would have activated all kind of "parallel image extraction" options before.That's why we don't offer any. Better a very small number of users being dissatisfied about non-parallel image extraction than a large amount of users who think that Emby doesn't work well. 

I don't think you're understanding what we are saying. It doesn't matter how efficient or not the process is on the CPU or GPU (but we know it is efficient).  The speed of processing an image file takes roughly 6 seconds (on hypothetical PC). That's 10 media files a minute or 600 per hour.  Since there are still plenty of resources available instead of processing 1 media file at a time we start up two different files.  Now we are processing 20 per minute or 1200 per hour.  Double that to 4 processes and you get 40 per minute or 2400 per hour. With 8 processes you have 80 per minute or 4800 per hour. If it would have take 4 days to process files using a single instance, then with 8 processes running you're looking at roughly half a day.

In the example I showed earlier I let kubernetes manage creating these processes and it created enough that the 55 days (1 instance takes) was complete in less than a day. Started in the late afternoon and was finished before noon the next day when I remembered to check on it.

Note: I'm using the term affinity sort of generically to mean a way of controlling how something runs compared to other processes.  There are two actual ways to do this using "priority" and the other is "affinity".  Technically adjusting affinity is adjusting what cores the process can run on while setting the priority controls the advantage or disadvantage of the task vs other tasks running.   Priority is what would be used to make the background tasks only run when system resources are available, so the main processes do not slow down.  Affinity is useful to lock programs/process to specific CPU cores.  An example use of this is setting the affinity of ffmpeg so it only uses P cores and not E cores or specifying a certain affinity of cores 0 to 7 vs all 32 cores. That way the CPU cache will be far more useful than if threads are recreated all over the cores which is usual as the OS will normally assign a process to the least busy core.

As I've repeated every one of these background processes would be set to run at a low affinity setting. This way they do not interfere with other processes having a higher infinity and need for CPU, GPU or IO.  You can effectively keep adding addition background processes until you're roughly at 80% CPU use and under 0.5% system wait states. These are conservate numbers as you could keep adding processes until you start to build up wait states and hit 2 or 3% but I prefer to try and keep them under 1% which keeps the system snappy.  Under 1% and especially under 0.5% guarantees you that you don't have a system bottleneck causing a performance issue. If resources get low or bottlenecked the wait states will rise.

Running multiple processes while wait states are under 0.5% just makes better use of resources otherwise going to waste. The additional CPU & IO is a better use of all resources up to the point of going over the tipping point which will definitely show up as processes needing to wait for availability and the wait state metric will rise. Having a GUI setting that an admin can specify as the upper limit puts the admin in control so they could limit use to say 4 instances keeping lots of resources free if they often have to startup and use other programs. If another process was to start the worst case, is it would hit swap space or need more I/O so the background processes would sit dormant allowing things with a higher affinity to have the the majority of resources but would still be running and processing. As these process finish, they would free up more resources pretty quickly. If the wait states are over 0.5% no other instances would get started so the background processes would drop off.

So getting back on point this isn't about making the extraction process more efficient, but about cutting down the overall time to do the extraction process.  We know running multiple instances works because a lot of people with large media libs will setup multiple instances of Emby Server, typically one instance per library to divide and concour. We know this works well but is time consuming getting it setup. If Emby Server could itself fire up additional background processes, there would be no need to use multiple Emby Servers running at the same time and would be far better as the processes would be using affinity properly which the multiple Emby setup likely isn't doing unless users set this themselves.

Edited by cayars
Link to comment
Share on other sites

4 minutes ago, cayars said:

I don't think you're understanding what we are saying.

I don't think you're understanding what I'm saying. The primary concern has always been about storage IO, not about process resources.
(always = always since we have quick extraction)

5 minutes ago, cayars said:

It doesn't matter how efficient or not the process is on the CPU or GPU (but we know it is efficient)

Yes, and it's so efficient that it reads the files so fast that the issue is not that much about CPU but IO primarily.

CPU resources can be controlled and managed, there exist apis and methods for this. But not for storage. Every setup is different and it's impossible to determine the utilization, when it's at 100% or 10%. You don't know which paths are associated with which physical storage paths, etc. This is endless and not doable in a universal way, not even for a specific target platform, not to speak of all Emby target platforms.

And that brings me back to the core statement: better very few users dissatisfied with the lack of parallel extraction than a large share of users which are dissatisfied with Emby not working smoothly (after having enabled parallel extraction which could make their IO unusable for playback)

Link to comment
Share on other sites

rbjtech

We hit this exact same scenario when writing the Introskip plugin - we gave the option to use as many parallel threads as the user wanted to get parallel workloads done for detection.  CPU for pure detection work was never an issue, but I/O was and dependent on the strength of the users system I/O, sometimes even the defaults of running 4 sets of detection stressed their systems (using NAS as source storage for example).  I did a lot of work and testing on this to get the balance to what I thought was 'safe' - but as predicted, we had users trying to use Pi's that crumbled on the defaults and people with 32 core cpu's maxing things, totally bottlenecked on I/O which then took longer than if they had left it as default as they simply couldn't 'stream' the data efficiently into ffmpeg.

So I think we all agree - parallel is good - but giving users manual options to do so in the Core, will simply result in support calls on why things are not going faster blah blah.

My personal view is if the user is savvy enough to understand what is going on under the hood, see the cpu utilisation is low etc - then they are savvy enough to spin up multiple instances of emby and split the workload that way.  

Alternatively - is to do this via a plugin because a) the user consciously needs to install it and accept the consequences and b) once complete, it actually has no practical use - so the user just removes the plugin.

@Cheesegeezer is always up for a quick Plugin development ... 🤣

 

  • Like 1
  • Agree 1
Link to comment
Share on other sites

1 hour ago, rbjtech said:

We hit this exact same scenario when writing the Introskip plugin - we gave the option to use as many parallel threads as the user wanted to get parallel workloads done for detection.  CPU for pure detection work was never an issue, but I/O was and dependent on the strength of the users system I/O, sometimes even the defaults of running 4 sets of detection stressed their systems (using NAS as source storage for example).  I did a lot of work and testing on this to get the balance to what I thought was 'safe' - but as predicted, we had users trying to use Pi's that crumbled on the defaults and people with 32 core cpu's maxing things, totally bottlenecked on I/O which then took longer than if they had left it as default as they simply couldn't 'stream' the data efficiently into ffmpeg.

I wonder why some keep responding as if I wouldn't know what I'm talking about (sigh).

Thanks for sharing your experience! That's the perfect example for what would gonna happen when we would allow this. Even @cayars still appears to assume that sufficient CPU is all it takes to parallelize this easily, same like OP was supposing, and all those would instantly set parallel extraction to the maximum, thinking it's fine because they have a powerful system. 
And eventually they'll wonder why playback doesn't work properly, they look at the (low) CPU usage and conclude that the extraction is not the culprit because the CPU usage is takes is fairly low - so Emby must have a problem... 
It would be like driving a long way out into the forest in order to shoot yourself into your own foot, even though - 1. you could have done this at home right away and 2. sooner or later it will happen anyway by accidence... 😜

1 hour ago, rbjtech said:

Alternatively - is to do this via a plugin because a) the user consciously needs to install it and accept the consequences and b) once complete, it actually has no practical use - so the user just removes the plugin.

@Cheesegeezer is always up for a quick Plugin development ... 🤣

Sure, that would be possible...
Unfortunately - while a plugin could use IImageExtractionManager.ExtractVideoImagesOnInterval or ILibraryManager.RefreshThumbnailImages - these APIs will both end up hitting the resource limit (only one parallel extraction).

But I still wonder whether there's really such an urgent need for this. When I look at @cayars's calculation:

4 hours ago, cayars said:

I don't think you're understanding what we are saying. It doesn't matter how efficient or not the process is on the CPU or GPU (but we know it is efficient).  The speed of processing an image file takes roughly 6 seconds (on hypothetical PC). That's 10 media files a minute or 600 per hour. 

When we assume an average of 10GB per media file and we do 600 media files per hour, that means 6 TB/h or 144 TB/day. 

The largest HDs at the moment are 18 TB and cost around $300. For 144 TB you would need 8 HDs which makes $2400. 
Seriously - who will still need parallelization? For what purpose? For a library of that size, one cannot wait 24h and wants to run 4 extractions in parallel? That's $9600 cost just for the HDs when still running 24h.

And where does the data come from - that quickly? With the $9600 case, we're talking about more than a half Peta Byte!. That's 24 TB/h, 400 GB/min, 6.6 GB/s or 53 Gbps. 
Even when somebody would have a connection that is so fast - there will be no other side to deliver content as fast as that.
Even when the content already exists. A single (normal) server will definitely not be able to process the data at such bandwidths, so this once again ends up in a way that you would need multiple servers - not a single Emby server with parallel extraction.

No matter how the figures are exactly: such somebody who may really need this, would be someone who has already spent many many thousands of dollars alone for the storage components. And AFAIC, I think that such person can easily pay somebody to create a custom solution (like a set of scripts and batch jobs) tailored to that persons needs, and should not expect a $120 software or a volunteer developer to do that job - from which just a handful of users would benefit - at best.

Edited by softworkz
  • Like 1
Link to comment
Share on other sites

rbjtech
51 minutes ago, softworkz said:

Sure, that would be possible...
Unfortunately - while a plugin could use IImageExtractionManager.ExtractVideoImagesOnInterval or ILibraryManager.RefreshThumbnailImages - these APIs will both end up hitting the resource limit (only one parallel extraction).

Ah right - yes, if we are using the API request to do it - then a great point.

If it was done - then we would probably do it outside the API, essentially at a file level - and use a direct ffmpeg external call passed the relevant syntax.  

This is the approach we have taken for the majority of the plugins such as Introskip,  mediainfo, bif generator etc. - but it does have it's own support issues as you've witnessed.

Agree 100% with your conclusion - CPU's these days are no longer the bottleneck in any modern system from the last~5 years or so - it's mass storage that remains on SATA or over NAS that is the problem.  It will no doubt catch up and when we have affordable 100TB NVME drives .. then we can have this conversation again ..  haha :)

 

 

  • Like 1
Link to comment
Share on other sites

34 minutes ago, rbjtech said:

If it was done - then we would probably do it outside the API, essentially at a file level - and use a direct ffmpeg external call passed the relevant syntax.  

This is the approach we have taken for the majority of the plugins such as Introskip,  mediainfo, bif generator etc. - but it does have it's own support issues as you've witnessed.

We have made preparations to make Emby's process runner implementation accessible to plugins. This works similar to the .NET implementation but it sets you free from all the quirks and issues that you can (and did) stumble upon, making it work in a way that you hardly can do anything wrong 🙂 
(I don't mean the ld path on Linux, but the other things)

This will make life a bit easier for process execution.

  • Thanks 2
Link to comment
Share on other sites

On 11/22/2022 at 5:41 AM, softworkz said:

I wonder why some keep responding as if I wouldn't know what I'm talking about (sigh).

Thanks for sharing your experience! That's the perfect example for what would gonna happen when we would allow this. Even @cayars still appears to assume that sufficient CPU is all it takes to parallelize this easily, same like OP was supposing, and all those would instantly set parallel extraction to the maximum, thinking it's fine because they have a powerful system. 
And eventually they'll wonder why playback doesn't work properly, they look at the (low) CPU usage and conclude that the extraction is not the culprit because the CPU usage is takes is fairly low - so Emby must have a problem... 
It would be like driving a long way out into the forest in order to shoot yourself into your own foot, even though - 1. you could have done this at home right away and 2. sooner or later it will happen anyway by accidence... 😜

What you describe should never happen if you're setting up background threads correctly. Did you not see me mention setting affinity?  I also mentioned checking CPU percentage used and making sure wait states are under 0.5%.  Dome like this you could have 1 thread or 100 threads all working on different files and your system should not slow down or have an issue. The affinity setting makes sure it only runs when there are spare resources available but always takes a back seat to normal processes and threads running on the whole machine.  If a transcode job was started using only CPU and there were 10 background threads running, they would slow way down because the transcode job running at normal affinity gets use of resources first before the background tasks. As the background tasks finish (slower than normal) they wouldn't get restarted because the CPU percent in use is higher than are limit.

You check the wait states, CPU & memory before starting up a new background process to keep from starting something new if the system resources are already over a limit. You could if needed/wanted check NIC/LAN percent used/available and a specific disk utilization as well as any other potential bottleneck.

On 11/22/2022 at 5:41 AM, softworkz said:

Sure, that would be possible...
Unfortunately - while a plugin could use IImageExtractionManager.ExtractVideoImagesOnInterval or ILibraryManager.RefreshThumbnailImages - these APIs will both end up hitting the resource limit (only one parallel extraction).

But I still wonder whether there's really such an urgent need for this. When I look at @cayars's calculation:

When we assume an average of 10GB per media file and we do 600 media files per hour, that means 6 TB/h or 144 TB/day. 

The largest HDs at the moment are 18 TB and cost around $300. For 144 TB you would need 8 HDs which makes $2400. 
Seriously - who will still need parallelization? For what purpose? For a library of that size, one cannot wait 24h and wants to run 4 extractions in parallel? That's $9600 cost just for the HDs when still running 24h.

And where does the data come from - that quickly? With the $9600 case, we're talking about more than a half Peta Byte!. That's 24 TB/h, 400 GB/min, 6.6 GB/s or 53 Gbps. 
Even when somebody would have a connection that is so fast - there will be no other side to deliver content as fast as that.
Even when the content already exists. A single (normal) server will definitely not be able to process the data at such bandwidths, so this once again ends up in a way that you would need multiple servers - not a single Emby server with parallel extraction.

No matter how the figures are exactly: such somebody who may really need this, would be someone who has already spent many many thousands of dollars alone for the storage components. And AFAIC, I think that such person can easily pay somebody to create a custom solution (like a set of scripts and batch jobs) tailored to that persons needs, and should not expect a $120 software or a volunteer developer to do that job - from which just a handful of users would benefit - at best.

Not sure how you got 6TB/hour or 144TB/day from my numbers, but if the streams averaged 3Mbit/sec / 8 = 0.375MB/sec (bits / 8 = bytes) or 22.5MB/min or 1.35GB/hour. If the average minutes per media file is 40 then each media file is 1.35/60*40=0.9GB in length.

Our numbers are quite different from each other with $189 vs $9600. Even if you doubled the average file bitrate from 3Mb to 6Mb I'd still only be only $378 in storage costs for one days processing.

That's using 6 seconds per file.  It could be that lower bitrate files process faster and might require only 4 seconds, so there are variables.  What isn't a variable however is that each person could test to see how many MB they can process per minute.  They can also lookup the total size off all stored media they have and calculate how long it would take.  Total storage and number of total minutes could be calculated from the database pretty easily for all local media. Stub/strm files won't be counted but also aren't processed so that's fortunate.  I thought there was a plugin that gives you both these numbers but couldn't find it quickly looking.

With a little SQL run on my system I have:
average bit = 1.666Mbit/sec or 208.3K MB/sec.
Average file size is 941.2 MB or 0.94 GB.

For rough calculations that's 1000 files per TB of storage.

950K files would use just about a Petabyte of storage. That's 50 22TB drives.
 

Link to comment
Share on other sites

On 11/22/2022 at 3:49 AM, rbjtech said:

So I think we all agree - parallel is good - but giving users manual options to do so in the Core, will simply result in support calls on why things are not going faster blah blah.

Did you happen to do the 3 things I've mentioned?
Set your background processes with lowest affinity
Don't start additional background process if either CPU is over X% or wait states over 0.5%?

If not, may want to try adding this to see how much better it works.

Link to comment
Share on other sites

59 minutes ago, cayars said:

Not sure how you got 6TB/hour or 144TB/day from my numbers, but if the streams averaged 3Mbit/sec / 8 = 0.375MB/sec (bits / 8 = bytes) or 22.5MB/min or 1.35GB/hour. If the average minutes per media file is 40 then each media file is 1.35/60*40=0.9GB in length.

Like I wrote above, I assumed 10GB per file.

File size of 0.9GB for 40min is quite a bit low quality, but okay. The 6s was your figure. What doesn't change is that an hour has 3600 seconds, divided by 6s means 600 files per hour. 

600 * 0.9 GB = 540 GB per hour and that's 13 TB per day. That's the value for non-parallel extraction, but as parallel extraction is asked for, I used 4 parallel extractions as an example above. So we still get 4 * 13 TB = 52 TB per day, for which you will need 3 HDs of 18 TB and that's still 3 * $300 = $900 storage cost for a single day of data processing.

Edited by softworkz
Link to comment
Share on other sites

1 hour ago, cayars said:

What you describe should never happen if you're setting up background threads correctly. Did you not see me mention setting affinity?  I also mentioned checking CPU percentage used and making sure wait states are under 0.5%.  Dome like this you could have 1 thread or 100 threads all working on different files and your system should not slow down or have an issue. The affinity setting makes sure it only runs when there are spare resources available but always takes a back seat to normal processes and threads running on the whole machine.  If a transcode job was started using only CPU and there were 10 background threads running, they would slow way down because the transcode job running at normal affinity gets use of resources first before the background tasks

One last time: It's about IO, not about CPU.

Link to comment
Share on other sites

rbjtech
8 hours ago, cayars said:

Did you happen to do the 3 things I've mentioned?
Set your background processes with lowest affinity
Don't start additional background process if either CPU is over X% or wait states over 0.5%?

If not, may want to try adding this to see how much better it works.

tbh Carlo, I don't have the time to invest in a detailed investigation into this - nor do I frankly see the need to. 

I believe the OP has an answer to this FR - yes it's possible, but the reasons why it's not a simple answer has been explained to death above .. ;) 

Edited by rbjtech
  • Agree 1
Link to comment
Share on other sites

11 hours ago, softworkz said:

Like I wrote above, I assumed 10GB per file.

File size of 0.9GB for 40min is quite a bit low quality, but okay. The 6s was your figure. What doesn't change is that an hour has 3600 seconds, divided by 6s means 600 files per hour. 

600 * 0.9 GB = 540 GB per hour and that's 13 TB per day. That's the value for non-parallel extraction, but as parallel extraction is asked for, I used 4 parallel extractions as an example above. So we still get 4 * 13 TB = 52 TB per day, for which you will need 3 HDs of 18 TB and that's still 3 * $300 = $900 storage cost for a single day of data processing.

0.9GB is an average of all media.  I do have about 1500 older TV shows in SD which helps bring the overall average down quite a bit.  I would not be too happy with 1080 filles at that size.

Not sure why the multiplier was used or really what the bitrate or file size is.  At the end of the day there is X files the user already has (optimized or not) and that doesn't change.  The only thing that would change is the duration of time it takes to process them.

3 hours ago, rbjtech said:

tbh Carlo, I don't have the time to invest in a detailed investigation into this - nor do I frankly see the need to. 

I believe the OP has an answer to this FR - yes it's possible, but the reasons why it's not a simple answer has been explained to death above .. ;) 

It's not complex. Pulling the wait states and CPU utilization is pretty easy, but different on Windows vs Linux. Then you just make sure you're under X percent before launching a new process.

11 hours ago, softworkz said:

One last time: It's about IO, not about CPU.

I'm quite aware of this and have said it myself over and over, but your not hearing me. :)  That is why you check wait states.  As long as your IO & other related resources are in good standing you won't have a queue of threads building up waiting to execute.  When threads do build up from lack of resources such as IO the wait states will start climbing.

Wait states monitoring is the easy way to make sure there is no IO problem or any other.  Combined with running the threads with low affinity stops you from launching them when the machine is approaching a point where it wouldn't be desirable to put more load on it.  The background processes won't be using resources other higher affinity threads run at so the these threads would only compete against itself.  Besides what I have mentioned another way you could determine optimal background threads and IO is tracking the total bytes being processed per second (or 5). When a new thread hardly changes the bytes processed you're at max number and can back off adding more and let 1 or 2 finish.  With slight experimenting you figure out something like, the bytes per second the initial process is able to handle. Then when you add new tasks if it doesn't add at least X (IE 50%) bytes compared to the first stream you back off adding more.

Thus, there will not be an IO issue.

Ty playing around slowly increasing resource (like IO) usage while watching the wait states on the OS and you'll see exactly how this works.

Link to comment
Share on other sites

There are no competing threads. There is only one thread that does an extraction

I also mentioned above, that it's possible to find ways how to control IO on one specific system that you are working with, but there is no way to create a universal and generic implementation for that, which works on all platforms and all kinds of hardware and setups.

Edited by softworkz
Link to comment
Share on other sites

What I described was mainly for using multiple thread/processes for the same task to divide and conquer.
But it would actually be a good idea to do this in several places in Emby where a single process runs.

Many low-end systems right now have issues with daily use when some "background" tasks are running such as image extraction, detect episode intros, download subtitles, scan media library, scan metadata folder, meta-data updates, convert media, transfer media, refresh Internet Channels, refresh guide. There are other jobs as well depending on plugins installed.  Much of this can be setup to run only at night from the scheduled tasks menu which helps, but when running the nightly jobs the system is too slow to be used effectively.

Running many of the jobs just mentioned with a lower priority and affinity set to a single core, would all but prevent this from happening as the main foreground processes get priority and resource usage first.

You can set the affinity and priority (2 items that control how the process runs) via NET as you create the process so there isn't anything complex about it.

You shouldn't need to do anything different based on OS to use affinity and priority classes. In .NET, processor affinity and priority class for all threads and child processes can be set using Process.ProcessorAffinity and Process.PriorityClass properties. Using a job object you also have job_Object_limit_affinity and job_object_limit_priority_class. Note:  The big difference is that limits on the job object are just that: limits, while setting process priority always has an immediate effect. It's not the same to limit the process priority to high (job object) and to set it too a high PriorityClass.

 

 

Link to comment
Share on other sites

38 minutes ago, cayars said:

Many low-end systems right now have issues with daily use when some "background" tasks are running such as image extraction, detect episode intros, download subtitles, scan media library, scan metadata folder, meta-data updates, convert media, transfer media, refresh Internet Channels, refresh guide. There are other jobs as well depending on plugins installed.  Much of this can be setup to run only at night from the scheduled tasks menu which helps, but when running the nightly jobs the system is too slow to be used effectively.

This doesn't fly. Emby is using async patterns everywhere and this is implemented in the .NET with thread pools. There is nothing like a single thread which would perform a single task only. There's a pool of threads and while a certain task is typically executed by many different threads - only one at a time, but the actual one can change frequently.
(by task, I mean the .NET async programing concept of a Task)

For any .NET application, fiddling around with process and thread priority and affinity is one of the worst ideas you could have. As soon as you don't have just a very simplistic test application, playing around with these things will lead to deadlocks and hangs of the application.

Further, it's anything but safe to assume that those APIs you mentioned would have the same effect on all platforms. They may have unified API to control these tings, but Windows has a quite different kind of task scheduler than Linux, so it will work differently under guarantee. 

Finally, I don't understand why we are suddenly talking about slowing down operations? We have no problem to do that and many ways to do that.

I know the things you mentioned quite well actually, but only on Windows. I have heard and read that it's working in a much better way on Linux, but on Windows, those settings can completely destabilize your system. I've been there. 
For a complex multi-threaded application, this is strict no-go, neither at the process nor at the thread level. For simple processes e.g. a long-running batch job, it may work better, but it depends on the resources that the process accesses. For example, when the process locks a certain resource and other processes come and want to access the same resource, then that process might not get any time slices anymore to continue and unlock the resource, because the other processes get priority, but they are stuck waiting for the resource while the limited process doesn't get a chance anymore to proceed and unlock (at least not quickly).

Another problem is that despite the affinity setting, a process will still "see" al CPU cores, so it might "optimize" itself for the number of cores in the system , not known that it has been restricted to a single core only. This false assumption by the app can easily lead to deadlocks, or bad and ineffective runtime behavior. 

Link to comment
Share on other sites

  • 1 year later...
hadim

If multi threading the image extraction process for one media does not bring that much performance, what about parallelizing the process on multiple medias? Or at least propose it as an option, since I/O could quickly become a bottleneck here (especially if the media library is on an NFS server).

What do you think?

Link to comment
Share on other sites

rbjtech
19 hours ago, hadim said:

If multi threading the image extraction process for one media does not bring that much performance, what about parallelizing the process on multiple medias? Or at least propose it as an option, since I/O could quickly become a bottleneck here (especially if the media library is on an NFS server).

What do you think?

Hi - Please read the above thread - it goes into a lot of technical detail as to why the 'theory vs actual' is very different and the bottleneck in any system is usually the storage.

In summary, if you want to run parallel extract processes on a large initial install - then just spin up 'n' instances of emby (all running different ports) - and run an extract on individual libraries per instance.    Unless you are running ssd/nvme for storage of all your media - you will soon hit a bottleneck on storage as they all start competing for I/O.    Any modern cpu from the last 5 years is going to be waiting for I/O - the cpu (nor gpu) is not the bottleneck.    Will multiple instances be faster ? - possible/probably - but at the expense of making your system I/O bound for all other uses - ie it will no longer be responsive as a media server (it's primary purpose).

 

Edited by rbjtech
  • Like 2
Link to comment
Share on other sites

hadim

Thanks for the answer, and yes, I read most of the posts above xD.

Quote

if you want to run parallel extract processes on a large initial install - then just spin up 'n' instances of emby

Nice trick! I like the idea, but I also found it a bit hacky. Just to be sure it works, do you have some kind of file lock on a media being processed? So two instances cannot process the same file at the same time?

 

Quote

you will soon hit a bottleneck on storage as they all start competing for I/O.   

Yes, I am well aware of that bottleneck, but so far and at least on my machine, IO is not the bottleneck when using 1 single thread.

 

Quote

Will multiple instances be faster ? - possible/probably - but at the expense of making your system I/O bound for all other uses - ie it will no longer be responsive as a media server (it's primary purpose)

 

Yes, it will be 100% (unless your machine has a really slow IO) for at least a few instances/processes (not threads!). But I got your point on slowing down the whole media server during that processing time, so I think you guys have chosen a good default by setting up thread=1. 

That being said, maybe a proposal would be to implement a multi instances/processes logic (processing multiple media in parallel) and only enable it either on-demand or dynamically depending on the server usage. Just processing two files at the same time, instead of one, will probably decrease the entire processing time by almost 2.

 

I also completely understand this task might not be your priority and might not be that easy to implement (I am not familiar with Emby source code). So I can propose you to actually write a script that performs that process in parallel and independently of Emby and then share it open source (as a docker image maybe?). I already know how to extract images from a movie, but I am lacking the part converting the images to a BIF file. If you can indicate me how that process is done, that would be wonderful.

 

Thanks again for your answer.

Link to comment
Share on other sites

rbjtech
4 minutes ago, hadim said:

Thanks for the answer, and yes, I read most of the posts above xD.

Nice trick! I like the idea, but I also found it a bit hacky. Just to be sure it works, do you have some kind of file lock on a media being processed? So two instances cannot process the same file at the same time?

This is why you do it on separate libraries.   You also need to obviously save the BIF files with the media.   This option does not work if you choose to save the BIF file locally for obvious reasons.

6 minutes ago, hadim said:

I also completely understand this task might not be your priority and might not be that easy to implement (I am not familiar with Emby source code). So I can propose you to actually write a script that performs that process in parallel and independently of Emby and then share it open source (as a docker image maybe?). I already know how to extract images from a movie, but I am lacking the part converting the images to a BIF file. If you can indicate me how that process is done, that would be wonderful.

Attached is a version of my original windows batch script that converted HDR bif's into SDR bif's - you'll need to obviously modify it - but it shows the process -

  1. Extract the jpg's from the source file 
  2. Some re-ordering magic to get the jpg's numbered correctly (the ffmpeg numbering does not work with biftool)
  3. use biftool (see Roku's website to get it) to create the bif
  4. move the bif to the media folder

Run the above multiple times, using multiple source shares/folders - and you have a solution.   To note, it's #1 that will demand the I/O - as long as you write to nvme in #2,#3 - then I/O should not be an issue.

If you create something, please consider sharing it here :)

hdrbif.bat.txt

  • Like 1
Link to comment
Share on other sites

2 hours ago, hadim said:

at least on my machine, IO is not the bottleneck when using 1 single thread

Because we are not allowing the system to be brought to its knees by this process.  It still needs to function as the actual media server.

  • Like 1
Link to comment
Share on other sites

hadim

Here is a small Python script that does the job. You must have ffmpeg and biftools installed. It should be easy to put everything in a docker image but I haven't done it.

See below how to use
 

python thumbget.py ~/Documents/Temp_Thumbget/ \
--jobs 4 \
--threads 4 \
--width 320 \
--height 180 \
--interval 10 \
--extension ".mp4" \
--extension ".mkv"

The ffmpeg command is the simplest I have found (you can tweak it to your desire).

I did a simple test on 4 movies:

- threads=1 and jobs=1 -> ~60 minutes
- threads=4 and jobs=4 -> 9 minutes

Those tests have been done locally and not over NFS but I would expect similar performance increase.

I still think it's worth bringing something similar to Emby as long as the default is set to threads=1 and jobs=1 so more advanced users can decide whether they want to speedup the process during a run without putting their server on their knees (since it also depends the actual spec of the machine running the server).

In the meantime, I hope this script can be useful to some!

thumbget.py

  • Like 1
Link to comment
Share on other sites

22 minutes ago, hadim said:

- threads=1 and jobs=1 -> ~60 minutes

And now, measure the time that Emby needs to extract the thumbnails from those 4 videos...

22 minutes ago, hadim said:

I still think it's worth bringing something similar to Emby as long as the default is set to threads=1 and jobs=1

Then you will understand why we're not gonna bring this to Emby.

Edited by softworkz
  • Agree 1
Link to comment
Share on other sites

rbjtech

tbh, no idea why it's taking 60 minutes to do 4 files with your script - even my batch file was not that slow - it was probably  4-5 minutes for a typical 30GB 4K remux file.

If you turn on debug loggin in emby - you can actually see the detail and times for each file - they are called 'quick-extract-xxx.txt'

It has the time started at the top - and I used the file write time to see when it ended (note to @softworkz, an end timestamp in the log would be nice ;) )

Same file using emby - within 1 second of the time as my batch file, primarily because it uses an identical ffmpeg command line .. 🤔

So 4 files of this size = ~20 mins.

This is also a very lengthy process because its doing tone mapping - given a normal 'typical emby user' SDR file of 2Gb - it took < 1 second !

Edited by rbjtech
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...