Library Scanning Performance

September 12, 2024

Now that the connection pools are a stable feature in Jellyfin, after a few hiccups in the initial releases, I’ve set up the same library in fresh Emby & Jellyfin instances to compare the server performance.

As expected, the library scans seem to be a lot faster in JF than Emby, with almost a 4x difference in timings. The initial library creation took ~1 month with Emby and only ~1 week in JF with the parallel scan threads set to 4. The daily library scans also show similar numbers with JF taking ~15 mins per scheduled scan task and Emby takes ~1 hour for the same.

September 12, 2024

25 minutes ago, adminExitium said:

Now that the connection pools are a stable feature in Jellyfin, after a few hiccups in the initial releases, I’ve set up the same library in fresh Emby & Jellyfin instances to compare the server performance.

As expected, the library scans seem to be a lot faster in JF than Emby, with almost a 4x difference in timings. The initial library creation took ~1 month with Emby and only ~1 week in JF with the parallel scan threads set to 4. The daily library scans also show similar numbers with JF taking ~15 mins per scheduled scan task and Emby takes ~1 hour for the same.

Indeed, I created a test server a few weeks ago, and adding content from scratch in JF took 3 days vs Emby 12-14 days. I would say 3 times faster, but it depends on your HW. I have limited the connections to not crash the server performance .

Library Scan:

JF 11 min (with an extra library compared to Emby)

Emby 1h 4m

Edited September 12, 2024 by shocker

September 12, 2024

@adminExitium @shocker - I don't doubt this, but this isn't related to database performance in any way.

September 12, 2024

Library scan performance is simply a matter of which actions are performed during a library scan and which not. It's easy to make it faster by omitting certain actions, but that also means that those need to be done at a later time where it might delay playback for example.

But other concepts exist for doing it in a more decoupled way or interleaved way, I had propsed these years ago, so there's indeed room for improvement.

September 12, 2024

3 hours ago, softworkz said:

this isn't related to database performance in any way

I posted this as one of the benefits of the newly added connection pooling there. The parallel scans have been present as a feature for quite some time now, but it had its problems due to the DB access still being single-threaded and causing SQLITE_BUSY errors or the DB being locked for other processing. This is no longer an issue in the latest release with the pooling and allows the scans to scale almost infinitely, limited only by your directory listing speeds (and internet metadata rate limits, if you have any metadata providers enabled).

2 hours ago, softworkz said:

It's easy to make it faster by omitting certain actions

These scans are performed multiple times on a read-only copy of the media, where I’ve ensured that nothing has changed between each scan. So it is basically just listing (and then comparing to the DB items) sequentially vs. in parallel that is causing the slowness.

September 12, 2024

51 minutes ago, adminExitium said:

limited only by your directory listing speeds (and internet metadata rate limits, if you have any metadata providers enabled).

That's one of the things I'm talking about. It takes quite some time until all metdata providers are run (in a typical setup), those are not just internet requests but also local ffprobe execution for example. This takes its time for each item and that's why I'm saying that DB performance is irrelevant: it is not the limiting factor.

September 12, 2024

1 hour ago, softworkz said:

local ffprobe execution

Yes, I understand that, which is why these runs were done one after the other after ensuring no change in the media seen by Emby or JF. Or does Emby do ffprobes and run other metadata providers even if nothing changes?

This comes back to something I’ve mentioned before too about the time taken to list directories, since both Emby/JF recurse into the nested directories too rather than checking just the root and the directory timestamps:

In my case, since the library is presented as a combination of multiple drives (which is probably the case for any large library here), doing the listing in parallel will be of significant benefit since they will hit different drives rather than doing each drive sequentially. The same performance benefits will also apply any network-ed filesystems like Ceph, NFS or SMB too.

Anyway, don't want to derail the thread any further if you think it's unrelated.

September 12, 2024

@adminExitium @shocker - I've split this out into its own topic

September 12, 2024

15 minutes ago, adminExitium said:

Yes, I understand that, which is why these runs were done one after the other after ensuring no change in the media seen by Emby or JF. Or does Emby do ffprobes and run other metadata providers even if nothing changes?

What are we talking about here? I was talking about an initial scan of a new library where everything is new and items are written to the database.

I thought you would be talking about the same, because you were talking about DB access issues and locks etc.

If you are now talking about re-scanning files in an existing library, then this doesn't apply because there are just DB reads, no writes and no locks. When no processing of new files is involved, then yea - it might be possible to optimize this further, but we are really just scratching the surface here and care needs to be taken not to worsen other cases.

26 minutes ago, adminExitium said:

This comes back to something I’ve mentioned before too about the time taken to list directories, since both Emby/JF recurse into the nested directories too rather than checking just the root and the directory timestamps:

This is not a reliable mechanism (I'm afraid, I have no time to read through that other topic right now) because:

It does not work on all file systems and network drives
We write things into folders in multiple use case, which affects those timestamps as well
Deletions of files may or may not be reflected in the parent folder timestamps

36 minutes ago, adminExitium said:

doing the listing in parallel will be of significant benefit since they will hit different drives rather than doing each drive sequentially

It would - but only in case when there are no changes - which you cannot know in advance.

37 minutes ago, adminExitium said:

The same performance benefits will also apply any network-ed filesystems like Ceph, NFS or SMB too.

Not necessarily. If it's multiple shares on the same remote file system (like is typically the case), then it can can make it even much slower, because of cache exhaustion/invalidation.

September 12, 2024

The primary cause of inefficience is the sequential nature of the process:

Get file list item
- Check or add to db
- Run Provider A
- Run Provider B
- Run Provider C
- Run Provider D
Get file list item
- Check or add to db
- Run Provider A
- Run Provider B
- Run Provider C
- Run Provider D

The problem is, that you cannot run many of the metadata providers in parallel as they will error. You even need to obeye rate limits for some of them. So, when you parallelize the file system scanning, you would still have to sequentialize the provider processing (which is what really takes time).

One possible way would be to interleave processing similar to this:

image.png.1804bcb765963fe37b579f00256918c8.png

Where things run in parallel to some extent, but in a way that each provider gets a single request only at a time. But that's way more complicated as it sounds. Also, not all providers take the same amount of time and this would equalize duraction for each provider to the max duration of all. So, it needs to be decoupled even further. It's very possible but not an easy change.

September 12, 2024

3 minutes ago, softworkz said:

What are we talking about here?

I have mentioned the timings for both a fresh scan and the average of a few scans when nothing has changed in the media.

I could care less about the time taken for a fresh scan, because no matter how long it takes, it will happen only once. I would think most users would fall in the same category because I doubt anyone is going around repeatedly scanning their libraries from scratch.

In even the worst case, I doubt I’ve ever modified more than a single-digit (if that, generally it is only a handful of movies or episodes that get added or replaced) percentage of my collection at once, which should easily get processed in 10-15 mins and not take up to an hour as was observed.

I understand there are lots of edge cases with optimizing the sequential scan approach, which is why the post was only in support of adding connection pools and just allowing the scans to be processed in parallel, with a default to 1 as before.

I don't think going the other way of optimizing the sequential scan will not provide enough of a benefit to most folks to be worth the effort involved.

September 12, 2024

4 minutes ago, adminExitium said:

I don't think going the other way of optimizing the sequential scan will not provide enough of a benefit to most folks to be worth the effort involved.

All the talking I've been involved so far has been about library scan times when new items are added.

TBH, this is the first time I hear concerns about the scan times in case when nothing has changed
I understand your standpoint, I'm even closer to your way of viewing it, but most others appear to see it differently.

Edited September 12, 2024 by softworkz

September 12, 2024

1 minute ago, softworkz said:

most others appear to see it differently

Yeah, that's understandable too because the time taken to scan in a fresh library very often determines whether a user will continue with Emby or not. Taking a multiple of the time taken by JF is a good way for the user to just give up and move on altogether without having experienced the vastly superior clients once the library is fully scanned in.

I, however, have been fully invested in the Emby ecosystem for a few years now so it's very rare that I ever need to scan in my library from scratch. This was just an attempt to see how the latest versions of both software fare now.

Hopefully, there is a solution that helps both types of users.

Sign In

Library Scanning Performance

Recommended Posts

adminExitium 356

shocker 135

softworkz 5121

softworkz 5121

adminExitium 356

softworkz 5121

adminExitium 356

softworkz 5121

softworkz 5121

softworkz 5121

adminExitium 356

softworkz 5121

adminExitium 356

Create an account or sign in to comment

Create an account

Sign in

Activity