Movie matching accuracy

January 9, 2023

Thanks @GrimReaper. I was hoping whatever the fix was for the O (2001) might be related. Guess we still need some more logic to catch some more of these issues.

January 9, 2023

..and around we go .. how many times has this been raised and never solved. Passing the buck to the API results is not good - if the user is bothering to provide the year and in some cases even the provider id - then emby should use this and further query all the API returned results.

January 9, 2023

2 hours ago, ebr said:

We could certainly try to apply some sort of additional logic once the results come back but that will have a potentially large impact on performance so we have to weigh the value of that against how often there actually are issues like this.

January 9, 2023

@ebr I think the point is that it needs to be fixed one way or another. "potential" issues are just that...potential. That doesn't mean they would ever be a reality, and in the meantime users aren't receiving the results that they should. Can you at least put it in the queue to get looked at?

January 9, 2023

Well, in other words - does improving the accuracy by 1% (I'd say these errors are on no more than 1% of titles) worth slowing down the library scan 100% of the time? That depends on just how much it would slow down. If it is even 50% slower, that is significant - especially on the initial scan. And then there is the potential that this additional "intelligence" actually screws up some that are matched correctly now.

So, it isn't a slam dunk. It would take some careful consideration and testing.

January 9, 2023

7 minutes ago, ebr said:

Well, in other words - does improving the accuracy by 1% (I'd say these errors are on no more than 1% of titles) worth slowing down the library scan 100% of the time? That depends on just how much it would slow down. If it is even 50% slower, that is significant - especially on the initial scan. And then there is the potential that this additional "intelligence" actually screws up some that are matched correctly now.

So, it isn't a slam dunk. It would take some careful consideration and testing.

@ebr You're just making numbers up. It shouldn't happen at all when it's such an obvious match. I'm just asking that you put it in the queue. Why are you resisting just putting it in the queue?

January 9, 2023

The 'slow' bit is surely getting the API results in the first place - the extra logic being applied to the results already in memory should be miniscule in terms of extra overhead. I doubt it would even be noticed but people would appreciate the increased accuracy for sure .. and it's actually one thing emby can improve its 'detection' statistics on to match/better it's competition. If the logic did have an overhead - then again it's a simple 'option' - on by default, to sacrifice accuracy for speed, blah blah, turn it off.

January 9, 2023

Would rather see if no 100% match put in a failed list for manual intervention if more logic is not going to be added for better matching from returned items instead of wrong first item.

To me a failed fetch is better than wrong fetch but would need an interface to show these failed items.

January 9, 2023

16 minutes ago, MBSki said:

It shouldn't happen at all when it's such an obvious match

Unfortunately, a computer cannot determine that without running through logic to tell it so.

It would definitely slow down the scan. The only question is how much. It is something that can be looked at but, library scan speed is one of the most complained about things by new users so I would be willing to concede a <1% inaccuracy over decreasing the scan speed by anything noticeable. The thing is, you have to look at all of the searches and most of them (the VAST majority) are already correct.

January 9, 2023

From my experience, the providers always get the result in at least the top three.

String comparisons are taxing.

The fastest way to find the absolute correct match (not to preach to the choir here... not my intent) is regex.

        protected WhateverProviderResult ResolveItem(List<WhateverProviderResult> results, string input)
        {            
            var pattern = @"^" + input;
            var regex = new Regex(pattern, RegexOptions.IgnoreCase);
            var bestMatch = -1;
            var bestMatchLength = -1;
            
            for (var ctr = 0; ctr < results.Count; ctr++)
            {
                var match = regex.Match(results[ctr].Name);
                
                if (!match.Success) continue;

                var matchLength = match.Length;
                if(bestMatchLength < matchLength)
                {
                    bestMatch = ctr;
                    bestMatchLength = matchLength;
                }
            }

            if (bestMatch >= 0)
            {
                return results[bestMatch];
            }

            return null; //<-- or just firstOrDefault which is what is returned anyway.
        }

But, in that example you are still iterating the results, you are keeping reference types which means you are allocating memory, and you are going to do this for each item in your library.

I can see Eric's point about why this would get exponentially more taxing.

Maybe there are better ways to do it (my example is probably not that great), and maybe you could pull out some magic using Spans<> to keep things off the heap... but it will be taxing, and Library scans will increase. Which means more users posting about why their library scans are stuck.

January 9, 2023

It is definitely something that can be looked at but I'm just trying to set expectations properly and make the point that 100% accuracy isn't necessarily the best outcome.

The fact that the provider search engines are having issues with these particular titles is evidence that these are edge cases with tough comparisons. And then, you have to also realize that ANY manipulation of the results from the provider carries the real potential (I would say probability) of creating new mis-matches that the provider search logic already resolved properly.

January 9, 2023

3 minutes ago, ebr said:

It is definitely something that can be looked at but I'm just trying to set expectations properly and make the point that 100% accuracy isn't necessarily the best outcome.

The fact that the provider search engines are having issues with these particular titles is evidence that these are edge cases with tough comparisons. And then, you have to also realize that ANY manipulation of the results from the provider carries the real potential (I would say probability) of creating new mis-matches that the provider search logic already resolved properly.

Look at my first example comparing Life and A Bug's Life. How is that a tough comparison?

January 9, 2023

47 minutes ago, Happy2Play said:

Would rather see if no 100% match put in a failed list for manual intervention if more logic is not going to be added for better matching from returned items instead of wrong first item.

To me a failed fetch is better than wrong fetch but would need an interface to show these failed items.

totally agree with this

January 10, 2023

14 hours ago, Happy2Play said:

Would rather see if no 100% match put in a failed list for manual intervention if more logic is not going to be added for better matching from returned items instead of wrong first item.

To me a failed fetch is better than wrong fetch but would need an interface to show these failed items.

Currently it's just taking the top returned result (I believe), how does it know it has 'failed' without applying logic ? And if you are doing that - why not attempt to correct it ?

You can apply lean logic on obvious 'fails' - if you have the year for example - for string based comparisons then yes this may be a little more taxing or put those in a 'list'.

Edited January 10, 2023 by rbjtech

January 10, 2023

If the first match is not exact, that should trigger a check to see if an exact match is somewhere in the list, I think.

Paul

January 12, 2023

@ebr So, can this be added to the queue and scoped? Seems there are a couple of options that wouldn't kill performance.

January 17, 2023

For those who like a project, here is a script which attempts to find better TMDB matches. Unfortunately it is not very accessible as it requires you request a TMDB api key and has only been tested on English.

Emby is correct the vast majority of the time. My frustration is all my movie additions (folders and filenames) are matched to TMDB with correct image when added to file system so Its a bit galling when Emby occasionally chooses wrong. Since all my movies have their TMDB image prior to Emby it took a long time to notice when Emby's choice was not optimal and i like my metadata biography surfing..hence the script.

Its worth noting (although things may have improved in the last few years when i last built the library from scratch) that it may not be just TMDB responses that could benefit from a bit of post processing. There was a time when Artist "Duran Duran" was detected as "Duran Duran Duran"

Movie matching accuracy

Recommended Posts

MBSki 1020

Link to comment

Share on other sites

rbjtech 4283

Link to comment

Share on other sites

ebr 14921

Link to comment

Share on other sites

MBSki 1020

Link to comment

Share on other sites

ebr 14921

Link to comment

Share on other sites

MBSki 1020

Link to comment

Share on other sites

rbjtech 4283

Link to comment

Share on other sites

Happy2Play 8296

Link to comment

Share on other sites

ebr 14921

Link to comment

Share on other sites

chef 3746

Link to comment

Share on other sites

ebr 14921

Link to comment

Share on other sites

MBSki 1020

Link to comment

Share on other sites

Spaceboy 2494

Link to comment

Share on other sites

rbjtech 4283

Link to comment

Share on other sites

pwhodges 1532

Link to comment

Share on other sites

MBSki 1020

Link to comment

Share on other sites

ginjaninja 537

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Activity