Audio Fringerprinting (Chromaprint) Segment detection

April 23, 2022

I wanted to continue on some of the ideas we had way back on the Audio fingerprinting segment detection which could be used for things like Intro detection, commercial detection or even areas of interest detection in audio and thus in video streams that have audio.

Chromaprint project : https://github.com/acoustid/chromaprint

The idea is to use a tool like ffmpeg to extract what is called a chromaprint of the audio, this is a stream of data that represents the fingerprint of the audio over time. You can then compare this simpler representation against other chromaprints to find similar sounding audio segments.

We started some of this work over a year ago on this thread:

https://emby.media/community/index.php?/topic/48304-show-intro-skip-option/&do=findComment&comment=952613

And found some interesting ways to use the data. The first being the ability to detect segments of the video using multiple episodes of a tv show the intro theme music that could then be marked to allow auto skipping.

I found this work interesting as it is kind of what I do, look for patterns in very very large data sets on a daily basis. So I went back and had a look at my original python that compared some of the bitstreams and found it was very slow, fine for a POC in small scale but very slow.

The main cost of doing this is in the bitstream comparisons needed to find aligning (similar) audio segments in two potentially large chromaprint bitstreams. The approach I were using to find the bit difference counts was brute force and inefficient. I went on the hunt for a more efficient hamming difference and fond a bunch of bit diff work that pointed to a bit shift approach that is very much faster.

So why all the fuss about this one aspect, well it is super important if you want a performant solution as a LOT of the heavy lifting in finding audio overlaps is comparing a lot of data and if you can do that as fast as possible then it will allows you more options, more exhausting searching etc

I have implemented a simple hamming diff test using my original brute force apporach and the new bit shift approach and it does make a big difference, about 86% or 6.8 time faster.

This bit shifting approach comes from stackoverflow, from a post about fast bit Hamming calcs in Matlab (a data science tool)

https://stackoverflow.com/questions/1024904/calculating-hamming-weight-efficiently-in-matlab

I just adapted it to python for my test but it could be any language that can do bit shifting.

here is the python code:

import time

# simple brute force
def count_bit_diff(left, right):
    count = 0
    for i in range(0, 32):
        if ((left >> i) & 1) != ((right >> i) & 1):
            count = count + 1
    return count

# https://stackoverflow.com/questions/1024904/calculating-hamming-weight-efficiently-in-matlab
# http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
def hamming_diff(left, right):
    # w = bitand(bitshift(w, -1), uint32(1431655765)) + bitand(w, uint32(1431655765));
    # w = bitand(bitshift(w, -2), uint32(858993459)) + bitand(w, uint32(858993459));
    # w = bitand(bitshift(w, -4), uint32(252645135)) + bitand(w, uint32(252645135));
    # w = bitand(bitshift(w, -8), uint32(16711935)) + bitand(w, uint32(16711935));
    # w = bitand(bitshift(w, -16), uint32(65535)) + bitand(w, uint32(65535));

    distance = left ^ right
    distance = ((distance >> 1) & 1431655765) + (distance & 1431655765)
    distance = ((distance >> 2) & 858993459) + (distance & 858993459)
    distance = ((distance >> 4) & 252645135) + (distance & 252645135)
    distance = ((distance >> 8) & 16711935) + (distance & 16711935)
    distance = ((distance >> 16) & 65535) + (distance & 65535)
    return distance

# confirm approaches are the same through the 32 bit address range
points = 20
skip = int(0xFFFFFFFF / points)
loops = 0
while loops <= points:
    right_val = int(loops * skip)
    diff01 = count_bit_diff(0, right_val)
    diff02 = hamming_diff(0, right_val)
    loops += 1
    print("%s %s" % (diff01, diff02))

# do some timing comparisons
# brute force
started = time.perf_counter()
for x in range(0, 2000000):
    diff01 = count_bit_diff(0, x)
finished = time.perf_counter()
print(f"time01 {finished - started:0.4f} seconds")

# faster way
started = time.perf_counter()
for x in range(0, 2000000):
    diff01 = hamming_diff(0, x)
finished = time.perf_counter()
print(f"time02 {finished - started:0.4f} seconds")

Output

time01 19.2995 seconds
time02 2.8164 seconds

That is some nice improvements.

Now that 4.7 is getting close and this version has the new ffmpeg with the chromaprint module compiled in I might get back on this to have a play with what else this sort of thing can be used for. I know the core Emby will soon have intro detection and skipping built in so that might not be as pressing but I am sure there are other areas of investigation. I am not going to write a plugin, this post is just for reference to continue on some of my thoughts on this.

I will probably just add to this thread as I play with this.

Repo with a few of the projects and tools i have built.

https://github.com/faush01/ThemeService

Edited May 13, 2022 by TeamB

April 23, 2022

just for reference the original POC

https://github.com/VictorBitca/matcher

that did some of this work uses a simple xor string count 1 approach

https://github.com/go-fingerprint/fingerprint/blob/29397256b7ff128173743bb28186cf6ee6b33707/fingerprint.go#L128

It would be interesting to see how performant this is.

Edited April 23, 2022 by TeamB

April 23, 2022

Here is a c# version of the above bit shift hamming approach

	// https://stackoverflow.com/questions/1024904/calculating-hamming-weight-efficiently-in-matlab
	// http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
	static uint GetHammingDist(uint x, uint y)
	{
		//w = bitand( bitshift(w, -1), uint32(1431655765)) + bitand(w, uint32(1431655765));
		//w = bitand(bitshift(w, -2), uint32(858993459)) + bitand(w, uint32(858993459));
		//w = bitand(bitshift(w, -4), uint32(252645135)) + bitand(w, uint32(252645135));
		//w = bitand(bitshift(w, -8), uint32(16711935)) + bitand(w, uint32(16711935));
		//w = bitand(bitshift(w, -16), uint32(65535)) + bitand(w, uint32(65535));
		
		uint distance = x ^ y;
		distance = ((distance >> 1) & 1431655765U) + (distance & 1431655765U);
		distance = ((distance >> 2) & 858993459U) + (distance & 858993459U);
		distance = ((distance >> 4) & 252645135U) + (distance & 252645135U);
		distance = ((distance >> 8) & 16711935U) + (distance & 16711935U);
		distance = ((distance >> 16) & 65535U) + (distance & 65535U);
		return distance;
	}

I have not tested this for performance yet, it is literally just a conversion of the original Matlab code.

April 23, 2022

Cool, thanks for sharing.

April 23, 2022

Comparing different hamming approaches, using string compare:

GO Code

func hamming(a, b int32) (dist int) {
	dist = strings.Count(strconv.FormatInt(int64(a^b), 2), "1")
	return
}

Python Code

def count_string_diff(left, right):
    diff = left ^ right
    diff_string = "{0:b}".format(diff)
    return diff_string.count("1")

python comp

String compare : 1.3580 seconds
Brute force : 12.4490 seconds
Bit shift : 1.9091 seconds

Wow I did not expect that

hamming_diff.py

Edited April 23, 2022 by TeamB

April 24, 2022

and some comparisons for the c# version

Bit shift : 0.1335036
String compare : 6.2134211
Brute force : 2.4104883

Bit shift still kicks ass

hamming_distance.cs

April 24, 2022

Congrats on that. Bit shifting through the hamming distance algorithm, can most likely be simplified even further though.

Edited April 24, 2022 by chef

April 24, 2022

April 24, 2022

4 hours ago, chef said:

Congrats on that. Bit shifting through the hamming distance algorithm, can most likely be simplified even further though.

No gratz needed on this, bit shifting bin data to get what you need has been around for longer than i have been doing this sort of work. I think that post i found was from 12 years ago and i remember doing this sort of work in my first year out of uni in 1998 so i did not invent this, i just use it.

Not sure about simplify it further, i had a play but the original 4 byte approach was the fastest from what i could see, any suggestions?

@roaku not sure i understand your post

Edited April 24, 2022 by TeamB

April 24, 2022

13 minutes ago, TeamB said:

No gratz needed on this, bit shifting bin data to get what you need has been around for longer than i have been doing this sort of work. I think that post i found was from 12 years ago and i remember doing this sort of work in my first year out of uni in 1998 so i did not invent this, i just use it.

Not sure about simplify it further, i had a play but the original 4 byte approach was the fastest from what i could see, any suggestions?

@roaku not sure i understand your post

Yes teamB 4 bits is the fastest for hamming distance. Thanks for sharing.

April 24, 2022

@TeamB

It made more sense before other participants deleted and edited their posts.

Edited April 24, 2022 by roaku

April 24, 2022

1 minute ago, roaku said:

@TeamB

It made more sense before other participants deleted and edited their posts.

I love Star Trek.

April 24, 2022

Maybe there is a way to try using the Levenshtein method so that the bin files don't have to equate to the same length.

That way you could host a provider with the chromaprints of just the intro of popular tv shows, and people could use the provider to compare any fingerprints on their system dispite duration.

That would be super cool.

April 24, 2022

I think if you were going to host this as a REST API you would have a curated chromaprint library of TV Show theme music. This way the user would just throw you there first 10 or 15 min of chromaprint data and you would try to "find" the sub-sequence of the themes you know that show season has. This is how my very first example of this approach worked, you gave it a known snip it of chromaprint theme data and it tried to find the best match in a larger stream. I did not follow through with that approach as everyone at the time was sure the stand alone approach that tried to detect similar segments in two full chromaprint extractions was the way to go.

The advantage in trying to find just a smaller sequence in a larger extraction is that you are comparing less data overall.

The disadvantage with REST API service approach is the server would be doing all the heavy lifting of comparing the chromaprint data streams and you would need a very beefy server to handle all the user requests. Not sure that would scale very well.

The alternative is having the service just host all the chromaprints data for the themes and you request the chromaprint/s for the series and season you want to scan. This is the best of both worlds, you have a curated list of theme chromaprints and the client is doing all the heavy data comparing still.

Edited April 24, 2022 by TeamB

April 24, 2022

31 minutes ago, chef said:

I love Star Trek.

image.png.879fd58acf46eb550af87ff0ae6c8503.png

Edited April 24, 2022 by TeamB

April 24, 2022

I think that in time someone or some company will curate a large library of chromaprints for tv shows. They do it for music already...

But I agree, I believe hosting the intro data which could be aquired through Rest API, then compared on the client would be the best way. It's more then what is available out there right now.

But, moving the intro fp across a larger sample to find some kind of contiguous region would be something new, and very cool in my opinion, and sparks interest.

Currently, everything you have mentioned so far in this thread is done... like... exactly... I mean.... exactly .... as you describe. LOL!

So, if you decide to take it to the next level, I think that would be really awesome, and definitely worth the investment of time.

April 25, 2022

@chef

here is my original experiment that tried to "find" a smaller segment i.e. the theme song in a larger longer data set, the idea was you provide the theme chromaprint and then let the client look for the offsets in its local data.

https://emby.media/community/index.php?/topic/48304-show-intro-skip-option/page/8/#comment-952613

I am not sure a company will try to create a curated list of chromaprints of theme songs, its just not worth their while, even the AucusticId project that tools like picard use was populated and maintained by a community of users. i.e. they fingerprinted their music and uploaded the data. Because it is just fingerprint data and not music content it is fine to host.

3 hours ago, chef said:

Currently, everything you have mentioned so far in this thread is done... like... exactly... I mean.... exactly .... as you describe. LOL!

So there is a web service already?

Edited April 25, 2022 by TeamB

April 25, 2022

10 minutes ago, TeamB said:

@chef

So there is a web service already?

no B, I mean the description of distance calculations, and the fp comparisons are almost note for note exactly what it is.

That's why I mentioned something next level would be cool, because switching bits in a fast hamming distance algorithm is already implemented.

I ve always thought it was a cool idea to see some kind of chromaprint host. I wonder what the demand for something like that is?

I also wonder how much faster you could compare and detect similarities?

But then, there is the end credit detection.

From my experience, the chromaprints just weren't cutting it, especially now a days when tv shows are changing the credits every show.

Black frame detection in ffmpeg is a cpu hog. There has to be a better way of finding out where those credits start. Even if the end is just the end of the stream.

Having the client show the "Start Now" button at the exact moment the credits roll that's a cool trick!

April 25, 2022

49 minutes ago, chef said:

no B, I mean the description of distance calculations, and the fp comparisons are almost note for note exactly what it is.

That's why I mentioned something next level would be cool, because switching bits in a fast hamming distance algorithm is already implemented.

I just remember back to when I was playing with this over a year ago now that I was using a really simple brute force approach for the POC, I started this thread again so I might talk about the technical side of what it takes to use Audio chromaprint data, and that the first step was a fast bit difference calculation and that is where I started this discussion off. I did not know there was one already as I did not read every post in this thread. So at least we have the info in the one place now

49 minutes ago, chef said:

I ve always thought it was a cool idea to see some kind of chromaprint host. I wonder what the demand for something like that is?

not sure on demand, if it worked well then I guess people would use it.

49 minutes ago, chef said:

But then, there is the end credit detection.

From my experience, the chromaprints just weren't cutting it, especially now a days when tv shows are changing the credits every show.

Black frame detection in ffmpeg is a cpu hog. There has to be a better way of finding out where those credits start. Even if the end is just the end of the stream.

Having the client show the "Start Now" button at the exact moment the credits roll that's a cool trick!

I have not really thought about end credits, I can see myself using intro skip but end credits I just hit skip to end and the next episode starts playing.

As soon as you step into the video frame processing world the order of magnitude of the work (cpu, disk io) needed goes through the roof. Its something I have played with in the distant past with add/commercial skipping but it is very costly.

Edited April 25, 2022 by TeamB

April 25, 2022

5 hours ago, chef said:

I ve always thought it was a cool idea to see some kind of chromaprint host. I wonder what the demand for something like that is?

I also wonder how much faster you could compare and detect similarities?

As you know @chef, creation of the FP has always been the 'slow' part of the entire process - the detection itself (especially the fast method) has never been particularly taxing on any half decent system - so it's not like a remote lookup is going to be significantly faster for detection - the gains are going to be the fact you don't need to generate the FP's in the first place.

If you recall when discussed, we had a few view opposing views on this - some of us thought there would be so many varieties of file that the chances of getting an exact match were not all that great - others thought a hash match would be relatively easy to do.

I don't think the actual intro detection is the issue, the issue is the timing on where that intro exists in the file. Of course you can have multiple versions -

TV broadcast version (recap and no-recap)
Media versions (dvd, bluray,uhd etc)
distributor X version (ie 'Netflix' intro)
Blah blah..

Every one of these would have a different 'start' and 'end' time for the Intro - this is the problem.

I guess you could hash the file (like what is done for subtitles today) - but for anybody who has created their own version, then a hash is simply not going to match. For the downloaders, then you are going to need a hash for every 'release'..

I think it's a useful exercise - and I'm onboard with testing for sure, but as a user who authors each file (I add compatibility tracks and remove all unnecessary tracks for example) then I'm never going to get a hash match.

Edited April 25, 2022 by rbjtech

April 25, 2022

5 hours ago, TeamB said:

I have not really thought about end credits, I can see myself using intro skip but end credits I just hit skip to end and the next episode starts playing.

As soon as you step into the video frame processing world the order of magnitude of the work (cpu, disk io) needed goes through the roof. Its something I have played with in the distant past with add/commercial skipping but it is very costly.

End Credit skipping is the icing on the cake for the Introskip Plugin - for binge watching it works perfectly without you even needing the remote for the entire session. Many people also watch TV like this as part of daily chores and simply don't have the remote to hand - so in this scenario, skip end credits is actually just as important as IntroSkip imho.

Agreed 100% about the resources though - the BlackFrame detection can be a system killer - we use FP to get the CreditStart and then use BlackFrame to make it accurate - so we are not video scanning more than we need to but vs Audio matching, it is indeed very resource intensive especially as we have no known way to throttle ffmpeg during this process.

April 25, 2022

19 minutes ago, rbjtech said:

but as a user who authors each file (I add compatibility tracks and remove all unnecessary tracks for example) then I'm never going to get a hash match.

what is this hash you keep referring to?

April 25, 2022

16 minutes ago, TeamB said:

what is this hash you keep referring to?

Perhaps I'm missing something here - but just because you can match a Intro to a series - you still have no idea where in the episode that Intro is. Thus you need to match both the Intro itself, the season/episode and the version of the file you have. One way to do that is to hash the file itself - after which, you don't actually need to even detect the Intro, as you have a hash match anyway. But personally, I don't think this is workable.

April 25, 2022

ah - sorry light bulb moment - I think I am talking about a different method - you are referring to keeping the actual FP of just the 'intro' online - and then use that to compare against all the locally written FP files ?

April 25, 2022

40 minutes ago, rbjtech said:

ah - sorry light bulb moment - I think I am talking about a different method - you are referring to keeping the actual FP of just the 'intro' online - and then use that to compare against all the locally written FP files ?

yes that is correct, still need to "detect" the intro but it would use a Chromaprint from an online library of known good theme audio chromaprints to do the search.

Audio Fringerprinting (Chromaprint) Segment detection

Recommended Posts

TeamB 2352

Link to comment

Share on other sites

TeamB 2352

Link to comment

Share on other sites

TeamB 2352

Link to comment

Share on other sites

Luke 37065

Link to comment

Share on other sites

TeamB 2352

Link to comment

Share on other sites

TeamB 2352

Link to comment

Share on other sites

chef 3745

Link to comment

Share on other sites

roaku 795

Link to comment

Share on other sites

TeamB 2352

Link to comment

Share on other sites

chef 3745

Link to comment

Share on other sites

roaku 795

Link to comment

Share on other sites

chef 3745

Link to comment

Share on other sites

chef 3745

Link to comment

Share on other sites

TeamB 2352

Link to comment

Share on other sites

TeamB 2352

Link to comment

Share on other sites

chef 3745

Link to comment

Share on other sites

TeamB 2352

Link to comment

Share on other sites

chef 3745

Link to comment

Share on other sites

TeamB 2352

Link to comment

Share on other sites

rbjtech 4265

Link to comment

Share on other sites

rbjtech 4265

Link to comment

Share on other sites

TeamB 2352

Link to comment

Share on other sites

rbjtech 4265

Link to comment

Share on other sites

rbjtech 4265

Link to comment

Share on other sites

TeamB 2352

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in