(From LinkedIn post by Sofien Kaabar)
You hold your phone up for 10 seconds. It tells you the exact song, the artist, and even the timestamp you're at in the track. That's not magic.
That's a 20-year-old paper by a guy named Avery Wang (Founder/Chief Scientist at Shazam and currently a Principal Research Scientist at Apple), and the math behind it is one of the cleanest ideas in all of computer science.
Sound is just a list of numbers. Your microphone records how loud the air is, thousands of times per second. Two recordings of the same song look completely different as raw numbers. Background noise, different volumes, different rooms. You can't match them directly.
So Shazam doesn't even try to match the raw sound.
Instead, it chops the audio into tiny 0.3-second slices and runs a mathematical operation (Fourier Transform)
Any sound is secretly just a bunch of pure tones layered on top of each other. The Fourier Transform pulls those tones apart and tells you exactly which frequencies are present and how loud each one is.
Shazam runs this on each 0.3-second slice, one after another. The "Short-Time" in Short-Time Fourier Transform just means you do it slice by slice, so you know not just which frequencies exist but when they were loud.
Stack all those slices side by side and you get a grid - time running left to right, frequency running bottom to top, brightness showing how loud each frequency is at each moment. This is called a spectrogram.
Then it throws away most of that grid. It keeps only the brightest point in each frequency band, per time window. What you're left with is a sparse scatter of dots - like a constellation map of the song's loudest moments.
For every "anchor" dot, Shazam looks at nearby dots and asks how far away they are in both time and frequency. Each relationship becomes a single code.
A 3-minute song produces roughly 10,000 of these addresses. Shazam pre-computes them for every song in their catalogue and stores it all in a lookup table.
When you Shazam something, your phone generates the same addresses from your noisy snippet and looks them up. If the same song keeps appearing, and the time offsets keep clustering at the same gap, that's your match.
The whole thing runs in under a second. Shazam doesn't search through 10 million songs. It looks up a hash.
That's an O(1) operation - the lookup time stays flat whether the database has 10 million songs or 100 million.
Source: Arin Verma
That's a 20-year-old paper by a guy named Avery Wang (Founder/Chief Scientist at Shazam and currently a Principal Research Scientist at Apple), and the math behind it is one of the cleanest ideas in all of computer science.
Sound is just a list of numbers. Your microphone records how loud the air is, thousands of times per second. Two recordings of the same song look completely different as raw numbers. Background noise, different volumes, different rooms. You can't match them directly.
So Shazam doesn't even try to match the raw sound.
Instead, it chops the audio into tiny 0.3-second slices and runs a mathematical operation (Fourier Transform)
Any sound is secretly just a bunch of pure tones layered on top of each other. The Fourier Transform pulls those tones apart and tells you exactly which frequencies are present and how loud each one is.
Shazam runs this on each 0.3-second slice, one after another. The "Short-Time" in Short-Time Fourier Transform just means you do it slice by slice, so you know not just which frequencies exist but when they were loud.
Stack all those slices side by side and you get a grid - time running left to right, frequency running bottom to top, brightness showing how loud each frequency is at each moment. This is called a spectrogram.
Then it throws away most of that grid. It keeps only the brightest point in each frequency band, per time window. What you're left with is a sparse scatter of dots - like a constellation map of the song's loudest moments.
For every "anchor" dot, Shazam looks at nearby dots and asks how far away they are in both time and frequency. Each relationship becomes a single code.
A 3-minute song produces roughly 10,000 of these addresses. Shazam pre-computes them for every song in their catalogue and stores it all in a lookup table.
When you Shazam something, your phone generates the same addresses from your noisy snippet and looks them up. If the same song keeps appearing, and the time offsets keep clustering at the same gap, that's your match.
The whole thing runs in under a second. Shazam doesn't search through 10 million songs. It looks up a hash.
That's an O(1) operation - the lookup time stays flat whether the database has 10 million songs or 100 million.
Source: Arin Verma
No comments:
Post a Comment