Three.js Audio-Reactive Humanlike Stickmen v4.11

What this is

Pick a song, and ten stick figures dance to it. The trick is they'll dance to whatever you give them: a track from the library, an MP3 you just uploaded, whatever you found on YouTube.

How it actually works

When the audio plays, the browser runs an FFT on it. FFT is basically a way of asking "how much bass is in this sound right now? how much mid? how much treble?" It splits the audio into frequency bands.

On every beat, each stickman grabs a new dance move, weighted by the current tempo.

The stick figures are Three.js. Each one is a rigged skeleton, and the picked move animates the joints. If a chorus is detected, then the stickman will breakout into a special dance scene (if the algorithm works correctly. Turns out its pretty hard lol)

How the chorus is found

Finding the chorus is its own little puzzle, and I do it two different ways depending on the song.

For library tracks, a Node script measures which of the 12 notes are sounding at every beat, then looks for sections that show up more than once with the same harmonic shape. The longest, most repeated, loudest section wins. That's the chorus.

For uploaded MP3s, I can't do that offline pass, so the browser just watches live audio energy. When it crosses a high enough threshold for long enough, the stickmen treat it as chorus and break out the bigger moves.

The math, if you want it

The offline detector builds a self-similarity matrix: an N×N grid where each cell is the cosine similarity between two beats' chroma vectors. Repeating sections show up as bright diagonal stripes offset from the main diagonal. I sweep across lags, smooth each diagonal with a box filter, threshold at 0.55, and treat any stripe at least 12 beats long (about 3 bars in 4/4) as a candidate. Overlapping repeats merge into instances, instances connect into sections, and the section with the highest repetition² × loudness × brightness × vocal-proxy score wins. If nothing beats the runner-up by 1.3×, I throw the result out and the runtime falls back to live energy thresholding.

Why the library is curated

Soooo I was worried about the legality of hosting MP3s directly on the site. So I wrote a small YouTube embedder that streams the audio from YouTube instead.

The catch is that running FFT analysis live on a YouTube stream is rough. The audio comes through a sandboxed iframe and the browser doesn't hand you the raw waveform the way it does for an uploaded file. So for every track in the library, I run a separate Node script ahead of time that fetches the audio, runs the FFT, and saves the result as a JSON file. The browser just loads that file and choreographs from it.

That's why the library is fixed. Every track in it has a pre-rendered FFT file. If you upload your own MP3, the analysis happens live in the browser instead.

Why FFTs are everywhere

The same FFT math runs Shazam (matching a song from a few seconds of audio), MP3 compression (deciding what parts of a sound you won't miss), and the background-noise filter on every phone call you've ever made. It is quietly everywhere.