Our cleaner tells you it uses "spectral gating" and we'd rather that meant something to you than sound like incense. Here's the whole idea, no math degree required — including the failure sounds, because knowing why a tool breaks is how you use it well.
Any slice of audio — say, 40 milliseconds of you talking over a fan — can be described as a recipe: this much 100 Hz, this much 250 Hz, this much 4 kHz, and so on. Computing that recipe is what a Fourier transform does, and it's cheap enough that your phone does it thousands of times per second without noticing.
The crucial observation: steady noise has a boring recipe. A fan sounds the same this second as last second — its frequency recipe barely changes. Your voice's recipe, meanwhile, dances around constantly. That difference is the entire trick.
1 – Fingerprint the noise. Find moments where nobody's talking (or use the sample you marked) and average their frequency recipes. That average is the noise profile: how much energy the fan puts at every frequency.
2 – Slice the audio into overlapping frames — ours are about 46 ms, overlapping 75% — and compute each frame's recipe.
3 – Subtract, per frequency. In each frame, any frequency holding barely more energy than the noise profile predicts is mostly noise: turn it down hard. A frequency towering over the profile is mostly voice: leave it alone. This is the "gate" — each of a thousand frequency bands has its own tiny volume fader, adjusted forty times a second.
4 – Smooth and rebuild. Raw gating flickers, so the faders are smoothed across neighbouring frequencies and across time, then the frames are woven back together into audio.
"Musical noise" — faint watery twinkling in the quiet parts — is what under-smoothed gating sounds like: random frequency bins winking on and off. Our temporal smoothing exists specifically to tame this; hearing it means strength is set too hot for how variable your noise is.
The underwater voice happens when subtraction gets greedy. Consonants and breath sounds are quiet and noise-like by nature; an aggressive gate eats their edges and speech turns soft and gargly. This is the tool working exactly as designed on the wrong settings — which is why the strength slider and the A/B habit matter more than any algorithm choice.
And the honest boundary, one more time: all of this assumes the noise holds still. AI dialogue isolators (the upload-your-file services) attack the moving-noise problem with learned models of what speech is — genuinely different machinery, with its own artifacts and its own privacy bill. For hiss, hum, fans and rooms, the forty-year-old trick, run locally, holds up remarkably well. You now know exactly what the button does.