Speech Spectrogram Simulator: Visualize the Time-Frequency Structure of Sound

simulator intermediate ~10 min
Loading simulation...
Wideband spectrogram — formant bands visible

A 25 ms window at 120 Hz f₀ produces a wideband spectrogram where formant bands are clearly visible as dark horizontal bars. The frequency resolution is 40 Hz, sufficient to resolve formants but not individual harmonics.

Formula

Δf × Δt ≥ 1 (time-frequency uncertainty)
Δf = 1 / T_window (frequency resolution)
Harmonic spacing = f₀ (fundamental frequency)

Painting Sound in Time and Frequency

The spectrogram, invented at Bell Labs in the 1940s, revolutionized speech science by making the invisible patterns of sound visible. Before spectrograms, phoneticians relied on ear training and crude oscilloscope traces. The spectrograph revealed for the first time that vowels have characteristic formant patterns, that consonants leave distinct acoustic signatures, and that speech is far more complex and variable than anyone had imagined.

The Window Tradeoff

Every spectrogram faces a fundamental tradeoff rooted in the uncertainty principle: you cannot know both the exact time and exact frequency of a sound event simultaneously. A short analysis window (5-10 ms) captures rapid events like stop bursts and glottal pulses with precision, but smears frequency detail. A long window (40-50 ms) reveals individual harmonics as crisp horizontal lines, but blurs temporal events. Most speech analysis uses 20-30 ms as a practical compromise.

Reading the Patterns

Trained phoneticians can read spectrograms almost like text. Vowels appear as stable formant bands — dark horizontal bars whose frequencies identify the vowel. Stop consonants show gaps (closures) followed by brief bursts. Fricatives appear as high-frequency noise bands. Nasals show low-frequency energy with antiformants. Formant transitions between consonants and vowels reveal the place of articulation — a key cue the brain uses for speech perception.

From Spectrograms to Speech Recognition

Modern automatic speech recognition (ASR) systems are essentially sophisticated spectrogram readers. The mel-frequency cepstral coefficients (MFCCs) used in classical ASR are derived from a spectrogram-like representation. Even neural network ASR systems like those in voice assistants take spectrogram-like features as input. Understanding spectrograms remains essential for debugging and improving speech technology.

FAQ

What is a speech spectrogram?

A spectrogram is a visual representation of how the frequency content of a sound changes over time. Time runs along the horizontal axis, frequency on the vertical axis, and intensity is shown by brightness or color. Speech spectrograms reveal formant patterns, voicing, and consonant noise.

What is the difference between narrowband and wideband spectrograms?

Narrowband spectrograms use long analysis windows (>40 ms), resolving individual harmonics as horizontal lines. Wideband spectrograms use short windows (<10 ms), blurring harmonics but showing formant bands and temporal events like stop bursts clearly.

How do you read formants on a spectrogram?

Formants appear as dark horizontal bands of concentrated energy. F1 is the lowest band (300-800 Hz), F2 is the next (800-2500 Hz), and F3 is higher (2000-3500 Hz). Their trajectories over time reveal vowel transitions and coarticulation.

What is the time-frequency tradeoff?

The Heisenberg-Gabor uncertainty principle means you cannot simultaneously have perfect time and frequency resolution. Short windows give good time resolution but poor frequency resolution, and vice versa. A 25 ms window is a common compromise for speech.

Sources

Embed

<iframe src="https://homo-deus.com/lab/speech-science/spectrogram-viewer/embed" width="100%" height="400" frameborder="0"></iframe>
View source on GitHub