Pitch Tracking Simulator: Visualize Speech Intonation Contours

simulator intermediate ~10 min
Loading simulation...
Falling contour — declarative intonation

A base frequency of 120 Hz with a 4-semitone range produces a natural-sounding male declarative contour. The pitch peaks on the stressed syllable and falls to signal sentence finality.

Formula

Semitones = 12 × log₂(f₁ / f₂)
f₀ = 1 / T₀ (fundamental period to frequency)
Jitter (%) = 100 × mean|T_i - T_{i+1}| / mean(T_i)

The Melody of Speech

Every utterance carries a melody — the pitch contour created by variations in vocal fold vibration rate. This fundamental frequency (f₀) contour is one of the richest channels of linguistic information, encoding whether you are asking a question or making a statement, which word carries emphasis, and even your emotional state. Pitch tracking — extracting this f₀ contour from the acoustic signal — is one of the most important tasks in speech analysis.

Intonation Contours

Languages use characteristic pitch patterns called intonation contours. In English, declarative statements typically have a falling contour: pitch peaks on the nuclear (most stressed) syllable and falls to the baseline. Yes/no questions end with a rise. Wh-questions often fall. These patterns are language-specific — in Bengali, statements rise, and in Sicilian Italian, questions fall. This simulation lets you compare these contour types.

Measuring Pitch Perturbation

No voice is perfectly periodic. Slight cycle-to-cycle variations in the vibration period (jitter) and amplitude (shimmer) give each voice its unique character. Normal voices have jitter below 1%. Professional singers often have remarkably low jitter, while pathological voices (vocal nodules, paralysis) show elevated jitter. Tracking these perturbations is essential for clinical voice assessment and for making synthetic speech sound natural.

From Acoustics to Meaning

Pitch tracking enables a wide range of applications: clinical voice assessment detects pathology from f₀ perturbation patterns; tonal language processing requires accurate f₀ for word recognition in Mandarin or Thai; emotion recognition systems use pitch range and contour shape to classify affect; and music information retrieval uses pitch tracking to transcribe melodies. The algorithms must handle the challenges of creaky voice, voice breaks, and background noise.

FAQ

What is pitch in speech?

Pitch is the perceptual correlate of fundamental frequency (f₀) — the rate at which the vocal folds vibrate. Adult male f₀ typically ranges from 85-180 Hz, adult female from 165-255 Hz, and children from 250-400 Hz. Pitch variations encode intonation, stress, and tone.

How does pitch tracking work?

Pitch tracking algorithms estimate f₀ by detecting periodicity in the speech waveform. Methods include autocorrelation (finding the period of maximum self-similarity), cepstral analysis (finding the quefrency peak), and neural network approaches. Praat's algorithm uses autocorrelation with dynamic programming.

What is the difference between intonation and tone?

Intonation refers to pitch patterns at the phrase level that convey pragmatic meaning (questions vs. statements) in all languages. Tone refers to pitch patterns at the word/syllable level that change lexical meaning, as in Mandarin, Thai, and Yoruba.

What does jitter tell us about voice quality?

Jitter is the cycle-to-cycle variation in f₀ period. Normal jitter is below 1%. Values of 1-3% may indicate mild voice strain or aging. Above 3% suggests pathological voice conditions. Jitter, along with shimmer (amplitude variation), is a standard clinical voice assessment metric.

Sources

Embed

<iframe src="https://homo-deus.com/lab/speech-science/pitch-tracking/embed" width="100%" height="400" frameborder="0"></iframe>
View source on GitHub