Articulatory Speech Synthesis Simulator: Shape the Vocal Tract

simulator intermediate ~10 min
Loading simulation...
~schwa /ə/ — neutral vocal tract

Default settings approximate a neutral vocal tract producing schwa /ə/ — the most common vowel in English. All articulators are near their rest position, producing mid-range formant values.

Formula

F1 ≈ k₁ × (jaw_opening + (1 - tongue_height))
F2 ≈ k₂ × (tongue_frontness) - k₃ × lip_rounding
F_n = (2n-1) × c / (4L_eff) for uniform tube approximation

From Articulation to Sound

Every vowel you speak is the product of a specific vocal tract shape. By raising or lowering the tongue, pushing it forward or back, opening or closing the jaw, and rounding or spreading the lips, you reshape a tube roughly 17 cm long into an acoustic filter that selectively amplifies certain frequencies. This simulation lets you control these articulatory parameters and see how they map to formant frequencies and vowel identity in real time.

The Source-Filter Model

Speech production follows the source-filter model proposed by Gunnar Fant in 1960. The source is the quasi-periodic buzzing of the vocal folds (glottal source), producing a harmonic series. The filter is the vocal tract, which amplifies frequencies near its resonances (formants) and attenuates others. By separating source and filter, we can independently control pitch (source) and vowel quality (filter) — just as the human system does.

Articulatory-Acoustic Mappings

The relationship between articulation and acoustics is nonlinear and many-to-one: different tract shapes can sometimes produce similar formant patterns (motor equivalence). However, the primary mappings are well established. Jaw opening and tongue lowering raise F1. Tongue fronting raises F2. Lip rounding lowers F2 and F3. The simulation computes these mappings using a simplified tube model, letting you discover the acoustic consequences of each articulatory gesture.

Building a Vocal Tract

Articulatory synthesis aims to generate speech by modeling the vocal tract as a series of concatenated tubes with varying cross-sectional areas. Advanced models simulate airflow, tissue compliance, and radiation from the lips. While modern text-to-speech systems primarily use neural networks, articulatory models remain essential for understanding speech motor control, simulating disorders, and teaching phonetics — because they reveal the causal chain from gesture to sound.

FAQ

What is articulatory synthesis?

Articulatory synthesis generates speech by simulating the physics of the vocal tract. Instead of concatenating recorded speech segments, it models the tube-like resonances created by tongue, jaw, and lip positions. This produces highly flexible and naturaloutput but requires accurate acoustic-articulatory models.

How does tongue position affect formants?

Tongue height primarily controls F1 — high tongue = low F1, low tongue = high F1. Tongue frontness primarily controls F2 — front tongue = high F2, back tongue = low F2. These relationships, established by Fant (1960), are the foundation of acoustic phonetics.

What role does lip rounding play?

Lip rounding extends the effective length of the vocal tract by several centimeters, lowering all formant frequencies. It has the strongest effect on F2 and F3. This is why rounded vowels (/u, o, y/) have lower F2 than their unrounded counterparts (/ɯ, ɤ, i/).

Can articulatory synthesis sound natural?

Modern articulatory synthesizers like VocalTractLab and DIVA produce intelligible and increasingly natural speech. The challenge is controlling the many degrees of freedom smoothly during connected speech. Articulatory synthesis is particularly valuable for research on speech disorders and second language learning.

Sources

Embed

<iframe src="https://homo-deus.com/lab/speech-science/speech-synthesis/embed" width="100%" height="400" frameborder="0"></iframe>
View source on GitHub