The Timing of Voice
When you say 'pat' versus 'bat', the primary acoustic difference is not loudness or pitch — it is timing. Voice onset time (VOT) measures the precise interval between the burst of air released when the lips open and the moment the vocal folds begin vibrating. This seemingly tiny timing difference — measured in milliseconds — is how your brain distinguishes voiced from voiceless consonants.
Three Voicing Categories
Across the world's languages, VOT defines three broad categories. Prevoiced stops (VOT < 0 ms) have voicing that begins during the closure, before the burst — typical of /b, d, g/ in Romance languages. Short-lag stops (0-30 ms) have near-simultaneous voicing and release — these are English 'voiced' stops. Long-lag stops (>50 ms) have a clear aspiration gap between burst and voicing — English 'voiceless' stops like /p, t, k/ in word-initial position.
Categorical Perception
One of the most remarkable findings in speech science is that VOT is perceived categorically, not continuously. If you create a synthetic continuum from 0 ms to 80 ms VOT, listeners do not hear a gradual change — they hear a sharp switch from /b/ to /p/ around 25-30 ms. This boundary is not fixed: it shifts with speaking rate, phonetic context, and even the listener's native language, revealing deep connections between acoustics and cognition.
Clinical and Forensic Applications
VOT measurement is clinically important: children with speech disorders often show abnormal VOT distributions, and bilingual speakers show VOT patterns influenced by both languages. In forensic phonetics, VOT patterns help identify speakers and their language backgrounds. Speech synthesis systems must generate appropriate VOT values to sound natural — too short or too long, and the consonant sounds foreign or robotic.