An Introduction to Speech Acoustics | Modeling the Human Vocal Tract

The study of speech acoustics has been a growing and evolving field of research for many years. Imaging the vocal tract to study speech production has progressed from x-ray videos of a human subject to MRI scans and computer simulations. These advancements have not only greatened our precision and accuracy in our analysis of speech production but have also made the data retrieval process significantly safer and more comfortable for human subjects. However, our knowledge of speech production is still hindered by the fact that the vocal tract is, for the most part, concealed from the naked eye. With today’s modern technology, doctors can examine a wide variety of internal body parts, organs, and cavities via very small cameras attached to tubes or probes. Unfortunately, this advancement fails to benefit speech production research because no accurate data can be gathered from a subject with a long tube inserted into their mouth and down their throat. Therefore, modeling the human vocal tract is both essential to our understanding of how we speak and very difficult.

Speech is produced by forcing air from our lungs through our trachea and the rest of the vocal tract. For some speech sounds, such as vowels, the air pressure causes the vocal folds to vibrate, thus providing the sound waves that we define as speech.

It is the shape of the vocal tract between the glottis (vocal cords) and the lips that determines which speech sound(s) are produced. The vocal tract constricts and expands in crucial places to change the resonant frequencies associated with a speech sound. All phonemes or distinct units of speech defined for a certain language, have identifying resonant frequencies known as formant frequencies. The movement of one’s articulators (i.e lips, tongue, teeth, throat) can change these and result in different sounds.

Anatomy Picture 2 — Anatomical diagrams of human vocal tract from Acoustic Phonetics by Kenneth N. Stevens