Kaushik Varadharajan

Reversing the Pink Trombone

Pink Trombone is a common demonstration tool I first used in my intro to linguistics course. It lets you drag around a cross-section of a cartoon mouth to produce sounds in real time, as opposed to the linguistics student technique of making weird noises with your mouth in public spaces. When it’s reimplemented in an automatic differentiation framework, you can use Pink Trombone’s structure to reconstruct vocal tract shapes from a recording.

The vocal tract model

The source-filter model of human speech is that speech is produced in two stages: the glottis (vocal folds) generates a source signal, and the vocal tract acts as a filter on that signal to produce sounds. The filter affects the formants, which are the resonant frequencies of the vocal tract. Phonetics tells us that the first two formants (F1,F2F_1, F_2) largely determine which vowel the ears and brain hear.

The source waveform is modeled using the Liljencrants-Fant (LF) model. Each glottal cycle has two phases: an open phase, during which the folds part and airflow accelerates then decelerates as an exponential sinusoid, and a return phase, during which the folds snap shut and airflow drops off exponentially. These phases have parameters which can make the voice sound breathy or pressed.

The Kelly-Lochbaum model then models the vocal tract as a sequence of cylindrical tube segments. The diameters are calculated from the tongue’s location and the constriction of the throat and lips. At junctions between segments, pressure waves reflect and transmit according to

ri=AiAi+1Ai+Ai+1r_i = \frac{A_i - A_{i+1}}{A_i + A_{i+1}}

where AiA_i is the cross-sectional area of segment ii. This is a discrete approximation of the acoustic wave equation in a tube, tracking right-traveling and left-traveling waves in each segment, and updating both directions at each sample. The glottis injects source energy at one end, and audio is read out at the lips. This model acts as a filter on the glottal signal, but it never explicitly creates that filter or finds its properties; it just generates digital audio sample-by-sample.

Inversion as gradient descent

VocalTrax and vocal-tract-grad frame inversion as optimization. Pink Trombone’s entire pipeline can be rewritten in JAX or PyTorch to be differentiable, meaning standard gradient descent methods can minimize a spectral reconstruction loss by optimizing the articulatory parameters.

A target audio clip is divided into frames, each with a set of articulatory parameters to optimize. In theory, vocal pitch is a parameter of the glottal waveform that could be optimized, but in practice tools like CREPE are used to extract pitch, which is then held constant while optimizing the other parameters.

One issue: the optimization has no sense of time, since each time frame’s articulation is optimized independently; adjacent frames can have completely different mouth positions even when the audio is smooth. VocalTrax handles this by periodically smoothing the parameters outside the gradient descent loop.

Extensions

We made several extensions to VocalTrax for CS 448 Audio Computing Lab: a nasal tract, turbulence noise at constrictions for fricatives, learnable spectral tilt and aspiration on the glottal source, radiation impedance at the lips, and position-dependent wall losses. Evaluation was the hard part; spectral reconstruction quality, articulatory accuracy, and perceptual quality are different and don’t move together.