3. Voice transformation

Voice transformation research

Friday, June 6th, 2008

MISTIC: Music Intelligence and Sound Technology, UVic
karl [insert at sign] karlnordstrom.ca


My PhD research started in collaboration with IVL Technologies and TC-Helicon, two voice processing companies in Victoria. TC-Helicon produces vocal effects products for the music industry with a focus on pitch correction and automatic harmony creation. IVL produces hand held karaoke products that produce a variety of vocal effects while being plugged into a TV. In recent years, vocal effects have become more common including chorus based effects and even distortion.

The goal of the research was to enhance a digital effect that adds noise to a voice to simulate breathiness. If a voice already sounds breathy, it is easy to add noise to increase the perception of breathiness. However, it is difficult to add noise to voices that exhibit high-effort (i.e. when the voice sounds strained). The added noise does not blend into the voice and instead sounds like a separate stream of noise. In my research, I developed a technique to adaptively manipulate the perception of vocal effort. This enabled the added noise to blend more effectively, thereby improving the breath effect.

An important outcome of this research is to highlight the fact that Linear Prediction (LP) (a common voice modeling technique) does not appropriately model the voice. I presented Adaptive Pre-emphasis Linear Prediction (APLP) as a technique to appropriately compensate for variations in vocal effort.

Academic and Industry Collaborators


APLP for phrases

Monday, June 2nd, 2008

I have now implemented Adaptive Pre-emphasis Linear Prediction (APLP) for phrases. The goal of this implementation of APLP is to transform high-effort voices into breathy voices.

Original voice:
I started with this sound sample: wav. With this kind of voice, it is difficult to add breathiness in a way that sounds natural.

Breath effect with constant pre-emphasis linear prediction (LP):
Constant pre-emphasis LP was carried out, noise was added to the residual, and the voice was resynthesized: wav. This is the technique used in current voice processors to apply a breath effect.

Breath effect with APLP:
APLP can be used to transform the spectral envelope of the voice to match that of a typical breathy voice. This reduces the perceived vocal effort and improves the blending of noise into the voice: wav.

Breath effect with APLP and breath modulation:
The amount of breathiness in a voice varies with the amount of effort. As such, it makes sense to add less breathiness during times of excessive effort. The APLP algorithm was further improved by modulating the quantity of added noise according to the quantity of vocal effort: wav.

Now listen to the original voice again.

With the breath effect, some voices work better than others. Here is how the breath effect sounds on another voice:

Original voice: wav.
Breath effect with constant pre-emphasis LP: wav.
Breath effect with APLP: wav.
Breath effect with APLP and breath modulation: wav.

Technical Note:
Previous iterations of the algorithm included glottal closure detection to improve the blending of noise into the voice. Glottal closure detection was eliminated from this iteration of the algorithm. This ensures that the sound samples are representative of what can be achieved on real-time voice signals in a musical context.


Thursday, May 15th, 2008


Wednesday, May 14th, 2008


Journal paper

Friday, April 4th, 2008

K. I. Nordstrom, G. Tzanetakis and P. F. Driessen, “Transforming Perceived Vocal Effort and Breathiness Using Adaptive Pre-Emphasis Linear Prediction“, IEEE Transactions on Audio, Speech and Language Processing, vol. 16, no. 6, pp. 1087-1096, August 2008.

Why did I choose this research topic?

Tuesday, April 1st, 2008

People often ask me why I chose my particular research topic. Why work on transforming high-effort voices into breathy voices? I typically respond by describing some of products for voice transformation, created by TC-Helicon, IVL Audio (now both competitors), and 3dB Research. I talk about how a breathy effect can be used in musical situations to enhance the voice. That said, there are some other motivations. I’m curious about the physiology and acoustics of the voice. I’m curious about how the acoustic signal from the voice can be manipulated to simulate physiological changes in the voice. I sometimes talk about these motivations. However, there is one motivation that I rarely describe, and that is of aesthetics.

In manipulating the voice, I am attentive to the particular sound texture of the voice (also known as voice quality). I want to find ways to manipulate that texture without creating something that sounds unnatural. This is very subjective. The voice is a complex instrument and it can create a diversity of sound textures. For example, voice quality has been described with more than thirty different terms as various researchers have attempted to better understand the voice. In my research, I am attempting to translate the subjective perceptions of breathiness and vocal effort into an engineering context where they can be quantified and manipulated. However, my deeper interest is in creating a pleasing sound that matches a subjective ideal.

This search for a particular sound texture in the voice parallels my search for what guitarists call “tone” in playing the guitar. It’s an aesthetic judgment about how good or bad the instrument sounds. It’s possible to analyze this “tone” on a spectrogram and artificially mimic the “tone” of a great guitarist, but ultimately the judgment about “tone” is an artistic judgment made by the listener. In theory, it may be possible to quantify “tone” but the point of creating a “tone” is not to be quantified and controlled. The point in the guitarist having “tone” is to create something aesthetically pleasing. It’s the same with my interest in the sound texture of the voice. I want to create voices that sound aesthetically pleasing.

While I find myself primarily in an engineering context, my motivation is to create sounds that are artistically appreciated. However, it’s not easy to do large transformations that sound natural. The complexity of the voice makes transformation difficult. After all, I’m working with an instrument that has no fixed dimensions. It is easy to make small changes, but most large changes sound bad. The techniques that I have developed are more effective than the prior technology, but there is still a long way to go before large voice transformations sound good. As a result, the artistic side of me is often disappointed. Yes, I find the work interesting, but there is still so far to go.

Transforming perceived vocal effort and breathiness using adaptive pre-emphasis linear prediction

Tuesday, January 8th, 2008

I carried out some listening experiments to evaluate the voice transformation algorithm. The results of this experiment have not yet been published but the algorithm is described in this paper that was presented at DAFX. The goal is to transform a high effort voice into a target breathy voice.

Here is the original high effort voice.

I want to transform that high-effort voice into this target breathy voice.

If I attempt to transform the high-effort voice by adding artificial aspiration noise while using constant pre-emphasis linear prediction (LP), the transformed voice sounds like this. The voice exhibits more aspiration noise but still retains the perception of high vocal effort.

A breathy effect from a commercial voice processor was also used to add breathiness to a high-effort voice. This LP-based effect uses some additional filtering to shape the voice spectrum and to shape the added aspiration noise. However, the core of the algorithm operates as a constant pre-emphasis LP algorithm.

The presence or absence of aspiration noise is only one of the differences between high-effort and breathy voices. When a voice changes between high-effort and breathiness, the spectral envelope of the voice also changes. High effort voices contain more high-frequency content than the corresponding breathy voices. Constant pre-emphasis LP does not modify the spectral envelope of the voice.

Adaptive pre-emphasis linear prediction (APLP) can be used to change the perceived vocal effort by modifying the spectral envelope of the voice to match the spectral envelope of the breathy voice. After the spectral envelope has been re-shaped, artificial aspiration noise is added. The results vary slightly depending on the order of the spectral emphasis filter that is used to shape the spectral envelope: first-order spectral emphasis filter or third-order spectral emphasis filter.

The transformation is not perfect but APLP results in a transformed high-effort voice that more closely matches the target breathy voice.

Glottal wave analysis with adaptive inverse filtering

Wednesday, June 13th, 2007

I have been developing a way to measure variations in the spectral envelope of the glottal source. Not long ago, I came across a old paper that describes much of what I had discovered on my own. Adaptive inverse filtering can effectively capture large variations in voice quality between breathy and pressed voices where more standard techniques of closed-phase, covariance LPC break down. Closed-phase techniques don’t work well for high-pitched or breathy voices because the closed-phase can be short to non-existent. Paavo Alku has written an excellent paper that effectively describes an adaptive inverse filtering technique (adaptive LPC) that can extract glottal pulses from breathy and pressed voices. If you’re interested in robust techniques for modeling variations in the glottal source, Alku’s paper is well worth reading.

  • Paavo Alku, “Glottal wave analysis with Pitch Synchronous Iterative Adaptive Inverse Filtering,” Speech Communication, vol. 11, pp. 109–118, 1992.

Comparing 1rst, 2nd and 3rd order pre-emphasis filters

Thursday, March 1st, 2007

You will find some samples, below, that demonstrate how the voice conversion algorithm sounds different depending upon the order of the adaptive pre-emphasis filter. The goal of the algorithm is to reduce the perceived vocal effort and to increase the perceived breathiness.

In the adaptive pre-emphasis algorithm, it’s necessary to choose an order of filter for the pre-emphasis. If the order is too low, the pre-emphasis does not have enough dynamic range. If the order is too high, the pre-emphasis will capture formant information.

I’m using LPC to estimate the pre-emphasis filter. As long as it is a first order filter, the pole of the filter is at zero hertz and the pre-emphasis filter looks like a spectral tilt. This is the typical configuration for the pre-emphasis filter. At orders higher than one, LPC can estimate a pre-emphasis filter with pole(s) at higher frequencies in the voice spectrum. This happens with high-effort voices and the resulting pre-emphasis looks like a spectral tilt plus a mid-range resonance. You can find a plot of this result in the DAFX paper.

I can make a number of arguments about whether the pre-emphasis should have a resonance in it or not. I’m not going to explain it now except to say that perceived vocal effort is the result of both changes to the voice source and changes to the vocal tract filter.

The above explanation was very brief. Whether you understand it or not, you can listen to some of the resulting samples, below:

We want to make the high-effort voice sound like this target breathy voice: wav.

Here is the original high-effort voice: wav.

One common way to try to simulate breathiness is to add aspiration noise to the LPC residual. This is what it sounds like when we do that with the high-effort voice: wav.

The voice conversion algorithm uses adaptive pre-emphasis LPC to reduce the perceived vocal effort in the voice before adding noise to simulate breathiness. Here is the transformed high-effort voice:

1rst order pre-emphasis filter: wav.
2nd order pre-emphasis filter: wav.
3rd order pre-emphasis filter: wav.

I have opinions about the sounds of these samples but I’m curious about your opinion. Which sample do you think gets closest to the target breathy voice? Which sample sounds the most natural to you? Which sample sounds the most unnatural?