a. sound samples

APLP for phrases

Monday, June 2nd, 2008

I have now implemented Adaptive Pre-emphasis Linear Prediction (APLP) for phrases. The goal of this implementation of APLP is to transform high-effort voices into breathy voices.

Original voice:
I started with this sound sample: wav. With this kind of voice, it is difficult to add breathiness in a way that sounds natural.

Breath effect with constant pre-emphasis linear prediction (LP):
Constant pre-emphasis LP was carried out, noise was added to the residual, and the voice was resynthesized: wav. This is the technique used in current voice processors to apply a breath effect.

Breath effect with APLP:
APLP can be used to transform the spectral envelope of the voice to match that of a typical breathy voice. This reduces the perceived vocal effort and improves the blending of noise into the voice: wav.

Breath effect with APLP and breath modulation:
The amount of breathiness in a voice varies with the amount of effort. As such, it makes sense to add less breathiness during times of excessive effort. The APLP algorithm was further improved by modulating the quantity of added noise according to the quantity of vocal effort: wav.

Now listen to the original voice again.

With the breath effect, some voices work better than others. Here is how the breath effect sounds on another voice:

Original voice: wav.
Breath effect with constant pre-emphasis LP: wav.
Breath effect with APLP: wav.
Breath effect with APLP and breath modulation: wav.

Technical Note:
Previous iterations of the algorithm included glottal closure detection to improve the blending of noise into the voice. Glottal closure detection was eliminated from this iteration of the algorithm. This ensures that the sound samples are representative of what can be achieved on real-time voice signals in a musical context.

Transforming perceived vocal effort and breathiness using adaptive pre-emphasis linear prediction

Tuesday, January 8th, 2008

I carried out some listening experiments to evaluate the voice transformation algorithm. The results of this experiment have not yet been published but the algorithm is described in this paper that was presented at DAFX. The goal is to transform a high effort voice into a target breathy voice.

Here is the original high effort voice.

I want to transform that high-effort voice into this target breathy voice.

If I attempt to transform the high-effort voice by adding artificial aspiration noise while using constant pre-emphasis linear prediction (LP), the transformed voice sounds like this. The voice exhibits more aspiration noise but still retains the perception of high vocal effort.

A breathy effect from a commercial voice processor was also used to add breathiness to a high-effort voice. This LP-based effect uses some additional filtering to shape the voice spectrum and to shape the added aspiration noise. However, the core of the algorithm operates as a constant pre-emphasis LP algorithm.

The presence or absence of aspiration noise is only one of the differences between high-effort and breathy voices. When a voice changes between high-effort and breathiness, the spectral envelope of the voice also changes. High effort voices contain more high-frequency content than the corresponding breathy voices. Constant pre-emphasis LP does not modify the spectral envelope of the voice.

Adaptive pre-emphasis linear prediction (APLP) can be used to change the perceived vocal effort by modifying the spectral envelope of the voice to match the spectral envelope of the breathy voice. After the spectral envelope has been re-shaped, artificial aspiration noise is added. The results vary slightly depending on the order of the spectral emphasis filter that is used to shape the spectral envelope: first-order spectral emphasis filter or third-order spectral emphasis filter.

The transformation is not perfect but APLP results in a transformed high-effort voice that more closely matches the target breathy voice.

Comparing 1rst, 2nd and 3rd order pre-emphasis filters

Thursday, March 1st, 2007

You will find some samples, below, that demonstrate how the voice conversion algorithm sounds different depending upon the order of the adaptive pre-emphasis filter. The goal of the algorithm is to reduce the perceived vocal effort and to increase the perceived breathiness.

In the adaptive pre-emphasis algorithm, it’s necessary to choose an order of filter for the pre-emphasis. If the order is too low, the pre-emphasis does not have enough dynamic range. If the order is too high, the pre-emphasis will capture formant information.

I’m using LPC to estimate the pre-emphasis filter. As long as it is a first order filter, the pole of the filter is at zero hertz and the pre-emphasis filter looks like a spectral tilt. This is the typical configuration for the pre-emphasis filter. At orders higher than one, LPC can estimate a pre-emphasis filter with pole(s) at higher frequencies in the voice spectrum. This happens with high-effort voices and the resulting pre-emphasis looks like a spectral tilt plus a mid-range resonance. You can find a plot of this result in the DAFX paper.

I can make a number of arguments about whether the pre-emphasis should have a resonance in it or not. I’m not going to explain it now except to say that perceived vocal effort is the result of both changes to the voice source and changes to the vocal tract filter.

The above explanation was very brief. Whether you understand it or not, you can listen to some of the resulting samples, below:

We want to make the high-effort voice sound like this target breathy voice: wav.

Here is the original high-effort voice: wav.

One common way to try to simulate breathiness is to add aspiration noise to the LPC residual. This is what it sounds like when we do that with the high-effort voice: wav.

The voice conversion algorithm uses adaptive pre-emphasis LPC to reduce the perceived vocal effort in the voice before adding noise to simulate breathiness. Here is the transformed high-effort voice:

1rst order pre-emphasis filter: wav.
2nd order pre-emphasis filter: wav.
3rd order pre-emphasis filter: wav.

I have opinions about the sounds of these samples but I’m curious about your opinion. Which sample do you think gets closest to the target breathy voice? Which sample sounds the most natural to you? Which sample sounds the most unnatural?

Further improvements to the voice conversion algorithm

Friday, October 27th, 2006

I wasn’t entirely happy with the DAFX samples. There were a number of artifacts in those voice samples that I have now reduced. (You will need good speakers to hear the differences properly, i.e. they certainly won’t sound right through laptop speakers.)

My goal is to convert a high-effort voice into a breathy voice.

Here is the breathy voice that I’m using as my target: wav.

Here is the high-effort voice that I’m trying to transform: wav.

And here is the transformed voice with the pre-emphasis modified to simulate reduced effort. Pulsed noise has also been added to simulate breathiness: wav
Same thing with more aspiration noise added: wav.

The transformed high-effort voice sounds more relaxed and breathy even if it has not been fully transformed into the target voice.

The main problem with the DAFX voice samples is that there is too much gain and spectrum modulation in the LPC filter (the LPC filter bounces around). When there is just one LPC filter, this problem is not as large. However, my algorithm has two LPC filters in series. The filter modulation from the pre-emphasis filter (low-order LPC) exacerbates the filter modulation from the following vocal tract filter (high-order LPC), making the artifacts worse.

I reduced the artifacts by keeping the pre-emphasis filter constant for the short voice segments that I am synthesizing. The pre-emphasis still varies from sample to sample. The next step would be to use time-varying pre-emphasis but to smooth the filter coefficients in time.

I also reshaped the added noise to make it more similar to the breathy noise in the target voice.

PS: You can hear even more recent sound samples here.

DAFX06 samples

Thursday, March 30th, 2006

This post provides sound samples of a new technique to improve linear predictive coding (LPC). This technique can also be used to modify the perception of vocal effort.

What happens when we use LPC to estimate formant filters from voice samples with two different voice qualities while keeping all other variables constant?

Here we have three pairs of voice samples. In each pair, the same voice is singing the same note but one sample is breathy and the other sample exhibits higher vocal effort. These are the original samples: popeil, low, hi.

LPC was carried out on these samples. New voices were resynthesized using an artificial excitation that remains constant across the two samples in the pair. Since the artificial excitation remains the same, the perceived differences between the samples are due to the LPC formant filters. If you listen to the pairs, you will find that the breathy formant filter sounds like it has more breathiness and the high-effort formant filter still sounds like it has more effort: popeil, low, hi. LPC captures in the formant filter some of the differences between a high-effort voice and a breathy voice. Ideally, this change should not be in the formant filter.

I am working on a variable preemphasis algorithm as an extension of LPC to eliminate variability in the perception of vocal effort from the formant filter. Variable pre-emphasis LPC (VPLPC) results in formant filters that are more uniform across varying voice qualities. VPLPC was carried out on the original samples. New voices were resynthesized using an artificial excitation that remains constant across the two samples in the pair. Since the artificial excitation remains the same, the perceived differences between the samples are due to the VPLPC formant filters. If you listen to the pairs, you will find that the breathy formant filter sounds similar to the high-effort formant filter: popeil, low, hi. The formant filters derived by VPLPC sound more neutral with respect to voice quality than the formant filters derived by standard LPC.

The VPLPC algorithm uses a variable preemphasis (VP) filter to capture variation in the spectral envelope. The variation in the spectral envelope primarily relates to the perception of vocal effort. By manipulating the VP filter, it is possible to increase or decrease the perception of vocal effort. The following samples have been modified solely by changing the VP filter. (It will be easier to hear the differences if you have high-quality speakers or headphones).

Reduce vocal effort:
original popeil_higheffort, popeil_lesseffort
original low_higheffort, low_lesseffort
original hi_higheffort, hi_lesseffort

Increase vocal effort:
original popeil_breathy, popeil_moreeffort
original low_breathy, low_moreeffort
original hi_breathy, hi_moreeffort

Manipulation of the VP filter does not fully transform the perception of vocal effort because our ears expect to hear simultaneous changes to the mix of harmonic and noise content. Our ears expect to hear less aspiration noise in voices with high effort. This makes the VP filter transformation less effective when the original voice has significant aspiration noise.

When reducing the perception of vocal effort, our ears expect to hear more aspiration noise. The following VP filter transformation also adds aspiration noise in an attempt to make the sample sound more natural: original popeil_higheffort, popeil_lesseffort_plusnoise.

In summary, VPLPC produces formant filters that are more resistant to changes in voice quality and the VP filter has some influence on the perception of vocal effort. For a fuller tranformation, more work needs to go into finding an appropriate way to modify the mix of harmonics and noise in the residual.

This is the first attempt to use the VP filter to manipulate the perceived voice quality. More sophisticated techniques could provide more effective control.

Interpolating LPC coefficients

Friday, February 24th, 2006

Here is the latest iteration in the sound of my artificial excitation algorithm for continuous speech: mp3, wav.

Technical description of the latest improvement:

Problem: My algorithm estimates LPC filter coefficients for each block of voice data. The filter stays the same for each block. When the filter changes between blocks, this can result in a discontinuity if the filters are dissimilar. I perceived a “grainy” sound that I thought might be due to discontinuities between filters.

Solution: To smooth out the differences between filters, I implemented an algorithm to interpolate between LPC filter coefficients. Right now, it’s just linear interpolation. I might do something more sophisticated later.

Artificial excitation for continuous speech

Monday, February 20th, 2006

My algorithms are now able to handle continuous speech thanks to finding a good pitch detector for voice. Praat is a comprehensive voice analyzer that includes a pitch detector and pitch contour editor. It’s handy because it shows a number of pitch candidates and you can modify the selection of the candidate or set the voice segment to unvoiced.

If you want to hear what the algorithm sounds like, then listen to this sample.

For the unvoiced sections, the original voice was passed through unmodified. For the voiced sections, I used LPC to represent the vocal tract filter. For the glottal source (sound from the vocal folds) I used an LF model with settings for a modal voice. (You can learn how to control the LF model here.) Some noise was added to the glottal source to represent aspiration noise.

My hard-drive died on my laptop. My laptop is now in the shop and I bought an inexpensive desktop that will carry me through until it gets back.

Using voice conversion as a paradigm for analyzing breath quality

Friday, August 26th, 2005

I published a paper for the PacRim Coference entitled:

Using voice conversion as a paradigm for analyzing breath quality

Here are some sound samples to go along with the paper. Read the paper if you want to know where the samples came from.

Original breathy voice
Original non-breathy voice
Synthesized: breathy excitation, non-breathy vocal tract filter
Synthesized: non-breathy excitation, breathy vocal tract filter

It’s a start

Friday, July 29th, 2005

I have implemented an LF model plus noise approach to transforming a voice from a non-breathy to a breathy voice. (If you don’t know what an LF model is then scroll half way down this page. I have some preliminary results. There are four samples in the file:

  • the original voice
  • an approximate reconstruction of the original voice using the LF model
  • making the voice breathier with the LF model
  • making the voice even more breathy

There are some glitches to fix up but it is a start in the right direction. After that I need to find a way to control the model in a more refined way.