Spectral voice conversion for text-to-speech synthesis

Alexander Kain, Michael W. Macon

Research output: Chapter in Book/Report/Conference proceedingConference contribution

456 Citations (Scopus)

Abstract

A new voice conversion algorithm that modifies a source speaker's speech to sound as if produced by a target speaker is presented. It is applied to a residual-excited LPC text-to-speech diphone synthesizer. Spectral parameters are mapped using a locally linear transformation based on Gaussian mixture models whose parameters are trained by joint density estimation. The LPC residuals are adjusted to match the target speaker's average pitch. To study effects of the amount of training on performance, data sets of varying sizes are created by automatically selecting subsets of all available diphones by a vector quantization method. In an objective evaluation, the proposed method is found to perform more reliably for small training sets than a previous approach. In perceptual tests, it was shown that nearly optimal spectral conversion performance was achieved, even with a small amount of training data. However, speech quality improved with increases in the training set size.

Original languageEnglish (US)
Title of host publicationICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
PublisherIEEE
Pages285-288
Number of pages4
Volume1
StatePublished - 1998
EventProceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP. Part 1 (of 6) - Seattler, WA, USA
Duration: May 12 1998May 15 1998

Other

OtherProceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP. Part 1 (of 6)
CitySeattler, WA, USA
Period5/12/985/15/98

Fingerprint

Speech synthesis
education
synthesis
linear transformations
synthesizers
vector quantization
Linear transformations
Vector quantization
set theory
Acoustic waves
acoustics
evaluation

ASJC Scopus subject areas

  • Signal Processing
  • Electrical and Electronic Engineering
  • Acoustics and Ultrasonics
  • Software

Cite this

Kain, A., & Macon, M. W. (1998). Spectral voice conversion for text-to-speech synthesis. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings (Vol. 1, pp. 285-288). IEEE.

Spectral voice conversion for text-to-speech synthesis. / Kain, Alexander; Macon, Michael W.

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Vol. 1 IEEE, 1998. p. 285-288.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kain, A & Macon, MW 1998, Spectral voice conversion for text-to-speech synthesis. in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. vol. 1, IEEE, pp. 285-288, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP. Part 1 (of 6), Seattler, WA, USA, 5/12/98.
Kain A, Macon MW. Spectral voice conversion for text-to-speech synthesis. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Vol. 1. IEEE. 1998. p. 285-288
Kain, Alexander ; Macon, Michael W. / Spectral voice conversion for text-to-speech synthesis. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Vol. 1 IEEE, 1998. pp. 285-288
@inproceedings{7ea32226e1f040fbbadb26d8a58e05a1,
title = "Spectral voice conversion for text-to-speech synthesis",
abstract = "A new voice conversion algorithm that modifies a source speaker's speech to sound as if produced by a target speaker is presented. It is applied to a residual-excited LPC text-to-speech diphone synthesizer. Spectral parameters are mapped using a locally linear transformation based on Gaussian mixture models whose parameters are trained by joint density estimation. The LPC residuals are adjusted to match the target speaker's average pitch. To study effects of the amount of training on performance, data sets of varying sizes are created by automatically selecting subsets of all available diphones by a vector quantization method. In an objective evaluation, the proposed method is found to perform more reliably for small training sets than a previous approach. In perceptual tests, it was shown that nearly optimal spectral conversion performance was achieved, even with a small amount of training data. However, speech quality improved with increases in the training set size.",
author = "Alexander Kain and Macon, {Michael W.}",
year = "1998",
language = "English (US)",
volume = "1",
pages = "285--288",
booktitle = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
publisher = "IEEE",

}

TY - GEN

T1 - Spectral voice conversion for text-to-speech synthesis

AU - Kain, Alexander

AU - Macon, Michael W.

PY - 1998

Y1 - 1998

N2 - A new voice conversion algorithm that modifies a source speaker's speech to sound as if produced by a target speaker is presented. It is applied to a residual-excited LPC text-to-speech diphone synthesizer. Spectral parameters are mapped using a locally linear transformation based on Gaussian mixture models whose parameters are trained by joint density estimation. The LPC residuals are adjusted to match the target speaker's average pitch. To study effects of the amount of training on performance, data sets of varying sizes are created by automatically selecting subsets of all available diphones by a vector quantization method. In an objective evaluation, the proposed method is found to perform more reliably for small training sets than a previous approach. In perceptual tests, it was shown that nearly optimal spectral conversion performance was achieved, even with a small amount of training data. However, speech quality improved with increases in the training set size.

AB - A new voice conversion algorithm that modifies a source speaker's speech to sound as if produced by a target speaker is presented. It is applied to a residual-excited LPC text-to-speech diphone synthesizer. Spectral parameters are mapped using a locally linear transformation based on Gaussian mixture models whose parameters are trained by joint density estimation. The LPC residuals are adjusted to match the target speaker's average pitch. To study effects of the amount of training on performance, data sets of varying sizes are created by automatically selecting subsets of all available diphones by a vector quantization method. In an objective evaluation, the proposed method is found to perform more reliably for small training sets than a previous approach. In perceptual tests, it was shown that nearly optimal spectral conversion performance was achieved, even with a small amount of training data. However, speech quality improved with increases in the training set size.

UR - http://www.scopus.com/inward/record.url?scp=0031623661&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0031623661&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:0031623661

VL - 1

SP - 285

EP - 288

BT - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

PB - IEEE

ER -