Using a manifold vocoder for spectral voice and style conversion

Tuan Dinh, Alexander Kain, Kris Tjaden

Research output: Contribution to journalConference article

1 Scopus citations

Abstract

We propose a new type of spectral feature that is both compact and interpolable, and thus ideally suited for regression approaches that involve averaging. The feature is realized by means of a speaker-independent variational autoencoder (VAE), which learns a latent space based on the low-dimensional manifold of high-resolution speech spectra. In vocoding experiments, we showed that using a 12-dimensional VAE feature (VAE-12) resulted in significantly better perceived speech quality compared to a 12-dimensional MCEP feature. In voice conversion experiments, using VAE-12 resulted in significantly better perceived speech quality as compared to 40-dimensional MCEPs, with similar speaker accuracy. In habitual to clear style conversion experiments, we significantly improved the speech intelligibility for one of three speakers, using a custom skip-connection deep neural network, with the average keyword recall accuracy increasing from 24% to 46%.

Original languageEnglish (US)
Pages (from-to)1388-1392
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2019-September
DOIs
StatePublished - Jan 1 2019
Event20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 - Graz, Austria
Duration: Sep 15 2019Sep 19 2019

Keywords

  • Intelligibility
  • Speech coding
  • Style conversion
  • Variational autoencoder
  • Voice conversion

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Fingerprint Dive into the research topics of 'Using a manifold vocoder for spectral voice and style conversion'. Together they form a unique fingerprint.

  • Cite this