A voice conversion mapping function based on a stacked joint-autoencoder

Seyed Hamidreza Mohammadi, Alexander Kain

Research output: Contribution to journalArticle

8 Scopus citations


In this study, we propose a novel method for training a regression function and apply it to a voice conversion task. The regression function is constructed using a Stacked Joint-Autoencoder (SJAE). Previously, we have used a more primitive version of this architecture for pre-training a Deep Neural Network (DNN). Using objective evaluation criteria, we show that the lower levels of the SJAE perform best with a low degree of jointness, and higher levels with a higher degree of jointness. We demonstrate that our proposed approach generates features that do not suffer from the averaging effect inherent in backpropagation training. We also carried out subjective listening experiments to evaluate speech quality and speaker similarity. Our results show that the SJAE approach has both higher quality and similarity than a SJAE+DNN approach, where the SJAE is used for pre-training a DNN, and the fine-tuned DNN is then used for mapping. We also present the system description and results of our submission to Voice Conversion Challenge 2016.

Original languageEnglish (US)
Pages (from-to)1647-1651
Number of pages5
JournalUnknown Journal
StatePublished - 2016


  • Autoencoder
  • Deep neural network
  • Joint-autoencoder
  • Voice conversion

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Fingerprint Dive into the research topics of 'A voice conversion mapping function based on a stacked joint-autoencoder'. Together they form a unique fingerprint.

  • Cite this