Semi-supervised training of a voice conversion mapping function using a joint-autoencoder

Seyed Hamidreza Mohammadi, Alexander Kain

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Scopus citations

Abstract

Recently, researchers have begun to investigate Deep Neural Network (DNN) architectures as mapping functions in voice conversion systems. In this study, we propose a novel Stacked- Joint-Autoencoder (SJAE) architecture, which aims to find a common encoding of parallel source and target features. The SJAE is initialized from a Stacked-Autoencoder (SAE) that has been trained on a large general-purpose speech database. We also propose to train the SJAE using unrelated speakers that are similar to the source and target speaker, instead of using only the source and target speakers. The final DNN is constructed from the source-encoding part and the target-decoding part of the SJAE, and then fine-tuned using back-propagation. The use of this semi-supervised training approach allows us to use multiple frames during mapping, since we have previously learned the general structure of the acoustic space and also the general structure of similar source-target speaker mappings. We train two speaker conversions and compare several system configurations objectively and subjectively while varying the number of available training sentences. The results show that each of the individual contributions of SAE, SJAE, and using unrelated speakers to initialize the mapping function increases conversion performance.

Original languageEnglish (US)
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
PublisherInternational Speech and Communication Association
Pages284-288
Number of pages5
Volume2015-January
StatePublished - 2015
Event16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015 - Dresden, Germany
Duration: Sep 6 2015Sep 10 2015

Other

Other16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015
Country/TerritoryGermany
CityDresden
Period9/6/159/10/15

Keywords

  • Deep neural network
  • Pre-training
  • Semisupervised learning
  • Voice conversion

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Fingerprint

Dive into the research topics of 'Semi-supervised training of a voice conversion mapping function using a joint-autoencoder'. Together they form a unique fingerprint.

Cite this