Voice conversion using deep neural networks with speaker-independent pre-training

Seyed Hamidreza Mohammadi, Alexander Kain

Research output: Chapter in Book/Report/Conference proceedingConference contribution

42 Scopus citations

Abstract

In this study, we trained a deep autoencoder to build compact representations of short-term spectra of multiple speakers. Using this compact representation as mapping features, we then trained an artificial neural network to predict target voice features from source voice features. Finally, we constructed a deep neural network from the trained deep autoencoder and artificial neural network weights, which were then fine-tuned using back-propagation. We compared the proposed method to existing methods using Gaussian mixture models and frame-selection. We evaluated the methods objectively, and also conducted perceptual experiments to measure both the conversion accuracy and speech quality of selected systems. The results showed that, for 70 training sentences, frame-selection performed best, regarding both accuracy and quality. When using only two training sentences, the pre-trained deep neural network performed best, regarding both accuracy and quality.

Original languageEnglish (US)
Title of host publication2014 IEEE Workshop on Spoken Language Technology, SLT 2014 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages19-23
Number of pages5
ISBN (Electronic)9781479971299
DOIs
StatePublished - Apr 1 2014
Event2014 IEEE Workshop on Spoken Language Technology, SLT 2014 - South Lake Tahoe, United States
Duration: Dec 7 2014Dec 10 2014

Publication series

Name2014 IEEE Workshop on Spoken Language Technology, SLT 2014 - Proceedings

Other

Other2014 IEEE Workshop on Spoken Language Technology, SLT 2014
CountryUnited States
CitySouth Lake Tahoe
Period12/7/1412/10/14

Keywords

  • Autoencoder
  • Deep neural network
  • Pre-training
  • Voice conversion

ASJC Scopus subject areas

  • Computer Science Applications
  • Human-Computer Interaction
  • Computer Vision and Pattern Recognition
  • Artificial Intelligence
  • Language and Linguistics

Fingerprint Dive into the research topics of 'Voice conversion using deep neural networks with speaker-independent pre-training'. Together they form a unique fingerprint.

  • Cite this

    Mohammadi, S. H., & Kain, A. (2014). Voice conversion using deep neural networks with speaker-independent pre-training. In 2014 IEEE Workshop on Spoken Language Technology, SLT 2014 - Proceedings (pp. 19-23). [7078543] (2014 IEEE Workshop on Spoken Language Technology, SLT 2014 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/SLT.2014.7078543