Abstract
Recently, researchers have begun to investigate Deep Neural Network (DNN) architectures as mapping functions in voice conversion systems. In this study, we propose a novel Stacked- Joint-Autoencoder (SJAE) architecture, which aims to find a common encoding of parallel source and target features. The SJAE is initialized from a Stacked-Autoencoder (SAE) that has been trained on a large general-purpose speech database. We also propose to train the SJAE using unrelated speakers that are similar to the source and target speaker, instead of using only the source and target speakers. The final DNN is constructed from the source-encoding part and the target-decoding part of the SJAE, and then fine-tuned using back-propagation. The use of this semi-supervised training approach allows us to use multiple frames during mapping, since we have previously learned the general structure of the acoustic space and also the general structure of similar source-target speaker mappings. We train two speaker conversions and compare several system configurations objectively and subjectively while varying the number of available training sentences. The results show that each of the individual contributions of SAE, SJAE, and using unrelated speakers to initialize the mapping function increases conversion performance.
Original language | English (US) |
---|---|
Title of host publication | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
Publisher | International Speech and Communication Association |
Pages | 284-288 |
Number of pages | 5 |
Volume | 2015-January |
State | Published - 2015 |
Event | 16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015 - Dresden, Germany Duration: Sep 6 2015 → Sep 10 2015 |
Other
Other | 16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015 |
---|---|
Country/Territory | Germany |
City | Dresden |
Period | 9/6/15 → 9/10/15 |
Keywords
- Deep neural network
- Pre-training
- Semisupervised learning
- Voice conversion
ASJC Scopus subject areas
- Language and Linguistics
- Human-Computer Interaction
- Signal Processing
- Software
- Modeling and Simulation