Semi-supervised training of a voice conversion mapping function using a joint-autoencoder

Seyed Hamidreza Mohammadi; Alexander Kain

Semi-supervised training of a voice conversion mapping function using a joint-autoencoder

Seyed Hamidreza Mohammadi, Alexander Kain

Institute on Development and Disability

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

9 Scopus citations

Abstract

Recently, researchers have begun to investigate Deep Neural Network (DNN) architectures as mapping functions in voice conversion systems. In this study, we propose a novel Stacked- Joint-Autoencoder (SJAE) architecture, which aims to find a common encoding of parallel source and target features. The SJAE is initialized from a Stacked-Autoencoder (SAE) that has been trained on a large general-purpose speech database. We also propose to train the SJAE using unrelated speakers that are similar to the source and target speaker, instead of using only the source and target speakers. The final DNN is constructed from the source-encoding part and the target-decoding part of the SJAE, and then fine-tuned using back-propagation. The use of this semi-supervised training approach allows us to use multiple frames during mapping, since we have previously learned the general structure of the acoustic space and also the general structure of similar source-target speaker mappings. We train two speaker conversions and compare several system configurations objectively and subjectively while varying the number of available training sentences. The results show that each of the individual contributions of SAE, SJAE, and using unrelated speakers to initialize the mapping function increases conversion performance.

Original language	English (US)
Title of host publication	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publisher	International Speech and Communication Association
Pages	284-288
Number of pages	5
Volume	2015-January
State	Published - 2015
Event	16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015 - Dresden, Germany Duration: Sep 6 2015 → Sep 10 2015

Other

Other	16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015
Country/Territory	Germany
City	Dresden
Period	9/6/15 → 9/10/15

Keywords

Deep neural network
Pre-training
Semisupervised learning
Voice conversion

ASJC Scopus subject areas

Language and Linguistics
Human-Computer Interaction
Signal Processing
Software
Modeling and Simulation

Cite this

Semi-supervised training of a voice conversion mapping function using a joint-autoencoder. / Mohammadi, Seyed Hamidreza; Kain, Alexander.
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2015-January International Speech and Communication Association, 2015. p. 284-288.

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Mohammadi, SH & Kain, A 2015, Semi-supervised training of a voice conversion mapping function using a joint-autoencoder. in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. vol. 2015-January, International Speech and Communication Association, pp. 284-288, 16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015, Dresden, Germany, 9/6/15.

@inproceedings{2960c7d16c9647d9a549811838233dcb,

title = "Semi-supervised training of a voice conversion mapping function using a joint-autoencoder",

abstract = "Recently, researchers have begun to investigate Deep Neural Network (DNN) architectures as mapping functions in voice conversion systems. In this study, we propose a novel Stacked- Joint-Autoencoder (SJAE) architecture, which aims to find a common encoding of parallel source and target features. The SJAE is initialized from a Stacked-Autoencoder (SAE) that has been trained on a large general-purpose speech database. We also propose to train the SJAE using unrelated speakers that are similar to the source and target speaker, instead of using only the source and target speakers. The final DNN is constructed from the source-encoding part and the target-decoding part of the SJAE, and then fine-tuned using back-propagation. The use of this semi-supervised training approach allows us to use multiple frames during mapping, since we have previously learned the general structure of the acoustic space and also the general structure of similar source-target speaker mappings. We train two speaker conversions and compare several system configurations objectively and subjectively while varying the number of available training sentences. The results show that each of the individual contributions of SAE, SJAE, and using unrelated speakers to initialize the mapping function increases conversion performance.",

keywords = "Deep neural network, Pre-training, Semisupervised learning, Voice conversion",

author = "Mohammadi, {Seyed Hamidreza} and Alexander Kain",

year = "2015",

language = "English (US)",

volume = "2015-January",

pages = "284--288",

booktitle = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

publisher = "International Speech and Communication Association",

note = "16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015 ; Conference date: 06-09-2015 Through 10-09-2015",

}

TY - GEN

T1 - Semi-supervised training of a voice conversion mapping function using a joint-autoencoder

AU - Mohammadi, Seyed Hamidreza

AU - Kain, Alexander

PY - 2015

Y1 - 2015

N2 - Recently, researchers have begun to investigate Deep Neural Network (DNN) architectures as mapping functions in voice conversion systems. In this study, we propose a novel Stacked- Joint-Autoencoder (SJAE) architecture, which aims to find a common encoding of parallel source and target features. The SJAE is initialized from a Stacked-Autoencoder (SAE) that has been trained on a large general-purpose speech database. We also propose to train the SJAE using unrelated speakers that are similar to the source and target speaker, instead of using only the source and target speakers. The final DNN is constructed from the source-encoding part and the target-decoding part of the SJAE, and then fine-tuned using back-propagation. The use of this semi-supervised training approach allows us to use multiple frames during mapping, since we have previously learned the general structure of the acoustic space and also the general structure of similar source-target speaker mappings. We train two speaker conversions and compare several system configurations objectively and subjectively while varying the number of available training sentences. The results show that each of the individual contributions of SAE, SJAE, and using unrelated speakers to initialize the mapping function increases conversion performance.

AB - Recently, researchers have begun to investigate Deep Neural Network (DNN) architectures as mapping functions in voice conversion systems. In this study, we propose a novel Stacked- Joint-Autoencoder (SJAE) architecture, which aims to find a common encoding of parallel source and target features. The SJAE is initialized from a Stacked-Autoencoder (SAE) that has been trained on a large general-purpose speech database. We also propose to train the SJAE using unrelated speakers that are similar to the source and target speaker, instead of using only the source and target speakers. The final DNN is constructed from the source-encoding part and the target-decoding part of the SJAE, and then fine-tuned using back-propagation. The use of this semi-supervised training approach allows us to use multiple frames during mapping, since we have previously learned the general structure of the acoustic space and also the general structure of similar source-target speaker mappings. We train two speaker conversions and compare several system configurations objectively and subjectively while varying the number of available training sentences. The results show that each of the individual contributions of SAE, SJAE, and using unrelated speakers to initialize the mapping function increases conversion performance.

KW - Deep neural network

KW - Pre-training

KW - Semisupervised learning

KW - Voice conversion

UR - http://www.scopus.com/inward/record.url?scp=84959173289&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959173289&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84959173289

VL - 2015-January

SP - 284

EP - 288

BT - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

PB - International Speech and Communication Association

T2 - 16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015

Y2 - 6 September 2015 through 10 September 2015

ER -

Semi-supervised training of a voice conversion mapping function using a joint-autoencoder

Abstract

Other

Keywords

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this