A voice conversion mapping function based on a stacked joint-autoencoder

Seyed Hamidreza Mohammadi; Alexander Kain

doi:10.21437/Interspeech.2016-1437

A voice conversion mapping function based on a stacked joint-autoencoder

Seyed Hamidreza Mohammadi, Alexander Kain

Institute on Development and Disability

Research output: Contribution to journal › Conference article › peer-review

9 Scopus citations

Abstract

In this study, we propose a novel method for training a regression function and apply it to a voice conversion task. The regression function is constructed using a Stacked Joint-Autoencoder (SJAE). Previously, we have used a more primitive version of this architecture for pre-training a Deep Neural Network (DNN). Using objective evaluation criteria, we show that the lower levels of the SJAE perform best with a low degree of jointness, and higher levels with a higher degree of jointness. We demonstrate that our proposed approach generates features that do not suffer from the averaging effect inherent in backpropagation training. We also carried out subjective listening experiments to evaluate speech quality and speaker similarity. Our results show that the SJAE approach has both higher quality and similarity than a SJAE+DNN approach, where the SJAE is used for pre-training a DNN, and the fine-tuned DNN is then used for mapping. We also present the system description and results of our submission to Voice Conversion Challenge 2016.

Original language	English (US)
Pages (from-to)	1647-1651
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	08-12-September-2016
DOIs	https://doi.org/10.21437/Interspeech.2016-1437
State	Published - 2016
Event	17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016 - San Francisco, United States Duration: Sep 8 2016 → Sep 16 2016

Keywords

Autoencoder
Deep neural network
Joint-autoencoder
Voice conversion

ASJC Scopus subject areas

Language and Linguistics
Human-Computer Interaction
Signal Processing
Software
Modeling and Simulation

Access to Document

10.21437/Interspeech.2016-1437

Cite this

@article{225971965b1d4821a84f95af3eea5922,

title = "A voice conversion mapping function based on a stacked joint-autoencoder",

abstract = "In this study, we propose a novel method for training a regression function and apply it to a voice conversion task. The regression function is constructed using a Stacked Joint-Autoencoder (SJAE). Previously, we have used a more primitive version of this architecture for pre-training a Deep Neural Network (DNN). Using objective evaluation criteria, we show that the lower levels of the SJAE perform best with a low degree of jointness, and higher levels with a higher degree of jointness. We demonstrate that our proposed approach generates features that do not suffer from the averaging effect inherent in backpropagation training. We also carried out subjective listening experiments to evaluate speech quality and speaker similarity. Our results show that the SJAE approach has both higher quality and similarity than a SJAE+DNN approach, where the SJAE is used for pre-training a DNN, and the fine-tuned DNN is then used for mapping. We also present the system description and results of our submission to Voice Conversion Challenge 2016.",

keywords = "Autoencoder, Deep neural network, Joint-autoencoder, Voice conversion",

author = "Mohammadi, {Seyed Hamidreza} and Alexander Kain",

note = "Publisher Copyright: Copyright {\textcopyright} 2016 ISCA.; 17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016 ; Conference date: 08-09-2016 Through 16-09-2016",

year = "2016",

doi = "10.21437/Interspeech.2016-1437",

language = "English (US)",

volume = "08-12-September-2016",

pages = "1647--1651",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - A voice conversion mapping function based on a stacked joint-autoencoder

AU - Mohammadi, Seyed Hamidreza

AU - Kain, Alexander

PY - 2016

Y1 - 2016

N2 - In this study, we propose a novel method for training a regression function and apply it to a voice conversion task. The regression function is constructed using a Stacked Joint-Autoencoder (SJAE). Previously, we have used a more primitive version of this architecture for pre-training a Deep Neural Network (DNN). Using objective evaluation criteria, we show that the lower levels of the SJAE perform best with a low degree of jointness, and higher levels with a higher degree of jointness. We demonstrate that our proposed approach generates features that do not suffer from the averaging effect inherent in backpropagation training. We also carried out subjective listening experiments to evaluate speech quality and speaker similarity. Our results show that the SJAE approach has both higher quality and similarity than a SJAE+DNN approach, where the SJAE is used for pre-training a DNN, and the fine-tuned DNN is then used for mapping. We also present the system description and results of our submission to Voice Conversion Challenge 2016.

AB - In this study, we propose a novel method for training a regression function and apply it to a voice conversion task. The regression function is constructed using a Stacked Joint-Autoencoder (SJAE). Previously, we have used a more primitive version of this architecture for pre-training a Deep Neural Network (DNN). Using objective evaluation criteria, we show that the lower levels of the SJAE perform best with a low degree of jointness, and higher levels with a higher degree of jointness. We demonstrate that our proposed approach generates features that do not suffer from the averaging effect inherent in backpropagation training. We also carried out subjective listening experiments to evaluate speech quality and speaker similarity. Our results show that the SJAE approach has both higher quality and similarity than a SJAE+DNN approach, where the SJAE is used for pre-training a DNN, and the fine-tuned DNN is then used for mapping. We also present the system description and results of our submission to Voice Conversion Challenge 2016.

KW - Autoencoder

KW - Deep neural network

KW - Joint-autoencoder

KW - Voice conversion

UR - http://www.scopus.com/inward/record.url?scp=84994219829&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84994219829&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2016-1437

DO - 10.21437/Interspeech.2016-1437

M3 - Conference article

AN - SCOPUS:84994219829

SN - 2308-457X

VL - 08-12-September-2016

SP - 1647

EP - 1651

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016

Y2 - 8 September 2016 through 16 September 2016

ER -

A voice conversion mapping function based on a stacked joint-autoencoder

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this