TY - GEN
T1 - Voice conversion using deep neural networks with speaker-independent pre-training
AU - Mohammadi, Seyed Hamidreza
AU - Kain, Alexander
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014/4/1
Y1 - 2014/4/1
N2 - In this study, we trained a deep autoencoder to build compact representations of short-term spectra of multiple speakers. Using this compact representation as mapping features, we then trained an artificial neural network to predict target voice features from source voice features. Finally, we constructed a deep neural network from the trained deep autoencoder and artificial neural network weights, which were then fine-tuned using back-propagation. We compared the proposed method to existing methods using Gaussian mixture models and frame-selection. We evaluated the methods objectively, and also conducted perceptual experiments to measure both the conversion accuracy and speech quality of selected systems. The results showed that, for 70 training sentences, frame-selection performed best, regarding both accuracy and quality. When using only two training sentences, the pre-trained deep neural network performed best, regarding both accuracy and quality.
AB - In this study, we trained a deep autoencoder to build compact representations of short-term spectra of multiple speakers. Using this compact representation as mapping features, we then trained an artificial neural network to predict target voice features from source voice features. Finally, we constructed a deep neural network from the trained deep autoencoder and artificial neural network weights, which were then fine-tuned using back-propagation. We compared the proposed method to existing methods using Gaussian mixture models and frame-selection. We evaluated the methods objectively, and also conducted perceptual experiments to measure both the conversion accuracy and speech quality of selected systems. The results showed that, for 70 training sentences, frame-selection performed best, regarding both accuracy and quality. When using only two training sentences, the pre-trained deep neural network performed best, regarding both accuracy and quality.
KW - Autoencoder
KW - Deep neural network
KW - Pre-training
KW - Voice conversion
UR - http://www.scopus.com/inward/record.url?scp=84946685887&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84946685887&partnerID=8YFLogxK
U2 - 10.1109/SLT.2014.7078543
DO - 10.1109/SLT.2014.7078543
M3 - Conference contribution
AN - SCOPUS:84946685887
T3 - 2014 IEEE Workshop on Spoken Language Technology, SLT 2014 - Proceedings
SP - 19
EP - 23
BT - 2014 IEEE Workshop on Spoken Language Technology, SLT 2014 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2014 IEEE Workshop on Spoken Language Technology, SLT 2014
Y2 - 7 December 2014 through 10 December 2014
ER -