TY - GEN
T1 - Speaker intonation adaptation for transforming text-to-speech synthesis speaker identity
AU - Langarani, Mahsa Sadat Elyasi
AU - Van Santen, Jan
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2016/2/10
Y1 - 2016/2/10
N2 - In this study, we propose a new intonation adaptation method to transform the perceived identity of a Text-To-Speech system to that of a target speaker with a small amount of training data. In the proposed method, during training we fit parametrized accent and phrase curves to parallel recordings of the target speaker F0 curves, and estimate the parameters of a mapping between the corresponding parameter spaces. During test, we fit the accent and phrase curves to the source utterances, apply the mapping, and create an F0 contour from the mapped accent and phrase curves. We compare the proposed method with a baseline adaptation method in which the source F0 contour is transformed linearly such that the per-utterance mean and variance of the target F0 contour is left unaltered. Perceptual tests showed that the proposed method was better than the baseline method in two subjective tests that assess similarity to the target speaker and speech quality, respectively.
AB - In this study, we propose a new intonation adaptation method to transform the perceived identity of a Text-To-Speech system to that of a target speaker with a small amount of training data. In the proposed method, during training we fit parametrized accent and phrase curves to parallel recordings of the target speaker F0 curves, and estimate the parameters of a mapping between the corresponding parameter spaces. During test, we fit the accent and phrase curves to the source utterances, apply the mapping, and create an F0 contour from the mapped accent and phrase curves. We compare the proposed method with a baseline adaptation method in which the source F0 contour is transformed linearly such that the per-utterance mean and variance of the target F0 contour is left unaltered. Perceptual tests showed that the proposed method was better than the baseline method in two subjective tests that assess similarity to the target speaker and speech quality, respectively.
KW - Adaptation
KW - Intonation modeling
KW - Prosody
KW - Text-to-Speech synthesis
UR - http://www.scopus.com/inward/record.url?scp=84964555662&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84964555662&partnerID=8YFLogxK
U2 - 10.1109/ASRU.2015.7404783
DO - 10.1109/ASRU.2015.7404783
M3 - Conference contribution
AN - SCOPUS:84964555662
T3 - 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings
SP - 116
EP - 123
BT - 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015
Y2 - 13 December 2015 through 17 December 2015
ER -