Speaker intonation adaptation for transforming text-to-speech synthesis speaker identity

Mahsa Sadat Elyasi Langarani; Jan Van Santen

doi:10.1109/ASRU.2015.7404783

Speaker intonation adaptation for transforming text-to-speech synthesis speaker identity

Mahsa Sadat Elyasi Langarani, Jan Van Santen

Institute on Development and Disability

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

3 Scopus citations

Abstract

In this study, we propose a new intonation adaptation method to transform the perceived identity of a Text-To-Speech system to that of a target speaker with a small amount of training data. In the proposed method, during training we fit parametrized accent and phrase curves to parallel recordings of the target speaker F0 curves, and estimate the parameters of a mapping between the corresponding parameter spaces. During test, we fit the accent and phrase curves to the source utterances, apply the mapping, and create an F0 contour from the mapped accent and phrase curves. We compare the proposed method with a baseline adaptation method in which the source F0 contour is transformed linearly such that the per-utterance mean and variance of the target F0 contour is left unaltered. Perceptual tests showed that the proposed method was better than the baseline method in two subjective tests that assess similarity to the target speaker and speech quality, respectively.

Original language	English (US)
Title of host publication	2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	116-123
Number of pages	8
ISBN (Electronic)	9781479972913
DOIs	https://doi.org/10.1109/ASRU.2015.7404783
State	Published - Feb 10 2016
Event	IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Scottsdale, United States Duration: Dec 13 2015 → Dec 17 2015

Publication series

Name	2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings

Other

Other	IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015
Country/Territory	United States
City	Scottsdale
Period	12/13/15 → 12/17/15

Keywords

Adaptation
Intonation modeling
Prosody
Text-to-Speech synthesis

ASJC Scopus subject areas

Artificial Intelligence
Computer Networks and Communications
Computer Vision and Pattern Recognition

Access to Document

10.1109/ASRU.2015.7404783

Cite this

Langarani, M. S. E., & Van Santen, J. (2016). Speaker intonation adaptation for transforming text-to-speech synthesis speaker identity. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings (pp. 116-123). Article 7404783 (2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ASRU.2015.7404783

Speaker intonation adaptation for transforming text-to-speech synthesis speaker identity. / Langarani, Mahsa Sadat Elyasi; Van Santen, Jan.
2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2016. p. 116-123 7404783 (2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Langarani, MSE & Van Santen, J 2016, Speaker intonation adaptation for transforming text-to-speech synthesis speaker identity. in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings., 7404783, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings, Institute of Electrical and Electronics Engineers Inc., pp. 116-123, IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015, Scottsdale, United States, 12/13/15. https://doi.org/10.1109/ASRU.2015.7404783

Langarani MSE, Van Santen J. Speaker intonation adaptation for transforming text-to-speech synthesis speaker identity. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2016. p. 116-123. 7404783. (2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings). doi: 10.1109/ASRU.2015.7404783

Langarani, Mahsa Sadat Elyasi ; Van Santen, Jan. / Speaker intonation adaptation for transforming text-to-speech synthesis speaker identity. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2016. pp. 116-123 (2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings).

@inproceedings{2da4e59e0e6744f49ab389788ec70980,

title = "Speaker intonation adaptation for transforming text-to-speech synthesis speaker identity",

abstract = "In this study, we propose a new intonation adaptation method to transform the perceived identity of a Text-To-Speech system to that of a target speaker with a small amount of training data. In the proposed method, during training we fit parametrized accent and phrase curves to parallel recordings of the target speaker F0 curves, and estimate the parameters of a mapping between the corresponding parameter spaces. During test, we fit the accent and phrase curves to the source utterances, apply the mapping, and create an F0 contour from the mapped accent and phrase curves. We compare the proposed method with a baseline adaptation method in which the source F0 contour is transformed linearly such that the per-utterance mean and variance of the target F0 contour is left unaltered. Perceptual tests showed that the proposed method was better than the baseline method in two subjective tests that assess similarity to the target speaker and speech quality, respectively.",

keywords = "Adaptation, Intonation modeling, Prosody, Text-to-Speech synthesis",

author = "Langarani, {Mahsa Sadat Elyasi} and {Van Santen}, Jan",

note = "Publisher Copyright: {\textcopyright} 2015 IEEE.; IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 ; Conference date: 13-12-2015 Through 17-12-2015",

year = "2016",

month = feb,

day = "10",

doi = "10.1109/ASRU.2015.7404783",

language = "English (US)",

series = "2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "116--123",

booktitle = "2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings",

}

TY - GEN

T1 - Speaker intonation adaptation for transforming text-to-speech synthesis speaker identity

AU - Langarani, Mahsa Sadat Elyasi

AU - Van Santen, Jan

PY - 2016/2/10

Y1 - 2016/2/10

N2 - In this study, we propose a new intonation adaptation method to transform the perceived identity of a Text-To-Speech system to that of a target speaker with a small amount of training data. In the proposed method, during training we fit parametrized accent and phrase curves to parallel recordings of the target speaker F0 curves, and estimate the parameters of a mapping between the corresponding parameter spaces. During test, we fit the accent and phrase curves to the source utterances, apply the mapping, and create an F0 contour from the mapped accent and phrase curves. We compare the proposed method with a baseline adaptation method in which the source F0 contour is transformed linearly such that the per-utterance mean and variance of the target F0 contour is left unaltered. Perceptual tests showed that the proposed method was better than the baseline method in two subjective tests that assess similarity to the target speaker and speech quality, respectively.

AB - In this study, we propose a new intonation adaptation method to transform the perceived identity of a Text-To-Speech system to that of a target speaker with a small amount of training data. In the proposed method, during training we fit parametrized accent and phrase curves to parallel recordings of the target speaker F0 curves, and estimate the parameters of a mapping between the corresponding parameter spaces. During test, we fit the accent and phrase curves to the source utterances, apply the mapping, and create an F0 contour from the mapped accent and phrase curves. We compare the proposed method with a baseline adaptation method in which the source F0 contour is transformed linearly such that the per-utterance mean and variance of the target F0 contour is left unaltered. Perceptual tests showed that the proposed method was better than the baseline method in two subjective tests that assess similarity to the target speaker and speech quality, respectively.

KW - Adaptation

KW - Intonation modeling

KW - Prosody

KW - Text-to-Speech synthesis

UR - http://www.scopus.com/inward/record.url?scp=84964555662&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84964555662&partnerID=8YFLogxK

U2 - 10.1109/ASRU.2015.7404783

DO - 10.1109/ASRU.2015.7404783

M3 - Conference contribution

AN - SCOPUS:84964555662

T3 - 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings

SP - 116

EP - 123

BT - 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015

Y2 - 13 December 2015 through 17 December 2015

ER -

Speaker intonation adaptation for transforming text-to-speech synthesis speaker identity

Abstract

Publication series

Other

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this