TY - JOUR
T1 - Synthesis of prosody using multi-level unit sequences
AU - Van Santen, Jan
AU - Kain, Alexander
AU - Klabbers, Esther
AU - Mishra, Taniya
N1 - Funding Information:
We thank Xiaochuan Niu, Johan Wouters, and Paul Hosom for insightful comments. We owe several of the key ideas to Mike Macon, who died in 2002, and to whom this paper is dedicated. This material is based on work supported by the National Science Foundation under Grants No. 0205731 (“ITR: Prosody Generation for Child Oriented Speech Synthesis”), jointly with Alan Black and Richard Sproat; 0313383 (“ITR: Objective Methods for Predicting and Optimizing Synthetic Speech Quality”); and 0082718 (“ITR: Modeling Degree of Articulation for Speech Synthesis”).
PY - 2005/7
Y1 - 2005/7
N2 - Generating meaningful and natural sounding prosody is a central challenge in text-to-speech synthesis (TTS). In traditional synthesis, the challenge consists of how to generate natural target prosodic contours and how to impose these contours on recorded speech without causing audible distortions. In unit selection synthesis, the challenge is the sheer size of the speech corpus that is needed to cover all combinations of phone sequences and prosodic contexts that can occur in a given language. This paper describes new methods that are being explored, based on the principle of superpositional prosody transplant. Both methods are based on the following procedure. In a recorded, prosodically and phonemically labeled corpus, the log pitch contours are additively decomposed into component curves according to a prosodic hierarchy, typically phrase curves (corresponding to phrases), accent curves (corresponding to feet), and segmental perturbation (or residuals) curves. During synthesis, the corpus is searched for multiple unit sequences: A unit sequence that covers the target phoneme string, and one or more unit sequences that cover the prosodic labels at a given phonological level (e.g., the foot or phrase) and are constrained by being matched to the phone match sequence in terms of the phonetic classes of the phonemes (or in terms of higher level entities, such as the number of feet and their sizes measured in syllables). The methods differ in terms of the level of detail of these constraints. A superpositional prosody transplant procedure generates a target pitch contour by extracting and recombining component curves from these sequences, and imposing this contour on the sequence that matches the phone string using standard speech modification methods. This process minimizes prosodic modification artifacts, optimizes the naturalness of the target pitch contour, yet avoids the combinatorial explosion of standard unit selection synthesis.
AB - Generating meaningful and natural sounding prosody is a central challenge in text-to-speech synthesis (TTS). In traditional synthesis, the challenge consists of how to generate natural target prosodic contours and how to impose these contours on recorded speech without causing audible distortions. In unit selection synthesis, the challenge is the sheer size of the speech corpus that is needed to cover all combinations of phone sequences and prosodic contexts that can occur in a given language. This paper describes new methods that are being explored, based on the principle of superpositional prosody transplant. Both methods are based on the following procedure. In a recorded, prosodically and phonemically labeled corpus, the log pitch contours are additively decomposed into component curves according to a prosodic hierarchy, typically phrase curves (corresponding to phrases), accent curves (corresponding to feet), and segmental perturbation (or residuals) curves. During synthesis, the corpus is searched for multiple unit sequences: A unit sequence that covers the target phoneme string, and one or more unit sequences that cover the prosodic labels at a given phonological level (e.g., the foot or phrase) and are constrained by being matched to the phone match sequence in terms of the phonetic classes of the phonemes (or in terms of higher level entities, such as the number of feet and their sizes measured in syllables). The methods differ in terms of the level of detail of these constraints. A superpositional prosody transplant procedure generates a target pitch contour by extracting and recombining component curves from these sequences, and imposing this contour on the sequence that matches the phone string using standard speech modification methods. This process minimizes prosodic modification artifacts, optimizes the naturalness of the target pitch contour, yet avoids the combinatorial explosion of standard unit selection synthesis.
UR - http://www.scopus.com/inward/record.url?scp=21844466234&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=21844466234&partnerID=8YFLogxK
U2 - 10.1016/j.specom.2005.01.008
DO - 10.1016/j.specom.2005.01.008
M3 - Article
AN - SCOPUS:21844466234
SN - 0167-6393
VL - 46
SP - 365
EP - 375
JO - Speech Communication
JF - Speech Communication
IS - 3-4
ER -