Synthesis of prosody using multi-level unit sequences

Jan Van Santen; Alexander Kain; Esther Klabbers; Taniya Mishra

doi:10.1016/j.specom.2005.01.008

Synthesis of prosody using multi-level unit sequences

Jan Van Santen, Alexander Kain, Esther Klabbers, Taniya Mishra

Institute on Development and Disability

Research output: Contribution to journal › Article › peer-review

28 Scopus citations

Abstract

Generating meaningful and natural sounding prosody is a central challenge in text-to-speech synthesis (TTS). In traditional synthesis, the challenge consists of how to generate natural target prosodic contours and how to impose these contours on recorded speech without causing audible distortions. In unit selection synthesis, the challenge is the sheer size of the speech corpus that is needed to cover all combinations of phone sequences and prosodic contexts that can occur in a given language. This paper describes new methods that are being explored, based on the principle of superpositional prosody transplant. Both methods are based on the following procedure. In a recorded, prosodically and phonemically labeled corpus, the log pitch contours are additively decomposed into component curves according to a prosodic hierarchy, typically phrase curves (corresponding to phrases), accent curves (corresponding to feet), and segmental perturbation (or residuals) curves. During synthesis, the corpus is searched for multiple unit sequences: A unit sequence that covers the target phoneme string, and one or more unit sequences that cover the prosodic labels at a given phonological level (e.g., the foot or phrase) and are constrained by being matched to the phone match sequence in terms of the phonetic classes of the phonemes (or in terms of higher level entities, such as the number of feet and their sizes measured in syllables). The methods differ in terms of the level of detail of these constraints. A superpositional prosody transplant procedure generates a target pitch contour by extracting and recombining component curves from these sequences, and imposing this contour on the sequence that matches the phone string using standard speech modification methods. This process minimizes prosodic modification artifacts, optimizes the naturalness of the target pitch contour, yet avoids the combinatorial explosion of standard unit selection synthesis.

Original language	English (US)
Pages (from-to)	365-375
Number of pages	11
Journal	Speech Communication
Volume	46
Issue number	3-4
DOIs	https://doi.org/10.1016/j.specom.2005.01.008
State	Published - Jul 2005

ASJC Scopus subject areas

Software
Modeling and Simulation
Communication
Language and Linguistics
Linguistics and Language
Computer Vision and Pattern Recognition
Computer Science Applications

Access to Document

10.1016/j.specom.2005.01.008

Cite this

@article{0d20f64a395e46528d02463f89f3aa0c,

title = "Synthesis of prosody using multi-level unit sequences",

abstract = "Generating meaningful and natural sounding prosody is a central challenge in text-to-speech synthesis (TTS). In traditional synthesis, the challenge consists of how to generate natural target prosodic contours and how to impose these contours on recorded speech without causing audible distortions. In unit selection synthesis, the challenge is the sheer size of the speech corpus that is needed to cover all combinations of phone sequences and prosodic contexts that can occur in a given language. This paper describes new methods that are being explored, based on the principle of superpositional prosody transplant. Both methods are based on the following procedure. In a recorded, prosodically and phonemically labeled corpus, the log pitch contours are additively decomposed into component curves according to a prosodic hierarchy, typically phrase curves (corresponding to phrases), accent curves (corresponding to feet), and segmental perturbation (or residuals) curves. During synthesis, the corpus is searched for multiple unit sequences: A unit sequence that covers the target phoneme string, and one or more unit sequences that cover the prosodic labels at a given phonological level (e.g., the foot or phrase) and are constrained by being matched to the phone match sequence in terms of the phonetic classes of the phonemes (or in terms of higher level entities, such as the number of feet and their sizes measured in syllables). The methods differ in terms of the level of detail of these constraints. A superpositional prosody transplant procedure generates a target pitch contour by extracting and recombining component curves from these sequences, and imposing this contour on the sequence that matches the phone string using standard speech modification methods. This process minimizes prosodic modification artifacts, optimizes the naturalness of the target pitch contour, yet avoids the combinatorial explosion of standard unit selection synthesis.",

author = "{Van Santen}, Jan and Alexander Kain and Esther Klabbers and Taniya Mishra",

note = "Funding Information: We thank Xiaochuan Niu, Johan Wouters, and Paul Hosom for insightful comments. We owe several of the key ideas to Mike Macon, who died in 2002, and to whom this paper is dedicated. This material is based on work supported by the National Science Foundation under Grants No. 0205731 (“ITR: Prosody Generation for Child Oriented Speech Synthesis”), jointly with Alan Black and Richard Sproat; 0313383 (“ITR: Objective Methods for Predicting and Optimizing Synthetic Speech Quality”); and 0082718 (“ITR: Modeling Degree of Articulation for Speech Synthesis”). ",

year = "2005",

month = jul,

doi = "10.1016/j.specom.2005.01.008",

language = "English (US)",

volume = "46",

pages = "365--375",

journal = "Speech Communication",

issn = "0167-6393",

publisher = "Elsevier",

number = "3-4",

}

TY - JOUR

T1 - Synthesis of prosody using multi-level unit sequences

AU - Van Santen, Jan

AU - Kain, Alexander

AU - Klabbers, Esther

AU - Mishra, Taniya

N1 - Funding Information: We thank Xiaochuan Niu, Johan Wouters, and Paul Hosom for insightful comments. We owe several of the key ideas to Mike Macon, who died in 2002, and to whom this paper is dedicated. This material is based on work supported by the National Science Foundation under Grants No. 0205731 (“ITR: Prosody Generation for Child Oriented Speech Synthesis”), jointly with Alan Black and Richard Sproat; 0313383 (“ITR: Objective Methods for Predicting and Optimizing Synthetic Speech Quality”); and 0082718 (“ITR: Modeling Degree of Articulation for Speech Synthesis”).

PY - 2005/7

Y1 - 2005/7

N2 - Generating meaningful and natural sounding prosody is a central challenge in text-to-speech synthesis (TTS). In traditional synthesis, the challenge consists of how to generate natural target prosodic contours and how to impose these contours on recorded speech without causing audible distortions. In unit selection synthesis, the challenge is the sheer size of the speech corpus that is needed to cover all combinations of phone sequences and prosodic contexts that can occur in a given language. This paper describes new methods that are being explored, based on the principle of superpositional prosody transplant. Both methods are based on the following procedure. In a recorded, prosodically and phonemically labeled corpus, the log pitch contours are additively decomposed into component curves according to a prosodic hierarchy, typically phrase curves (corresponding to phrases), accent curves (corresponding to feet), and segmental perturbation (or residuals) curves. During synthesis, the corpus is searched for multiple unit sequences: A unit sequence that covers the target phoneme string, and one or more unit sequences that cover the prosodic labels at a given phonological level (e.g., the foot or phrase) and are constrained by being matched to the phone match sequence in terms of the phonetic classes of the phonemes (or in terms of higher level entities, such as the number of feet and their sizes measured in syllables). The methods differ in terms of the level of detail of these constraints. A superpositional prosody transplant procedure generates a target pitch contour by extracting and recombining component curves from these sequences, and imposing this contour on the sequence that matches the phone string using standard speech modification methods. This process minimizes prosodic modification artifacts, optimizes the naturalness of the target pitch contour, yet avoids the combinatorial explosion of standard unit selection synthesis.

AB - Generating meaningful and natural sounding prosody is a central challenge in text-to-speech synthesis (TTS). In traditional synthesis, the challenge consists of how to generate natural target prosodic contours and how to impose these contours on recorded speech without causing audible distortions. In unit selection synthesis, the challenge is the sheer size of the speech corpus that is needed to cover all combinations of phone sequences and prosodic contexts that can occur in a given language. This paper describes new methods that are being explored, based on the principle of superpositional prosody transplant. Both methods are based on the following procedure. In a recorded, prosodically and phonemically labeled corpus, the log pitch contours are additively decomposed into component curves according to a prosodic hierarchy, typically phrase curves (corresponding to phrases), accent curves (corresponding to feet), and segmental perturbation (or residuals) curves. During synthesis, the corpus is searched for multiple unit sequences: A unit sequence that covers the target phoneme string, and one or more unit sequences that cover the prosodic labels at a given phonological level (e.g., the foot or phrase) and are constrained by being matched to the phone match sequence in terms of the phonetic classes of the phonemes (or in terms of higher level entities, such as the number of feet and their sizes measured in syllables). The methods differ in terms of the level of detail of these constraints. A superpositional prosody transplant procedure generates a target pitch contour by extracting and recombining component curves from these sequences, and imposing this contour on the sequence that matches the phone string using standard speech modification methods. This process minimizes prosodic modification artifacts, optimizes the naturalness of the target pitch contour, yet avoids the combinatorial explosion of standard unit selection synthesis.

UR - http://www.scopus.com/inward/record.url?scp=21844466234&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=21844466234&partnerID=8YFLogxK

U2 - 10.1016/j.specom.2005.01.008

DO - 10.1016/j.specom.2005.01.008

M3 - Article

AN - SCOPUS:21844466234

SN - 0167-6393

VL - 46

SP - 365

EP - 375

JO - Speech Communication

JF - Speech Communication

IS - 3-4

ER -

Synthesis of prosody using multi-level unit sequences

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this