Assignment of segmental duration in text-to-speech synthesis

Research output: Contribution to journalArticle

87 Citations (Scopus)

Abstract

In natural speech, durations of phonetic segments are strongly dependent on contextual factors. For synthetic speech to sound natural, the module for computing segmental duration (the duration system) must mimic these contextual effects as closely as possible. Construction of duration system is obstructed by two facets of segmental duration: (1) interactions between contextual factors, and (2) sparsity of training data. This paper describes a new duration system in which a central role is played by duration models, in the form of equations consisting of sums and products such as in: duration (/i/, voiced, stressed)=A(/i/)+B(voiced)+C(stressed). These models, which w call sums-of-products models, can capture the types of interaction patterns often found in duration data, where one factor typically amplifies - but does not reverse - the effects of other factors. Yet, these models are mathematically sufficiently tractable for robust parameter estimation in the presence of severe sparsity. The overall architecture of the system consists of a category structure, or tree, that divides the space into similar-behaved cases; for each of these categories a separate sums-of-products model is developed and its parameters are estimated. Perceptual evaluation results are reported for a implementation in the AT&T Bell Laboratories text-to-speech system.

Original languageEnglish (US)
Pages (from-to)95-128
Number of pages34
JournalComputer Speech and Language
Volume8
Issue number2
DOIs
StatePublished - Apr 1994
Externally publishedYes

Fingerprint

Text-to-speech
Speech Synthesis
Speech synthesis
Phonetics
Assignment
Product Model
Sparsity
Speech analysis
Parameter estimation
Duration Models
Acoustic waves
Robust Estimation
interaction pattern
Interaction
Facet
Divides
Parameter Estimation
phonetics
Reverse
Module

ASJC Scopus subject areas

  • Signal Processing
  • Electrical and Electronic Engineering
  • Experimental and Cognitive Psychology
  • Linguistics and Language

Cite this

Assignment of segmental duration in text-to-speech synthesis. / Van Santen, Jan.

In: Computer Speech and Language, Vol. 8, No. 2, 04.1994, p. 95-128.

Research output: Contribution to journalArticle

@article{7702b1205c714c85a031af1da02af9b7,
title = "Assignment of segmental duration in text-to-speech synthesis",
abstract = "In natural speech, durations of phonetic segments are strongly dependent on contextual factors. For synthetic speech to sound natural, the module for computing segmental duration (the duration system) must mimic these contextual effects as closely as possible. Construction of duration system is obstructed by two facets of segmental duration: (1) interactions between contextual factors, and (2) sparsity of training data. This paper describes a new duration system in which a central role is played by duration models, in the form of equations consisting of sums and products such as in: duration (/i/, voiced, stressed)=A(/i/)+B(voiced)+C(stressed). These models, which w call sums-of-products models, can capture the types of interaction patterns often found in duration data, where one factor typically amplifies - but does not reverse - the effects of other factors. Yet, these models are mathematically sufficiently tractable for robust parameter estimation in the presence of severe sparsity. The overall architecture of the system consists of a category structure, or tree, that divides the space into similar-behaved cases; for each of these categories a separate sums-of-products model is developed and its parameters are estimated. Perceptual evaluation results are reported for a implementation in the AT&T Bell Laboratories text-to-speech system.",
author = "{Van Santen}, Jan",
year = "1994",
month = "4",
doi = "10.1006/csla.1994.1005",
language = "English (US)",
volume = "8",
pages = "95--128",
journal = "Computer Speech and Language",
issn = "0885-2308",
publisher = "Academic Press Inc.",
number = "2",

}

TY - JOUR

T1 - Assignment of segmental duration in text-to-speech synthesis

AU - Van Santen, Jan

PY - 1994/4

Y1 - 1994/4

N2 - In natural speech, durations of phonetic segments are strongly dependent on contextual factors. For synthetic speech to sound natural, the module for computing segmental duration (the duration system) must mimic these contextual effects as closely as possible. Construction of duration system is obstructed by two facets of segmental duration: (1) interactions between contextual factors, and (2) sparsity of training data. This paper describes a new duration system in which a central role is played by duration models, in the form of equations consisting of sums and products such as in: duration (/i/, voiced, stressed)=A(/i/)+B(voiced)+C(stressed). These models, which w call sums-of-products models, can capture the types of interaction patterns often found in duration data, where one factor typically amplifies - but does not reverse - the effects of other factors. Yet, these models are mathematically sufficiently tractable for robust parameter estimation in the presence of severe sparsity. The overall architecture of the system consists of a category structure, or tree, that divides the space into similar-behaved cases; for each of these categories a separate sums-of-products model is developed and its parameters are estimated. Perceptual evaluation results are reported for a implementation in the AT&T Bell Laboratories text-to-speech system.

AB - In natural speech, durations of phonetic segments are strongly dependent on contextual factors. For synthetic speech to sound natural, the module for computing segmental duration (the duration system) must mimic these contextual effects as closely as possible. Construction of duration system is obstructed by two facets of segmental duration: (1) interactions between contextual factors, and (2) sparsity of training data. This paper describes a new duration system in which a central role is played by duration models, in the form of equations consisting of sums and products such as in: duration (/i/, voiced, stressed)=A(/i/)+B(voiced)+C(stressed). These models, which w call sums-of-products models, can capture the types of interaction patterns often found in duration data, where one factor typically amplifies - but does not reverse - the effects of other factors. Yet, these models are mathematically sufficiently tractable for robust parameter estimation in the presence of severe sparsity. The overall architecture of the system consists of a category structure, or tree, that divides the space into similar-behaved cases; for each of these categories a separate sums-of-products model is developed and its parameters are estimated. Perceptual evaluation results are reported for a implementation in the AT&T Bell Laboratories text-to-speech system.

UR - http://www.scopus.com/inward/record.url?scp=0028405296&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0028405296&partnerID=8YFLogxK

U2 - 10.1006/csla.1994.1005

DO - 10.1006/csla.1994.1005

M3 - Article

AN - SCOPUS:0028405296

VL - 8

SP - 95

EP - 128

JO - Computer Speech and Language

JF - Computer Speech and Language

SN - 0885-2308

IS - 2

ER -