Data-driven foot-based intonation generator for text-to-speech synthesis

Mahsa Sadat Elyasi Langarani, Jan Van Santen, Seyed Hamidreza Mohammadi, Alexander Kain

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

We propose a method for generating F0 contours for text-tospeech synthesis. Training speech is automatically annotated in terms of feet, with features indicating start and end times of syllables, foot position, and foot length. During training, we fit a foot-based superpositional intonation model comprising accent curves and phrase curves. During synthesis, the method searches for stored, fitted accent curves associated with feet that optimally match to-be-synthesized feet in the feature space, while minimizing differences between successive accent curve heights. We tested the proposed method against the HMMbased Speech Synthesis System (HTS) by imposing contours generated by these two methods onto natural speech, and obtaining quality ratings. Test sets varied in how well they were covered by the training data. Contours generated by the proposed method were preferred over HTS-generated contours, especially for poorly-covered test items. To test the new method's usefulness for processing marked-up text input, we compared its ability to convey contrastive stress with that of natural speech recordings, and found no difference. We conclude that the new method holds promise for generating comparatively highquality F0 contours, especially when training data are sparse and when mark-up is required.

Original languageEnglish (US)
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
PublisherInternational Speech and Communication Association
Pages1596-1600
Number of pages5
Volume2015-January
StatePublished - 2015
Event16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015 - Dresden, Germany
Duration: Sep 6 2015Sep 10 2015

Other

Other16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015
CountryGermany
CityDresden
Period9/6/159/10/15

Fingerprint

Text-to-speech
Speech Synthesis
Speech synthesis
Data-driven
Generator
Curve
Synthesis
Test Set
Processing
Feature Space
Search Methods
Intonation
Training
Speech

Keywords

  • Intonation modeling
  • Prosody
  • Text-to-Speech Synthesis

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Cite this

Langarani, M. S. E., Van Santen, J., Mohammadi, S. H., & Kain, A. (2015). Data-driven foot-based intonation generator for text-to-speech synthesis. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (Vol. 2015-January, pp. 1596-1600). International Speech and Communication Association.

Data-driven foot-based intonation generator for text-to-speech synthesis. / Langarani, Mahsa Sadat Elyasi; Van Santen, Jan; Mohammadi, Seyed Hamidreza; Kain, Alexander.

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2015-January International Speech and Communication Association, 2015. p. 1596-1600.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Langarani, MSE, Van Santen, J, Mohammadi, SH & Kain, A 2015, Data-driven foot-based intonation generator for text-to-speech synthesis. in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. vol. 2015-January, International Speech and Communication Association, pp. 1596-1600, 16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015, Dresden, Germany, 9/6/15.
Langarani MSE, Van Santen J, Mohammadi SH, Kain A. Data-driven foot-based intonation generator for text-to-speech synthesis. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2015-January. International Speech and Communication Association. 2015. p. 1596-1600
Langarani, Mahsa Sadat Elyasi ; Van Santen, Jan ; Mohammadi, Seyed Hamidreza ; Kain, Alexander. / Data-driven foot-based intonation generator for text-to-speech synthesis. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2015-January International Speech and Communication Association, 2015. pp. 1596-1600
@inproceedings{8ce2deea673a417ea80f6b58d1a877bb,
title = "Data-driven foot-based intonation generator for text-to-speech synthesis",
abstract = "We propose a method for generating F0 contours for text-tospeech synthesis. Training speech is automatically annotated in terms of feet, with features indicating start and end times of syllables, foot position, and foot length. During training, we fit a foot-based superpositional intonation model comprising accent curves and phrase curves. During synthesis, the method searches for stored, fitted accent curves associated with feet that optimally match to-be-synthesized feet in the feature space, while minimizing differences between successive accent curve heights. We tested the proposed method against the HMMbased Speech Synthesis System (HTS) by imposing contours generated by these two methods onto natural speech, and obtaining quality ratings. Test sets varied in how well they were covered by the training data. Contours generated by the proposed method were preferred over HTS-generated contours, especially for poorly-covered test items. To test the new method's usefulness for processing marked-up text input, we compared its ability to convey contrastive stress with that of natural speech recordings, and found no difference. We conclude that the new method holds promise for generating comparatively highquality F0 contours, especially when training data are sparse and when mark-up is required.",
keywords = "Intonation modeling, Prosody, Text-to-Speech Synthesis",
author = "Langarani, {Mahsa Sadat Elyasi} and {Van Santen}, Jan and Mohammadi, {Seyed Hamidreza} and Alexander Kain",
year = "2015",
language = "English (US)",
volume = "2015-January",
pages = "1596--1600",
booktitle = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
publisher = "International Speech and Communication Association",

}

TY - GEN

T1 - Data-driven foot-based intonation generator for text-to-speech synthesis

AU - Langarani, Mahsa Sadat Elyasi

AU - Van Santen, Jan

AU - Mohammadi, Seyed Hamidreza

AU - Kain, Alexander

PY - 2015

Y1 - 2015

N2 - We propose a method for generating F0 contours for text-tospeech synthesis. Training speech is automatically annotated in terms of feet, with features indicating start and end times of syllables, foot position, and foot length. During training, we fit a foot-based superpositional intonation model comprising accent curves and phrase curves. During synthesis, the method searches for stored, fitted accent curves associated with feet that optimally match to-be-synthesized feet in the feature space, while minimizing differences between successive accent curve heights. We tested the proposed method against the HMMbased Speech Synthesis System (HTS) by imposing contours generated by these two methods onto natural speech, and obtaining quality ratings. Test sets varied in how well they were covered by the training data. Contours generated by the proposed method were preferred over HTS-generated contours, especially for poorly-covered test items. To test the new method's usefulness for processing marked-up text input, we compared its ability to convey contrastive stress with that of natural speech recordings, and found no difference. We conclude that the new method holds promise for generating comparatively highquality F0 contours, especially when training data are sparse and when mark-up is required.

AB - We propose a method for generating F0 contours for text-tospeech synthesis. Training speech is automatically annotated in terms of feet, with features indicating start and end times of syllables, foot position, and foot length. During training, we fit a foot-based superpositional intonation model comprising accent curves and phrase curves. During synthesis, the method searches for stored, fitted accent curves associated with feet that optimally match to-be-synthesized feet in the feature space, while minimizing differences between successive accent curve heights. We tested the proposed method against the HMMbased Speech Synthesis System (HTS) by imposing contours generated by these two methods onto natural speech, and obtaining quality ratings. Test sets varied in how well they were covered by the training data. Contours generated by the proposed method were preferred over HTS-generated contours, especially for poorly-covered test items. To test the new method's usefulness for processing marked-up text input, we compared its ability to convey contrastive stress with that of natural speech recordings, and found no difference. We conclude that the new method holds promise for generating comparatively highquality F0 contours, especially when training data are sparse and when mark-up is required.

KW - Intonation modeling

KW - Prosody

KW - Text-to-Speech Synthesis

UR - http://www.scopus.com/inward/record.url?scp=84959075368&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959075368&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84959075368

VL - 2015-January

SP - 1596

EP - 1600

BT - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

PB - International Speech and Communication Association

ER -