Data-driven foot-based intonation generator for text-to-speech synthesis

Mahsa Sadat Elyasi Langarani, Jan Van Santen, Seyed Hamidreza Mohammadi, Alexander Kain

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Scopus citations

Abstract

We propose a method for generating F0 contours for text-tospeech synthesis. Training speech is automatically annotated in terms of feet, with features indicating start and end times of syllables, foot position, and foot length. During training, we fit a foot-based superpositional intonation model comprising accent curves and phrase curves. During synthesis, the method searches for stored, fitted accent curves associated with feet that optimally match to-be-synthesized feet in the feature space, while minimizing differences between successive accent curve heights. We tested the proposed method against the HMMbased Speech Synthesis System (HTS) by imposing contours generated by these two methods onto natural speech, and obtaining quality ratings. Test sets varied in how well they were covered by the training data. Contours generated by the proposed method were preferred over HTS-generated contours, especially for poorly-covered test items. To test the new method's usefulness for processing marked-up text input, we compared its ability to convey contrastive stress with that of natural speech recordings, and found no difference. We conclude that the new method holds promise for generating comparatively highquality F0 contours, especially when training data are sparse and when mark-up is required.

Original languageEnglish (US)
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
PublisherInternational Speech and Communication Association
Pages1596-1600
Number of pages5
Volume2015-January
StatePublished - 2015
Event16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015 - Dresden, Germany
Duration: Sep 6 2015Sep 10 2015

Other

Other16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015
CountryGermany
CityDresden
Period9/6/159/10/15

Keywords

  • Intonation modeling
  • Prosody
  • Text-to-Speech Synthesis

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Fingerprint Dive into the research topics of 'Data-driven foot-based intonation generator for text-to-speech synthesis'. Together they form a unique fingerprint.

  • Cite this

    Langarani, M. S. E., Van Santen, J., Mohammadi, S. H., & Kain, A. (2015). Data-driven foot-based intonation generator for text-to-speech synthesis. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (Vol. 2015-January, pp. 1596-1600). International Speech and Communication Association.