Foot-based intonation for text-to-speech synthesis using neural networks

Mahsa Sadat Elyasi Langarani, Jan Van Santen

Research output: Contribution to journalArticle

Abstract

We propose a method (“FONN”) for F0 contour generation for text-to-speech synthesis. Training speech is automatically segmented into left-headed feet, annotated with syllable start/end times, foot position in the sentence, and the number of syllables in the foot. During training, we fit a superpositional intonation model comprising accent curves associated with feet and phrase curves. We propose to use a neural network for model parameter estimation. We tested the method against the HMM-based Speech Synthesis System (HTS) as well as against a template based variant of FONN (“DRIFT”) by imposing contours generated by the methods onto natural speech and obtaining quality ratings. Test sets varied in degree of coverage by training data. Contours generated by DRIFT and FONN were strongly preferred over HTS-generated contours, especially for poorly-covered test items, with DRIFT slightly preferred over FONN. We conclude that the new methods hold promise for high-quality F0 contour generation while making efficient use of training data.

Original languageEnglish (US)
Pages (from-to)1009-1013
Number of pages5
JournalUnknown Journal
Volume2016-January
StatePublished - 2016

Keywords

  • Artificial neural networks
  • Intonation modeling
  • Prosody
  • Text-to-speech synthesis

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Fingerprint Dive into the research topics of 'Foot-based intonation for text-to-speech synthesis using neural networks'. Together they form a unique fingerprint.

  • Cite this