TY - JOUR
T1 - Foot-based intonation for text-to-speech synthesis using neural networks
AU - Langarani, Mahsa Sadat Elyasi
AU - van Santen, Jan
N1 - Funding Information:
This material is based upon work supported by the National Science Foundation under Grant No. 0964468.
Publisher Copyright:
© 2016, International Speech Communications Association. All rights reserved.
PY - 2016
Y1 - 2016
N2 - We propose a method (“FONN”) for F0 contour generation for text-to-speech synthesis. Training speech is automatically segmented into left-headed feet, annotated with syllable start/end times, foot position in the sentence, and the number of syllables in the foot. During training, we fit a superpositional intonation model comprising accent curves associated with feet and phrase curves. We propose to use a neural network for model parameter estimation. We tested the method against the HMM-based Speech Synthesis System (HTS) as well as against a template based variant of FONN (“DRIFT”) by imposing contours generated by the methods onto natural speech and obtaining quality ratings. Test sets varied in degree of coverage by training data. Contours generated by DRIFT and FONN were strongly preferred over HTS-generated contours, especially for poorly-covered test items, with DRIFT slightly preferred over FONN. We conclude that the new methods hold promise for high-quality F0 contour generation while making efficient use of training data.
AB - We propose a method (“FONN”) for F0 contour generation for text-to-speech synthesis. Training speech is automatically segmented into left-headed feet, annotated with syllable start/end times, foot position in the sentence, and the number of syllables in the foot. During training, we fit a superpositional intonation model comprising accent curves associated with feet and phrase curves. We propose to use a neural network for model parameter estimation. We tested the method against the HMM-based Speech Synthesis System (HTS) as well as against a template based variant of FONN (“DRIFT”) by imposing contours generated by the methods onto natural speech and obtaining quality ratings. Test sets varied in degree of coverage by training data. Contours generated by DRIFT and FONN were strongly preferred over HTS-generated contours, especially for poorly-covered test items, with DRIFT slightly preferred over FONN. We conclude that the new methods hold promise for high-quality F0 contour generation while making efficient use of training data.
KW - Artificial neural networks
KW - Intonation modeling
KW - Prosody
KW - Text-to-speech synthesis
UR - http://www.scopus.com/inward/record.url?scp=84982994414&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84982994414&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:84982994414
SN - 2333-2042
VL - 2016-January
SP - 1009
EP - 1013
JO - Proceedings of the International Conference on Speech Prosody
JF - Proceedings of the International Conference on Speech Prosody
T2 - 8th Speech Prosody 2016
Y2 - 31 May 2016 through 3 June 2016
ER -