Integrating articulatory information in deep learning-based text-To-speech synthesis

Beiming Cao; Myungjong Kim; Jan Van Santen; Ted Mau; Jun Wang

doi:10.21437/Interspeech.2017-1762

Integrating articulatory information in deep learning-based text-To-speech synthesis

Beiming Cao, Myungjong Kim, Jan Van Santen, Ted Mau, Jun Wang

Institute on Development and Disability

Research output: Contribution to journal › Conference article › peer-review

12 Scopus citations

Abstract

Articulatory information has been shown to be effective in improving the performance of hidden Markov model (HMM)-based text-To-speech (TTS) synthesis. Recently, deep learningbased TTS has outperformed HMM-based approaches. However, articulatory information has rarely been integrated in deep learning-based TTS. This paper investigated the effectiveness of integrating articulatory movement data to deep learning-based TTS. The integration of articulatory information was achieved in two ways: (1) direct integration, where articulatory and acoustic features were the output of a deep neural network (DNN), and (2) direct integration plus forward-mapping, where the output articulatory features were mapped to acoustic features by an additional DNN; These forward-mapped acoustic features were then combined with the output acoustic features to produce the final acoustic features. Articulatory (tongue and lip) and acoustic data collected from male and female speakers were used in the experiment. Both objective measures and subjective judgment by human listeners showed the approaches integrated articulatory information outperformed the baseline approach (without using articulatory information) in terms of naturalness and speaker voice identity (voice similarity).

Original language	English (US)
Pages (from-to)	254-258
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2017-August
DOIs	https://doi.org/10.21437/Interspeech.2017-1762
State	Published - 2017
Event	18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017 - Stockholm, Sweden Duration: Aug 20 2017 → Aug 24 2017

Keywords

Text-To-speech synthesis
articulatory data
deep learning
deep neural network

ASJC Scopus subject areas

Language and Linguistics
Human-Computer Interaction
Signal Processing
Software
Modeling and Simulation

Access to Document

10.21437/Interspeech.2017-1762

Cite this

@article{7d967335f3094fd2af8f01c44713f614,

title = "Integrating articulatory information in deep learning-based text-To-speech synthesis",

abstract = "Articulatory information has been shown to be effective in improving the performance of hidden Markov model (HMM)-based text-To-speech (TTS) synthesis. Recently, deep learningbased TTS has outperformed HMM-based approaches. However, articulatory information has rarely been integrated in deep learning-based TTS. This paper investigated the effectiveness of integrating articulatory movement data to deep learning-based TTS. The integration of articulatory information was achieved in two ways: (1) direct integration, where articulatory and acoustic features were the output of a deep neural network (DNN), and (2) direct integration plus forward-mapping, where the output articulatory features were mapped to acoustic features by an additional DNN; These forward-mapped acoustic features were then combined with the output acoustic features to produce the final acoustic features. Articulatory (tongue and lip) and acoustic data collected from male and female speakers were used in the experiment. Both objective measures and subjective judgment by human listeners showed the approaches integrated articulatory information outperformed the baseline approach (without using articulatory information) in terms of naturalness and speaker voice identity (voice similarity).",

keywords = "Text-To-speech synthesis, articulatory data, deep learning, deep neural network",

author = "Beiming Cao and Myungjong Kim and {Van Santen}, Jan and Ted Mau and Jun Wang",

note = "Publisher Copyright: Copyright {\textcopyright} 2017 ISCA.; 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017 ; Conference date: 20-08-2017 Through 24-08-2017",

year = "2017",

doi = "10.21437/Interspeech.2017-1762",

language = "English (US)",

volume = "2017-August",

pages = "254--258",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Integrating articulatory information in deep learning-based text-To-speech synthesis

AU - Cao, Beiming

AU - Kim, Myungjong

AU - Van Santen, Jan

AU - Mau, Ted

AU - Wang, Jun

PY - 2017

Y1 - 2017

N2 - Articulatory information has been shown to be effective in improving the performance of hidden Markov model (HMM)-based text-To-speech (TTS) synthesis. Recently, deep learningbased TTS has outperformed HMM-based approaches. However, articulatory information has rarely been integrated in deep learning-based TTS. This paper investigated the effectiveness of integrating articulatory movement data to deep learning-based TTS. The integration of articulatory information was achieved in two ways: (1) direct integration, where articulatory and acoustic features were the output of a deep neural network (DNN), and (2) direct integration plus forward-mapping, where the output articulatory features were mapped to acoustic features by an additional DNN; These forward-mapped acoustic features were then combined with the output acoustic features to produce the final acoustic features. Articulatory (tongue and lip) and acoustic data collected from male and female speakers were used in the experiment. Both objective measures and subjective judgment by human listeners showed the approaches integrated articulatory information outperformed the baseline approach (without using articulatory information) in terms of naturalness and speaker voice identity (voice similarity).

AB - Articulatory information has been shown to be effective in improving the performance of hidden Markov model (HMM)-based text-To-speech (TTS) synthesis. Recently, deep learningbased TTS has outperformed HMM-based approaches. However, articulatory information has rarely been integrated in deep learning-based TTS. This paper investigated the effectiveness of integrating articulatory movement data to deep learning-based TTS. The integration of articulatory information was achieved in two ways: (1) direct integration, where articulatory and acoustic features were the output of a deep neural network (DNN), and (2) direct integration plus forward-mapping, where the output articulatory features were mapped to acoustic features by an additional DNN; These forward-mapped acoustic features were then combined with the output acoustic features to produce the final acoustic features. Articulatory (tongue and lip) and acoustic data collected from male and female speakers were used in the experiment. Both objective measures and subjective judgment by human listeners showed the approaches integrated articulatory information outperformed the baseline approach (without using articulatory information) in terms of naturalness and speaker voice identity (voice similarity).

KW - Text-To-speech synthesis

KW - articulatory data

KW - deep learning

KW - deep neural network

UR - http://www.scopus.com/inward/record.url?scp=85039167284&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85039167284&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2017-1762

DO - 10.21437/Interspeech.2017-1762

M3 - Conference article

AN - SCOPUS:85039167284

SN - 2308-457X

VL - 2017-August

SP - 254

EP - 258

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017

Y2 - 20 August 2017 through 24 August 2017

ER -

Integrating articulatory information in deep learning-based text-To-speech synthesis

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this