Recent innovations in speech-to-text transcription at SRI-ICSI-UW

Andreas Stolcke, Barry Chen, Horacio Franco, Venkata Ramana Rao Gadde, Martin Graciarena, Mei Yuh Hwang, Katrin Kirchhoff, Arindam Mandal, Nelson Morgan, Xin Lei, Tim Ng, Mari Ostendorf, Mustafa (Kemal) Sonmez, Anand Venkataraman, Dimitra Vergyri, Wen Wang, Jing Zheng, Qifeng Zhu

Research output: Contribution to journalArticle

62 Citations (Scopus)

Abstract

We summarize recent progress in automatic specch-to-text transcription at SRI, ICSI, and the University of Washington. The work encompasses all components of speech modeling found in a state-of-the-art recognition system, from acoustic features, to acoustic modeling and adaptation, to language modeling. In the front end, we experimented with nonstandard features, including various measures of voicing, discriminative phone posterior features estimated by multilayer perceptrons, and a novel phone-level macro-averaging for cepstral normalization. Acoustic modeling was improved with combinations of front ends operating at multiple frame rates, as well as by modifications to the standard methods for discriminative Gaussian estimation. We show that acoustic adaptation can be improved by predicting the optimal regression class complexity for a given speaker. Language modeling innovations include the use of a syntax-motivated almost-parsing language model, as well as principled vocabulary-selection techniques. Finally, we address portability issues, such as the use of imperfect training transcripts, and language-specific adjustments required for recognition of Arabic and Mandarin.

Original languageEnglish (US)
Pages (from-to)1729-1742
Number of pages14
JournalIEEE Transactions on Audio, Speech and Language Processing
Volume14
Issue number5
DOIs
StatePublished - Sep 2006
Externally publishedYes

Fingerprint

Transcription
Innovation
Acoustics
acoustics
syntax
self organizing systems
Multilayer neural networks
Macros
regression analysis
education
adjusting

Keywords

  • Broadcast news (BN)
  • Conversational telephone speech (CTS)
  • Specch-to-text (STT)

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Acoustics and Ultrasonics

Cite this

Stolcke, A., Chen, B., Franco, H., Gadde, V. R. R., Graciarena, M., Hwang, M. Y., ... Zhu, Q. (2006). Recent innovations in speech-to-text transcription at SRI-ICSI-UW. IEEE Transactions on Audio, Speech and Language Processing, 14(5), 1729-1742. https://doi.org/10.1109/TASL.2006.879807

Recent innovations in speech-to-text transcription at SRI-ICSI-UW. / Stolcke, Andreas; Chen, Barry; Franco, Horacio; Gadde, Venkata Ramana Rao; Graciarena, Martin; Hwang, Mei Yuh; Kirchhoff, Katrin; Mandal, Arindam; Morgan, Nelson; Lei, Xin; Ng, Tim; Ostendorf, Mari; Sonmez, Mustafa (Kemal); Venkataraman, Anand; Vergyri, Dimitra; Wang, Wen; Zheng, Jing; Zhu, Qifeng.

In: IEEE Transactions on Audio, Speech and Language Processing, Vol. 14, No. 5, 09.2006, p. 1729-1742.

Research output: Contribution to journalArticle

Stolcke, A, Chen, B, Franco, H, Gadde, VRR, Graciarena, M, Hwang, MY, Kirchhoff, K, Mandal, A, Morgan, N, Lei, X, Ng, T, Ostendorf, M, Sonmez, MK, Venkataraman, A, Vergyri, D, Wang, W, Zheng, J & Zhu, Q 2006, 'Recent innovations in speech-to-text transcription at SRI-ICSI-UW', IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 5, pp. 1729-1742. https://doi.org/10.1109/TASL.2006.879807
Stolcke, Andreas ; Chen, Barry ; Franco, Horacio ; Gadde, Venkata Ramana Rao ; Graciarena, Martin ; Hwang, Mei Yuh ; Kirchhoff, Katrin ; Mandal, Arindam ; Morgan, Nelson ; Lei, Xin ; Ng, Tim ; Ostendorf, Mari ; Sonmez, Mustafa (Kemal) ; Venkataraman, Anand ; Vergyri, Dimitra ; Wang, Wen ; Zheng, Jing ; Zhu, Qifeng. / Recent innovations in speech-to-text transcription at SRI-ICSI-UW. In: IEEE Transactions on Audio, Speech and Language Processing. 2006 ; Vol. 14, No. 5. pp. 1729-1742.
@article{e9b73e5eb25b467e96930e6b4624532b,
title = "Recent innovations in speech-to-text transcription at SRI-ICSI-UW",
abstract = "We summarize recent progress in automatic specch-to-text transcription at SRI, ICSI, and the University of Washington. The work encompasses all components of speech modeling found in a state-of-the-art recognition system, from acoustic features, to acoustic modeling and adaptation, to language modeling. In the front end, we experimented with nonstandard features, including various measures of voicing, discriminative phone posterior features estimated by multilayer perceptrons, and a novel phone-level macro-averaging for cepstral normalization. Acoustic modeling was improved with combinations of front ends operating at multiple frame rates, as well as by modifications to the standard methods for discriminative Gaussian estimation. We show that acoustic adaptation can be improved by predicting the optimal regression class complexity for a given speaker. Language modeling innovations include the use of a syntax-motivated almost-parsing language model, as well as principled vocabulary-selection techniques. Finally, we address portability issues, such as the use of imperfect training transcripts, and language-specific adjustments required for recognition of Arabic and Mandarin.",
keywords = "Broadcast news (BN), Conversational telephone speech (CTS), Specch-to-text (STT)",
author = "Andreas Stolcke and Barry Chen and Horacio Franco and Gadde, {Venkata Ramana Rao} and Martin Graciarena and Hwang, {Mei Yuh} and Katrin Kirchhoff and Arindam Mandal and Nelson Morgan and Xin Lei and Tim Ng and Mari Ostendorf and Sonmez, {Mustafa (Kemal)} and Anand Venkataraman and Dimitra Vergyri and Wen Wang and Jing Zheng and Qifeng Zhu",
year = "2006",
month = "9",
doi = "10.1109/TASL.2006.879807",
language = "English (US)",
volume = "14",
pages = "1729--1742",
journal = "IEEE Transactions on Speech and Audio Processing",
issn = "1558-7916",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "5",

}

TY - JOUR

T1 - Recent innovations in speech-to-text transcription at SRI-ICSI-UW

AU - Stolcke, Andreas

AU - Chen, Barry

AU - Franco, Horacio

AU - Gadde, Venkata Ramana Rao

AU - Graciarena, Martin

AU - Hwang, Mei Yuh

AU - Kirchhoff, Katrin

AU - Mandal, Arindam

AU - Morgan, Nelson

AU - Lei, Xin

AU - Ng, Tim

AU - Ostendorf, Mari

AU - Sonmez, Mustafa (Kemal)

AU - Venkataraman, Anand

AU - Vergyri, Dimitra

AU - Wang, Wen

AU - Zheng, Jing

AU - Zhu, Qifeng

PY - 2006/9

Y1 - 2006/9

N2 - We summarize recent progress in automatic specch-to-text transcription at SRI, ICSI, and the University of Washington. The work encompasses all components of speech modeling found in a state-of-the-art recognition system, from acoustic features, to acoustic modeling and adaptation, to language modeling. In the front end, we experimented with nonstandard features, including various measures of voicing, discriminative phone posterior features estimated by multilayer perceptrons, and a novel phone-level macro-averaging for cepstral normalization. Acoustic modeling was improved with combinations of front ends operating at multiple frame rates, as well as by modifications to the standard methods for discriminative Gaussian estimation. We show that acoustic adaptation can be improved by predicting the optimal regression class complexity for a given speaker. Language modeling innovations include the use of a syntax-motivated almost-parsing language model, as well as principled vocabulary-selection techniques. Finally, we address portability issues, such as the use of imperfect training transcripts, and language-specific adjustments required for recognition of Arabic and Mandarin.

AB - We summarize recent progress in automatic specch-to-text transcription at SRI, ICSI, and the University of Washington. The work encompasses all components of speech modeling found in a state-of-the-art recognition system, from acoustic features, to acoustic modeling and adaptation, to language modeling. In the front end, we experimented with nonstandard features, including various measures of voicing, discriminative phone posterior features estimated by multilayer perceptrons, and a novel phone-level macro-averaging for cepstral normalization. Acoustic modeling was improved with combinations of front ends operating at multiple frame rates, as well as by modifications to the standard methods for discriminative Gaussian estimation. We show that acoustic adaptation can be improved by predicting the optimal regression class complexity for a given speaker. Language modeling innovations include the use of a syntax-motivated almost-parsing language model, as well as principled vocabulary-selection techniques. Finally, we address portability issues, such as the use of imperfect training transcripts, and language-specific adjustments required for recognition of Arabic and Mandarin.

KW - Broadcast news (BN)

KW - Conversational telephone speech (CTS)

KW - Specch-to-text (STT)

UR - http://www.scopus.com/inward/record.url?scp=34047270914&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34047270914&partnerID=8YFLogxK

U2 - 10.1109/TASL.2006.879807

DO - 10.1109/TASL.2006.879807

M3 - Article

AN - SCOPUS:34047270914

VL - 14

SP - 1729

EP - 1742

JO - IEEE Transactions on Speech and Audio Processing

JF - IEEE Transactions on Speech and Audio Processing

SN - 1558-7916

IS - 5

ER -