TY - JOUR
T1 - Recent innovations in speech-to-text transcription at SRI-ICSI-UW
AU - Stolcke, Andreas
AU - Chen, Barry
AU - Franco, Horacio
AU - Gadde, Venkata Ramana Rao
AU - Graciarena, Martin
AU - Hwang, Mei Yuh
AU - Kirchhoff, Katrin
AU - Mandal, Arindam
AU - Morgan, Nelson
AU - Lei, Xin
AU - Ng, Tim
AU - Ostendorf, Mari
AU - Sönmez, Kemal
AU - Venkataraman, Anand
AU - Vergyri, Dimitra
AU - Wang, Wen
AU - Zheng, Jing
AU - Zhu, Qifeng
N1 - Funding Information:
Manuscript received October 16, 2005; revised May 30, 2006. This work was supported by the Defense Advanced Research Projects Agency (DARPA) under Contract MDA972-02-C-0038 and Grant MDA972-02-1-0024 (approved for public release, distribution unlimited). The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Alex Acero.
PY - 2006/9
Y1 - 2006/9
N2 - We summarize recent progress in automatic specch-to-text transcription at SRI, ICSI, and the University of Washington. The work encompasses all components of speech modeling found in a state-of-the-art recognition system, from acoustic features, to acoustic modeling and adaptation, to language modeling. In the front end, we experimented with nonstandard features, including various measures of voicing, discriminative phone posterior features estimated by multilayer perceptrons, and a novel phone-level macro-averaging for cepstral normalization. Acoustic modeling was improved with combinations of front ends operating at multiple frame rates, as well as by modifications to the standard methods for discriminative Gaussian estimation. We show that acoustic adaptation can be improved by predicting the optimal regression class complexity for a given speaker. Language modeling innovations include the use of a syntax-motivated almost-parsing language model, as well as principled vocabulary-selection techniques. Finally, we address portability issues, such as the use of imperfect training transcripts, and language-specific adjustments required for recognition of Arabic and Mandarin.
AB - We summarize recent progress in automatic specch-to-text transcription at SRI, ICSI, and the University of Washington. The work encompasses all components of speech modeling found in a state-of-the-art recognition system, from acoustic features, to acoustic modeling and adaptation, to language modeling. In the front end, we experimented with nonstandard features, including various measures of voicing, discriminative phone posterior features estimated by multilayer perceptrons, and a novel phone-level macro-averaging for cepstral normalization. Acoustic modeling was improved with combinations of front ends operating at multiple frame rates, as well as by modifications to the standard methods for discriminative Gaussian estimation. We show that acoustic adaptation can be improved by predicting the optimal regression class complexity for a given speaker. Language modeling innovations include the use of a syntax-motivated almost-parsing language model, as well as principled vocabulary-selection techniques. Finally, we address portability issues, such as the use of imperfect training transcripts, and language-specific adjustments required for recognition of Arabic and Mandarin.
KW - Broadcast news (BN)
KW - Conversational telephone speech (CTS)
KW - Specch-to-text (STT)
UR - http://www.scopus.com/inward/record.url?scp=34047270914&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=34047270914&partnerID=8YFLogxK
U2 - 10.1109/TASL.2006.879807
DO - 10.1109/TASL.2006.879807
M3 - Article
AN - SCOPUS:34047270914
SN - 1558-7916
VL - 14
SP - 1729
EP - 1742
JO - IEEE Transactions on Audio, Speech and Language Processing
JF - IEEE Transactions on Audio, Speech and Language Processing
IS - 5
ER -