Class-dependent score combination for speaker recognition

Luciana Ferrer; Kemal Sönmez; Sachin Kajarekar

Class-dependent score combination for speaker recognition

Luciana Ferrer, Kemal Sönmez, Sachin Kajarekar

Research output: Contribution to conference › Paper › peer-review

Abstract

Many recent performance improvements in speaker recognition using higher-level features, as demonstrated in the NIST Speaker Recognition Evaluation (SRE) task, rely on combinations of multiple systems modeling a large variety of features. The diversity of the large set of features starting from short-term acoustic spectrum features all the way to habitual word usage from a large set of speakers in a multitude of settings (acoustic environment, speaking style, quantities of enrollment/test data) results in a challenging model combination task. In this work, we are presenting a class-dependent score combination technique that relies on clustering of both the target models and the test utterances in a vector space defined by a set of speaker-specific transformation parameters estimated during transcription of the talker's speech by automatic speech recognition (ASR). We show that significant performance gains are obtained by using the first few principal components of a model transform for clustering the speaker verification trials into classes for (target speaker, test utterance) pairs, and then training a separate combiner for each class. We report results on the NIST SRE 2004 and FISHER datasets.

Original language	English (US)
Pages	2173-2176
Number of pages	4
State	Published - 2005
Externally published	Yes
Event	9th European Conference on Speech Communication and Technology - Lisbon, Portugal Duration: Sep 4 2005 → Sep 8 2005

Other

Other	9th European Conference on Speech Communication and Technology
Country/Territory	Portugal
City	Lisbon
Period	9/4/05 → 9/8/05

ASJC Scopus subject areas

General Engineering

Cite this

@conference{deeef95575ca472ca77c2c580c77d4bc,

title = "Class-dependent score combination for speaker recognition",

abstract = "Many recent performance improvements in speaker recognition using higher-level features, as demonstrated in the NIST Speaker Recognition Evaluation (SRE) task, rely on combinations of multiple systems modeling a large variety of features. The diversity of the large set of features starting from short-term acoustic spectrum features all the way to habitual word usage from a large set of speakers in a multitude of settings (acoustic environment, speaking style, quantities of enrollment/test data) results in a challenging model combination task. In this work, we are presenting a class-dependent score combination technique that relies on clustering of both the target models and the test utterances in a vector space defined by a set of speaker-specific transformation parameters estimated during transcription of the talker's speech by automatic speech recognition (ASR). We show that significant performance gains are obtained by using the first few principal components of a model transform for clustering the speaker verification trials into classes for (target speaker, test utterance) pairs, and then training a separate combiner for each class. We report results on the NIST SRE 2004 and FISHER datasets.",

author = "Luciana Ferrer and Kemal S{\"o}nmez and Sachin Kajarekar",

year = "2005",

language = "English (US)",

pages = "2173--2176",

note = "9th European Conference on Speech Communication and Technology ; Conference date: 04-09-2005 Through 08-09-2005",

}

TY - CONF

T1 - Class-dependent score combination for speaker recognition

AU - Ferrer, Luciana

AU - Sönmez, Kemal

AU - Kajarekar, Sachin

PY - 2005

Y1 - 2005

N2 - Many recent performance improvements in speaker recognition using higher-level features, as demonstrated in the NIST Speaker Recognition Evaluation (SRE) task, rely on combinations of multiple systems modeling a large variety of features. The diversity of the large set of features starting from short-term acoustic spectrum features all the way to habitual word usage from a large set of speakers in a multitude of settings (acoustic environment, speaking style, quantities of enrollment/test data) results in a challenging model combination task. In this work, we are presenting a class-dependent score combination technique that relies on clustering of both the target models and the test utterances in a vector space defined by a set of speaker-specific transformation parameters estimated during transcription of the talker's speech by automatic speech recognition (ASR). We show that significant performance gains are obtained by using the first few principal components of a model transform for clustering the speaker verification trials into classes for (target speaker, test utterance) pairs, and then training a separate combiner for each class. We report results on the NIST SRE 2004 and FISHER datasets.

AB - Many recent performance improvements in speaker recognition using higher-level features, as demonstrated in the NIST Speaker Recognition Evaluation (SRE) task, rely on combinations of multiple systems modeling a large variety of features. The diversity of the large set of features starting from short-term acoustic spectrum features all the way to habitual word usage from a large set of speakers in a multitude of settings (acoustic environment, speaking style, quantities of enrollment/test data) results in a challenging model combination task. In this work, we are presenting a class-dependent score combination technique that relies on clustering of both the target models and the test utterances in a vector space defined by a set of speaker-specific transformation parameters estimated during transcription of the talker's speech by automatic speech recognition (ASR). We show that significant performance gains are obtained by using the first few principal components of a model transform for clustering the speaker verification trials into classes for (target speaker, test utterance) pairs, and then training a separate combiner for each class. We report results on the NIST SRE 2004 and FISHER datasets.

UR - http://www.scopus.com/inward/record.url?scp=33745193034&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33745193034&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:33745193034

SP - 2173

EP - 2176

T2 - 9th European Conference on Speech Communication and Technology

Y2 - 4 September 2005 through 8 September 2005

ER -

Class-dependent score combination for speaker recognition

Abstract

Other

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this