Robust detection of voiced segments in samples of everyday conversations using unsupervised HMMS

Meysam Asgari; Izhak Shafran; Alireza Bayestehtashk

doi:10.1109/SLT.2012.6424264

Robust detection of voiced segments in samples of everyday conversations using unsupervised HMMS

Meysam Asgari, Izhak Shafran, Alireza Bayestehtashk

Institute on Development and Disability

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

6 Scopus citations

Abstract

We investigate methods for detecting voiced segments in everyday conversations from ambient recordings. Such recordings contain high diversity of background noise, making it difficult or infeasible to collect representative labelled samples for estimating noise-specific HMM models. The popular utility get-f0 and its derivatives compute normalized cross-correlation for detecting voiced segments, which unfortunately is sensitive to different types of noise. Exploiting the fact that voiced speech is not just periodic but also rich in harmonic, we model voiced segments by adopting harmonic models, which have recently gained considerable attention. In previous work, the parameters of the model were estimated independently for each frame using maximum likelihood criterion. However, since the distribution of harmonic coefficients depend on articulators of speakers, we estimate the model parameters more robustly using a maximum a posteriori criterion. We use the likelihood of voicing, computed from the harmonic model, as an observation probability of an HMM and detect speech using this unsupervised HMM. The one caveat of the harmonic model is that they fail to distinguish speech from other stationary harmonic noise. We rectify this weakness by taking advantage of the non-stationary property of speech. We evaluate our models empirically on a task of detecting speech on a large corpora of everyday speech and demonstrate that these models perform significantly better than standard voice detection algorithm employed in popular tools.

Original language	English (US)
Title of host publication	2012 IEEE Workshop on Spoken Language Technology, SLT 2012 - Proceedings
Pages	438-442
Number of pages	5
DOIs	https://doi.org/10.1109/SLT.2012.6424264
State	Published - 2012
Event	2012 IEEE Workshop on Spoken Language Technology, SLT 2012 - Miami, FL, United States Duration: Dec 2 2012 → Dec 5 2012

Publication series

Name	2012 IEEE Workshop on Spoken Language Technology, SLT 2012 - Proceedings

Conference

Conference	2012 IEEE Workshop on Spoken Language Technology, SLT 2012
Country/Territory	United States
City	Miami, FL
Period	12/2/12 → 12/5/12

Keywords

life log
speech detection
voice detection

ASJC Scopus subject areas

Language and Linguistics
Linguistics and Language

Access to Document

10.1109/SLT.2012.6424264

Cite this

Asgari, M., Shafran, I., & Bayestehtashk, A. (2012). Robust detection of voiced segments in samples of everyday conversations using unsupervised HMMS. In 2012 IEEE Workshop on Spoken Language Technology, SLT 2012 - Proceedings (pp. 438-442). Article 6424264 (2012 IEEE Workshop on Spoken Language Technology, SLT 2012 - Proceedings). https://doi.org/10.1109/SLT.2012.6424264

Robust detection of voiced segments in samples of everyday conversations using unsupervised HMMS. / Asgari, Meysam; Shafran, Izhak; Bayestehtashk, Alireza.
2012 IEEE Workshop on Spoken Language Technology, SLT 2012 - Proceedings. 2012. p. 438-442 6424264 (2012 IEEE Workshop on Spoken Language Technology, SLT 2012 - Proceedings).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Asgari, M, Shafran, I & Bayestehtashk, A 2012, Robust detection of voiced segments in samples of everyday conversations using unsupervised HMMS. in 2012 IEEE Workshop on Spoken Language Technology, SLT 2012 - Proceedings., 6424264, 2012 IEEE Workshop on Spoken Language Technology, SLT 2012 - Proceedings, pp. 438-442, 2012 IEEE Workshop on Spoken Language Technology, SLT 2012, Miami, FL, United States, 12/2/12. https://doi.org/10.1109/SLT.2012.6424264

@inproceedings{6d9c9d502afb46e39c99d044f4820c5d,

title = "Robust detection of voiced segments in samples of everyday conversations using unsupervised HMMS",

abstract = "We investigate methods for detecting voiced segments in everyday conversations from ambient recordings. Such recordings contain high diversity of background noise, making it difficult or infeasible to collect representative labelled samples for estimating noise-specific HMM models. The popular utility get-f0 and its derivatives compute normalized cross-correlation for detecting voiced segments, which unfortunately is sensitive to different types of noise. Exploiting the fact that voiced speech is not just periodic but also rich in harmonic, we model voiced segments by adopting harmonic models, which have recently gained considerable attention. In previous work, the parameters of the model were estimated independently for each frame using maximum likelihood criterion. However, since the distribution of harmonic coefficients depend on articulators of speakers, we estimate the model parameters more robustly using a maximum a posteriori criterion. We use the likelihood of voicing, computed from the harmonic model, as an observation probability of an HMM and detect speech using this unsupervised HMM. The one caveat of the harmonic model is that they fail to distinguish speech from other stationary harmonic noise. We rectify this weakness by taking advantage of the non-stationary property of speech. We evaluate our models empirically on a task of detecting speech on a large corpora of everyday speech and demonstrate that these models perform significantly better than standard voice detection algorithm employed in popular tools.",

keywords = "life log, speech detection, voice detection",

author = "Meysam Asgari and Izhak Shafran and Alireza Bayestehtashk",

year = "2012",

doi = "10.1109/SLT.2012.6424264",

language = "English (US)",

isbn = "9781467351263",

series = "2012 IEEE Workshop on Spoken Language Technology, SLT 2012 - Proceedings",

pages = "438--442",

booktitle = "2012 IEEE Workshop on Spoken Language Technology, SLT 2012 - Proceedings",

note = "2012 IEEE Workshop on Spoken Language Technology, SLT 2012 ; Conference date: 02-12-2012 Through 05-12-2012",

}

TY - GEN

T1 - Robust detection of voiced segments in samples of everyday conversations using unsupervised HMMS

AU - Asgari, Meysam

AU - Shafran, Izhak

AU - Bayestehtashk, Alireza

PY - 2012

Y1 - 2012

N2 - We investigate methods for detecting voiced segments in everyday conversations from ambient recordings. Such recordings contain high diversity of background noise, making it difficult or infeasible to collect representative labelled samples for estimating noise-specific HMM models. The popular utility get-f0 and its derivatives compute normalized cross-correlation for detecting voiced segments, which unfortunately is sensitive to different types of noise. Exploiting the fact that voiced speech is not just periodic but also rich in harmonic, we model voiced segments by adopting harmonic models, which have recently gained considerable attention. In previous work, the parameters of the model were estimated independently for each frame using maximum likelihood criterion. However, since the distribution of harmonic coefficients depend on articulators of speakers, we estimate the model parameters more robustly using a maximum a posteriori criterion. We use the likelihood of voicing, computed from the harmonic model, as an observation probability of an HMM and detect speech using this unsupervised HMM. The one caveat of the harmonic model is that they fail to distinguish speech from other stationary harmonic noise. We rectify this weakness by taking advantage of the non-stationary property of speech. We evaluate our models empirically on a task of detecting speech on a large corpora of everyday speech and demonstrate that these models perform significantly better than standard voice detection algorithm employed in popular tools.

AB - We investigate methods for detecting voiced segments in everyday conversations from ambient recordings. Such recordings contain high diversity of background noise, making it difficult or infeasible to collect representative labelled samples for estimating noise-specific HMM models. The popular utility get-f0 and its derivatives compute normalized cross-correlation for detecting voiced segments, which unfortunately is sensitive to different types of noise. Exploiting the fact that voiced speech is not just periodic but also rich in harmonic, we model voiced segments by adopting harmonic models, which have recently gained considerable attention. In previous work, the parameters of the model were estimated independently for each frame using maximum likelihood criterion. However, since the distribution of harmonic coefficients depend on articulators of speakers, we estimate the model parameters more robustly using a maximum a posteriori criterion. We use the likelihood of voicing, computed from the harmonic model, as an observation probability of an HMM and detect speech using this unsupervised HMM. The one caveat of the harmonic model is that they fail to distinguish speech from other stationary harmonic noise. We rectify this weakness by taking advantage of the non-stationary property of speech. We evaluate our models empirically on a task of detecting speech on a large corpora of everyday speech and demonstrate that these models perform significantly better than standard voice detection algorithm employed in popular tools.

KW - life log

KW - speech detection

KW - voice detection

UR - http://www.scopus.com/inward/record.url?scp=84874223263&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84874223263&partnerID=8YFLogxK

U2 - 10.1109/SLT.2012.6424264

DO - 10.1109/SLT.2012.6424264

M3 - Conference contribution

AN - SCOPUS:84874223263

SN - 9781467351263

T3 - 2012 IEEE Workshop on Spoken Language Technology, SLT 2012 - Proceedings

SP - 438

EP - 442

BT - 2012 IEEE Workshop on Spoken Language Technology, SLT 2012 - Proceedings

T2 - 2012 IEEE Workshop on Spoken Language Technology, SLT 2012

Y2 - 2 December 2012 through 5 December 2012

ER -

Robust detection of voiced segments in samples of everyday conversations using unsupervised HMMS

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this