Siamese autoencoders for speech style extraction and switching applied to voice identification and conversion

Seyed Hamidreza Mohammadi, Alexander Kain

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

We propose an architecture called siamese autoencoders for extracting and switching pre-determined styles of speech signals while retaining the content. We apply this architecture to a voice conversion task in which we define the content to be the linguistic message and the style to be the speaker's voice. We assume two or more data streams with the same content but unique styles. The architecture is composed of two or more separate but shared-weight autoencoders that are joined by loss functions at the hidden layers. A hidden vector is composed of style and content sub-vectors and the loss functions constrain the encodings to decompose style and content. We can select an intended target speaker either by supplying the associated style vector, or by extracting a new style vector from a new utterance, using a proposed style extraction algorithm. We focus on in-Training speakers but perform some initial experiments for out-of-Training speakers as well. We propose and study several types of loss functions. The experiment results show that the proposed many-To-many model is able to convert voices successfully; however, its performance does not surpass that of the state-of-The-Art one-To-one model's.

Original languageEnglish (US)
Pages (from-to)1293-1297
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2017-August
DOIs
StatePublished - Jan 1 2017

Fingerprint

Loss Function
Linguistics
Experiments
Voice Conversion
Many to many
Style
Speech
Voice
Speech Style
Speech Signal
Data Streams
Convert
Experiment
Encoding
Decompose
Target
Model
Architecture
Training

Keywords

  • Siamese autoencoders
  • Style extraction
  • Style switching
  • Voice conversion

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Cite this

@article{0c7a8f73d00044e5a3923cedacc43571,
title = "Siamese autoencoders for speech style extraction and switching applied to voice identification and conversion",
abstract = "We propose an architecture called siamese autoencoders for extracting and switching pre-determined styles of speech signals while retaining the content. We apply this architecture to a voice conversion task in which we define the content to be the linguistic message and the style to be the speaker's voice. We assume two or more data streams with the same content but unique styles. The architecture is composed of two or more separate but shared-weight autoencoders that are joined by loss functions at the hidden layers. A hidden vector is composed of style and content sub-vectors and the loss functions constrain the encodings to decompose style and content. We can select an intended target speaker either by supplying the associated style vector, or by extracting a new style vector from a new utterance, using a proposed style extraction algorithm. We focus on in-Training speakers but perform some initial experiments for out-of-Training speakers as well. We propose and study several types of loss functions. The experiment results show that the proposed many-To-many model is able to convert voices successfully; however, its performance does not surpass that of the state-of-The-Art one-To-one model's.",
keywords = "Siamese autoencoders, Style extraction, Style switching, Voice conversion",
author = "Mohammadi, {Seyed Hamidreza} and Alexander Kain",
year = "2017",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2017-1434",
language = "English (US)",
volume = "2017-August",
pages = "1293--1297",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Siamese autoencoders for speech style extraction and switching applied to voice identification and conversion

AU - Mohammadi, Seyed Hamidreza

AU - Kain, Alexander

PY - 2017/1/1

Y1 - 2017/1/1

N2 - We propose an architecture called siamese autoencoders for extracting and switching pre-determined styles of speech signals while retaining the content. We apply this architecture to a voice conversion task in which we define the content to be the linguistic message and the style to be the speaker's voice. We assume two or more data streams with the same content but unique styles. The architecture is composed of two or more separate but shared-weight autoencoders that are joined by loss functions at the hidden layers. A hidden vector is composed of style and content sub-vectors and the loss functions constrain the encodings to decompose style and content. We can select an intended target speaker either by supplying the associated style vector, or by extracting a new style vector from a new utterance, using a proposed style extraction algorithm. We focus on in-Training speakers but perform some initial experiments for out-of-Training speakers as well. We propose and study several types of loss functions. The experiment results show that the proposed many-To-many model is able to convert voices successfully; however, its performance does not surpass that of the state-of-The-Art one-To-one model's.

AB - We propose an architecture called siamese autoencoders for extracting and switching pre-determined styles of speech signals while retaining the content. We apply this architecture to a voice conversion task in which we define the content to be the linguistic message and the style to be the speaker's voice. We assume two or more data streams with the same content but unique styles. The architecture is composed of two or more separate but shared-weight autoencoders that are joined by loss functions at the hidden layers. A hidden vector is composed of style and content sub-vectors and the loss functions constrain the encodings to decompose style and content. We can select an intended target speaker either by supplying the associated style vector, or by extracting a new style vector from a new utterance, using a proposed style extraction algorithm. We focus on in-Training speakers but perform some initial experiments for out-of-Training speakers as well. We propose and study several types of loss functions. The experiment results show that the proposed many-To-many model is able to convert voices successfully; however, its performance does not surpass that of the state-of-The-Art one-To-one model's.

KW - Siamese autoencoders

KW - Style extraction

KW - Style switching

KW - Voice conversion

UR - http://www.scopus.com/inward/record.url?scp=85039165729&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85039165729&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2017-1434

DO - 10.21437/Interspeech.2017-1434

M3 - Article

AN - SCOPUS:85039165729

VL - 2017-August

SP - 1293

EP - 1297

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -