Bag-of-Audio-Words based on Autoencoder Codebook for Continuous Emotion Prediction

Alessandro Lameiras Koerich; Mohammed Senoussaoui; Patrick Cardinal

arxiv: 1907.04928 · v1 · pith:SMJQBSIRnew · submitted 2019-07-06 · 📡 eess.AS · cs.LG· cs.SD· stat.ML

Bag-of-Audio-Words based on Autoencoder Codebook for Continuous Emotion Prediction

Mohammed Senoussaoui , Patrick Cardinal , Alessandro Lameiras Koerich This is my paper

Pith reviewed 2026-05-25 01:24 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SDstat.ML

keywords Bag-of-Wordsautoencoderemotion predictionaudio featurescodebookcontinuous emotionhistogramAVEC 2017

0 comments

The pith

An autoencoder can serve as both the dictionary and the assignment metric for Bag-of-Audio-Words representations, raising concordance in audio emotion prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that an autoencoder whose middle layer dimension equals the target dictionary size can generate Bag-of-Audio-Words histograms by treating its neuron outputs as the assignment values. This replaces the usual steps of random or clustered dictionary building followed by a separate distance metric. On the AVEC 2017 audio dataset the resulting features improve concordance correlation coefficients for continuous arousal and valence prediction over a conventional Bag-of-Words baseline. A sympathetic reader would care because audio feature quality directly limits how accurately machines can track emotional state from sound alone.

Core claim

The authors state that an autoencoder whose encoded layer dimension matches the dictionary size simultaneously creates the codebook entries and supplies the assignment metric through its neuron outputs; the resulting histograms raise the concordance correlation coefficient from 0.225 to 0.322 for arousal and from 0.244 to 0.368 for valence on the AVEC 2017 audio data relative to a standard Bag-of-Words implementation.

What carries the argument

Autoencoder codebook whose bottleneck layer size sets dictionary size and whose neuron activations provide the assignment weights for histogram construction.

If this is right

The produced histograms improve concordance correlation for both arousal and valence dimensions.
Dictionary creation and feature assignment become a single learned step rather than two separate procedures.
The same architecture can be substituted into any pipeline that previously used conventional Bag-of-Audio-Words.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same autoencoder construction could be tried on other audio regression tasks such as speaker trait prediction or music mood tracking.
Training the autoencoder jointly with the final emotion regressor might remove the need for a separate downstream model.
If the neuron outputs already act as soft assignments, explicit normalization or thresholding steps could be removed without loss of performance.

Load-bearing premise

The autoencoder training produces both a dictionary and an assignment function that is better than clustering without any extra post-processing steps that the baseline would also receive.

What would settle it

Re-running the exact same audio features through k-means clustering with an identical number of clusters and then measuring CCC on the AVEC 2017 arousal and valence tasks; if the clustered version matches or exceeds the autoencoder version the central claim is refuted.

Figures

Figures reproduced from arXiv: 1907.04928 by Alessandro Lameiras Koerich, Mohammed Senoussaoui, Patrick Cardinal.

**Figure 2.** Figure 2: An overview of the proposed approach. V. EXPERIMENTAL RESULTS For the evaluation of the proposed BoAW-NN approach, we have carried out several experiments on the dataset of the affect sub-challenge of AVEC 2017 challenge. The dataset used in this challenge is a subset of German subjects taken from the Sentiment Analysis in the Wild (SEWA) dataset. SEWA dataset consists of audiovisual recordings of spontan… view at source ↗

read the original abstract

In this paper we present a novel approach for extracting a Bag-of-Words (BoW) representation based on a Neural Network codebook. The conventional BoW model is based on a dictionary (codebook) built from elementary representations which are selected randomly or by using a clustering algorithm on a training dataset. A metric is then used to assign unseen elementary representations to the closest dictionary entries in order to produce a histogram. In the proposed approach, an autoencoder (AE) encompasses the role of both the dictionary creation and the assignment metric. The dimension of the encoded layer of the AE corresponds to the size of the dictionary and the output of its neurons represents the assignment metric. Experimental results for the continuous emotion prediction task on the AVEC 2017 audio dataset have shown an improvement of the Concordance Correlation Coefficient (CCC) from 0.225 to 0.322 for arousal dimension and from 0.244 to 0.368 for valence dimension relative to the conventional BoW version implemented in a baseline system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AE-based BoW lifts CCC on AVEC 2017 audio by treating bottleneck neurons as assignment scores, but the paper needs to confirm those scores form valid histograms without extra tuning the baseline lacked.

read the letter

The core move here is to let an autoencoder handle both the codebook and the assignment step for Bag-of-Audio-Words. The encoded layer has K neurons where K is the dictionary size, and the raw neuron outputs are used directly as the assignment values to build the histogram. On the AVEC 2017 audio track this produces CCC gains from 0.225 to 0.322 on arousal and 0.244 to 0.368 on valence against their own conventional BoW baseline. That is the concrete result worth noting first.

Referee Report

2 major / 1 minor

Summary. The paper proposes replacing the conventional codebook construction (random selection or k-means clustering) and assignment step in Bag-of-Audio-Words with a single autoencoder whose bottleneck layer has dimension equal to the desired dictionary size K; the raw neuron activations in this layer are used directly as assignment scores to produce histograms. These histograms are then fed to a downstream regressor for continuous arousal and valence prediction. On the AVEC 2017 audio track the method is reported to raise CCC from 0.225 to 0.322 (arousal) and from 0.244 to 0.368 (valence) relative to a conventional BoW baseline.

Significance. If the reported CCC gains can be reproduced under matched experimental conditions, the work would demonstrate that a jointly learned dictionary-plus-assignment function can improve audio BoW representations for dimensional emotion recognition without requiring an explicit clustering stage.

major comments (2)

[Abstract / §3] Abstract and §3 (method description): the claim that the AE neuron outputs can be used directly as assignment scores to form valid BoW histograms is load-bearing, yet the manuscript provides no explicit statement that the activations are non-negative, sum to a constant, or undergo any normalization/thresholding; if raw sigmoid or linear outputs are used without such steps the resulting “histograms” are not comparable to the conventional BoW baseline.
[Abstract / Experimental results] Abstract and experimental section: no information is given on the train/validation/test splits used for AVEC 2017, whether the AE architecture, loss, optimizer, or learning-rate schedule were tuned on the same validation data employed by the baseline, or whether any statistical significance test accompanies the CCC deltas; these omissions prevent attribution of the 0.097 / 0.124 absolute CCC gains solely to the proposed joint dictionary+assignment construction.

minor comments (1)

[§3] Notation for the encoded-layer dimension and the assignment function should be introduced with an equation rather than prose only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (method description): the claim that the AE neuron outputs can be used directly as assignment scores to form valid BoW histograms is load-bearing, yet the manuscript provides no explicit statement that the activations are non-negative, sum to a constant, or undergo any normalization/thresholding; if raw sigmoid or linear outputs are used without such steps the resulting “histograms” are not comparable to the conventional BoW baseline.

Authors: We agree that the current manuscript lacks an explicit description of the activation function in the bottleneck layer and the precise procedure for converting neuron outputs into histograms. The method section will be revised to state that a linear activation is used in the bottleneck and that the raw outputs serve as soft-assignment weights (without forced non-negativity or normalization to sum to one). This formulation is presented as a learned alternative to hard assignment rather than a strict histogram; the revised text will clarify the distinction from conventional BoW while retaining the reported experimental comparison. revision: yes
Referee: [Abstract / Experimental results] Abstract and experimental section: no information is given on the train/validation/test splits used for AVEC 2017, whether the AE architecture, loss, optimizer, or learning-rate schedule were tuned on the same validation data employed by the baseline, or whether any statistical significance test accompanies the CCC deltas; these omissions prevent attribution of the 0.097 / 0.124 absolute CCC gains solely to the proposed joint dictionary+assignment construction.

Authors: The manuscript will be expanded with an experimental-setup subsection that explicitly states the use of the standard AVEC 2017 training/development/test partitions, that the autoencoder was trained on the training partition with hyperparameters selected on the development partition (matching the baseline protocol), and that no statistical significance test was performed on the single-run CCC values. These additions will allow readers to evaluate the source of the observed gains; the reported deltas remain as measured under the described conditions. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical comparison

full rationale

The paper proposes an autoencoder-based BoW representation and reports CCC improvements on AVEC 2017 via direct experimental comparison to a conventional k-means baseline. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim rests on measured performance differences rather than any self-referential construction that would force the reported gains.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method relies on standard unsupervised neural-network training assumptions (gradient descent on reconstruction loss) and the unstated premise that the learned hidden-layer activations form a useful soft histogram. No new physical entities or ad-hoc constants are introduced.

free parameters (1)

autoencoder architecture and training hyperparameters
Layer widths, activation functions, learning rate, and number of epochs are chosen to produce the reported performance; these choices are not enumerated in the abstract.

axioms (1)

domain assumption Gradient-based optimization of reconstruction loss yields a hidden representation usable as both codebook and assignment weights
Invoked implicitly when the abstract states that the encoded layer dimension corresponds to dictionary size and neuron outputs represent the assignment metric.

pith-pipeline@v0.9.0 · 5726 in / 1365 out tokens · 28796 ms · 2026-05-25T01:24:10.954049+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 2 internal anchors

[1]

Scherer, T

K. Scherer, T. Johnstone, and G. Klasmeyer, Handbook of Affective Sciences. Oxford: Oxford University Press, 2003, ch. V ocal Expres- sion of Emotion, pp. 433–456

work page 2003
[2]

R. W. Picard, Affective Computing . Cambridge, MA, USA: MIT Press, 1997

work page 1997
[3]

ETS system for A V+EC 2015 challenge,

P. Cardinal, N. Dehak, A. L. Koerich, J. Alam, and P. Boucher, “ETS system for A V+EC 2015 challenge,” in 5th International Workshop on Audio/Visual Emotion Challenge , Brisbane, Australia, 2015, pp. 17–23

work page 2015
[4]

Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,

G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on . IEEE, 2016, pp. 5200–5204

work page 2016
[5]

Multimodal multi-task learning for dimensional and continuous emotion recognition,

S. Chen, Q. Jin, J. Zhao, and S. Wang, “Multimodal multi-task learning for dimensional and continuous emotion recognition,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge . ACM, 2017, pp. 19–26

work page 2017
[6]

Avec 2017: Real-life depression, and affect recognition workshop and challenge,

F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, S. Scherer, S. Mozgai, N. Cummins, M. Schmitt, and M. Pantic, “Avec 2017: Real-life depression, and affect recognition workshop and challenge,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 2017, pp. 3–9

work page 2017
[7]

At the border of acoustics and linguistics: Bag-of-audio-words for the recognition of emotions in speech

M. Schmitt, F. Ringeval, and B. W. Schuller, “At the border of acoustics and linguistics: Bag-of-audio-words for the recognition of emotions in speech.” in INTERSPEECH, 2016, pp. 495–499

work page 2016
[8]

Ets system for av+ec 2015 challenge,

P. Cardinal, N. Dehak, A. L. Koerich, J. Alam, and P. Boucher, “Ets system for av+ec 2015 challenge,” in ACM Multimedia Conference

work page 2015
[9]

ACM, 2015, pp. 17–23

work page 2015
[10]

Emotion recogni- tion using fusion of audio and video features,

J. D. Silva Ortega, P. Cardinal, and A. L. Koerich, “Emotion recogni- tion using fusion of audio and video features,” in IEEE International Conference on Systems, Man, and Cybernetics (SMC) , 2019, pp. 1–6

work page 2019
[11]

Soundnet: Learning sound representations from unlabeled video,

Y . Aytar, C. V ondrick, and A. Torralba, “Soundnet: Learning sound representations from unlabeled video,” in Advances in Neural Infor- mation Processing Systems , 2016, pp. 892–900

work page 2016
[12]

Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the ﬁrst challenge,

B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the ﬁrst challenge,” Speech Communication , vol. 53, no. 9-10, pp. 1062–1087, 2011

work page 2011
[13]

A bayesian hierarchical model for learning natural scene categories,

L. Fei-Fei and P. Perona, “A bayesian hierarchical model for learning natural scene categories,” in Computer Vision and Pattern Recognition,

work page
[14]

IEEE Computer Society Conference on , vol

CVPR 2005. IEEE Computer Society Conference on , vol. 2. IEEE, 2005, pp. 524–531

work page 2005
[15]

N-gram extension for bag-of-audio- words,

S. Pancoast and M. Akbacak, “N-gram extension for bag-of-audio- words,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on . IEEE, 2013, pp. 778–782

work page 2013
[16]

openXBOW Introducing the passau open-source crossmodal bag-of-words toolkit,

M. Schmitt and B. Schuller, “openXBOW Introducing the passau open-source crossmodal bag-of-words toolkit,” Journal of Machine Learning Research, vol. 18, no. 96, pp. 1–5, 2017

work page 2017
[17]

Speaker indexing in large audio databases using anchor models,

D. E. Sturim, D. A. Reynolds, E. Singer, and J. P. Campbell, “Speaker indexing in large audio databases using anchor models,” in Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on, vol. 1. IEEE, 2001, pp. 429–432

work page 2001
[18]

The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,

F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Y . Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2016

work page 2016
[19]

AVEC 2016 - Depression, Mood, and Emotion Recognition Workshop and Challenge

M. F. Valstar, J. Gratch, B. W. Schuller, F. Ringeval, D. Lalanne, M. Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic, “A VEC 2016 - depression, mood, and emotion recognition workshop and challenge,” CoRR, vol. abs/1605.01600, 2016. [Online]. Available: http://arxiv.org/abs/1605.01600

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

LIBLINEAR: A library for large linear classiﬁcation,

R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “LIBLINEAR: A library for large linear classiﬁcation,” Journal of Machine Learning Research , vol. 9, pp. 1871–1874, 2008

work page 2008
[21]

Yu and L

D. Yu and L. Deng, Automatic Speech Recognition: A Deep Learning Approach. Springer Publishing Company, Incorporated, 2014

work page 2014
[22]

A Unified Deep Neural Network for Speaker and Language Recognition

F. Richardson, D. Reynolds, and N. Dehak, “A uniﬁed deep neural network for speaker and language recognition,” arXiv:1504.00923, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[1] [1]

Scherer, T

K. Scherer, T. Johnstone, and G. Klasmeyer, Handbook of Affective Sciences. Oxford: Oxford University Press, 2003, ch. V ocal Expres- sion of Emotion, pp. 433–456

work page 2003

[2] [2]

R. W. Picard, Affective Computing . Cambridge, MA, USA: MIT Press, 1997

work page 1997

[3] [3]

ETS system for A V+EC 2015 challenge,

P. Cardinal, N. Dehak, A. L. Koerich, J. Alam, and P. Boucher, “ETS system for A V+EC 2015 challenge,” in 5th International Workshop on Audio/Visual Emotion Challenge , Brisbane, Australia, 2015, pp. 17–23

work page 2015

[4] [4]

Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,

G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on . IEEE, 2016, pp. 5200–5204

work page 2016

[5] [5]

Multimodal multi-task learning for dimensional and continuous emotion recognition,

S. Chen, Q. Jin, J. Zhao, and S. Wang, “Multimodal multi-task learning for dimensional and continuous emotion recognition,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge . ACM, 2017, pp. 19–26

work page 2017

[6] [6]

Avec 2017: Real-life depression, and affect recognition workshop and challenge,

F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, S. Scherer, S. Mozgai, N. Cummins, M. Schmitt, and M. Pantic, “Avec 2017: Real-life depression, and affect recognition workshop and challenge,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 2017, pp. 3–9

work page 2017

[7] [7]

At the border of acoustics and linguistics: Bag-of-audio-words for the recognition of emotions in speech

M. Schmitt, F. Ringeval, and B. W. Schuller, “At the border of acoustics and linguistics: Bag-of-audio-words for the recognition of emotions in speech.” in INTERSPEECH, 2016, pp. 495–499

work page 2016

[8] [8]

Ets system for av+ec 2015 challenge,

P. Cardinal, N. Dehak, A. L. Koerich, J. Alam, and P. Boucher, “Ets system for av+ec 2015 challenge,” in ACM Multimedia Conference

work page 2015

[9] [9]

ACM, 2015, pp. 17–23

work page 2015

[10] [10]

Emotion recogni- tion using fusion of audio and video features,

J. D. Silva Ortega, P. Cardinal, and A. L. Koerich, “Emotion recogni- tion using fusion of audio and video features,” in IEEE International Conference on Systems, Man, and Cybernetics (SMC) , 2019, pp. 1–6

work page 2019

[11] [11]

Soundnet: Learning sound representations from unlabeled video,

Y . Aytar, C. V ondrick, and A. Torralba, “Soundnet: Learning sound representations from unlabeled video,” in Advances in Neural Infor- mation Processing Systems , 2016, pp. 892–900

work page 2016

[12] [12]

Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the ﬁrst challenge,

B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the ﬁrst challenge,” Speech Communication , vol. 53, no. 9-10, pp. 1062–1087, 2011

work page 2011

[13] [13]

A bayesian hierarchical model for learning natural scene categories,

L. Fei-Fei and P. Perona, “A bayesian hierarchical model for learning natural scene categories,” in Computer Vision and Pattern Recognition,

work page

[14] [14]

IEEE Computer Society Conference on , vol

CVPR 2005. IEEE Computer Society Conference on , vol. 2. IEEE, 2005, pp. 524–531

work page 2005

[15] [15]

N-gram extension for bag-of-audio- words,

S. Pancoast and M. Akbacak, “N-gram extension for bag-of-audio- words,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on . IEEE, 2013, pp. 778–782

work page 2013

[16] [16]

openXBOW Introducing the passau open-source crossmodal bag-of-words toolkit,

M. Schmitt and B. Schuller, “openXBOW Introducing the passau open-source crossmodal bag-of-words toolkit,” Journal of Machine Learning Research, vol. 18, no. 96, pp. 1–5, 2017

work page 2017

[17] [17]

Speaker indexing in large audio databases using anchor models,

D. E. Sturim, D. A. Reynolds, E. Singer, and J. P. Campbell, “Speaker indexing in large audio databases using anchor models,” in Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on, vol. 1. IEEE, 2001, pp. 429–432

work page 2001

[18] [18]

The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,

F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Y . Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2016

work page 2016

[19] [19]

AVEC 2016 - Depression, Mood, and Emotion Recognition Workshop and Challenge

M. F. Valstar, J. Gratch, B. W. Schuller, F. Ringeval, D. Lalanne, M. Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic, “A VEC 2016 - depression, mood, and emotion recognition workshop and challenge,” CoRR, vol. abs/1605.01600, 2016. [Online]. Available: http://arxiv.org/abs/1605.01600

work page internal anchor Pith review Pith/arXiv arXiv 2016

[20] [20]

LIBLINEAR: A library for large linear classiﬁcation,

R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “LIBLINEAR: A library for large linear classiﬁcation,” Journal of Machine Learning Research , vol. 9, pp. 1871–1874, 2008

work page 2008

[21] [21]

Yu and L

D. Yu and L. Deng, Automatic Speech Recognition: A Deep Learning Approach. Springer Publishing Company, Incorporated, 2014

work page 2014

[22] [22]

A Unified Deep Neural Network for Speaker and Language Recognition

F. Richardson, D. Reynolds, and N. Dehak, “A uniﬁed deep neural network for speaker and language recognition,” arXiv:1504.00923, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015