pith. sign in

arxiv: 1907.00112 · v1 · pith:KKCS3BISnew · submitted 2019-06-28 · 💻 cs.CL · cs.LG· cs.SD· eess.AS

Leveraging Acoustic Cues and Paralinguistic Embeddings to Detect Expression from Voice

Pith reviewed 2026-05-25 13:16 UTC · model grok-4.3

classification 💻 cs.CL cs.LGcs.SDeess.AS
keywords expression detectionacoustic embeddingsparalinguistic featuresvoice queriesemotion recognitionequal error ratedigital assistantsvocal attributes
0
0 comments X

The pith

Acoustic and paralinguistic embeddings detect vocal expression in short queries with 60 percent lower error than word-based methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether vocal expression in brief user queries to digital assistants can be identified from sound properties instead of relying only on transcribed words. It compares systems built on acoustic embeddings and emotion embeddings against a baseline that uses bag-of-words features from speech recognition. The embedding approach achieves a 60 percent relative reduction in equal error rate, showing that expression appears more in vocal attributes than in lexical content. Adding the emotion embedding produces a further 30 percent relative error drop over acoustic embeddings alone. This points to a practical way for voice systems to register how something is said, not merely what is said.

Core claim

The work demonstrates that acoustic cues and paralinguistic embeddings enable reliable detection of vocal expression in short isolated utterances. The method yields a 60 percent relative equal error rate decrease compared with a bag-of-word system, supporting that expression is carried substantially by vocal attributes rather than lexical content. Incorporating emotion embeddings reduces error by an additional 30 percent relative to acoustic embeddings, confirming the contribution of emotion information to expressive voice.

What carries the argument

Acoustic and paralinguistic embeddings that encode vocal attributes and emotion beyond transcribed words.

If this is right

  • Digital assistants gain the ability to interpret user intent from how a query is voiced in addition to its transcribed words.
  • Expression in voice queries depends more on acoustic and paralinguistic properties than on lexical content alone.
  • Emotion embeddings provide measurable extra value over acoustic embeddings for expression detection.
  • The performance gain holds for the short, isolated utterances typical of voice-assistant interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Assistants could adjust their responses or tone according to detected expression to improve interaction quality.
  • The same embedding approach might extend to related tasks such as identifying emphasis or uncertainty in voice.
  • Larger and more varied training sets would be needed to confirm that the reported gains hold outside the original data.

Load-bearing premise

The expression labels used to train and test the system are accurate and the embeddings generalize to new short utterances without overfitting to dataset acoustics.

What would settle it

Running the same embedding system on a fresh collection of short utterances with independently verified expression labels and finding equal or higher error rates than the bag-of-words baseline.

Figures

Figures reproduced from arXiv: 1907.00112 by Anuj Mehta, Bridget Cheng, David Scott Farrar, Devang Naik, Erik Marchi, Ermine Teves, Sue Booker, Ute Dorothea Peitz, Vikramjit Mitra.

Figure 1
Figure 1. Figure 1: Distribution of grader agreement on an utterance be￾ing expressive: Yes [two or more selected ”Yes”]; Mild-Yes [only one selected ”Yes” and two or more are ”Not Sure”], Mild-No [two or more selected ”Not Sure” and the rest selected ”No”] and No [two or more selected ”No”] [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of perceived Arousal for expressive ver￾sus not-expressive cases [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of perceived Valence for expressive ver￾sus not-expressive cases. 4. Acoustic Features We investigated several acoustic features to parameterize speech. The baseline feature is the 20 dimensional mel￾frequency cepstral coefficients (MFCCs). We explored gam￾matone cepstral coefficients (GCCs) and modulation features (modulation cepstral coefficients (NMCC) [13]), both of which consisted of 20 c… view at source ↗
Figure 4
Figure 4. Figure 4: Mid-sagittal view of the vocal tract constriction vari￾ables (TVs) [18] [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Correlation of the TVs with valence scores. sional cepstral features (MFCC, GFCC and NMCC) to gener￾ate 23 dimensional features (MFCC+F0-V, GFCC+F0-V and NMCC+F0-V). We investigated articulatory features in the form of vocal tract constriction variables (TVs) as detailed in [13]. Detecting valence from speech has been relatively difficult compared to arousal and dominance [14]. Visual and lexical features … view at source ↗
Figure 7
Figure 7. Figure 7: ROC curve from the random, BoW, AE and EE sys￾tems. 7. Conclusions In this work, we investigated how acoustic and emotion cues can be used to detect vocal expression in speech. We observed that (a) primitive emotion can help in determining vocal expression, (b) articulatory information can help in improving the valence detection, and (c) robust acoustic features can help in gener￾ating better embedding. We… view at source ↗
read the original abstract

Millions of people reach out to digital assistants such as Siri every day, asking for information, making phone calls, seeking assistance, and much more. The expectation is that such assistants should understand the intent of the users query. Detecting the intent of a query from a short, isolated utterance is a difficult task. Intent cannot always be obtained from speech-recognized transcriptions. A transcription driven approach can interpret what has been said but fails to acknowledge how it has been said, and as a consequence, may ignore the expression present in the voice. Our work investigates whether a system can reliably detect vocal expression in queries using acoustic and paralinguistic embedding. Results show that the proposed method offers a relative equal error rate (EER) decrease of 60% compared to a bag-of-word based system, corroborating that expression is significantly represented by vocal attributes, rather than being purely lexical. Addition of emotion embedding helped to reduce the EER by 30% relative to the acoustic embedding, demonstrating the relevance of emotion in expressive voice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes detecting vocal expression in short utterances to digital assistants via acoustic cues and paralinguistic embeddings rather than text alone. It reports a 60% relative EER reduction versus a bag-of-words baseline and a further 30% relative reduction when emotion embeddings are added, arguing that expression is substantially encoded in vocal attributes.

Significance. If the EER gains can be shown to arise from speaker- and content-independent vocal cues with reliable labels, the work would support incorporating paralinguistic embeddings into intent detection pipelines for voice interfaces.

major comments (3)
  1. [Abstract] Abstract: the 60% and 30% relative EER reductions are stated without any accompanying dataset description (size, label provenance, inter-annotator agreement), model architecture, training regime for the embeddings, or evaluation protocol (speaker-disjoint or content-controlled splits). These omissions make it impossible to determine whether the reported gains reflect genuine expression detection or confounding factors such as label noise correlation or acoustic artifacts.
  2. [Abstract] Abstract: no cross-validation procedure, statistical significance tests, or variance estimates are supplied for the EER figures, so the stability of the claimed improvements cannot be assessed.
  3. [Abstract] Abstract: the potential dependence between the paralinguistic embeddings and the evaluation data is not discussed; if the embeddings were trained on distributions overlapping the test utterances, the EER metric may not be independent of the training distribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments focused on the abstract. We agree that the abstract should be expanded to include key experimental details so that the reported EER reductions can be properly contextualized. We will revise the abstract accordingly while keeping it concise. Below we respond point by point.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the 60% and 30% relative EER reductions are stated without any accompanying dataset description (size, label provenance, inter-annotator agreement), model architecture, training regime for the embeddings, or evaluation protocol (speaker-disjoint or content-controlled splits). These omissions make it impossible to determine whether the reported gains reflect genuine expression detection or confounding factors such as label noise correlation or acoustic artifacts.

    Authors: The full manuscript contains these details (dataset of short utterances with human-provided expression labels, acoustic feature extraction plus pre-trained paralinguistic embeddings, and speaker-disjoint evaluation splits). We acknowledge the abstract is too terse and will revise it to include a brief statement of dataset size, label source, speaker-disjoint protocol, and embedding origin so readers can assess independence from lexical content. revision: yes

  2. Referee: [Abstract] Abstract: no cross-validation procedure, statistical significance tests, or variance estimates are supplied for the EER figures, so the stability of the claimed improvements cannot be assessed.

    Authors: The results section of the manuscript reports EER on held-out data but does not include variance across folds or significance tests in the abstract. We will revise the abstract to note that results are averaged over speaker-disjoint folds and will add variance estimates or significance testing to the results section if not already present. revision: partial

  3. Referee: [Abstract] Abstract: the potential dependence between the paralinguistic embeddings and the evaluation data is not discussed; if the embeddings were trained on distributions overlapping the test utterances, the EER metric may not be independent of the training distribution.

    Authors: The embeddings are taken from models pre-trained on separate corpora; the evaluation utterances were not seen during embedding training. We will add an explicit statement in the revised abstract and methods section confirming the training/evaluation separation to address this concern directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical EER results benchmarked against external baseline

full rationale

The paper reports experimental equal-error-rate reductions (60% relative to bag-of-words, 30% from adding emotion embedding) on held-out utterances. No equations, self-definitional mappings, fitted-input-as-prediction steps, or load-bearing self-citations appear in the provided text. The central claim is a direct performance comparison to an independent lexical baseline and therefore does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields limited visibility; the central claim rests on standard ML embedding models whose hyperparameters are fitted to data and on the domain assumption that expression is acoustically separable from lexical content.

free parameters (2)
  • embedding model hyperparameters
    Neural embedding dimensions, learning rates, and layer sizes are fitted during training but unspecified.
  • EER decision threshold
    Equal error rate requires selection of an operating point on the ROC curve.
axioms (1)
  • domain assumption Vocal expression is reliably labeled in the evaluation data and is captured by acoustic and emotion embeddings independent of lexical content
    This premise is invoked to interpret the EER reductions as evidence that expression is represented by vocal attributes.

pith-pipeline@v0.9.0 · 5752 in / 1253 out tokens · 45519 ms · 2026-05-25T13:16:48.697180+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 2 internal anchors

  1. [1]

    Introduction One of the key challenges faced by voice operated assistants, such as Siri, is the interpretation of the intent of the user’s query. For example, an intelligent assistant may need to distinguish between a query for information on sports, a request to make a phone call, a command to play music, or many other supported actions. Existing systems...

  2. [2]

    The data had neither any speaker level information nor any contextual information: every query was independent of every other

    Data We have collected approximately 100 hours of US English speech material and their associated automatically generated speech transcriptions. The data had neither any speaker level information nor any contextual information: every query was independent of every other. Hence the task we explored was: speaker independent and context-free. The collected s...

  3. [3]

    The graders voted if the query expressed the intent clearly, by selecting one of the three possible options: ”Y es”, ”No” and ”Not sure”

    A query’svocal expressionwith respect to the type of intent, i.e., asking for a resource, an accidental trigger, or a prank or other humor attempt. The graders voted if the query expressed the intent clearly, by selecting one of the three possible options: ”Y es”, ”No” and ”Not sure”

  4. [4]

    After grading, the data was filtered to remove cases where all four graders were Not Sure in their decision, which resulted in 70 hours of data that was used in our experiments

    Perceivedprimitive emotion(Arousal and V alence) on a three-level Likert scale. After grading, the data was filtered to remove cases where all four graders were Not Sure in their decision, which resulted in 70 hours of data that was used in our experiments. The final grade for a query was an average of the individual grades by the graders (where the grades ...

  5. [5]

    Figure 1 shows the distribution of their decisions

    Data Analysis and metric Data grading provided some interesting insights, where the graders agreed more on labeling a query as not-expressive than expressive. Figure 1 shows the distribution of their decisions. The expressive and not-expressive cases are those where two or more graders have agreed strongly toward that decision. When graders labeled primit...

  6. [6]

    The baseline feature is the 20 dimensional mel- frequency cepstral coefficients (MFCCs)

    Acoustic Features We investigated several acoustic features to parameterize speech. The baseline feature is the 20 dimensional mel- frequency cepstral coefficients (MFCCs). We explored gam- matone cepstral coefficients (GCCs) and modulation features (modulation cepstral coefficients (NMCC) [13]), both of which consisted of 20 cepstral features. In addition, ...

  7. [7]

    128 neurons in the recurrent and the embedding layers

    Acoustic Model We used the graded data to train single-layer long-short term memory (LSTM) neural network based acoustic models, with Figure 6:Embedding fusion for expression detection. 128 neurons in the recurrent and the embedding layers. The models were tuned using a held-out dev set. The models were trained using cross-entropy loss, with a mini-batch ...

  8. [8]

    We also used a random model that gener- ated random outputs and the resulting scores are shown in Table

    Results We investigated text-based models for the given task, where bag-of-words (BoW) features were used to train a multi-layered neural network (NN). We also used a random model that gener- ated random outputs and the resulting scores are shown in Table

  9. [9]

    Both the number of hidden layers and number of neurons in each layer were optimized given a held-out validation set. The BoW feature transforms were learned from the speech transcrip- tions of the 60-hour pre-training data, and the neural net model was trained using BoW features obtained from the 30-hour bal- anced data. Additionally, an MFCC feature base...

  10. [10]

    Conclusions In this work, we investigated how acoustic and emotion cues can be used to detect vocal expression in speech. We observed that (a) primitive emotion can help in determining vocal expression, (b) articulatory information can help in improving the valence detection, and (c) robust acoustic features can help in gener- ating better embedding. We h...

  11. [11]

    Acknowledgements The authors would like to thank Russ Webb, Sachin Kajarekar and Alex Acero for their valuable comments and suggestions to improve the contents of this paper

  12. [12]

    Zero-shot learning of intent embeddings for expansion by convolutional deep struc- tured semantic models,

    Y .-N. Chen, D. Hakkani-T¨ur, and X. He, “Zero-shot learning of intent embeddings for expansion by convolutional deep struc- tured semantic models,” in2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 6045–6049

  13. [13]

    Unsupervised in- duction and filling of semantic slots for spoken dialogue systems using frame-semantic parsing,

    Y .-N. Chen, W. Y . Wang, and A. I. Rudnicky, “Unsupervised in- duction and filling of semantic slots for spoken dialogue systems using frame-semantic parsing,” in2013 IEEE Workshop on Auto- matic Speech Recognition and Understanding. IEEE, 2013, pp. 120–125

  14. [14]

    Jointly modeling inter-slot relations by random walk on knowledge graphs for un- supervised spoken language understanding,

    Y .-N. Chen, W. Y . Wang, and A. Rudnicky, “Jointly modeling inter-slot relations by random walk on knowledge graphs for un- supervised spoken language understanding,” inProceedings of the 2015 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technolo- gies, 2015, pp. 619–629

  15. [15]

    A Bi-model based RNN Semantic Frame Parsing Model for Intent Detection and Slot Filling

    Y . Wang, Y . Shen, and H. Jin, “A bi-model based rnn semantic frame parsing model for intent detection and slot filling,”arXiv preprint arXiv:1812.10235, 2018

  16. [16]

    Speech intent recognition for robots,

    B. Shen and D. Inkpen, “Speech intent recognition for robots,” Proceedings of Third International Conference on Mathematics and Computers in Sciences and in Industry, pp. 185–190, 2016

  17. [17]

    Multi-Layer Ensembling Techniques for Multilingual Intent Classification

    C. Costello, R. Lin, V . Mruthyunjaya, B. Bolla, and C. Jankowski, “Multi-layer ensembling techniques for multilingual intent classi- fication,”arXiv preprint arXiv:1806.07914, 2018

  18. [18]

    Deep belief nets for natural language call-routing,

    R. Sarikaya, G. E. Hinton, and B. Ramabhadran, “Deep belief nets for natural language call-routing,” in2011 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, 2011, pp. 5680–5683

  19. [19]

    Investigating utterance level representations for detecting intent from acoustics,

    S. Rallabandi, B. Karki, C. Viegas, E. Nyberg, and B. A.W., “Investigating utterance level representations for detecting intent from acoustics,” inProceedings of Interspeech. ISCA, 2018, pp. 516–520

  20. [20]

    The use of technology in suicide prevention,

    M. E. Larsen, N. Cummins, T. W. Boonstra, B. O’Dea, J. Tighe, J. Nicholas, F. Shand, J. Epps, and H. Christensen, “The use of technology in suicide prevention,” in2015 37th annual interna- tional conference of the IEEE engineering in Medicine and biol- ogy society (EMBC). IEEE, 2015, pp. 7316–7319

  21. [21]

    The sri avec-2014 evaluation system,

    V . Mitra, E. Shriberg, M. McLaren, A. Kathol, C. Richey, D. V er- gyri, and M. Graciarena, “The sri avec-2014 evaluation system,” inProceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. ACM, 2014, pp. 93–101

  22. [22]

    Speech-based assessment of ptsd in a military population using diverse feature classes,

    D. V ergyri, B. Knoth, E. Shriberg, V . Mitra, M. McLaren, L. Fer- rer, P . Garcia, and C. Marmar, “Speech-based assessment of ptsd in a military population using diverse feature classes,” inSixteenth annual conference of the international speech communication as- sociation, 2015

  23. [23]

    Building naturalistic emotionally bal- anced speech corpus by retrieving emotional speech from existing podcast recordings,

    R. Lotfian and C. Busso, “Building naturalistic emotionally bal- anced speech corpus by retrieving emotional speech from existing podcast recordings,”IEEE Transactions on Affective Computing, 2017

  24. [24]

    Hybrid convolutional neural networks for articu- latory and acoustic information based speech recognition,

    V . Mitra, G. Sivaraman, H. Nam, C. Espy-Wilson, E. Saltzman, and M. Tiede, “Hybrid convolutional neural networks for articu- latory and acoustic information based speech recognition,”Speech Communication, vol. 89, pp. 103–112, 2017

  25. [25]

    Jointly predicting arousal, valence and dominance with multi-task learning

    S. Parthasarathy and C. Busso, “Jointly predicting arousal, valence and dominance with multi-task learning.” inINTER- SPEECH, 2017, pp. 1103–1107

  26. [26]

    Unveiling the acoustic properties that describe the valence dimension,

    C. Busso and T. Rahman, “Unveiling the acoustic properties that describe the valence dimension,” inThirteenth Annual Conference of the International Speech Communication Association, 2012

  27. [27]

    Predicting arousal and valence from waveforms and spectrograms using deep neural networks,

    Z. Y ang and J. Hirschberg, “Predicting arousal and valence from waveforms and spectrograms using deep neural networks,”Proc. Interspeech 2018, pp. 3092–3096, 2018

  28. [28]

    Retrieving tract variables from acoustics: a comparison of different machine learning strategies,

    V . Mitra, H. Nam, C. Y . Espy-Wilson, E. Saltzman, and L. Gold- stein, “Retrieving tract variables from acoustics: a comparison of different machine learning strategies,”IEEE journal of selected topics in signal processing, vol. 4, no. 6, pp. 1027–1045, 2010

  29. [29]

    Modeling of articulatory gestures to control effects of production variability on speech technologies,

    C. Espy-Wilson, G. Sivaraman, M. Tiede, V . Mitra, E. Saltzmann, L. Goldstein, and H. Nam, “Modeling of articulatory gestures to control effects of production variability on speech technologies,” Rethinking Reduction, pp. 243–276, 2018