Leveraging Acoustic Cues and Paralinguistic Embeddings to Detect Expression from Voice

Anuj Mehta; Bridget Cheng; David Scott Farrar; Devang Naik; Erik Marchi; Ermine Teves; Sue Booker; Ute Dorothea Peitz; Vikramjit Mitra

arxiv: 1907.00112 · v1 · pith:KKCS3BISnew · submitted 2019-06-28 · 💻 cs.CL · cs.LG· cs.SD· eess.AS

Leveraging Acoustic Cues and Paralinguistic Embeddings to Detect Expression from Voice

Vikramjit Mitra , Sue Booker , Erik Marchi , David Scott Farrar , Ute Dorothea Peitz , Bridget Cheng , Ermine Teves , Anuj Mehta

show 1 more author

Devang Naik

This is my paper

Pith reviewed 2026-05-25 13:16 UTC · model grok-4.3

classification 💻 cs.CL cs.LGcs.SDeess.AS

keywords expression detectionacoustic embeddingsparalinguistic featuresvoice queriesemotion recognitionequal error ratedigital assistantsvocal attributes

0 comments

The pith

Acoustic and paralinguistic embeddings detect vocal expression in short queries with 60 percent lower error than word-based methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether vocal expression in brief user queries to digital assistants can be identified from sound properties instead of relying only on transcribed words. It compares systems built on acoustic embeddings and emotion embeddings against a baseline that uses bag-of-words features from speech recognition. The embedding approach achieves a 60 percent relative reduction in equal error rate, showing that expression appears more in vocal attributes than in lexical content. Adding the emotion embedding produces a further 30 percent relative error drop over acoustic embeddings alone. This points to a practical way for voice systems to register how something is said, not merely what is said.

Core claim

The work demonstrates that acoustic cues and paralinguistic embeddings enable reliable detection of vocal expression in short isolated utterances. The method yields a 60 percent relative equal error rate decrease compared with a bag-of-word system, supporting that expression is carried substantially by vocal attributes rather than lexical content. Incorporating emotion embeddings reduces error by an additional 30 percent relative to acoustic embeddings, confirming the contribution of emotion information to expressive voice.

What carries the argument

Acoustic and paralinguistic embeddings that encode vocal attributes and emotion beyond transcribed words.

If this is right

Digital assistants gain the ability to interpret user intent from how a query is voiced in addition to its transcribed words.
Expression in voice queries depends more on acoustic and paralinguistic properties than on lexical content alone.
Emotion embeddings provide measurable extra value over acoustic embeddings for expression detection.
The performance gain holds for the short, isolated utterances typical of voice-assistant interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Assistants could adjust their responses or tone according to detected expression to improve interaction quality.
The same embedding approach might extend to related tasks such as identifying emphasis or uncertainty in voice.
Larger and more varied training sets would be needed to confirm that the reported gains hold outside the original data.

Load-bearing premise

The expression labels used to train and test the system are accurate and the embeddings generalize to new short utterances without overfitting to dataset acoustics.

What would settle it

Running the same embedding system on a fresh collection of short utterances with independently verified expression labels and finding equal or higher error rates than the bag-of-words baseline.

Figures

Figures reproduced from arXiv: 1907.00112 by Anuj Mehta, Bridget Cheng, David Scott Farrar, Devang Naik, Erik Marchi, Ermine Teves, Sue Booker, Ute Dorothea Peitz, Vikramjit Mitra.

**Figure 1.** Figure 1: Distribution of grader agreement on an utterance being expressive: Yes [two or more selected ”Yes”]; Mild-Yes [only one selected ”Yes” and two or more are ”Not Sure”], Mild-No [two or more selected ”Not Sure” and the rest selected ”No”] and No [two or more selected ”No”] [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Distribution of perceived Arousal for expressive versus not-expressive cases [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of perceived Valence for expressive versus not-expressive cases. 4. Acoustic Features We investigated several acoustic features to parameterize speech. The baseline feature is the 20 dimensional melfrequency cepstral coefficients (MFCCs). We explored gammatone cepstral coefficients (GCCs) and modulation features (modulation cepstral coefficients (NMCC) [13]), both of which consisted of 20 c… view at source ↗

**Figure 4.** Figure 4: Mid-sagittal view of the vocal tract constriction variables (TVs) [18] [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Correlation of the TVs with valence scores. sional cepstral features (MFCC, GFCC and NMCC) to generate 23 dimensional features (MFCC+F0-V, GFCC+F0-V and NMCC+F0-V). We investigated articulatory features in the form of vocal tract constriction variables (TVs) as detailed in [13]. Detecting valence from speech has been relatively difficult compared to arousal and dominance [14]. Visual and lexical features … view at source ↗

**Figure 7.** Figure 7: ROC curve from the random, BoW, AE and EE systems. 7. Conclusions In this work, we investigated how acoustic and emotion cues can be used to detect vocal expression in speech. We observed that (a) primitive emotion can help in determining vocal expression, (b) articulatory information can help in improving the valence detection, and (c) robust acoustic features can help in generating better embedding. We… view at source ↗

read the original abstract

Millions of people reach out to digital assistants such as Siri every day, asking for information, making phone calls, seeking assistance, and much more. The expectation is that such assistants should understand the intent of the users query. Detecting the intent of a query from a short, isolated utterance is a difficult task. Intent cannot always be obtained from speech-recognized transcriptions. A transcription driven approach can interpret what has been said but fails to acknowledge how it has been said, and as a consequence, may ignore the expression present in the voice. Our work investigates whether a system can reliably detect vocal expression in queries using acoustic and paralinguistic embedding. Results show that the proposed method offers a relative equal error rate (EER) decrease of 60% compared to a bag-of-word based system, corroborating that expression is significantly represented by vocal attributes, rather than being purely lexical. Addition of emotion embedding helped to reduce the EER by 30% relative to the acoustic embedding, demonstrating the relevance of emotion in expressive voice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract claims 60% and 30% relative EER drops from acoustic and emotion embeddings but gives zero dataset, label, or evaluation details, so the numbers cannot be assessed.

read the letter

The abstract reports that acoustic plus paralinguistic embeddings cut EER by 60% relative to bag-of-words for expression detection in short voice queries, with emotion embeddings adding another 30% relative gain. That is the only quantitative result on offer. What is new is the direct application of existing embedding techniques to the narrow task of spotting expression in assistant queries rather than broad emotion classification. The paper does a clear job stating that lexical content alone misses vocal cues and that expression matters for intent. The soft spots are large and central. No dataset size, label source, inter-annotator agreement, train/test split, speaker or content controls, embedding training regime, or significance test appears in the text. Without those, the reported gains could reflect label noise, speaker leakage, or dataset artifacts instead of genuine expression signals. The stress-test point on label accuracy and independent evaluation holds because nothing in the provided text addresses it. This work is aimed at applied speech engineers building voice interfaces who might want a quick signal that paralinguistic features help. A reader could take the high-level idea away, but the lack of methods makes it unusable for replication or extension. I would not bring it to a reading group, would not cite it, and would not send it to peer review until the experimental setup is actually described.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes detecting vocal expression in short utterances to digital assistants via acoustic cues and paralinguistic embeddings rather than text alone. It reports a 60% relative EER reduction versus a bag-of-words baseline and a further 30% relative reduction when emotion embeddings are added, arguing that expression is substantially encoded in vocal attributes.

Significance. If the EER gains can be shown to arise from speaker- and content-independent vocal cues with reliable labels, the work would support incorporating paralinguistic embeddings into intent detection pipelines for voice interfaces.

major comments (3)

[Abstract] Abstract: the 60% and 30% relative EER reductions are stated without any accompanying dataset description (size, label provenance, inter-annotator agreement), model architecture, training regime for the embeddings, or evaluation protocol (speaker-disjoint or content-controlled splits). These omissions make it impossible to determine whether the reported gains reflect genuine expression detection or confounding factors such as label noise correlation or acoustic artifacts.
[Abstract] Abstract: no cross-validation procedure, statistical significance tests, or variance estimates are supplied for the EER figures, so the stability of the claimed improvements cannot be assessed.
[Abstract] Abstract: the potential dependence between the paralinguistic embeddings and the evaluation data is not discussed; if the embeddings were trained on distributions overlapping the test utterances, the EER metric may not be independent of the training distribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments focused on the abstract. We agree that the abstract should be expanded to include key experimental details so that the reported EER reductions can be properly contextualized. We will revise the abstract accordingly while keeping it concise. Below we respond point by point.

read point-by-point responses

Referee: [Abstract] Abstract: the 60% and 30% relative EER reductions are stated without any accompanying dataset description (size, label provenance, inter-annotator agreement), model architecture, training regime for the embeddings, or evaluation protocol (speaker-disjoint or content-controlled splits). These omissions make it impossible to determine whether the reported gains reflect genuine expression detection or confounding factors such as label noise correlation or acoustic artifacts.

Authors: The full manuscript contains these details (dataset of short utterances with human-provided expression labels, acoustic feature extraction plus pre-trained paralinguistic embeddings, and speaker-disjoint evaluation splits). We acknowledge the abstract is too terse and will revise it to include a brief statement of dataset size, label source, speaker-disjoint protocol, and embedding origin so readers can assess independence from lexical content. revision: yes
Referee: [Abstract] Abstract: no cross-validation procedure, statistical significance tests, or variance estimates are supplied for the EER figures, so the stability of the claimed improvements cannot be assessed.

Authors: The results section of the manuscript reports EER on held-out data but does not include variance across folds or significance tests in the abstract. We will revise the abstract to note that results are averaged over speaker-disjoint folds and will add variance estimates or significance testing to the results section if not already present. revision: partial
Referee: [Abstract] Abstract: the potential dependence between the paralinguistic embeddings and the evaluation data is not discussed; if the embeddings were trained on distributions overlapping the test utterances, the EER metric may not be independent of the training distribution.

Authors: The embeddings are taken from models pre-trained on separate corpora; the evaluation utterances were not seen during embedding training. We will add an explicit statement in the revised abstract and methods section confirming the training/evaluation separation to address this concern directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical EER results benchmarked against external baseline

full rationale

The paper reports experimental equal-error-rate reductions (60% relative to bag-of-words, 30% from adding emotion embedding) on held-out utterances. No equations, self-definitional mappings, fitted-input-as-prediction steps, or load-bearing self-citations appear in the provided text. The central claim is a direct performance comparison to an independent lexical baseline and therefore does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields limited visibility; the central claim rests on standard ML embedding models whose hyperparameters are fitted to data and on the domain assumption that expression is acoustically separable from lexical content.

free parameters (2)

embedding model hyperparameters
Neural embedding dimensions, learning rates, and layer sizes are fitted during training but unspecified.
EER decision threshold
Equal error rate requires selection of an operating point on the ROC curve.

axioms (1)

domain assumption Vocal expression is reliably labeled in the evaluation data and is captured by acoustic and emotion embeddings independent of lexical content
This premise is invoked to interpret the EER reductions as evidence that expression is represented by vocal attributes.

pith-pipeline@v0.9.0 · 5752 in / 1253 out tokens · 45519 ms · 2026-05-25T13:16:48.697180+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 2 internal anchors

[1]

Introduction One of the key challenges faced by voice operated assistants, such as Siri, is the interpretation of the intent of the user’s query. For example, an intelligent assistant may need to distinguish between a query for information on sports, a request to make a phone call, a command to play music, or many other supported actions. Existing systems...

work page
[2]

The data had neither any speaker level information nor any contextual information: every query was independent of every other

Data We have collected approximately 100 hours of US English speech material and their associated automatically generated speech transcriptions. The data had neither any speaker level information nor any contextual information: every query was independent of every other. Hence the task we explored was: speaker independent and context-free. The collected s...

work page
[3]

The graders voted if the query expressed the intent clearly, by selecting one of the three possible options: ”Y es”, ”No” and ”Not sure”

A query’svocal expressionwith respect to the type of intent, i.e., asking for a resource, an accidental trigger, or a prank or other humor attempt. The graders voted if the query expressed the intent clearly, by selecting one of the three possible options: ”Y es”, ”No” and ”Not sure”

work page
[4]

After grading, the data was ﬁltered to remove cases where all four graders were Not Sure in their decision, which resulted in 70 hours of data that was used in our experiments

Perceivedprimitive emotion(Arousal and V alence) on a three-level Likert scale. After grading, the data was ﬁltered to remove cases where all four graders were Not Sure in their decision, which resulted in 70 hours of data that was used in our experiments. The ﬁnal grade for a query was an average of the individual grades by the graders (where the grades ...

work page
[5]

Figure 1 shows the distribution of their decisions

Data Analysis and metric Data grading provided some interesting insights, where the graders agreed more on labeling a query as not-expressive than expressive. Figure 1 shows the distribution of their decisions. The expressive and not-expressive cases are those where two or more graders have agreed strongly toward that decision. When graders labeled primit...

work page
[6]

The baseline feature is the 20 dimensional mel- frequency cepstral coefﬁcients (MFCCs)

Acoustic Features We investigated several acoustic features to parameterize speech. The baseline feature is the 20 dimensional mel- frequency cepstral coefﬁcients (MFCCs). We explored gam- matone cepstral coefﬁcients (GCCs) and modulation features (modulation cepstral coefﬁcients (NMCC) [13]), both of which consisted of 20 cepstral features. In addition, ...

work page
[7]

128 neurons in the recurrent and the embedding layers

Acoustic Model We used the graded data to train single-layer long-short term memory (LSTM) neural network based acoustic models, with Figure 6:Embedding fusion for expression detection. 128 neurons in the recurrent and the embedding layers. The models were tuned using a held-out dev set. The models were trained using cross-entropy loss, with a mini-batch ...

work page
[8]

We also used a random model that gener- ated random outputs and the resulting scores are shown in Table

Results We investigated text-based models for the given task, where bag-of-words (BoW) features were used to train a multi-layered neural network (NN). We also used a random model that gener- ated random outputs and the resulting scores are shown in Table

work page
[9]

Both the number of hidden layers and number of neurons in each layer were optimized given a held-out validation set. The BoW feature transforms were learned from the speech transcrip- tions of the 60-hour pre-training data, and the neural net model was trained using BoW features obtained from the 30-hour bal- anced data. Additionally, an MFCC feature base...

work page
[10]

Conclusions In this work, we investigated how acoustic and emotion cues can be used to detect vocal expression in speech. We observed that (a) primitive emotion can help in determining vocal expression, (b) articulatory information can help in improving the valence detection, and (c) robust acoustic features can help in gener- ating better embedding. We h...

work page
[11]

Acknowledgements The authors would like to thank Russ Webb, Sachin Kajarekar and Alex Acero for their valuable comments and suggestions to improve the contents of this paper

work page
[12]

Zero-shot learning of intent embeddings for expansion by convolutional deep struc- tured semantic models,

Y .-N. Chen, D. Hakkani-T¨ur, and X. He, “Zero-shot learning of intent embeddings for expansion by convolutional deep struc- tured semantic models,” in2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 6045–6049

work page 2016
[13]

Unsupervised in- duction and ﬁlling of semantic slots for spoken dialogue systems using frame-semantic parsing,

Y .-N. Chen, W. Y . Wang, and A. I. Rudnicky, “Unsupervised in- duction and ﬁlling of semantic slots for spoken dialogue systems using frame-semantic parsing,” in2013 IEEE Workshop on Auto- matic Speech Recognition and Understanding. IEEE, 2013, pp. 120–125

work page 2013
[14]

Jointly modeling inter-slot relations by random walk on knowledge graphs for un- supervised spoken language understanding,

Y .-N. Chen, W. Y . Wang, and A. Rudnicky, “Jointly modeling inter-slot relations by random walk on knowledge graphs for un- supervised spoken language understanding,” inProceedings of the 2015 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technolo- gies, 2015, pp. 619–629

work page 2015
[15]

A Bi-model based RNN Semantic Frame Parsing Model for Intent Detection and Slot Filling

Y . Wang, Y . Shen, and H. Jin, “A bi-model based rnn semantic frame parsing model for intent detection and slot ﬁlling,”arXiv preprint arXiv:1812.10235, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Speech intent recognition for robots,

B. Shen and D. Inkpen, “Speech intent recognition for robots,” Proceedings of Third International Conference on Mathematics and Computers in Sciences and in Industry, pp. 185–190, 2016

work page 2016
[17]

Multi-Layer Ensembling Techniques for Multilingual Intent Classification

C. Costello, R. Lin, V . Mruthyunjaya, B. Bolla, and C. Jankowski, “Multi-layer ensembling techniques for multilingual intent classi- ﬁcation,”arXiv preprint arXiv:1806.07914, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

Deep belief nets for natural language call-routing,

R. Sarikaya, G. E. Hinton, and B. Ramabhadran, “Deep belief nets for natural language call-routing,” in2011 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, 2011, pp. 5680–5683

work page 2011
[19]

Investigating utterance level representations for detecting intent from acoustics,

S. Rallabandi, B. Karki, C. Viegas, E. Nyberg, and B. A.W., “Investigating utterance level representations for detecting intent from acoustics,” inProceedings of Interspeech. ISCA, 2018, pp. 516–520

work page 2018
[20]

The use of technology in suicide prevention,

M. E. Larsen, N. Cummins, T. W. Boonstra, B. O’Dea, J. Tighe, J. Nicholas, F. Shand, J. Epps, and H. Christensen, “The use of technology in suicide prevention,” in2015 37th annual interna- tional conference of the IEEE engineering in Medicine and biol- ogy society (EMBC). IEEE, 2015, pp. 7316–7319

work page 2015
[21]

The sri avec-2014 evaluation system,

V . Mitra, E. Shriberg, M. McLaren, A. Kathol, C. Richey, D. V er- gyri, and M. Graciarena, “The sri avec-2014 evaluation system,” inProceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. ACM, 2014, pp. 93–101

work page 2014
[22]

Speech-based assessment of ptsd in a military population using diverse feature classes,

D. V ergyri, B. Knoth, E. Shriberg, V . Mitra, M. McLaren, L. Fer- rer, P . Garcia, and C. Marmar, “Speech-based assessment of ptsd in a military population using diverse feature classes,” inSixteenth annual conference of the international speech communication as- sociation, 2015

work page 2015
[23]

Building naturalistic emotionally bal- anced speech corpus by retrieving emotional speech from existing podcast recordings,

R. Lotﬁan and C. Busso, “Building naturalistic emotionally bal- anced speech corpus by retrieving emotional speech from existing podcast recordings,”IEEE Transactions on Affective Computing, 2017

work page 2017
[24]

Hybrid convolutional neural networks for articu- latory and acoustic information based speech recognition,

V . Mitra, G. Sivaraman, H. Nam, C. Espy-Wilson, E. Saltzman, and M. Tiede, “Hybrid convolutional neural networks for articu- latory and acoustic information based speech recognition,”Speech Communication, vol. 89, pp. 103–112, 2017

work page 2017
[25]

Jointly predicting arousal, valence and dominance with multi-task learning

S. Parthasarathy and C. Busso, “Jointly predicting arousal, valence and dominance with multi-task learning.” inINTER- SPEECH, 2017, pp. 1103–1107

work page 2017
[26]

Unveiling the acoustic properties that describe the valence dimension,

C. Busso and T. Rahman, “Unveiling the acoustic properties that describe the valence dimension,” inThirteenth Annual Conference of the International Speech Communication Association, 2012

work page 2012
[27]

Predicting arousal and valence from waveforms and spectrograms using deep neural networks,

Z. Y ang and J. Hirschberg, “Predicting arousal and valence from waveforms and spectrograms using deep neural networks,”Proc. Interspeech 2018, pp. 3092–3096, 2018

work page 2018
[28]

Retrieving tract variables from acoustics: a comparison of different machine learning strategies,

V . Mitra, H. Nam, C. Y . Espy-Wilson, E. Saltzman, and L. Gold- stein, “Retrieving tract variables from acoustics: a comparison of different machine learning strategies,”IEEE journal of selected topics in signal processing, vol. 4, no. 6, pp. 1027–1045, 2010

work page 2010
[29]

Modeling of articulatory gestures to control effects of production variability on speech technologies,

C. Espy-Wilson, G. Sivaraman, M. Tiede, V . Mitra, E. Saltzmann, L. Goldstein, and H. Nam, “Modeling of articulatory gestures to control effects of production variability on speech technologies,” Rethinking Reduction, pp. 243–276, 2018

work page 2018

[1] [1]

Introduction One of the key challenges faced by voice operated assistants, such as Siri, is the interpretation of the intent of the user’s query. For example, an intelligent assistant may need to distinguish between a query for information on sports, a request to make a phone call, a command to play music, or many other supported actions. Existing systems...

work page

[2] [2]

The data had neither any speaker level information nor any contextual information: every query was independent of every other

Data We have collected approximately 100 hours of US English speech material and their associated automatically generated speech transcriptions. The data had neither any speaker level information nor any contextual information: every query was independent of every other. Hence the task we explored was: speaker independent and context-free. The collected s...

work page

[3] [3]

The graders voted if the query expressed the intent clearly, by selecting one of the three possible options: ”Y es”, ”No” and ”Not sure”

A query’svocal expressionwith respect to the type of intent, i.e., asking for a resource, an accidental trigger, or a prank or other humor attempt. The graders voted if the query expressed the intent clearly, by selecting one of the three possible options: ”Y es”, ”No” and ”Not sure”

work page

[4] [4]

After grading, the data was ﬁltered to remove cases where all four graders were Not Sure in their decision, which resulted in 70 hours of data that was used in our experiments

Perceivedprimitive emotion(Arousal and V alence) on a three-level Likert scale. After grading, the data was ﬁltered to remove cases where all four graders were Not Sure in their decision, which resulted in 70 hours of data that was used in our experiments. The ﬁnal grade for a query was an average of the individual grades by the graders (where the grades ...

work page

[5] [5]

Figure 1 shows the distribution of their decisions

Data Analysis and metric Data grading provided some interesting insights, where the graders agreed more on labeling a query as not-expressive than expressive. Figure 1 shows the distribution of their decisions. The expressive and not-expressive cases are those where two or more graders have agreed strongly toward that decision. When graders labeled primit...

work page

[6] [6]

The baseline feature is the 20 dimensional mel- frequency cepstral coefﬁcients (MFCCs)

Acoustic Features We investigated several acoustic features to parameterize speech. The baseline feature is the 20 dimensional mel- frequency cepstral coefﬁcients (MFCCs). We explored gam- matone cepstral coefﬁcients (GCCs) and modulation features (modulation cepstral coefﬁcients (NMCC) [13]), both of which consisted of 20 cepstral features. In addition, ...

work page

[7] [7]

128 neurons in the recurrent and the embedding layers

Acoustic Model We used the graded data to train single-layer long-short term memory (LSTM) neural network based acoustic models, with Figure 6:Embedding fusion for expression detection. 128 neurons in the recurrent and the embedding layers. The models were tuned using a held-out dev set. The models were trained using cross-entropy loss, with a mini-batch ...

work page

[8] [8]

We also used a random model that gener- ated random outputs and the resulting scores are shown in Table

Results We investigated text-based models for the given task, where bag-of-words (BoW) features were used to train a multi-layered neural network (NN). We also used a random model that gener- ated random outputs and the resulting scores are shown in Table

work page

[9] [9]

Both the number of hidden layers and number of neurons in each layer were optimized given a held-out validation set. The BoW feature transforms were learned from the speech transcrip- tions of the 60-hour pre-training data, and the neural net model was trained using BoW features obtained from the 30-hour bal- anced data. Additionally, an MFCC feature base...

work page

[10] [10]

Conclusions In this work, we investigated how acoustic and emotion cues can be used to detect vocal expression in speech. We observed that (a) primitive emotion can help in determining vocal expression, (b) articulatory information can help in improving the valence detection, and (c) robust acoustic features can help in gener- ating better embedding. We h...

work page

[11] [11]

Acknowledgements The authors would like to thank Russ Webb, Sachin Kajarekar and Alex Acero for their valuable comments and suggestions to improve the contents of this paper

work page

[12] [12]

Zero-shot learning of intent embeddings for expansion by convolutional deep struc- tured semantic models,

Y .-N. Chen, D. Hakkani-T¨ur, and X. He, “Zero-shot learning of intent embeddings for expansion by convolutional deep struc- tured semantic models,” in2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 6045–6049

work page 2016

[13] [13]

Unsupervised in- duction and ﬁlling of semantic slots for spoken dialogue systems using frame-semantic parsing,

Y .-N. Chen, W. Y . Wang, and A. I. Rudnicky, “Unsupervised in- duction and ﬁlling of semantic slots for spoken dialogue systems using frame-semantic parsing,” in2013 IEEE Workshop on Auto- matic Speech Recognition and Understanding. IEEE, 2013, pp. 120–125

work page 2013

[14] [14]

Jointly modeling inter-slot relations by random walk on knowledge graphs for un- supervised spoken language understanding,

Y .-N. Chen, W. Y . Wang, and A. Rudnicky, “Jointly modeling inter-slot relations by random walk on knowledge graphs for un- supervised spoken language understanding,” inProceedings of the 2015 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technolo- gies, 2015, pp. 619–629

work page 2015

[15] [15]

A Bi-model based RNN Semantic Frame Parsing Model for Intent Detection and Slot Filling

Y . Wang, Y . Shen, and H. Jin, “A bi-model based rnn semantic frame parsing model for intent detection and slot ﬁlling,”arXiv preprint arXiv:1812.10235, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Speech intent recognition for robots,

B. Shen and D. Inkpen, “Speech intent recognition for robots,” Proceedings of Third International Conference on Mathematics and Computers in Sciences and in Industry, pp. 185–190, 2016

work page 2016

[17] [17]

Multi-Layer Ensembling Techniques for Multilingual Intent Classification

C. Costello, R. Lin, V . Mruthyunjaya, B. Bolla, and C. Jankowski, “Multi-layer ensembling techniques for multilingual intent classi- ﬁcation,”arXiv preprint arXiv:1806.07914, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [18]

Deep belief nets for natural language call-routing,

R. Sarikaya, G. E. Hinton, and B. Ramabhadran, “Deep belief nets for natural language call-routing,” in2011 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, 2011, pp. 5680–5683

work page 2011

[19] [19]

Investigating utterance level representations for detecting intent from acoustics,

S. Rallabandi, B. Karki, C. Viegas, E. Nyberg, and B. A.W., “Investigating utterance level representations for detecting intent from acoustics,” inProceedings of Interspeech. ISCA, 2018, pp. 516–520

work page 2018

[20] [20]

The use of technology in suicide prevention,

M. E. Larsen, N. Cummins, T. W. Boonstra, B. O’Dea, J. Tighe, J. Nicholas, F. Shand, J. Epps, and H. Christensen, “The use of technology in suicide prevention,” in2015 37th annual interna- tional conference of the IEEE engineering in Medicine and biol- ogy society (EMBC). IEEE, 2015, pp. 7316–7319

work page 2015

[21] [21]

The sri avec-2014 evaluation system,

V . Mitra, E. Shriberg, M. McLaren, A. Kathol, C. Richey, D. V er- gyri, and M. Graciarena, “The sri avec-2014 evaluation system,” inProceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. ACM, 2014, pp. 93–101

work page 2014

[22] [22]

Speech-based assessment of ptsd in a military population using diverse feature classes,

D. V ergyri, B. Knoth, E. Shriberg, V . Mitra, M. McLaren, L. Fer- rer, P . Garcia, and C. Marmar, “Speech-based assessment of ptsd in a military population using diverse feature classes,” inSixteenth annual conference of the international speech communication as- sociation, 2015

work page 2015

[23] [23]

Building naturalistic emotionally bal- anced speech corpus by retrieving emotional speech from existing podcast recordings,

R. Lotﬁan and C. Busso, “Building naturalistic emotionally bal- anced speech corpus by retrieving emotional speech from existing podcast recordings,”IEEE Transactions on Affective Computing, 2017

work page 2017

[24] [24]

Hybrid convolutional neural networks for articu- latory and acoustic information based speech recognition,

V . Mitra, G. Sivaraman, H. Nam, C. Espy-Wilson, E. Saltzman, and M. Tiede, “Hybrid convolutional neural networks for articu- latory and acoustic information based speech recognition,”Speech Communication, vol. 89, pp. 103–112, 2017

work page 2017

[25] [25]

Jointly predicting arousal, valence and dominance with multi-task learning

S. Parthasarathy and C. Busso, “Jointly predicting arousal, valence and dominance with multi-task learning.” inINTER- SPEECH, 2017, pp. 1103–1107

work page 2017

[26] [26]

Unveiling the acoustic properties that describe the valence dimension,

C. Busso and T. Rahman, “Unveiling the acoustic properties that describe the valence dimension,” inThirteenth Annual Conference of the International Speech Communication Association, 2012

work page 2012

[27] [27]

Predicting arousal and valence from waveforms and spectrograms using deep neural networks,

Z. Y ang and J. Hirschberg, “Predicting arousal and valence from waveforms and spectrograms using deep neural networks,”Proc. Interspeech 2018, pp. 3092–3096, 2018

work page 2018

[28] [28]

Retrieving tract variables from acoustics: a comparison of different machine learning strategies,

V . Mitra, H. Nam, C. Y . Espy-Wilson, E. Saltzman, and L. Gold- stein, “Retrieving tract variables from acoustics: a comparison of different machine learning strategies,”IEEE journal of selected topics in signal processing, vol. 4, no. 6, pp. 1027–1045, 2010

work page 2010

[29] [29]

Modeling of articulatory gestures to control effects of production variability on speech technologies,

C. Espy-Wilson, G. Sivaraman, M. Tiede, V . Mitra, E. Saltzmann, L. Goldstein, and H. Nam, “Modeling of articulatory gestures to control effects of production variability on speech technologies,” Rethinking Reduction, pp. 243–276, 2018

work page 2018