Multimodal Group Emotion Recognition In-the-Wild Towards a Privacy-Safe Non-Individual Approach

Anderson Augusma

arxiv: 2606.07585 · v1 · pith:Y553LVH6new · submitted 2026-05-27 · 💻 cs.CV · cs.AI

Multimodal Group Emotion Recognition In-the-Wild Towards a Privacy-Safe Non-Individual Approach

Anderson Augusma This is my paper

Pith reviewed 2026-06-29 12:58 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords group emotion recognitionmultimodal fusionprivacy-preservingaudio-videonon-individual approachin-the-wildcross-attentionvariational encoder

0 comments

The pith

Group emotion recognition achieves competitive accuracy using only collective audio-video signals without any individual face, gaze or voice data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that group-level emotion can be inferred in real-world conditions from collective audio and video alone, avoiding the privacy risks of tracking individual faces or voices. It introduces two architectures: one that fuses modalities through cross-attention and pools frames temporally, and another that learns a shared latent space for both emotion labels and structural predictions. These are trained with synthetic augmentation and tested via ablation. A sympathetic reader would care because the approach removes the need for personal biometric inputs while still reaching performance levels that matter for applications like crowd monitoring or social robotics. The work shows that the collective signal itself carries sufficient affective information once properly aggregated and augmented.

Core claim

The thesis demonstrates that competitive performance on group emotion recognition in the wild is possible without using individual features as input. Two frameworks are presented: a cross-attention multimodal model with Frames Attention Pooling for audio-video fusion, and a Variational Encoder Multi-Decoder that learns a shared latent space supporting both classification and structural representation prediction. Experiments with synthetic data augmentation and ablation studies confirm that collective signals suffice for robust real-world results.

What carries the argument

Collective audio-video signals processed through cross-attention fusion and variational multi-decoder latent spaces, which aggregate group-level information without ever extracting individual cues.

If this is right

Cross-attention fusion of audio and video improves temporal aggregation for group-level labels.
Frames Attention Pooling enables effective handling of variable-length video without individual tracking.
The variational encoder-decoder structure separates emotion classification from structural cue prediction while sharing a common representation.
Synthetic augmentation compensates for limited real-world group emotion data and increases robustness.
Performance remains competitive across both group-only and mixed individual-group evaluation settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same collective-signal approach could extend to other group-level predictions such as activity coordination or conflict detection without requiring personal identification.
Public deployments in shared spaces could adopt the method to reduce legal exposure under privacy regulations that restrict biometric capture.
If collective signals prove sufficient, future datasets might be collected and shared without face-blurring or voice anonymization steps.
The latent-space separation in the second framework suggests a route to disentangle affective state from visible body configuration for more interpretable group models.

Load-bearing premise

Collective audio-video signals, once synthetically augmented, contain enough information to match or approach the accuracy of models that rely on individual face, gaze or voice cues in real-world group settings.

What would settle it

A controlled test set of labeled group scenes where a model restricted to collective signals scores at least 15 percentage points lower in accuracy than an otherwise identical model given explicit individual face and voice tracks.

Figures

Figures reproduced from arXiv: 2606.07585 by Anderson Augusma.

**Figure 1.1.** Figure 1.1: Smart classroom design (CAC) of the Teaching Lab project [1 [PITH_FULL_IMAGE:figures/full_fig_p014_1_1.png] view at source ↗

**Figure 1.2.** Figure 1.2: Example of individual and group emotions in-the-wild from EMOTIC [PITH_FULL_IMAGE:figures/full_fig_p017_1_2.png] view at source ↗

**Figure 2.1.** Figure 2.1: FACES dataset overview: From left to right, the emotion goes like: Anger, [PITH_FULL_IMAGE:figures/full_fig_p026_2_1.png] view at source ↗

**Figure 2.2.** Figure 2.2: RAF DB overview: six-class basic emotions and twelve-class compound [PITH_FULL_IMAGE:figures/full_fig_p026_2_2.png] view at source ↗

**Figure 2.3.** Figure 2.3: Examples of the 26 feeling categories of EMOTIC dataset. In each category [PITH_FULL_IMAGE:figures/full_fig_p030_2_3.png] view at source ↗

**Figure 2.4.** Figure 2.4: Group Affect Database 2.0 overview: from the top to the bottom, the [PITH_FULL_IMAGE:figures/full_fig_p031_2_4.png] view at source ↗

**Figure 2.5.** Figure 2.5: Group and Scene Database overview: The first two rows contain negative [PITH_FULL_IMAGE:figures/full_fig_p033_2_5.png] view at source ↗

**Figure 3.1.** Figure 3.1: Examples from the VGAF dataset. From left to right: Positive, Neutral, [PITH_FULL_IMAGE:figures/full_fig_p076_3_1.png] view at source ↗

**Figure 3.2.** Figure 3.2: At left, the proposed model is a combination of two monomodal branches, a [PITH_FULL_IMAGE:figures/full_fig_p076_3_2.png] view at source ↗

**Figure 3.3.** Figure 3.3: Vision Transformers Architecture, source [5 [PITH_FULL_IMAGE:figures/full_fig_p077_3_3.png] view at source ↗

**Figure 3.4.** Figure 3.4: Non-individual feature policy. The global image is the input; identity [PITH_FULL_IMAGE:figures/full_fig_p079_3_4.png] view at source ↗

**Figure 3.5.** Figure 3.5: Audio framing aligned with 5 and 75 video frames. [PITH_FULL_IMAGE:figures/full_fig_p080_3_5.png] view at source ↗

**Figure 3.6.** Figure 3.6: Synthetic image process (source [176]). Motivation. Because privacy-preserving processing omits facial and pose crops as inputs, the model may lack localized affective cues. To counterbalance this loss, we design a controlled synthetic data generation process. The intention is to regularize learning and improve generalization without reintroducing individual features as inputs. 3.4.1 Synthetic Video Gene… view at source ↗

**Figure 3.7.** Figure 3.7: Negative, Neutral, Positive grad-cam visualization. (source [1 [PITH_FULL_IMAGE:figures/full_fig_p082_3_7.png] view at source ↗

**Figure 3.8.** Figure 3.8: Ten synthetic images across environments. [PITH_FULL_IMAGE:figures/full_fig_p083_3_8.png] view at source ↗

**Figure 3.9.** Figure 3.9: Example synthetic video (Neutral) composed of seven frames with animated [PITH_FULL_IMAGE:figures/full_fig_p083_3_9.png] view at source ↗

**Figure 3.10.** Figure 3.10: Top: confusion matrices for the multimodal cross-attention model trained [PITH_FULL_IMAGE:figures/full_fig_p092_3_10.png] view at source ↗

**Figure 4.1.** Figure 4.1: Overview of datasets used for experiments in this chapter. The first two [PITH_FULL_IMAGE:figures/full_fig_p097_4_1.png] view at source ↗

**Figure 4.2.** Figure 4.2: VitPose annotation Structural Representation for body. There are 18 limb [PITH_FULL_IMAGE:figures/full_fig_p099_4_2.png] view at source ↗

**Figure 4.3.** Figure 4.3: Custom Face annotation landmark with FaceAlignment model: They are [PITH_FULL_IMAGE:figures/full_fig_p100_4_3.png] view at source ↗

**Figure 4.4.** Figure 4.4: The proposed VE-MD architecture using a multitask latent space. The left [PITH_FULL_IMAGE:figures/full_fig_p105_4_4.png] view at source ↗

**Figure 4.5.** Figure 4.5: The CUSTom RESidual block. It performs downsampling of the input [PITH_FULL_IMAGE:figures/full_fig_p106_4_5.png] view at source ↗

**Figure 4.6.** Figure 4.6: At left, the network takes the latent space as input. We then apply an aux [PITH_FULL_IMAGE:figures/full_fig_p110_4_6.png] view at source ↗

**Figure 4.7.** Figure 4.7: Emotion Decoder. It receives the latent space as input and optionally [PITH_FULL_IMAGE:figures/full_fig_p115_4_7.png] view at source ↗

**Figure 4.8.** Figure 4.8: Predicted Structural Representation with DETR on MER-MULTI with [PITH_FULL_IMAGE:figures/full_fig_p124_4_8.png] view at source ↗

**Figure 4.9.** Figure 4.9: At the top is the latent space input, followed by a custom UNet upsample [PITH_FULL_IMAGE:figures/full_fig_p128_4_9.png] view at source ↗

**Figure 4.10.** Figure 4.10: Predicted structural representations with Heatmap estimation on the GAF [PITH_FULL_IMAGE:figures/full_fig_p138_4_10.png] view at source ↗

**Figure 4.11.** Figure 4.11: All comparisons are made with ViT, which is considered the baseline for the [PITH_FULL_IMAGE:figures/full_fig_p140_4_11.png] view at source ↗

**Figure 4.12.** Figure 4.12: All comparisons are made with ViT, which is considered the baseline for the [PITH_FULL_IMAGE:figures/full_fig_p141_4_12.png] view at source ↗

**Figure 4.13.** Figure 4.13: Multimodal combination with audio. Features from the two VE-MD latent [PITH_FULL_IMAGE:figures/full_fig_p142_4_13.png] view at source ↗

**Figure 4.14.** Figure 4.14: Multimodal classification head. Outputs from VE_MD (video branch) and the audio encoder pass through self-attention and learned Frames Attention Pooling (FAP), then are concatenated and projected by an MLP for emotion classification. See Section 3.2 for the frames attention pooling details. Single audio encoder. With one audio encoder, we test: • Late fusion: concatenate audio and video embeddings after… view at source ↗

**Figure 4.15.** Figure 4.15: Summary of VE_MD versus SOTA across datasets. Cyan/gray bars [PITH_FULL_IMAGE:figures/full_fig_p153_4_15.png] view at source ↗

read the original abstract

This thesis addresses group emotion recognition (GER) in-the-wild with a focus on privacy preservation. Unlike traditional emotion recognition methods that rely on individual-level cues such as face, gaze, or voice analysis, this work uses collective audio-video signals to infer emotions at the group level, reducing risks of individual monitoring and surveillance. Two complementary frameworks are proposed. The first is a cross-attention multimodal architecture for audio-video fusion, combined with Frames Attention Pooling (FAP) for temporal aggregation. It is supported by synthetic data augmentation and validated through ablation studies, demonstrating robustness in real-world GER conditions. The second framework, Variational Encoder Multi-Decoder (VE-MD), learns a shared latent space for emotion classification and structural representation prediction, including body and face cues. Two decoding strategies, DETR-based and heatmap-based, are explored to analyze the role of structural representations in group and individual settings. The thesis makes three main contributions: it clarifies the role of multimodality and structural cues in group-level affective computing; introduces two architectures for privacy-preserving multimodal GER; and shows that competitive performance can be achieved without using individual features as input data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The thesis states that collective audio-video signals can deliver competitive group emotion recognition without individual cues, but the abstract supplies no numbers or comparisons to support that.

read the letter

The abstract lays out two architectures for privacy-preserving group emotion recognition: a cross-attention model with frames attention pooling and synthetic augmentation, plus a variational encoder multi-decoder that predicts both emotion labels and structural representations via DETR or heatmap decoders. The privacy framing is the clearest part of the contribution.

What stands out is the consistent focus on avoiding individual face, gaze, or voice inputs. The work applies standard multimodal fusion tools to the non-individual setting and tests two decoding routes for structural cues. That is a reasonable engineering step for anyone already working on affective computing under privacy constraints.

The soft spot is the complete absence of results. The text mentions ablation studies and robustness in real-world conditions but reports no accuracy, F1 scores, dataset names, splits, or direct comparisons against published individual-cue baselines. Without those numbers the central claim—that collective signals alone are sufficient—remains untested. If the full manuscript contains the tables and error bars, they need to be the first thing a reader sees; right now the empirical gap is large.

This is for people already inside multimodal emotion recognition who care about surveillance risks. A reader looking for new theory, first-principles derivations, or proven performance gains will not find them here. The citation pattern looks standard for the subfield and there are no obvious circular derivations.

I would not bring this to a reading group in its current form. I would not cite it. It does not yet deserve peer review; the authors should add the missing quantitative comparisons before an editor sends it out.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes two privacy-preserving frameworks for group emotion recognition (GER) in-the-wild that rely exclusively on collective audio-video signals rather than individual cues such as faces, gaze, or voices. The first combines cross-attention multimodal fusion with Frames Attention Pooling (FAP) and synthetic data augmentation; the second is a Variational Encoder Multi-Decoder (VE-MD) that learns a shared latent space for emotion classification and structural representation prediction using DETR-based or heatmap-based decoders. Ablation studies are cited to support robustness, and the central claim is that competitive performance can be achieved without individual-level input features.

Significance. If the quantitative results hold, the work would provide a concrete demonstration that collective multimodal signals suffice for competitive GER accuracy, advancing privacy-safe affective computing and offering an alternative to individual-cue pipelines in applications such as crowd monitoring.

major comments (3)

[Abstract] Abstract: the claim that 'competitive performance can be achieved without using individual features as input data' is the central thesis but is unsupported by any reported accuracy, F1, or other metrics, dataset names with splits, error bars, or head-to-head comparisons against published individual-feature GER baselines on the same in-the-wild test sets.
[Abstract] Abstract and § on VE-MD: the two decoding strategies (DETR-based and heatmap-based) are introduced to analyze the role of structural representations, yet no quantitative ablation or comparison results are supplied to show whether these representations improve or are necessary for the group-level claim.
[Abstract] Abstract: ablation studies and synthetic augmentation are invoked to demonstrate robustness, but the text supplies neither the quantitative outcomes of those ablations nor the specific collective-signal baselines against which they were measured.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for highlighting the need to strengthen the abstract with explicit quantitative support. The full manuscript contains the requested metrics, ablations, and comparisons in the experimental sections; we will revise the abstract to make these self-contained while preserving the privacy-preserving focus.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'competitive performance can be achieved without using individual features as input data' is the central thesis but is unsupported by any reported accuracy, F1, or other metrics, dataset names with splits, error bars, or head-to-head comparisons against published individual-feature GER baselines on the same in-the-wild test sets.

Authors: The experimental results section reports accuracy, F1, and other metrics on both synthetic and real in-the-wild datasets (with splits and error bars), including direct comparisons to individual-feature baselines. To address the abstract's self-containment, we will incorporate the key numerical results and dataset references into the revised abstract. revision: yes
Referee: [Abstract] Abstract and § on VE-MD: the two decoding strategies (DETR-based and heatmap-based) are introduced to analyze the role of structural representations, yet no quantitative ablation or comparison results are supplied to show whether these representations improve or are necessary for the group-level claim.

Authors: The VE-MD section provides quantitative ablations comparing DETR-based and heatmap-based decoders, including their effect on group-level emotion classification accuracy. We will add a concise summary of these comparative results to the abstract. revision: yes
Referee: [Abstract] Abstract: ablation studies and synthetic augmentation are invoked to demonstrate robustness, but the text supplies neither the quantitative outcomes of those ablations nor the specific collective-signal baselines against which they were measured.

Authors: The results section details the ablation outcomes (with numerical deltas) and the collective-signal baselines used. We will update the abstract to reference these specific quantitative findings and baselines. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical architecture design with no derivation chain

full rationale

The manuscript describes two multimodal architectures (cross-attention with FAP; VE-MD) trained on collective audio-video signals plus synthetic augmentation, evaluated via ablation studies. No equations, fitted parameters renamed as predictions, self-definitional relations, or load-bearing self-citations appear in the provided text. The central claim of competitive performance without individual features is an empirical sufficiency statement, not a mathematical derivation that reduces to its inputs by construction. External benchmarks and numeric head-to-head results are referenced only at the level of experimental validation, not as self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The central claim implicitly rests on the unstated assumption that group-level signals suffice for emotion inference.

pith-pipeline@v0.9.1-grok · 5730 in / 1077 out tokens · 20400 ms · 2026-06-29T12:58:01.113334+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

265 extracted references · 33 canonical work pages · 15 internal anchors

[1]

Engagement Measurement Based on Facial Landmarks and Spatial-Temporal Graph Convolutional Networks

Ali Abedi and Shehroz S Khan. “Engagement Measurement Based on Facial Landmarks and Spatial-Temporal Graph Convolutional Networks” . In: arXiv e- prints (2024), arXiv–2403

2024
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. “Flamingo: a visual language model for few-shot learning” . In: Advances in neural information processing systems 35 (2022), pp. 23716–23736

2022
[3]

Speech emotion recognition in conversations using artificial intelligence: a sys- tematic review and meta-analysis

Ghada Alhussein, Ioannis Ziogas, Shiza Saleem, and Leontios J Hadjileontiadis. “Speech emotion recognition in conversations using artificial intelligence: a sys- tematic review and meta-analysis” . In: Artificial Intelligence Review 58.7 (2025), p. 198

2025
[4]

ExCEDA: Unlocking Attention Paradigms in Extended Duration E-Classrooms by Leveraging Attention-Mechanism Models

A vinash Anand, A vni Mittal, Laavanaya Dhawan, Juhi Krishnamurthy, Mahisha Ramesh, Naman Lal, Astha Verma, Pijush Bhuyan, Raijv Ratn Shah, Roger Zimmermann, et al. “ExCEDA: Unlocking Attention Paradigms in Extended Duration E-Classrooms by Leveraging Attention-Mechanism Models” . In: 2024 IEEE 7th International Conference on Multimedia Information Proces...

2024
[5]

Facial emotion recognition in Parkinson’s disease: a review and new hypotheses

Soizic Argaud, Marc Vérin, Paul Sauleau, and Didier Grandjean. “Facial emotion recognition in Parkinson’s disease: a review and new hypotheses” . In: Movement disorders 33.4 (2018), pp. 554–567

2018
[6]

Real-time Convolu- tional Neural Networks for emotion and gender classification

Octavio Arriaga, Matias Valdenegro-Toro, and Paul Plöger. “Real-time Convolu- tional Neural Networks for emotion and gender classification” . In: 27th European Symposium on Artificial Neural Networks, ESANN 2019, Bruges, Belgium, April 24-26, 2019 . 2019, pp. 221–226

2019
[7]

BODY LANGUAGE IN- TERPRETATION: PSYCHOPHYSIOLOGICAL AND COGNITIVE ASPECTS

Rustamjon Asatullaev and Diyora Muxamedjonova. “BODY LANGUAGE IN- TERPRETATION: PSYCHOPHYSIOLOGICAL AND COGNITIVE ASPECTS” . In: Journal of Applied Science and Social Science 1.1 (2025), pp. 456–458

2025
[8]

Multimodal Perception and Statistical Modeling of Ped- agogical Classroom Events Using a Privacy-safe Non-individual Approach

Anderson Augusma. “Multimodal Perception and Statistical Modeling of Ped- agogical Classroom Events Using a Privacy-safe Non-individual Approach” . In: 2022 10th International Conference on Affective Computing and Intelligent In- teraction Workshops and Demos (ACIIW) . IEEE. 2022, pp. 1–5

2022
[9]

Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant Features

Anderson Augusma, Dominique Vaufreydaz, and Frédérique Letué. “Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant Features” . In: Proceedings of the 25th International Conference on Multimodal Interaction . 2023, pp. 750–754

2023
[10]

Enhancing Sentiment Analysis With Emo- tion And Sarcasm Detection: A Transformer-Based Approach

Mr Suryavamshi Sandeep Babu, SV Suryanarayana, M Sruthi, P Bhagya Lak- shmi, T Sravanthi, and M Spandana. “Enhancing Sentiment Analysis With Emo- tion And Sarcasm Detection: A Transformer-Based Approach” . In: Metallurgical and Materials Engineering (2025), pp. 794–803

2025
[11]

wav2vec 2.0: A framework for self-supervised learning of speech representations

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. “wav2vec 2.0: A framework for self-supervised learning of speech representations” . In: Ad- vances in neural information processing systems 33 (2020), pp. 12449–12460. BIBLIOGRAPHY 175

2020
[12]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. “Qwen technical report” . In: arXiv preprint arXiv:2309.16609 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Facial expression- based emotion recognition across diverse age groups: a multi-scale vision trans- former with contrastive learning approach

G Balachandran, S Ranjith, TR Chenthil, and GC Jagan. “Facial expression- based emotion recognition across diverse age groups: a multi-scale vision trans- former with contrastive learning approach” . In: Journal of Combinatorial Opti- mization 49.1 (2025), pp. 1–39

2025
[14]

Natural Language Processing for Sentiment Analysis in Social Media Marketing

Murat Başal. “Natural Language Processing for Sentiment Analysis in Social Media Marketing” . In: Economics 12.1 (2025), pp. 39–51

2025
[15]

Sentiment prediction based on dempster-shafer theory of evidence

Mohammad Ehsan Basiri, Ahmad Reza Naghsh-Nilchi, and Nasser Ghasem- Aghaee. “Sentiment prediction based on dempster-shafer theory of evidence” . In: Mathematical Problems in Engineering 2014.1 (2014), p. 361201

2014
[16]

Semantic-emotion neural network for emotion recognition from text

Erdenebileg Batbaatar, Meijing Li, and Keun Ho Ryu. “Semantic-emotion neural network for emotion recognition from text” . In:IEEE access 7 (2019), pp. 111866– 111878

2019
[17]

INTERPRETING BODY LANGUAGE: A SCIENTIFIC PERSPECTIVE

Asatullayev Rustamjon Baxtiyarovich and Boboqulovc Behruz Bahodirovich. “INTERPRETING BODY LANGUAGE: A SCIENTIFIC PERSPECTIVE” . In: YANGI O ‘ZBEKISTON, YANGI TADQIQOTLAR JURNALI 2.5 (2025), pp. 143– 146

2025
[18]

Group-Level Affect Recognition in Video Using Deviation of Frame Features

Natalya S Belova. “Group-Level Affect Recognition in Video Using Deviation of Frame Features” . In: Analysis of Images, Social Networks and Texts: 10th International Conference, AIST 2021, Tbilisi, Georgia, December 16–18, 2021, Revised Selected Papers. Vol. 13217. Springer Nature. 2022, p. 199

2021
[19]

Is space-time attention all you need for video understanding?

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. “Is space-time attention all you need for video understanding?” In: ICML. Vol. 2. 3. 2021, p. 4. 176 BIBLIOGRAPHY

2021
[20]

Learning privacy-enhancing face representations through feature disentanglement

Blaž Bortolato, Marija Ivanovska, Peter Rot, Janez Križaj, Philipp Terhörst, Naser Damer, Peter Peer, and Vitomir Štruc. “Learning privacy-enhancing face representations through feature disentanglement” . In: 2020 15th IEEE Interna- tional Conference on Automatic Face and Gesture Recognition (FG 2020) . IEEE. 2020, pp. 495–502

2020
[21]

Legal and Regulatory Perspec- tives on Synthetic Data as an Anonymization Strategy

Alexander Boudewijn and Andrea F Ferraris. “Legal and Regulatory Perspec- tives on Synthetic Data as an Anonymization Strategy” . In: J. Pers. Data Prot. L. (2024), p. 17

2024
[22]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language models are few-shot learners” . In: Advances in neural information processing systems 33 (2020), pp. 1877–1901

2020
[23]

SAMSEMO: New dataset for multilingual and multimodal emotion recognition

Paweł Bujnowski, Bartłomiej Kuźma, Bartłomiej Paziewski, Jacek Rutkowski, Joanna Marhula, Zuzanna Bordzicka, and Piotr Andruszkiewicz. “SAMSEMO: New dataset for multilingual and multimodal emotion recognition” . In: Inter- speech. 2024

2024
[24]

How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks)

Adrian Bulat and Georgios Tzimiropoulos. “How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks)” . In: Proceedings of the IEEE international conference on computer vision . 2017, pp. 1021–1030

2017
[25]

A database of German emotional speech

Felix Burkhardt, Astrid Paeschke, Miriam Rolfes, Walter F Sendlmeier, Ben- jamin Weiss, et al. “A database of German emotional speech. ” In: Interspeech. Vol. 5. 2005, pp. 1517–1520

2005
[26]

IEMOCAP: Interactive Emotional Dyadic Motion Capture Database

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. “IEMOCAP: Interactive Emotional Dyadic Motion Capture Database” . In: Jour- BIBLIOGRAPHY 177 nal of Language Resources and Evaluation 42.4 (2008), pp. 335–359. doi: 10. 1007/s10579-008-9076-6

2008
[27]

MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception

Carlos Busso, Srinivas Parthasarathy, Alec Burmania, Mohammed AbdelWahab, Najmeh Sadoughi, and Emily Mower Provost. “MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception” . In: IEEE Transactions on Affective Computing 8.1 (2016), pp. 67–80

2016
[28]

Human observers and automated assessment of dynamic emotional facial ex- pressions: KDEF-dyn database validation

Manuel G Calvo, Andrés Fernández-Martín, Guillermo Recio, and Daniel Lundqvist. “Human observers and automated assessment of dynamic emotional facial ex- pressions: KDEF-dyn database validation” . In: Frontiers in psychology 9 (2018), p. 2052

2018
[29]

The EU’s AI act: A framework for collaborative gover- nance

Celso Cancela-Outeda. “The EU’s AI act: A framework for collaborative gover- nance” . In: Internet of Things 27 (2024), p. 101291

2024
[30]

Crema-d: Crowd-sourced emotional multimodal actors dataset

Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. “Crema-d: Crowd-sourced emotional multimodal actors dataset” . In: IEEE transactions on affective computing 5.4 (2014), pp. 377–390

2014
[31]

Deep learning-based depression recognition through facial expression: A systematic review

Xiaoming Cao, Lingling Zhai, Pengpeng Zhai, Fangfei Li, Tao He, and Lang He. “Deep learning-based depression recognition through facial expression: A systematic review” . In: Neurocomputing (2025), p. 129605

2025
[32]

Open- pose: Realtime multi-person 2d pose estimation using part aﬀinity fields

Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. “Open- pose: Realtime multi-person 2d pose estimation using part aﬀinity fields” . In: IEEE transactions on pattern analysis and machine intelligence 43.1 (2019), pp. 172–186

2019
[33]

Realtime multi-person 2d pose estimation using part aﬀinity fields

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. “Realtime multi-person 2d pose estimation using part aﬀinity fields” . In: Proceedings of the IEEE con- ference on computer vision and pattern recognition . 2017, pp. 7291–7299. 178 BIBLIOGRAPHY

2017
[34]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. “End-to-end object detection with transformers” . In: European conference on computer vision . Springer. 2020, pp. 213–229

2020
[35]

CDGT: Constructing diverse graph transformers for emotion recognition from facial videos

Dongliang Chen, Guihua Wen, Huihui Li, Pei Yang, Chuyun Chen, and Bao Wang. “CDGT: Constructing diverse graph transformers for emotion recognition from facial videos” . In: Neural Networks 179 (2024), p. 106573

2024
[36]

En- hancing robustness against adversarial attacks in multimodal emotion recogni- tion with spiking transformers

Guoming Chen, Zhuoxian Qian, Dong Zhang, Shuang Qiu, and Ruqi Zhou. “En- hancing robustness against adversarial attacks in multimodal emotion recogni- tion with spiking transformers” . In: IEEE Access (2025)

2025
[37]

Finecliper: Multi-modal fine-grained clip for dynamic facial expression recogni- tion with adapters

Haodong Chen, Haojian Huang, Junhao Dong, Mingzhe Zheng, and Dian Shao. “Finecliper: Multi-modal fine-grained clip for dynamic facial expression recogni- tion with adapters” . In: Proceedings of the 32nd ACM International Conference on Multimedia. 2024, pp. 2301–2310

2024
[38]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. “Shikra: Unleashing multimodal llm’s referential dialogue magic” . In:arXiv preprint arXiv:2306.15195 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Wavlm: Large- scale self-supervised pre-training for full stack speech processing

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. “Wavlm: Large- scale self-supervised pre-training for full stack speech processing” . In: IEEE Jour- nal of Selected Topics in Signal Processing 16.6 (2022), pp. 1505–1518

2022
[40]

System description for voice privacy challenge 2022

Xiaojiao Chen, Guangxing Li, Hao Huang, Wangjin Zhou, Sheng Li, Yang Cao, and Yi Zhao. “System description for voice privacy challenge 2022” . In: Proc. 2nd Symposium on Security and Privacy in Speech Communication . 2022

2022
[41]

Emotion-llama: Multimodal emo- tion recognition and reasoning with instruction tuning

Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, and Alexander Hauptmann. “Emotion-llama: Multimodal emo- tion recognition and reasoning with instruction tuning” . In: Advances in Neural Information Processing Systems 37 (2024), pp. 110805–110853. BIBLIOGRAPHY 179

2024
[42]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023” . In: URL https://lmsys. org/blog/2023-03-30-vicuna 3.5 (2023)

2023
[43]

Stargan: Unified generative adversarial networks for multi-domain image-to-image translation

Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation” . In:Proceedings of the IEEE conference on computer vision and pattern recognition . 2018, pp. 8789–8797

2018
[44]

MMA- DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expres- sion Recognition in-the-wild

Kateryna Chumachenko, Alexandros Iosifidis, and Moncef Gabbouj. “MMA- DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expres- sion Recognition in-the-wild” . In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 2024, pp. 4673–4682

2024
[45]

Canonical cor- relation analysis for data fusion and group inferences

Nicolle M Correa, Tulay Adali, Yi-Ou Li, and Vince D Calhoun. “Canonical cor- relation analysis for data fusion and group inferences” . In: IEEE signal processing magazine 27.4 (2010), pp. 39–50

2010
[46]

Facial Emotion Recog- nition and Classification Using the Convolutional Neural Network-10 (CNN- 10)

Emmanuel Gbenga Dada, David Opeoluwa Oyewola, Stephen Bassi Joseph, Onyeka Emebo, and Olugbenga Oluseun Oluwagbemi. “Facial Emotion Recog- nition and Classification Using the Convolutional Neural Network-10 (CNN- 10)” . In: Applied Computational Intelligence and Soft Computing 2023.1 (2023), p. 2457898

2023
[47]

What else does your biometric data reveal? A survey on soft biometrics

Antitza Dantcheva, Petros Elia, and Arun Ross. “What else does your biometric data reveal? A survey on soft biometrics” . In: IEEE Transactions on Information Forensics and Security 11.3 (2015), pp. 441–467

2015
[48]

Detection and analysis of emotion from speech signals

Assel Davletcharova, Sherin Sugathan, Bibia Abraham, and Alex Pappachen James. “Detection and analysis of emotion from speech signals” . In: Procedia Computer Science 58 (2015), pp. 91–96. 180 BIBLIOGRAPHY

2015
[49]

A generalization of Bayesian inference

Arthur P Dempster. “A generalization of Bayesian inference” . In: Journal of the Royal Statistical Society: Series B (Methodological) 30.2 (1968), pp. 205–232

1968
[50]

From individual to group-level emotion recognition: Emotiw 5.0

Abhinav Dhall, Roland Goecke, Shreya Ghosh, Jyoti Joshi, Jesse Hoey, and Tom Gedeon. “From individual to group-level emotion recognition: Emotiw 5.0” . In: Proceedings of the 19th ACM international conference on multimodal interaction . 2017, pp. 524–528

2017
[51]

Finding happiest moments in a social context

Abhinav Dhall, Jyoti Joshi, Ibrahim Radwan, and Roland Goecke. “Finding happiest moments in a social context” . In: Computer Vision–ACCV 2012: 11th Asian Conference on Computer Vision, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, Part II 11 . Springer. 2013, pp. 613–626

2012
[52]

The more the merrier: Analysing the affect of a group of people in images

Abhinav Dhall, Jyoti Joshi, Karan Sikka, Roland Goecke, and Nicu Sebe. “The more the merrier: Analysing the affect of a group of people in images” . In: 2015 11th IEEE international conference and workshops on automatic face and gesture recognition (FG). Vol. 1. IEEE. 2015, pp. 1–8

2015
[53]

Emotiw 2018: Audio-video, student engagement and group-level affect prediction

Abhinav Dhall, Amanjot Kaur, Roland Goecke, and Tom Gedeon. “Emotiw 2018: Audio-video, student engagement and group-level affect prediction” . In: Proceed- ings of the 20th ACM International Conference on Multimodal Interaction . 2018, pp. 653–656

2018
[54]

Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based challenges

Abhinav Dhall, Garima Sharma, Roland Goecke, and Tom Gedeon. “Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based challenges” . In: Proceedings of the 2020 International Conference on Mul- timodal Interaction. 2020, pp. 784–789

2020
[55]

EmotiW 2023: Emotion Recognition in the Wild Challenge

Abhinav Dhall, Monisha Singh, Roland Goecke, Tom Gedeon, Donghuo Zeng, Yanan Wang, and Kazushi Ikeda. “EmotiW 2023: Emotion Recognition in the Wild Challenge” . In: Proceedings of the 25th International Conference on Multi- modal Interaction (ICMI 2023) . 2023. BIBLIOGRAPHY 181

2023
[56]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” . In:CoRR abs/2010.11929 (2020). arXiv: 2010.11929. url: https://arxi...

work page internal anchor Pith review Pith/arXiv arXiv 2010
[57]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” . In: ICLR (2021)

2021
[58]

Speech emotion recognition based on spiking neural network and convolutional neural network

Chengyan Du, Fu Liu, Bing Kang, and Tao Hou. “Speech emotion recognition based on spiking neural network and convolutional neural network” . In: Engi- neering Applications of Artificial Intelligence 147 (2025), p. 110314

2025
[59]

Training generative neural networks via maximum mean discrepancy optimization

Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. “Training generative neural networks via maximum mean discrepancy optimization” . In: Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelli- gence. 2015, pp. 258–267

2015
[60]

F ACES—A database of facial expressions in young, middle-aged, and older women and men: Development and validation

Natalie C Ebner, Michaela Riediger, and Ulman Lindenberger. “F ACES—A database of facial expressions in young, middle-aged, and older women and men: Development and validation” . In: Behavior research methods 42 (2010), pp. 351– 362

2010
[61]

F ACES-a database of facial expressions in young, middle-aged, and older women and men: Develop- ment and validation

Natalie C. Ebner, Michaela Riediger, and Ulman Lindenberger. “F ACES-a database of facial expressions in young, middle-aged, and older women and men: Develop- ment and validation” . In: Behavior Research Methods 42 (1 Feb. 2010), pp. 351–

2010
[62]

doi: 10.3758/BRM.42.1.351

issn: 1554351X. doi: 10.3758/BRM.42.1.351

work page doi:10.3758/brm.42.1.351
[63]

Are there basic emotions?

Paul Ekman. “Are there basic emotions?” In: (1992). 182 BIBLIOGRAPHY

1992
[64]

Emonext: an adapted convnext for facial emotion recognition

Yassine El Boudouri and Amine Bohi. “Emonext: an adapted convnext for facial emotion recognition” . In:2023 IEEE 25th International Workshop on Multimedia Signal Processing (MMSP) . IEEE. 2023, pp. 1–6

2023
[65]

On the multivariate Laplace distribution

Torbjørn Eltoft, Taesu Kim, and Te-Won Lee. “On the multivariate Laplace distribution” . In: IEEE Signal Processing Letters 13.5 (2006), pp. 300–303

2006
[66]

Multimodal End-to-End Group Emotion Recognition using Cross-Modal Attention

Lev Evtodienko. “Multimodal End-to-End Group Emotion Recognition using Cross-Modal Attention” . In: CoRR abs/2111.05890 (2021). arXiv: 2111.05890. url: https://arxiv.org/abs/2111.05890

work page arXiv 2021
[67]

Emotion recognition from unimodal to multimodal analysis: A review

Kaouther Ezzameli and Hela Mahersia. “Emotion recognition from unimodal to multimodal analysis: A review” . In: Information Fusion 99 (2023), p. 101847

2023
[68]

Eva: Exploring the limits of masked visual representation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. “Eva: Exploring the limits of masked visual representation learning at scale” . In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition . 2023, pp. 19358–19369

2023
[69]

Multi-task learning on the edge for effective gender, age, ethnicity and emotion recognition

Pasquale Foggia, Antonio Greco, Alessia Saggese, and Mario Vento. “Multi-task learning on the edge for effective gender, age, ethnicity and emotion recognition” . In: Engineering Applications of Artificial Intelligence 118 (2023), p. 105651

2023
[70]

Emoclip: A vision-language method for zero-shot video facial expression recognition

Niki Maria Foteinopoulou and Ioannis Patras. “Emoclip: A vision-language method for zero-shot video facial expression recognition” . In: 2024 IEEE 18th Interna- tional Conference on Automatic Face and Gesture Recognition (FG). IEEE. 2024, pp. 1–10

2024
[71]

Emotion experience

Nico Frijda. “Emotion experience” . In: Cognition and Emotion 19.4 (2005), pp. 473–

2005
[72]

doi: 10.1080/02699930441000346

work page doi:10.1080/02699930441000346
[73]

Percep- tion of expressed emotion among persons with mental illness

Sailaxmi Gandhi, Narayanasamy Padmavathi, Rajil Raveendran, Prabhu Jad- hav, Maya Sahu, Jothimani Gurusamy, and Krishna Prasad Muliyala. “Percep- tion of expressed emotion among persons with mental illness” . In: Journal of Psychosocial Rehabilitation and Mental Health 7 (2020), pp. 121–130. BIBLIOGRAPHY 183

2020
[74]

Multimodal and temporal perception of audio-visual cues for emotion recognition

Esam Ghaleb, Mirela Popa, and Stylianos Asteriadis. “Multimodal and temporal perception of audio-visual cues for emotion recognition” . In: 2019 8th Interna- tional Conference on Affective Computing and Intelligent Interaction (ACII) . IEEE. 2019, pp. 552–558

2019
[75]

Automatic group affect anal- ysis in images via visual attribute and feature networks

Shreya Ghosh, Abhinav Dhall, and Nicu Sebe. “Automatic group affect anal- ysis in images via visual attribute and feature networks” . In: 2018 25th IEEE International Conference on Image Processing (ICIP) . IEEE. 2018, pp. 1967– 1971

2018
[76]

Predicting group cohesiveness in images

Shreya Ghosh, Abhinav Dhall, Nicu Sebe, and Tom Gedeon. “Predicting group cohesiveness in images” . In: 2019 International Joint Conference on Neural Net- works (IJCNN) . IEEE. 2019, pp. 1–8

2019
[77]

Dynamical variational autoencoders: A comprehensive review

Laurent Girin, Simon Leglaive, Xiaoyu Bie, Julien Diard, Thomas Hueber, and Xavier Alameda-Pineda. “Dynamical variational autoencoders: A comprehensive review” . In: arXiv preprint arXiv:2008.12595 (2020)

work page arXiv 2008
[78]

A Hybrid Fusion Model for Group-Level Emotion Recogni- tion in Complex Scenarios

Wenjuan Gong, Yifan Wang, Yikai Wu, Shuaipeng Gao, Athanasios V Vasilakos, and Peiying Zhang. “A Hybrid Fusion Model for Group-Level Emotion Recogni- tion in Complex Scenarios” . In: Information Sciences (2025), p. 121968

2025
[79]

Generative adver- sarial networks

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. “Generative adver- sarial networks” . In: Communications of the ACM 63.11 (2020), pp. 139–144

2020
[80]

Challenges in representation learning: A report on three machine learning contests

Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. “Challenges in representation learning: A report on three machine learning contests” . In: Neural information processing: 20th international confer- ence, ICONIP 2013, daegu, korea, november 3-7, ...

2013

Showing first 80 references.

[1] [1]

Engagement Measurement Based on Facial Landmarks and Spatial-Temporal Graph Convolutional Networks

Ali Abedi and Shehroz S Khan. “Engagement Measurement Based on Facial Landmarks and Spatial-Temporal Graph Convolutional Networks” . In: arXiv e- prints (2024), arXiv–2403

2024

[2] [2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. “Flamingo: a visual language model for few-shot learning” . In: Advances in neural information processing systems 35 (2022), pp. 23716–23736

2022

[3] [3]

Speech emotion recognition in conversations using artificial intelligence: a sys- tematic review and meta-analysis

Ghada Alhussein, Ioannis Ziogas, Shiza Saleem, and Leontios J Hadjileontiadis. “Speech emotion recognition in conversations using artificial intelligence: a sys- tematic review and meta-analysis” . In: Artificial Intelligence Review 58.7 (2025), p. 198

2025

[4] [4]

ExCEDA: Unlocking Attention Paradigms in Extended Duration E-Classrooms by Leveraging Attention-Mechanism Models

A vinash Anand, A vni Mittal, Laavanaya Dhawan, Juhi Krishnamurthy, Mahisha Ramesh, Naman Lal, Astha Verma, Pijush Bhuyan, Raijv Ratn Shah, Roger Zimmermann, et al. “ExCEDA: Unlocking Attention Paradigms in Extended Duration E-Classrooms by Leveraging Attention-Mechanism Models” . In: 2024 IEEE 7th International Conference on Multimedia Information Proces...

2024

[5] [5]

Facial emotion recognition in Parkinson’s disease: a review and new hypotheses

Soizic Argaud, Marc Vérin, Paul Sauleau, and Didier Grandjean. “Facial emotion recognition in Parkinson’s disease: a review and new hypotheses” . In: Movement disorders 33.4 (2018), pp. 554–567

2018

[6] [6]

Real-time Convolu- tional Neural Networks for emotion and gender classification

Octavio Arriaga, Matias Valdenegro-Toro, and Paul Plöger. “Real-time Convolu- tional Neural Networks for emotion and gender classification” . In: 27th European Symposium on Artificial Neural Networks, ESANN 2019, Bruges, Belgium, April 24-26, 2019 . 2019, pp. 221–226

2019

[7] [7]

BODY LANGUAGE IN- TERPRETATION: PSYCHOPHYSIOLOGICAL AND COGNITIVE ASPECTS

Rustamjon Asatullaev and Diyora Muxamedjonova. “BODY LANGUAGE IN- TERPRETATION: PSYCHOPHYSIOLOGICAL AND COGNITIVE ASPECTS” . In: Journal of Applied Science and Social Science 1.1 (2025), pp. 456–458

2025

[8] [8]

Multimodal Perception and Statistical Modeling of Ped- agogical Classroom Events Using a Privacy-safe Non-individual Approach

Anderson Augusma. “Multimodal Perception and Statistical Modeling of Ped- agogical Classroom Events Using a Privacy-safe Non-individual Approach” . In: 2022 10th International Conference on Affective Computing and Intelligent In- teraction Workshops and Demos (ACIIW) . IEEE. 2022, pp. 1–5

2022

[9] [9]

Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant Features

Anderson Augusma, Dominique Vaufreydaz, and Frédérique Letué. “Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant Features” . In: Proceedings of the 25th International Conference on Multimodal Interaction . 2023, pp. 750–754

2023

[10] [10]

Enhancing Sentiment Analysis With Emo- tion And Sarcasm Detection: A Transformer-Based Approach

Mr Suryavamshi Sandeep Babu, SV Suryanarayana, M Sruthi, P Bhagya Lak- shmi, T Sravanthi, and M Spandana. “Enhancing Sentiment Analysis With Emo- tion And Sarcasm Detection: A Transformer-Based Approach” . In: Metallurgical and Materials Engineering (2025), pp. 794–803

2025

[11] [11]

wav2vec 2.0: A framework for self-supervised learning of speech representations

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. “wav2vec 2.0: A framework for self-supervised learning of speech representations” . In: Ad- vances in neural information processing systems 33 (2020), pp. 12449–12460. BIBLIOGRAPHY 175

2020

[12] [12]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. “Qwen technical report” . In: arXiv preprint arXiv:2309.16609 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Facial expression- based emotion recognition across diverse age groups: a multi-scale vision trans- former with contrastive learning approach

G Balachandran, S Ranjith, TR Chenthil, and GC Jagan. “Facial expression- based emotion recognition across diverse age groups: a multi-scale vision trans- former with contrastive learning approach” . In: Journal of Combinatorial Opti- mization 49.1 (2025), pp. 1–39

2025

[14] [14]

Natural Language Processing for Sentiment Analysis in Social Media Marketing

Murat Başal. “Natural Language Processing for Sentiment Analysis in Social Media Marketing” . In: Economics 12.1 (2025), pp. 39–51

2025

[15] [15]

Sentiment prediction based on dempster-shafer theory of evidence

Mohammad Ehsan Basiri, Ahmad Reza Naghsh-Nilchi, and Nasser Ghasem- Aghaee. “Sentiment prediction based on dempster-shafer theory of evidence” . In: Mathematical Problems in Engineering 2014.1 (2014), p. 361201

2014

[16] [16]

Semantic-emotion neural network for emotion recognition from text

Erdenebileg Batbaatar, Meijing Li, and Keun Ho Ryu. “Semantic-emotion neural network for emotion recognition from text” . In:IEEE access 7 (2019), pp. 111866– 111878

2019

[17] [17]

INTERPRETING BODY LANGUAGE: A SCIENTIFIC PERSPECTIVE

Asatullayev Rustamjon Baxtiyarovich and Boboqulovc Behruz Bahodirovich. “INTERPRETING BODY LANGUAGE: A SCIENTIFIC PERSPECTIVE” . In: YANGI O ‘ZBEKISTON, YANGI TADQIQOTLAR JURNALI 2.5 (2025), pp. 143– 146

2025

[18] [18]

Group-Level Affect Recognition in Video Using Deviation of Frame Features

Natalya S Belova. “Group-Level Affect Recognition in Video Using Deviation of Frame Features” . In: Analysis of Images, Social Networks and Texts: 10th International Conference, AIST 2021, Tbilisi, Georgia, December 16–18, 2021, Revised Selected Papers. Vol. 13217. Springer Nature. 2022, p. 199

2021

[19] [19]

Is space-time attention all you need for video understanding?

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. “Is space-time attention all you need for video understanding?” In: ICML. Vol. 2. 3. 2021, p. 4. 176 BIBLIOGRAPHY

2021

[20] [20]

Learning privacy-enhancing face representations through feature disentanglement

Blaž Bortolato, Marija Ivanovska, Peter Rot, Janez Križaj, Philipp Terhörst, Naser Damer, Peter Peer, and Vitomir Štruc. “Learning privacy-enhancing face representations through feature disentanglement” . In: 2020 15th IEEE Interna- tional Conference on Automatic Face and Gesture Recognition (FG 2020) . IEEE. 2020, pp. 495–502

2020

[21] [21]

Legal and Regulatory Perspec- tives on Synthetic Data as an Anonymization Strategy

Alexander Boudewijn and Andrea F Ferraris. “Legal and Regulatory Perspec- tives on Synthetic Data as an Anonymization Strategy” . In: J. Pers. Data Prot. L. (2024), p. 17

2024

[22] [22]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language models are few-shot learners” . In: Advances in neural information processing systems 33 (2020), pp. 1877–1901

2020

[23] [23]

SAMSEMO: New dataset for multilingual and multimodal emotion recognition

Paweł Bujnowski, Bartłomiej Kuźma, Bartłomiej Paziewski, Jacek Rutkowski, Joanna Marhula, Zuzanna Bordzicka, and Piotr Andruszkiewicz. “SAMSEMO: New dataset for multilingual and multimodal emotion recognition” . In: Inter- speech. 2024

2024

[24] [24]

How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks)

Adrian Bulat and Georgios Tzimiropoulos. “How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks)” . In: Proceedings of the IEEE international conference on computer vision . 2017, pp. 1021–1030

2017

[25] [25]

A database of German emotional speech

Felix Burkhardt, Astrid Paeschke, Miriam Rolfes, Walter F Sendlmeier, Ben- jamin Weiss, et al. “A database of German emotional speech. ” In: Interspeech. Vol. 5. 2005, pp. 1517–1520

2005

[26] [26]

IEMOCAP: Interactive Emotional Dyadic Motion Capture Database

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. “IEMOCAP: Interactive Emotional Dyadic Motion Capture Database” . In: Jour- BIBLIOGRAPHY 177 nal of Language Resources and Evaluation 42.4 (2008), pp. 335–359. doi: 10. 1007/s10579-008-9076-6

2008

[27] [27]

MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception

Carlos Busso, Srinivas Parthasarathy, Alec Burmania, Mohammed AbdelWahab, Najmeh Sadoughi, and Emily Mower Provost. “MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception” . In: IEEE Transactions on Affective Computing 8.1 (2016), pp. 67–80

2016

[28] [28]

Human observers and automated assessment of dynamic emotional facial ex- pressions: KDEF-dyn database validation

Manuel G Calvo, Andrés Fernández-Martín, Guillermo Recio, and Daniel Lundqvist. “Human observers and automated assessment of dynamic emotional facial ex- pressions: KDEF-dyn database validation” . In: Frontiers in psychology 9 (2018), p. 2052

2018

[29] [29]

The EU’s AI act: A framework for collaborative gover- nance

Celso Cancela-Outeda. “The EU’s AI act: A framework for collaborative gover- nance” . In: Internet of Things 27 (2024), p. 101291

2024

[30] [30]

Crema-d: Crowd-sourced emotional multimodal actors dataset

Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. “Crema-d: Crowd-sourced emotional multimodal actors dataset” . In: IEEE transactions on affective computing 5.4 (2014), pp. 377–390

2014

[31] [31]

Deep learning-based depression recognition through facial expression: A systematic review

Xiaoming Cao, Lingling Zhai, Pengpeng Zhai, Fangfei Li, Tao He, and Lang He. “Deep learning-based depression recognition through facial expression: A systematic review” . In: Neurocomputing (2025), p. 129605

2025

[32] [32]

Open- pose: Realtime multi-person 2d pose estimation using part aﬀinity fields

Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. “Open- pose: Realtime multi-person 2d pose estimation using part aﬀinity fields” . In: IEEE transactions on pattern analysis and machine intelligence 43.1 (2019), pp. 172–186

2019

[33] [33]

Realtime multi-person 2d pose estimation using part aﬀinity fields

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. “Realtime multi-person 2d pose estimation using part aﬀinity fields” . In: Proceedings of the IEEE con- ference on computer vision and pattern recognition . 2017, pp. 7291–7299. 178 BIBLIOGRAPHY

2017

[34] [34]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. “End-to-end object detection with transformers” . In: European conference on computer vision . Springer. 2020, pp. 213–229

2020

[35] [35]

CDGT: Constructing diverse graph transformers for emotion recognition from facial videos

Dongliang Chen, Guihua Wen, Huihui Li, Pei Yang, Chuyun Chen, and Bao Wang. “CDGT: Constructing diverse graph transformers for emotion recognition from facial videos” . In: Neural Networks 179 (2024), p. 106573

2024

[36] [36]

En- hancing robustness against adversarial attacks in multimodal emotion recogni- tion with spiking transformers

Guoming Chen, Zhuoxian Qian, Dong Zhang, Shuang Qiu, and Ruqi Zhou. “En- hancing robustness against adversarial attacks in multimodal emotion recogni- tion with spiking transformers” . In: IEEE Access (2025)

2025

[37] [37]

Finecliper: Multi-modal fine-grained clip for dynamic facial expression recogni- tion with adapters

Haodong Chen, Haojian Huang, Junhao Dong, Mingzhe Zheng, and Dian Shao. “Finecliper: Multi-modal fine-grained clip for dynamic facial expression recogni- tion with adapters” . In: Proceedings of the 32nd ACM International Conference on Multimedia. 2024, pp. 2301–2310

2024

[38] [38]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. “Shikra: Unleashing multimodal llm’s referential dialogue magic” . In:arXiv preprint arXiv:2306.15195 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Wavlm: Large- scale self-supervised pre-training for full stack speech processing

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. “Wavlm: Large- scale self-supervised pre-training for full stack speech processing” . In: IEEE Jour- nal of Selected Topics in Signal Processing 16.6 (2022), pp. 1505–1518

2022

[40] [40]

System description for voice privacy challenge 2022

Xiaojiao Chen, Guangxing Li, Hao Huang, Wangjin Zhou, Sheng Li, Yang Cao, and Yi Zhao. “System description for voice privacy challenge 2022” . In: Proc. 2nd Symposium on Security and Privacy in Speech Communication . 2022

2022

[41] [41]

Emotion-llama: Multimodal emo- tion recognition and reasoning with instruction tuning

Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, and Alexander Hauptmann. “Emotion-llama: Multimodal emo- tion recognition and reasoning with instruction tuning” . In: Advances in Neural Information Processing Systems 37 (2024), pp. 110805–110853. BIBLIOGRAPHY 179

2024

[42] [42]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023” . In: URL https://lmsys. org/blog/2023-03-30-vicuna 3.5 (2023)

2023

[43] [43]

Stargan: Unified generative adversarial networks for multi-domain image-to-image translation

Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation” . In:Proceedings of the IEEE conference on computer vision and pattern recognition . 2018, pp. 8789–8797

2018

[44] [44]

MMA- DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expres- sion Recognition in-the-wild

Kateryna Chumachenko, Alexandros Iosifidis, and Moncef Gabbouj. “MMA- DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expres- sion Recognition in-the-wild” . In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 2024, pp. 4673–4682

2024

[45] [45]

Canonical cor- relation analysis for data fusion and group inferences

Nicolle M Correa, Tulay Adali, Yi-Ou Li, and Vince D Calhoun. “Canonical cor- relation analysis for data fusion and group inferences” . In: IEEE signal processing magazine 27.4 (2010), pp. 39–50

2010

[46] [46]

Facial Emotion Recog- nition and Classification Using the Convolutional Neural Network-10 (CNN- 10)

Emmanuel Gbenga Dada, David Opeoluwa Oyewola, Stephen Bassi Joseph, Onyeka Emebo, and Olugbenga Oluseun Oluwagbemi. “Facial Emotion Recog- nition and Classification Using the Convolutional Neural Network-10 (CNN- 10)” . In: Applied Computational Intelligence and Soft Computing 2023.1 (2023), p. 2457898

2023

[47] [47]

What else does your biometric data reveal? A survey on soft biometrics

Antitza Dantcheva, Petros Elia, and Arun Ross. “What else does your biometric data reveal? A survey on soft biometrics” . In: IEEE Transactions on Information Forensics and Security 11.3 (2015), pp. 441–467

2015

[48] [48]

Detection and analysis of emotion from speech signals

Assel Davletcharova, Sherin Sugathan, Bibia Abraham, and Alex Pappachen James. “Detection and analysis of emotion from speech signals” . In: Procedia Computer Science 58 (2015), pp. 91–96. 180 BIBLIOGRAPHY

2015

[49] [49]

A generalization of Bayesian inference

Arthur P Dempster. “A generalization of Bayesian inference” . In: Journal of the Royal Statistical Society: Series B (Methodological) 30.2 (1968), pp. 205–232

1968

[50] [50]

From individual to group-level emotion recognition: Emotiw 5.0

Abhinav Dhall, Roland Goecke, Shreya Ghosh, Jyoti Joshi, Jesse Hoey, and Tom Gedeon. “From individual to group-level emotion recognition: Emotiw 5.0” . In: Proceedings of the 19th ACM international conference on multimodal interaction . 2017, pp. 524–528

2017

[51] [51]

Finding happiest moments in a social context

Abhinav Dhall, Jyoti Joshi, Ibrahim Radwan, and Roland Goecke. “Finding happiest moments in a social context” . In: Computer Vision–ACCV 2012: 11th Asian Conference on Computer Vision, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, Part II 11 . Springer. 2013, pp. 613–626

2012

[52] [52]

The more the merrier: Analysing the affect of a group of people in images

Abhinav Dhall, Jyoti Joshi, Karan Sikka, Roland Goecke, and Nicu Sebe. “The more the merrier: Analysing the affect of a group of people in images” . In: 2015 11th IEEE international conference and workshops on automatic face and gesture recognition (FG). Vol. 1. IEEE. 2015, pp. 1–8

2015

[53] [53]

Emotiw 2018: Audio-video, student engagement and group-level affect prediction

Abhinav Dhall, Amanjot Kaur, Roland Goecke, and Tom Gedeon. “Emotiw 2018: Audio-video, student engagement and group-level affect prediction” . In: Proceed- ings of the 20th ACM International Conference on Multimodal Interaction . 2018, pp. 653–656

2018

[54] [54]

Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based challenges

Abhinav Dhall, Garima Sharma, Roland Goecke, and Tom Gedeon. “Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based challenges” . In: Proceedings of the 2020 International Conference on Mul- timodal Interaction. 2020, pp. 784–789

2020

[55] [55]

EmotiW 2023: Emotion Recognition in the Wild Challenge

Abhinav Dhall, Monisha Singh, Roland Goecke, Tom Gedeon, Donghuo Zeng, Yanan Wang, and Kazushi Ikeda. “EmotiW 2023: Emotion Recognition in the Wild Challenge” . In: Proceedings of the 25th International Conference on Multi- modal Interaction (ICMI 2023) . 2023. BIBLIOGRAPHY 181

2023

[56] [56]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” . In:CoRR abs/2010.11929 (2020). arXiv: 2010.11929. url: https://arxi...

work page internal anchor Pith review Pith/arXiv arXiv 2010

[57] [57]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” . In: ICLR (2021)

2021

[58] [58]

Speech emotion recognition based on spiking neural network and convolutional neural network

Chengyan Du, Fu Liu, Bing Kang, and Tao Hou. “Speech emotion recognition based on spiking neural network and convolutional neural network” . In: Engi- neering Applications of Artificial Intelligence 147 (2025), p. 110314

2025

[59] [59]

Training generative neural networks via maximum mean discrepancy optimization

Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. “Training generative neural networks via maximum mean discrepancy optimization” . In: Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelli- gence. 2015, pp. 258–267

2015

[60] [60]

F ACES—A database of facial expressions in young, middle-aged, and older women and men: Development and validation

Natalie C Ebner, Michaela Riediger, and Ulman Lindenberger. “F ACES—A database of facial expressions in young, middle-aged, and older women and men: Development and validation” . In: Behavior research methods 42 (2010), pp. 351– 362

2010

[61] [61]

F ACES-a database of facial expressions in young, middle-aged, and older women and men: Develop- ment and validation

Natalie C. Ebner, Michaela Riediger, and Ulman Lindenberger. “F ACES-a database of facial expressions in young, middle-aged, and older women and men: Develop- ment and validation” . In: Behavior Research Methods 42 (1 Feb. 2010), pp. 351–

2010

[62] [62]

doi: 10.3758/BRM.42.1.351

issn: 1554351X. doi: 10.3758/BRM.42.1.351

work page doi:10.3758/brm.42.1.351

[63] [63]

Are there basic emotions?

Paul Ekman. “Are there basic emotions?” In: (1992). 182 BIBLIOGRAPHY

1992

[64] [64]

Emonext: an adapted convnext for facial emotion recognition

Yassine El Boudouri and Amine Bohi. “Emonext: an adapted convnext for facial emotion recognition” . In:2023 IEEE 25th International Workshop on Multimedia Signal Processing (MMSP) . IEEE. 2023, pp. 1–6

2023

[65] [65]

On the multivariate Laplace distribution

Torbjørn Eltoft, Taesu Kim, and Te-Won Lee. “On the multivariate Laplace distribution” . In: IEEE Signal Processing Letters 13.5 (2006), pp. 300–303

2006

[66] [66]

Multimodal End-to-End Group Emotion Recognition using Cross-Modal Attention

Lev Evtodienko. “Multimodal End-to-End Group Emotion Recognition using Cross-Modal Attention” . In: CoRR abs/2111.05890 (2021). arXiv: 2111.05890. url: https://arxiv.org/abs/2111.05890

work page arXiv 2021

[67] [67]

Emotion recognition from unimodal to multimodal analysis: A review

Kaouther Ezzameli and Hela Mahersia. “Emotion recognition from unimodal to multimodal analysis: A review” . In: Information Fusion 99 (2023), p. 101847

2023

[68] [68]

Eva: Exploring the limits of masked visual representation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. “Eva: Exploring the limits of masked visual representation learning at scale” . In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition . 2023, pp. 19358–19369

2023

[69] [69]

Multi-task learning on the edge for effective gender, age, ethnicity and emotion recognition

Pasquale Foggia, Antonio Greco, Alessia Saggese, and Mario Vento. “Multi-task learning on the edge for effective gender, age, ethnicity and emotion recognition” . In: Engineering Applications of Artificial Intelligence 118 (2023), p. 105651

2023

[70] [70]

Emoclip: A vision-language method for zero-shot video facial expression recognition

Niki Maria Foteinopoulou and Ioannis Patras. “Emoclip: A vision-language method for zero-shot video facial expression recognition” . In: 2024 IEEE 18th Interna- tional Conference on Automatic Face and Gesture Recognition (FG). IEEE. 2024, pp. 1–10

2024

[71] [71]

Emotion experience

Nico Frijda. “Emotion experience” . In: Cognition and Emotion 19.4 (2005), pp. 473–

2005

[72] [72]

doi: 10.1080/02699930441000346

work page doi:10.1080/02699930441000346

[73] [73]

Percep- tion of expressed emotion among persons with mental illness

Sailaxmi Gandhi, Narayanasamy Padmavathi, Rajil Raveendran, Prabhu Jad- hav, Maya Sahu, Jothimani Gurusamy, and Krishna Prasad Muliyala. “Percep- tion of expressed emotion among persons with mental illness” . In: Journal of Psychosocial Rehabilitation and Mental Health 7 (2020), pp. 121–130. BIBLIOGRAPHY 183

2020

[74] [74]

Multimodal and temporal perception of audio-visual cues for emotion recognition

Esam Ghaleb, Mirela Popa, and Stylianos Asteriadis. “Multimodal and temporal perception of audio-visual cues for emotion recognition” . In: 2019 8th Interna- tional Conference on Affective Computing and Intelligent Interaction (ACII) . IEEE. 2019, pp. 552–558

2019

[75] [75]

Automatic group affect anal- ysis in images via visual attribute and feature networks

Shreya Ghosh, Abhinav Dhall, and Nicu Sebe. “Automatic group affect anal- ysis in images via visual attribute and feature networks” . In: 2018 25th IEEE International Conference on Image Processing (ICIP) . IEEE. 2018, pp. 1967– 1971

2018

[76] [76]

Predicting group cohesiveness in images

Shreya Ghosh, Abhinav Dhall, Nicu Sebe, and Tom Gedeon. “Predicting group cohesiveness in images” . In: 2019 International Joint Conference on Neural Net- works (IJCNN) . IEEE. 2019, pp. 1–8

2019

[77] [77]

Dynamical variational autoencoders: A comprehensive review

Laurent Girin, Simon Leglaive, Xiaoyu Bie, Julien Diard, Thomas Hueber, and Xavier Alameda-Pineda. “Dynamical variational autoencoders: A comprehensive review” . In: arXiv preprint arXiv:2008.12595 (2020)

work page arXiv 2008

[78] [78]

A Hybrid Fusion Model for Group-Level Emotion Recogni- tion in Complex Scenarios

Wenjuan Gong, Yifan Wang, Yikai Wu, Shuaipeng Gao, Athanasios V Vasilakos, and Peiying Zhang. “A Hybrid Fusion Model for Group-Level Emotion Recogni- tion in Complex Scenarios” . In: Information Sciences (2025), p. 121968

2025

[79] [79]

Generative adver- sarial networks

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. “Generative adver- sarial networks” . In: Communications of the ACM 63.11 (2020), pp. 139–144

2020

[80] [80]

Challenges in representation learning: A report on three machine learning contests

Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. “Challenges in representation learning: A report on three machine learning contests” . In: Neural information processing: 20th international confer- ence, ICONIP 2013, daegu, korea, november 3-7, ...

2013