EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models
Pith reviewed 2026-05-23 03:40 UTC · model grok-4.3
The pith
A new benchmark reveals that even leading multimodal AI models score far below humans on emotional intelligence tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EmoBench-M evaluates MLLMs across 13 scenarios in three hierarchical dimensions grounded in psychological theories, revealing a substantial performance gap relative to human-level competence even for the strongest models, which reach only 70.5 and 66.5 points.
What carries the argument
EmoBench-M benchmark of 13 evaluation scenarios spanning foundational emotion recognition, conversational emotion understanding, and socially complex emotion analysis.
If this is right
- MLLMs will need additional training focused on dynamic multimodal emotional cues to approach human competence.
- Specialized emotion models show uneven coverage and will require broader integration to achieve consistent results.
- Robotic and interactive AI systems that rely on emotional intelligence remain constrained until the measured gaps close.
- The benchmark supplies concrete metrics that can track incremental progress in model emotional capabilities.
Where Pith is reading between the lines
- Improvements tracked by this benchmark could translate directly into more natural responses during extended human-AI conversations.
- Adding live video and audio streams to future versions of the benchmark would test whether models handle temporal emotional shifts.
- Hybrid training that combines general multimodal pretraining with targeted emotional data may address the uneven performance seen in specialized models.
Load-bearing premise
The chosen 13 scenarios accurately represent the dynamic, context-dependent, and multimodal nature of emotional expressions that real interactions involve.
What would settle it
An MLLM that scores at or above average human performance across all 13 EmoBench-M scenarios while preserving accuracy on unrelated multimodal tasks.
Figures
read the original abstract
With the integration of multimodal large language models (MLLMs) into robotic systems and AI applications, embedding emotional intelligence (EI) capabilities is essential for enabling these models to perceive, interpret, and respond to human emotions effectively in real-world scenarios. Existing static, text-based, or text-image benchmarks overlook the multimodal complexities of real interactions and fail to capture the dynamic, context-dependent nature of emotional expressions, rendering them inadequate for evaluating MLLMs' EI capabilities. To address these limitations, we introduce EmoBench-M, a systematic benchmark grounded in established psychological theories, designed to evaluate MLLMs across 13 evaluation scenarios spanning three hierarchical dimensions: foundational emotion recognition (FER), conversational emotion understanding (CEU), and socially complex emotion analysis (SCEA). Evaluation was conducted on 27 state-of-the-art MLLMs, using both objective task-specific metrics and LLM-based evaluation, revealing a substantial performance gap relative to human-level competence. Even the best performing models, Gemini-3.0-Pro and GPT-5.2, achieve the highest scores on EmoBench-M, 70.5 and 66.5 points respectively. Specialized models such as AffectGPT exhibit uneven performance across EmoBench-M, demonstrating strengths in certain scenarios but generally lacking comprehensive emotional intelligence. By providing a comprehensive, multimodal evaluation framework, EmoBench-M captures both the strengths and weaknesses of current MLLMs across diverse emotional contexts. All benchmark resources, including datasets and code, are publicly available at https://emo-gml.github.io/, facilitating further research and advancement in MLLM emotional intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EmoBench-M, a multimodal benchmark for emotional intelligence in MLLMs. It comprises 13 scenarios across three hierarchical dimensions (foundational emotion recognition (FER), conversational emotion understanding (CEU), and socially complex emotion analysis (SCEA)) grounded in psychological theories. The authors evaluate 27 state-of-the-art MLLMs using objective task-specific metrics and LLM-based evaluation, report that the top models (Gemini-3.0-Pro at 70.5 and GPT-5.2 at 66.5) lag human performance, and release all datasets and code publicly.
Significance. If the scenarios are shown through validation to capture dynamic, context-dependent multimodal EI beyond existing static benchmarks, the work would offer a useful evaluation framework for MLLM capabilities relevant to real-world applications. The public release of datasets and code is a clear strength that supports reproducibility and community follow-up.
major comments (2)
- [Abstract] Abstract: The central claim of a substantial performance gap (Gemini-3.0-Pro at 70.5, GPT-5.2 at 66.5 vs. human) depends on the 13 scenarios accurately measuring the 'dynamic, context-dependent nature of emotional expressions' that prior benchmarks miss. The manuscript asserts grounding in psychological theories for the three dimensions but supplies no details on theory-to-scenario mapping, how temporal video dynamics or multimodality are operationalized, inter-annotator agreement for human baselines, or any empirical validation (e.g., correlation with established EI instruments). This validation information is load-bearing for interpreting the reported gap.
- [Evaluation section] Evaluation section: The manuscript employs both objective metrics and LLM-based evaluation, yet provides no analysis of how the LLM evaluator was selected or any checks for circularity risk (evaluator models potentially sharing the same EI limitations as the evaluated models). This affects confidence in the aggregate scores used to support the performance-gap conclusion.
Simulated Author's Rebuttal
We appreciate the referee's thorough review and constructive suggestions. We have carefully considered each comment and provide point-by-point responses below. Where appropriate, we have made revisions to the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of a substantial performance gap (Gemini-3.0-Pro at 70.5, GPT-5.2 at 66.5 vs. human) depends on the 13 scenarios accurately measuring the 'dynamic, context-dependent nature of emotional expressions' that prior benchmarks miss. The manuscript asserts grounding in psychological theories for the three dimensions but supplies no details on theory-to-scenario mapping, how temporal video dynamics or multimodality are operationalized, inter-annotator agreement for human baselines, or any empirical validation (e.g., correlation with established EI instruments). This validation information is load-bearing for interpreting the reported gap.
Authors: We agree that the manuscript would be strengthened by providing more explicit details on these aspects. In the revised manuscript, we have expanded the 'Benchmark Construction' section to include a detailed mapping of each scenario to the relevant psychological theories, such as linking FER scenarios to Ekman's basic emotion theory and SCEA to more complex models like the component process model. We have also added explanations of how temporal dynamics are operationalized through the use of video sequences that capture emotion evolution over time and how multimodality is incorporated by integrating visual, auditory, and textual modalities in the scenarios. Furthermore, we now report inter-annotator agreement metrics (e.g., Cohen's kappa) for the human baseline annotations. However, we did not perform correlations with established EI instruments like the MSCEIT, as this would necessitate an extensive additional human study not included in the current work. We view the theory grounding and human performance comparisons as sufficient for the benchmark's validity but acknowledge this as a potential area for future work. revision: partial
-
Referee: [Evaluation section] Evaluation section: The manuscript employs both objective metrics and LLM-based evaluation, yet provides no analysis of how the LLM evaluator was selected or any checks for circularity risk (evaluator models potentially sharing the same EI limitations as the evaluated models). This affects confidence in the aggregate scores used to support the performance-gap conclusion.
Authors: We thank the referee for highlighting this important methodological point. In the updated manuscript, we have added a subsection under 'Evaluation Methodology' that details the selection criteria for the LLM evaluator, including its superior performance on related emotion understanding tasks and its general capabilities. To address circularity concerns, we conducted additional experiments using two different LLM evaluators and compared their scores with the objective metrics. The results show high consistency across evaluators, and the objective metrics (which do not rely on LLMs) align well with the LLM-based evaluations, supporting the reliability of the aggregate scores. We have included these analyses and a discussion of potential limitations in the revised paper. revision: yes
Circularity Check
No circularity: empirical benchmark with no derivations or fitted predictions
full rationale
The paper presents EmoBench-M as a new multimodal benchmark with 13 scenarios across FER, CEU, and SCEA dimensions, evaluates 27 MLLMs using objective metrics and LLM-based scoring, and reports empirical results such as Gemini-3.0-Pro at 70.5 and GPT-5.2 at 66.5 versus human baselines. No equations, parameter fittings, uniqueness theorems, or derivation chains exist that could reduce outputs to inputs by construction. Assertions of grounding in psychological theories are descriptive rather than derived, and the work is fully self-contained as an empirical evaluation with public artifacts; no self-citation load-bearing or ansatz smuggling is present.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Established psychological theories provide a valid grounding for the three hierarchical dimensions of emotional intelligence.
Forward citations
Cited by 3 Pith papers
-
EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning
EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.
-
EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs
EmoTrans is a new video benchmark with four progressive tasks that measures how well current multimodal LLMs handle dynamic emotion transitions rather than static recognition.
-
AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis
AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.
Reference graph
Works this paper leans on
-
[1]
jitter, 6. shimmer)) is given. The audience laughing moment is marked as (audience laughing) in certain utter- ances Explain why the audience laughed given the video clip, at most 40 words, starting with ’The audience laughed because ’. Given video clip:{query}. G Evaluation Prompt for SMILE Dataset Prompt From Logical Judgment Dimension Evalua- tion Crit...
-
[2]
The reason for laughter generated by the model. 2. The reference reason for laughter is annotated manually (as a benchmark). Please score based on the following dimension, with a max- imum of 5 points: Logical Judgment Dimension:Based on the reference reason, evaluate the model-generated reason in terms of logical clarity, the rationality of the causal ch...
-
[3]
The reason for laughter generated by the model. 2. The reference reason for laughter is annotated manually (as a benchmark). Please score based on the following dimension, with a max- imum of 5 points: Multimodal Content Association Dimension:Based on the reference reason, evaluate whether the generated text accurately reflects the interactions between la...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.