pith. sign in

arxiv: 2502.04424 · v4 · submitted 2025-02-06 · 💻 cs.CL · cs.AI

EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models

Pith reviewed 2026-05-23 03:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords emotional intelligencemultimodal large language modelsbenchmark evaluationemotion recognitionconversational understandingsocial emotion analysisAI capability assessment
0
0 comments X

The pith

A new benchmark reveals that even leading multimodal AI models score far below humans on emotional intelligence tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EmoBench-M to evaluate how well multimodal large language models perceive, interpret, and respond to human emotions in realistic settings. Prior benchmarks rely on static text or simple image-text pairs that miss the dynamic, context-sensitive, multimodal character of actual interactions. The new benchmark applies 13 scenarios drawn from psychological theories and divides them into three levels: basic emotion recognition, conversational understanding, and complex social analysis. When 27 current models are tested, the strongest results reach only 70.5 and 66.5 points, well short of human performance. This gap indicates that present systems remain limited for applications that require reliable emotional awareness.

Core claim

EmoBench-M evaluates MLLMs across 13 scenarios in three hierarchical dimensions grounded in psychological theories, revealing a substantial performance gap relative to human-level competence even for the strongest models, which reach only 70.5 and 66.5 points.

What carries the argument

EmoBench-M benchmark of 13 evaluation scenarios spanning foundational emotion recognition, conversational emotion understanding, and socially complex emotion analysis.

If this is right

  • MLLMs will need additional training focused on dynamic multimodal emotional cues to approach human competence.
  • Specialized emotion models show uneven coverage and will require broader integration to achieve consistent results.
  • Robotic and interactive AI systems that rely on emotional intelligence remain constrained until the measured gaps close.
  • The benchmark supplies concrete metrics that can track incremental progress in model emotional capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improvements tracked by this benchmark could translate directly into more natural responses during extended human-AI conversations.
  • Adding live video and audio streams to future versions of the benchmark would test whether models handle temporal emotional shifts.
  • Hybrid training that combines general multimodal pretraining with targeted emotional data may address the uneven performance seen in specialized models.

Load-bearing premise

The chosen 13 scenarios accurately represent the dynamic, context-dependent, and multimodal nature of emotional expressions that real interactions involve.

What would settle it

An MLLM that scores at or above average human performance across all 13 EmoBench-M scenarios while preserving accuracy on unrelated multimodal tasks.

Figures

Figures reproduced from arXiv: 2502.04424 by Fei Ma, Fei Richard Yu, He Hu, Hongbo Xu, Laizhong Cui, Lianzhong You, Qianning Wang, Yucheng Zhou, Zebang Cheng, Zheng Lian.

Figure 1
Figure 1. Figure 1: Taxonomy for Evaluating Emotion Intelligence (EI) Capabilities of Multimodal Large Language Models (MLLMs): The diagram outlines the categories of “Foundational Emotion Recognition”, “Conversational Emotion Understanding”, and “Socially Complex Emotion Analysis” along with their respective evaluation scenarios. It also presents a performance comparison of different methods on the proposed dataset EmoBench-… view at source ↗
Figure 2
Figure 2. Figure 2: Data Filtering and Label Verification Process. Bar charts show original dataset label (red) and label from our reviewers (blue). students with research experience in affective computing, fol￾lowing a unified annotation guideline. Let s be a sample from an initial dataset Dinitial, and let yorig(s) be its original label. Each reviewer i assigned an emotion label vi(s) from the set of possible labels L. We f… view at source ↗
Figure 3
Figure 3. Figure 3: Confusion matrices for Gemini-3.0-Pro on each evaluation scenario of EmoBench-M. Method FER CEU SCEA Avg. Open-Source Model Random 23.1 19.8 33.3 25.4 InternVL2.5-4B 54.5 49.3 49.0 50.9 Video-LLaMA2-7B 45.4 34.5 61.3 47.1 Qwen2-Audio-7B-Instruct 59.9 43.3 55.7 53.0 Video-LLaMA2.1-7B-16F 50.9 46.1 57.5 51.5 Video-LLaMA2.1-7B-AV 50.4 46.1 49.5 48.7 LongVA-DPO-7B 45.7 32.1 53.5 43.8 Qwen2.5-VL-Instruct-7B 51.… view at source ↗
Figure 4
Figure 4. Figure 4: Example of CH-SIMS dataset. The person in video says: I think it's really good. Determine the emotion conveyed. Do not provide any additional explanations or extra content. Choose one of the following labels as your final answer: A. neutral B. negative C. positive. Question ( Speech ) [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example of CMU-MOSEI dataset. The person in video says: IM JUST KINDA LIKE NO THIS IS GOING TO FAR. Determine the emotion conveyed. Do not provide any additional explanations or extra content. Choose one of the following labels as your final answer: A. neutral B. negative C. positive. Question ( Speech ) [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example of CMU-MOSI dataset [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example of FMSA-SC dataset. The person in video says: How did you even get in, you weirdo? -Yeah, yeah, really? Analyze the emotion and intent. Choose one emotion: A. happy B. surprise C. sad D. disgust E. anger F. fear G. neutral. Choose one intent: A. questioning B. agreeing C. acknowledging D. encouraging E. consoling F. suggesting G. wishing H. neutral. Question ( Dialogue ) [PITH_FULL_IMAGE:figures/f… view at source ↗
Figure 8
Figure 8. Figure 8: Example of MC-EIU dataset. The person in video says: Who told you that? Please watch the provided video and determine the emotion it conveys. Do not provide any additional explanations or extra content. Choose one of the following labels as your final answer: A. neutral B. surprise C. fear D. sadness E. joy F. disgust G. anger. Question ( Dialogue ) [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example of MELD dataset [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example of MER2023 dataset. The person in the video says: There a new girlfriend in there? Cause you might need one. Does this statement express sarcasm? Do not provide any additional explanations or extra content. Choose one of the following labels as your final answer: A. true B. false. Question ( Dialogue ) [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example of MUStARD dataset. Please watch the provided video and determine the emotion it conveys. Do not provide any additional explanations or extra content. Choose one of the following labels as your final answer: A. neutral B. calm C. happy D. sad E. angry F. fearful. Question ( Song ) [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example of RAVDSS-song dataset [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Example of CH-SIMSv2 dataset. Please watch the provided video and determine the emotion it conveys. Do not provide any additional explanations or extra content. Choose one of the following labels as your final answer: A. neutral B. calm C. happy D. sad E. angry F. fearful G. surprised H. disgust. Question ( Speech ) [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Example of RAVDSS-speech dataset. The context sentences in the video is: that makes a good cartoon, but what are you going to do with a flat track pad those square things there's nothing i can do as a cartoonist, well i know the world is flat now, that's true, and the internet has reached every corner of the world the poorest the remotest places. The punchline sentence in the video is: every village in af… view at source ↗
Figure 15
Figure 15. Figure 15: Example of UR-FUNNY dataset [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Example of SMILE dataset [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Confusion matrices for GPT-5.2 on each evaluation scenario of EmoBench-M. angry calm fearful happy neutral sad angry calm fearful happy neutral sad 0.75 0.12 0.07 0.00 0.04 0.02 0.00 0.73 0.00 0.14 0.04 0.08 0.10 0.10 0.41 0.04 0.05 0.31 0.00 0.08 0.01 0.90 0.00 0.00 0.00 0.75 0.02 0.01 0.09 0.12 0.00 0.30 0.02 0.02 0.01 0.64 (a) SOER angry calm disgust fearful happy neutral sad surprised angry calm disgu… view at source ↗
Figure 18
Figure 18. Figure 18: Confusion matrices for Gemini-3.0-Flash on each evaluation scenario of EmoBench-M [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Confusion matrices for Gemini-2.0-Flash on each evaluation scenario of EmoBench-M. angry calm fearful happy neutral sad angry calm fearful happy neutral sad 0.75 0.01 0.04 0.02 0.18 0.00 0.00 0.18 0.00 0.27 0.49 0.06 0.10 0.01 0.42 0.00 0.18 0.29 0.01 0.02 0.00 0.89 0.07 0.00 0.00 0.08 0.02 0.00 0.84 0.06 0.06 0.05 0.02 0.02 0.20 0.64 (a) SOER angry calm disgust fearful happy neutral sad surprised angry c… view at source ↗
Figure 20
Figure 20. Figure 20: Confusion matrices for Gemini-1.5-Flash on each evaluation scenario of EmoBench-M [PITH_FULL_IMAGE:figures/full_fig_p019_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Confusion matrices for Gemini-2.0-Flash-Thinking on each evaluation scenario of EmoBench-M. angry calm fearful happy neutral sad angry calm fearful happy neutral sad 0.68 0.00 0.01 0.12 0.13 0.05 0.00 0.00 0.00 0.61 0.35 0.04 0.13 0.00 0.34 0.16 0.22 0.16 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.01 0.01 0.11 0.87 0.00 0.08 0.01 0.06 0.10 0.35 0.40 (a) SOER angry calm disgust fearful happy neutral sad surprise… view at source ↗
Figure 22
Figure 22. Figure 22: Confusion matrices for GLM-4V-PLUS on each evaluation scenario of EmoBench-M [PITH_FULL_IMAGE:figures/full_fig_p020_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Confusion matrices for InternVL2.5-4B on each evaluation scenario of EmoBench-M. angry calm fearful happy neutral sad angry calm fearful happy neutral sad 0.16 0.00 0.01 0.00 0.77 0.06 0.00 0.04 0.00 0.12 0.82 0.02 0.00 0.00 0.11 0.00 0.71 0.18 0.00 0.00 0.00 0.69 0.31 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.01 0.02 0.55 0.41 (a) SOER angry calm disgust fearful happy neutral sad surprised angry cal… view at source ↗
Figure 24
Figure 24. Figure 24: Confusion matrices for InternVL2.5-8B on each evaluation scenario of EmoBench-M [PITH_FULL_IMAGE:figures/full_fig_p021_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Confusion matrices for InternVL2.5-38B on each evaluation scenario of EmoBench-M. angry calm fearful happy neutral sad angry calm fearful happy neutral sad 0.30 0.06 0.05 0.01 0.48 0.10 0.00 0.23 0.00 0.19 0.51 0.07 0.01 0.07 0.25 0.00 0.42 0.24 0.00 0.04 0.00 0.89 0.07 0.00 0.00 0.15 0.00 0.00 0.80 0.05 0.02 0.07 0.00 0.02 0.43 0.45 (a) SOER angry calm disgust fearful happy neutral sad surprised angry ca… view at source ↗
Figure 26
Figure 26. Figure 26: Confusion matrices for InternVL2.5-78B on each evaluation scenario of EmoBench-M [PITH_FULL_IMAGE:figures/full_fig_p022_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Confusion matrices for InternVideo2-Chat-8B on each evaluation scenario of EmoBench-M. angry calm fearful happy neutral sad angry calm fearful happy neutral sad 0.59 0.00 0.02 0.00 0.39 0.00 0.00 0.00 0.01 0.12 0.84 0.02 0.10 0.00 0.42 0.00 0.45 0.04 0.00 0.00 0.00 0.87 0.13 0.00 0.00 0.00 0.00 0.00 0.99 0.01 0.08 0.04 0.19 0.01 0.54 0.13 (a) SOER angry calm disgust fearful happy neutral sad surprised ang… view at source ↗
Figure 28
Figure 28. Figure 28: Confusion matrices for LongVA-7B-DPO on each evaluation scenario of EmoBench-M [PITH_FULL_IMAGE:figures/full_fig_p023_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Confusion matrices for MiniCPM-V-2.6-8B on each evaluation scenario of EmoBench-M. angry calm fearful happy neutral sad angry calm fearful happy neutral sad 0.99 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.02 0.69 0.29 0.05 0.00 0.31 0.00 0.00 0.64 0.00 0.00 0.00 0.99 0.01 0.00 0.00 0.00 0.00 0.05 0.66 0.29 0.00 0.00 0.00 0.00 0.00 1.00 (a) SOER angry calm disgust fearful happy neutral sad surprised angry c… view at source ↗
Figure 30
Figure 30. Figure 30: Confusion matrices for Qwen2-Audio-7B-Instruct on each evaluation scenario of EmoBench-M [PITH_FULL_IMAGE:figures/full_fig_p024_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Confusion matrices for VideoLLaMA2-7B on each evaluation scenario of EmoBench-M. angry calm fearful happy neutral sad angry calm fearful happy neutral sad 0.31 0.00 0.14 0.00 0.49 0.05 0.00 0.00 0.00 0.02 0.95 0.02 0.00 0.00 0.51 0.00 0.41 0.08 0.00 0.00 0.00 0.80 0.20 0.00 0.00 0.00 0.02 0.00 0.93 0.05 0.00 0.00 0.24 0.01 0.60 0.14 (a) SOER angry calm disgust fearful happy neutral sad surprised angry cal… view at source ↗
Figure 32
Figure 32. Figure 32: Confusion matrices for VideoLLaMA2-7B-16F on each evaluation scenario of EmoBench-M [PITH_FULL_IMAGE:figures/full_fig_p025_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Confusion matrices for VideoLLaMA2-72B on each evaluation scenario of EmoBench-M. angry calm fearful happy neutral sad angry calm fearful happy neutral sad 0.48 0.00 0.00 0.00 0.52 0.00 0.00 0.00 0.00 0.10 0.89 0.01 0.13 0.00 0.04 0.00 0.81 0.02 0.00 0.00 0.00 0.80 0.20 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.08 0.00 0.00 0.01 0.76 0.14 (a) SOER angry calm disgust fearful happy neutral sad surprised angry ca… view at source ↗
Figure 34
Figure 34. Figure 34: Confusion matrices for VideoLLaMA2.1-7B-16F on each evaluation scenario of EmoBench-M [PITH_FULL_IMAGE:figures/full_fig_p026_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Confusion matrices for VideoLLaMA2.1-7B-AV on each evaluation scenario of EmoBench-M [PITH_FULL_IMAGE:figures/full_fig_p027_35.png] view at source ↗
read the original abstract

With the integration of multimodal large language models (MLLMs) into robotic systems and AI applications, embedding emotional intelligence (EI) capabilities is essential for enabling these models to perceive, interpret, and respond to human emotions effectively in real-world scenarios. Existing static, text-based, or text-image benchmarks overlook the multimodal complexities of real interactions and fail to capture the dynamic, context-dependent nature of emotional expressions, rendering them inadequate for evaluating MLLMs' EI capabilities. To address these limitations, we introduce EmoBench-M, a systematic benchmark grounded in established psychological theories, designed to evaluate MLLMs across 13 evaluation scenarios spanning three hierarchical dimensions: foundational emotion recognition (FER), conversational emotion understanding (CEU), and socially complex emotion analysis (SCEA). Evaluation was conducted on 27 state-of-the-art MLLMs, using both objective task-specific metrics and LLM-based evaluation, revealing a substantial performance gap relative to human-level competence. Even the best performing models, Gemini-3.0-Pro and GPT-5.2, achieve the highest scores on EmoBench-M, 70.5 and 66.5 points respectively. Specialized models such as AffectGPT exhibit uneven performance across EmoBench-M, demonstrating strengths in certain scenarios but generally lacking comprehensive emotional intelligence. By providing a comprehensive, multimodal evaluation framework, EmoBench-M captures both the strengths and weaknesses of current MLLMs across diverse emotional contexts. All benchmark resources, including datasets and code, are publicly available at https://emo-gml.github.io/, facilitating further research and advancement in MLLM emotional intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces EmoBench-M, a multimodal benchmark for emotional intelligence in MLLMs. It comprises 13 scenarios across three hierarchical dimensions (foundational emotion recognition (FER), conversational emotion understanding (CEU), and socially complex emotion analysis (SCEA)) grounded in psychological theories. The authors evaluate 27 state-of-the-art MLLMs using objective task-specific metrics and LLM-based evaluation, report that the top models (Gemini-3.0-Pro at 70.5 and GPT-5.2 at 66.5) lag human performance, and release all datasets and code publicly.

Significance. If the scenarios are shown through validation to capture dynamic, context-dependent multimodal EI beyond existing static benchmarks, the work would offer a useful evaluation framework for MLLM capabilities relevant to real-world applications. The public release of datasets and code is a clear strength that supports reproducibility and community follow-up.

major comments (2)
  1. [Abstract] Abstract: The central claim of a substantial performance gap (Gemini-3.0-Pro at 70.5, GPT-5.2 at 66.5 vs. human) depends on the 13 scenarios accurately measuring the 'dynamic, context-dependent nature of emotional expressions' that prior benchmarks miss. The manuscript asserts grounding in psychological theories for the three dimensions but supplies no details on theory-to-scenario mapping, how temporal video dynamics or multimodality are operationalized, inter-annotator agreement for human baselines, or any empirical validation (e.g., correlation with established EI instruments). This validation information is load-bearing for interpreting the reported gap.
  2. [Evaluation section] Evaluation section: The manuscript employs both objective metrics and LLM-based evaluation, yet provides no analysis of how the LLM evaluator was selected or any checks for circularity risk (evaluator models potentially sharing the same EI limitations as the evaluated models). This affects confidence in the aggregate scores used to support the performance-gap conclusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's thorough review and constructive suggestions. We have carefully considered each comment and provide point-by-point responses below. Where appropriate, we have made revisions to the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of a substantial performance gap (Gemini-3.0-Pro at 70.5, GPT-5.2 at 66.5 vs. human) depends on the 13 scenarios accurately measuring the 'dynamic, context-dependent nature of emotional expressions' that prior benchmarks miss. The manuscript asserts grounding in psychological theories for the three dimensions but supplies no details on theory-to-scenario mapping, how temporal video dynamics or multimodality are operationalized, inter-annotator agreement for human baselines, or any empirical validation (e.g., correlation with established EI instruments). This validation information is load-bearing for interpreting the reported gap.

    Authors: We agree that the manuscript would be strengthened by providing more explicit details on these aspects. In the revised manuscript, we have expanded the 'Benchmark Construction' section to include a detailed mapping of each scenario to the relevant psychological theories, such as linking FER scenarios to Ekman's basic emotion theory and SCEA to more complex models like the component process model. We have also added explanations of how temporal dynamics are operationalized through the use of video sequences that capture emotion evolution over time and how multimodality is incorporated by integrating visual, auditory, and textual modalities in the scenarios. Furthermore, we now report inter-annotator agreement metrics (e.g., Cohen's kappa) for the human baseline annotations. However, we did not perform correlations with established EI instruments like the MSCEIT, as this would necessitate an extensive additional human study not included in the current work. We view the theory grounding and human performance comparisons as sufficient for the benchmark's validity but acknowledge this as a potential area for future work. revision: partial

  2. Referee: [Evaluation section] Evaluation section: The manuscript employs both objective metrics and LLM-based evaluation, yet provides no analysis of how the LLM evaluator was selected or any checks for circularity risk (evaluator models potentially sharing the same EI limitations as the evaluated models). This affects confidence in the aggregate scores used to support the performance-gap conclusion.

    Authors: We thank the referee for highlighting this important methodological point. In the updated manuscript, we have added a subsection under 'Evaluation Methodology' that details the selection criteria for the LLM evaluator, including its superior performance on related emotion understanding tasks and its general capabilities. To address circularity concerns, we conducted additional experiments using two different LLM evaluators and compared their scores with the objective metrics. The results show high consistency across evaluators, and the objective metrics (which do not rely on LLMs) align well with the LLM-based evaluations, supporting the reliability of the aggregate scores. We have included these analyses and a discussion of potential limitations in the revised paper. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or fitted predictions

full rationale

The paper presents EmoBench-M as a new multimodal benchmark with 13 scenarios across FER, CEU, and SCEA dimensions, evaluates 27 MLLMs using objective metrics and LLM-based scoring, and reports empirical results such as Gemini-3.0-Pro at 70.5 and GPT-5.2 at 66.5 versus human baselines. No equations, parameter fittings, uniqueness theorems, or derivation chains exist that could reduce outputs to inputs by construction. Assertions of grounding in psychological theories are descriptive rather than derived, and the work is fully self-contained as an empirical evaluation with public artifacts; no self-citation load-bearing or ansatz smuggling is present.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the domain assumption that psychological theories validly define the three dimensions and that the chosen scenarios measure them; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Established psychological theories provide a valid grounding for the three hierarchical dimensions of emotional intelligence.
    The benchmark is explicitly designed around these theories as stated in the abstract.

pith-pipeline@v0.9.0 · 5850 in / 1041 out tokens · 33577 ms · 2026-05-23T03:40:27.750471+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

    cs.CV 2026-05 unverdicted novelty 8.0

    EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.

  2. EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    EmoTrans is a new video benchmark with four progressive tasks that measures how well current multimodal LLMs handle dynamic emotion transitions rather than static recognition.

  3. AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

    cs.CV 2026-04 unverdicted novelty 6.0

    AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · cited by 3 Pith papers

  1. [1]

    shimmer)) is given

    jitter, 6. shimmer)) is given. The audience laughing moment is marked as (audience laughing) in certain utter- ances Explain why the audience laughed given the video clip, at most 40 words, starting with ’The audience laughed because ’. Given video clip:{query}. G Evaluation Prompt for SMILE Dataset Prompt From Logical Judgment Dimension Evalua- tion Crit...

  2. [2]

    The reason for laughter generated by the model. 2. The reference reason for laughter is annotated manually (as a benchmark). Please score based on the following dimension, with a max- imum of 5 points: Logical Judgment Dimension:Based on the reference reason, evaluate the model-generated reason in terms of logical clarity, the rationality of the causal ch...

  3. [3]

    The reason for laughter generated by the model. 2. The reference reason for laughter is annotated manually (as a benchmark). Please score based on the following dimension, with a max- imum of 5 points: Multimodal Content Association Dimension:Based on the reference reason, evaluate whether the generated text accurately reflects the interactions between la...