pith. sign in

arxiv: 2607.02089 · v1 · pith:CQJOFFPJnew · submitted 2026-07-01 · 💻 cs.CV · cs.AI· cs.CL· cs.LG· cs.MM

ESC: Emotional Self-Correction for Reliable Vision-Language Models

Pith reviewed 2026-07-03 21:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LGcs.MM
keywords vision-language modelsself-correctionemotional cuestraining-free methodsmodel reliabilityhallucination mitigation
0
0 comments X

The pith

Emotional signals trigger self-correction in vision-language models without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that emotional cues can activate latent self-correction behaviors already present in VLMs. An external verifier spots likely errors in an initial response and feeds back emotional language that prompts the model to reflect and revise. This produces more reliable outputs on safety, hallucination, perception, and reasoning tasks while keeping overall performance intact and requiring no extra training or parameter changes.

Core claim

Emotional signals serve as an effective trigger for self-correction, encouraging more cautious and reflective reasoning; the resulting ESC framework uses an external verifier to detect incorrect initial responses and injects emotional feedback so the VLM produces a better revised answer without additional training.

What carries the argument

ESC (Emotional Self-Correction) framework: an external verifier detects potentially incorrect responses and injects emotional feedback to prompt reflection and revision.

If this is right

  • VLMs gain reliability on safety, hallucination, and reasoning benchmarks without any retraining or added parameters.
  • Emotion functions as a practical control signal that scales self-correction across multiple VLM tasks.
  • Model utility stays intact while error rates drop, showing the method does not trade one capability for another.
  • The approach opens a training-free route to more cautious reasoning in multimodal systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same emotional-trigger idea could be tested on language-only models to check whether the effect depends on vision input.
  • Different emotional tones (calm versus urgent) might produce measurably different revision quality; this remains untested in the paper.
  • If the verifier itself is a smaller model, the whole pipeline could run locally and reduce reliance on large external judges.

Load-bearing premise

An external verifier can accurately detect potentially incorrect initial responses and injecting emotional feedback will reliably cause the VLM to produce a better revised response.

What would settle it

Run the same initial responses through ESC but replace emotional feedback with neutral or factual prompts and measure whether accuracy gains disappear or shrink substantially.

Figures

Figures reproduced from arXiv: 2607.02089 by Cuong Tuan Nguyen, Dat Nguyen, Hoang M. Le, Hung Viet Nguyen, Huy Nguyen Minh Nhat, Minh-Nhat Nguyen, Min Xu, Nguyen Nhat Huy, Phat Kim Huynh, Thanh-Huy Nguyen, Tien-Huy Nguyen, Ulas Bagci.

Figure 1
Figure 1. Figure 1: Comparison of ESC against VLMs [50, 92] across diverse benchmarks. Abstract. Vision-language models (VLMs) have achieved strong per￾formance across diverse multimodal tasks, yet they remain vulnerable to unreliable reasoning. Existing self-correction methods mitigate these is￾sues but typically rely on post-training or carefully engineered feedback, incurring high computational cost. In this work, we revis… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of ESC and comparison with existing self-correction [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Emotional context reduces ASR across all 5 VLMs. Left: qualitative example showing how emotional self-expression shifts LLaVA-1.5-7B from compliance to refusal. Right: ASR comparison between neutral and emotionally-cued queries on VLSafe [10]. 3.2 How do different emotional states shape VLMs’ behavior more broadly? Having established that emotional context affects VLM safety, we adopt Rus￾sell’s Circumplex… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of emotional states on ASR across five VLMs on VLSafe [10] Benchmark. All four emotional quadrants [67,71] reduce ASR relative to the neutral baseline, with negative-valence prompts yielding consistently larger reductions [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scenario-wise ASR on MMSafetyBench (averaged over SD, SD+Typo, and Typo image types) and overall ASR on VLSafe. ESC reduces ASR across all scenarios on both benchmarks. Green annotations indicate absolute percentage-point reductions [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of VLM responses w/ and w/o ESC. Red highlights in￾dicate incorrect one; green highlights indicate correct one. ESC improves visual ground￾ing across chart reading, arithmetic recognition, and fine-grained object detection. despite one being visible in the bottom-right corner, while ESC accurately lo￾cates it. Across all three cases, baseline errors stem from weak visual grounding, t… view at source ↗
Figure 7
Figure 7. Figure 7: Ablation on (a) the choice of verifier and (b) the emotion type in the ESC pipeline, evaluated on VLSafe [10]. Verifier model selection. We vary the verifier across four VLMs: Gemma3- 12B [82], Pixtral-12B [1], InternVL2.5-8B [11], and LLaVA-1.5-7B [50]. When LLaVA-1.5-7B acts as both the target model and the verifier, ASR remains at 50.3%, as the model fails to critically evaluate its own outputs. It is k… view at source ↗
Figure 8
Figure 8. Figure 8: Ablation on (a) the insertion location of the emotional cue and (b) the number of emotions in the ESC pipeline, evaluated on VLSafe [10] [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example of the ESC self-correction workflow. [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Evaluation prompt for MM-SafetyBench [53] scenarios 01–07 and 09 [PITH_FULL_IMAGE:figures/full_fig_p050_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Evaluation prompt for MM-SafetyBench [53] scenarios 08 (Political Lobbying) and 13 (Government Decision) [PITH_FULL_IMAGE:figures/full_fig_p051_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Evaluation prompt for MM-SafetyBench [53] scenarios 10 (Legal Opinion), 11 (Financial Advice), and 12 (Health Consultation) [PITH_FULL_IMAGE:figures/full_fig_p052_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Evaluation prompt for VLSafe [10] [PITH_FULL_IMAGE:figures/full_fig_p053_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: GPT-4o [63] judge prompt for HallusionBench [27] evaluation [PITH_FULL_IMAGE:figures/full_fig_p054_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: GPT-4o [63] scoring prompt for MM-Vet [97] evaluation (with official few-shot examples) [PITH_FULL_IMAGE:figures/full_fig_p055_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Gemma-4-26B [16] scoring prompt for cautiousness evaluation of thinking traces (1–5 scale), used in Tab. 5 [PITH_FULL_IMAGE:figures/full_fig_p056_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: MMSafetyBench [53] ASR across 13 safety scenarios — SD image type [PITH_FULL_IMAGE:figures/full_fig_p059_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: MMSafetyBench [53] ASR across 13 safety scenarios — SD+Typo image type [PITH_FULL_IMAGE:figures/full_fig_p060_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: MMSafetyBench [53] ASR across 13 safety scenarios — Typo image type [PITH_FULL_IMAGE:figures/full_fig_p061_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Ablation on Qwen2-VL-7B [92] (VLSafe [10]). (a) Choice of Verifier: Gemma3- 12B [82] achieves the lowest ASR (11.5%). (b) Emotion type: Negative-Low emotions yield the best safety (9.9% ASR) [PITH_FULL_IMAGE:figures/full_fig_p062_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Ablation on Qwen2-VL-7B [92] (VLSafe [10]). (a) Insertion location: beginning placement reduces ASR by 10.1 pp over the baseline. (b) Number of emotions: a single emotion is optimal (9.9% ASR). Summary Across all four ablation dimensions, the optimal ESC configura￾tion on Qwen2-VL-7B [92] matches the one identified on LLaVA-1.5-7B [50]: Gemma3-12B [82] as the Verifier, a single Negative-Low emotion, inser… view at source ↗
Figure 22
Figure 22. Figure 22: Iterative self-correction convergence on VLSafe [10]. (a) LLaVA-1.5-7B and (b) Qwen2-VL-7B. Red: ASR (↓ better); green: safe rate (↑ better) [PITH_FULL_IMAGE:figures/full_fig_p068_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: A correct example from MMStar benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p071_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: An error example from MMStar benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p072_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: An example where LLaVa performs better from MMStar benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p073_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: An example where Qwen performs better from MMStar benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p074_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: A correct example from MathVista benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p075_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: A correct example (2) from MathVista benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p076_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: An error example from MathVista benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p077_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: An example where Qwen performs better from MathVista benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p078_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: An example where Qwen performs better (2) from MathVista benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p079_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: A correct example from MMVP benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p080_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: A correct example (2) from MMVP benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p081_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: A correct example (3) from MMVP benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p082_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: An error example from MMVP benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p083_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: An error example (2) from MMVP benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p084_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: A safe example from MMSafety benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p085_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: A safe example from MMSafety benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p086_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: An unsafe example from MMSafety benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p087_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: An unsafe example from MMSafety benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p088_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: A safe example from VLSafe benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p089_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: A safe example from VLSafe benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p090_42.png] view at source ↗
Figure 43
Figure 43. Figure 43: A safe example from VLSafe benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p091_43.png] view at source ↗
Figure 44
Figure 44. Figure 44: An unsafe example from VLSafe benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p092_44.png] view at source ↗
Figure 45
Figure 45. Figure 45: A correct example from RWQA benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p093_45.png] view at source ↗
Figure 46
Figure 46. Figure 46: A correct example (2) from RWQA benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p094_46.png] view at source ↗
Figure 47
Figure 47. Figure 47: A correct example (3) from RWQA benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p095_47.png] view at source ↗
Figure 48
Figure 48. Figure 48: A correct example (4) from RWQA benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p096_48.png] view at source ↗
Figure 49
Figure 49. Figure 49: An error example from RWQA benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p097_49.png] view at source ↗
Figure 50
Figure 50. Figure 50: A correct example from MMVet benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p098_50.png] view at source ↗
Figure 51
Figure 51. Figure 51: A correct example (2) from MMVet benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p099_51.png] view at source ↗
Figure 52
Figure 52. Figure 52: An error example from MMVet benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p100_52.png] view at source ↗
Figure 53
Figure 53. Figure 53: An error example (2) from MMVet benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p101_53.png] view at source ↗
Figure 54
Figure 54. Figure 54: A correct example from HallusionBench benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p102_54.png] view at source ↗
Figure 55
Figure 55. Figure 55: A correct example (2) from HallusionBench benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p103_55.png] view at source ↗
Figure 56
Figure 56. Figure 56: An error example from HallusionBench benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p104_56.png] view at source ↗
Figure 57
Figure 57. Figure 57: An error example (2) from HallusionBench benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p105_57.png] view at source ↗
Figure 58
Figure 58. Figure 58: An example where Qwen performs better from HallusionBench benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p106_58.png] view at source ↗
Figure 59
Figure 59. Figure 59: A correct example from POPE benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p107_59.png] view at source ↗
Figure 60
Figure 60. Figure 60: A correct example (2) from POPE benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p108_60.png] view at source ↗
Figure 61
Figure 61. Figure 61: A correct example (3) from POPE benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p109_61.png] view at source ↗
Figure 62
Figure 62. Figure 62: An error example from POPE benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p110_62.png] view at source ↗
Figure 63
Figure 63. Figure 63: A correct example from BLINK benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p111_63.png] view at source ↗
Figure 64
Figure 64. Figure 64: A correct example (2) from BLINK benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p112_64.png] view at source ↗
Figure 65
Figure 65. Figure 65: An error example from BLINK benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p113_65.png] view at source ↗
read the original abstract

Vision-language models (VLMs) have achieved strong performance across diverse multimodal tasks, yet they remain vulnerable to unreliable reasoning. Existing self-correction methods mitigate these issues but typically rely on post-training or carefully engineered feedback, incurring high computational cost. In this work, we revisit this challenge through the lens of emotional cues, asking whether they can activate latent self-correction behaviors in VLMs without additional training. \textbf{We find that emotional signals serve as an effective trigger for self-correction, encouraging more cautious and reflective reasoning}. Motivated by this finding, we propose \escabstract (\textbf{\underline{E}}motional \textbf{\underline{S}}elf-\textbf{\underline{C}}orrection), a training-free self-correction framework. ESC introduces an external verifier that detects potentially incorrect initial responses and injects emotional feedback to encourage model to reflect, and produce a better revised response without additional training. Extensive experiments across safety, hallucination, vision-centric perception, and multimodal reasoning benchmarks show that ESC consistently improves reliability while preserving overall model utility. These results suggest that emotion can function not only as an ability to be recognized, but also as a practical control signal for scalable self-correction in VLMs. \textbf{We therefore believe that ESC provides a strong foundation for a new reliable human-like, emotion-integrated research direction.} Our project is publicly available at \textcolor{red}{https://genai4e.github.io/ESC/}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that emotional signals can serve as an effective trigger for self-correction in vision-language models (VLMs) without additional training. It introduces ESC, a training-free framework that deploys an external verifier to detect potentially incorrect initial responses and injects emotional feedback to encourage reflection and yield improved revised outputs. Extensive experiments on safety, hallucination, vision-centric perception, and multimodal reasoning benchmarks are said to demonstrate consistent gains in reliability while preserving overall model utility, positioning emotion as a practical control signal for scalable self-correction.

Significance. If the empirical claims hold after verification of the verifier and ablation details, the work would offer a low-cost, training-free route to more reliable VLMs by repurposing emotional language as a control signal. This could open a distinct research direction focused on emotion-integrated mechanisms rather than post-training or engineered feedback, with potential for broader applicability if the emotional cue proves additive beyond generic revision prompts.

major comments (2)
  1. [Abstract] Abstract: the central claim that ESC improves reliability via emotional self-correction rests on two unverified preconditions—an external verifier that reliably flags incorrect outputs and emotional feedback that measurably outperforms neutral revision instructions—yet the abstract supplies no precision/recall figures for the verifier, no ablation replacing emotional cues with neutral “reconsider” prompts, and no oracle-verifier upper-bound experiment.
  2. [Method (implied by abstract description)] The method description states that the verifier “detects potentially incorrect initial responses and injects emotional feedback,” but provides no quantitative assessment of verifier error rates; if those rates are high, observed benchmark gains could be artifacts of selective revision rather than emotion-driven reflection.
minor comments (1)
  1. [Abstract] The project URL is rendered in red text; this should be corrected to standard formatting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency on the verifier and the specific contribution of emotional cues. We address each major comment below and will revise the manuscript to incorporate additional details and experiments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that ESC improves reliability via emotional self-correction rests on two unverified preconditions—an external verifier that reliably flags incorrect outputs and emotional feedback that measurably outperforms neutral revision instructions—yet the abstract supplies no precision/recall figures for the verifier, no ablation replacing emotional cues with neutral “reconsider” prompts, and no oracle-verifier upper-bound experiment.

    Authors: We agree that the abstract, due to length constraints, does not detail verifier metrics or ablations. The full manuscript reports consistent benchmark gains from ESC, but we acknowledge that explicitly addressing the preconditions would strengthen the abstract. In revision we will add a concise statement on verifier effectiveness and the role of emotional feedback, include a neutral-prompt ablation, and report an oracle-verifier upper bound to quantify the headroom. revision: yes

  2. Referee: [Method (implied by abstract description)] The method description states that the verifier “detects potentially incorrect initial responses and injects emotional feedback,” but provides no quantitative assessment of verifier error rates; if those rates are high, observed benchmark gains could be artifacts of selective revision rather than emotion-driven reflection.

    Authors: The concern is valid: without reported verifier error rates it is difficult to fully exclude selective-revision artifacts. While end-to-end gains across diverse benchmarks support that emotional feedback drives reflection, we will add a quantitative analysis of the verifier’s precision and recall on a held-out subset in the revised manuscript. This will allow readers to assess whether gains arise primarily from emotion-triggered correction. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with no derivations or self-referential predictions

full rationale

The paper proposes ESC as a training-free method relying on an external verifier and emotional feedback prompts. No equations, first-principles derivations, or fitted parameters are described in the provided text. Claims rest on experimental benchmarks rather than any reduction of outputs to inputs by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing. The central mechanism (verifier + emotional injection) is presented as a design choice validated by results, not derived tautologically. This is a standard empirical contribution with no detectable circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; ledger is empty by necessity.

pith-pipeline@v0.9.1-grok · 5845 in / 949 out tokens · 26069 ms · 2026-07-03T21:18:19.603855+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

108 extracted references · 50 canonical work pages · 21 internal anchors

  1. [1]

    Agrawal, P., Antoniak, S., Hanna, E.B., Bout, B., Chaplot, D., Chudnovsky, J., Costa, D., Monicault, B.D., Garg, S., Gervet, T., Ghosh, S., Héliou, A., Jacob, P., Jiang, A.Q., Khandelwal, K., Lacroix, T., Lample, G., Casas, D.L., Lavril, T., Scao, T.L., Lo, A., Marshall, W., Martin, L., Mensch, A., Muddireddy, P., Nemy- chnikova, V., Pellat, M., Platen, P...

  2. [2]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  3. [3]

    Qwen2.5-VL Technical Report

    Bai, S., et al.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  4. [4]

    Personality and Social Psychology Review10(1), 20–46 (2006)

    Barrett, L.F.: Solving the emotion paradox: Categorization and the experience of emotion. Personality and Social Psychology Review10(1), 20–46 (2006)

  5. [5]

    Social Cognitive and Affective Neuroscience 12(1), 1–23 (2017)

    Barrett, L.F.: The theory of constructed emotion: An active inference account of interoception and categorization. Social Cognitive and Affective Neuroscience 12(1), 1–23 (2017)

  6. [6]

    Current Directions in Psychological Science8(1), 10–14 (1999)

    Barrett, L.F., Russell, J.A.: The structure of current affect: Controversies and emerging consensus. Current Directions in Psychological Science8(1), 10–14 (1999)

  7. [7]

    Bhattacharyya, S., Wang, J.Z.: Evaluating vision-language models for emotion recognition (2025),https://arxiv.org/abs/2502.05660

  8. [8]

    Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Wang, J., Qiao, Y., Lin, D., et al.: Are we on the right way for evaluating large vision- language models? (2024)

  9. [9]

    Chen, P., Lou, Y., Cao, S., Guo, J., Fan, L., Wu, Y., Yang, L., Ma, L., Ye, J.: Sd- vlm: Spatial measuring and understanding with depth-encoded vision-language models (2025),https://arxiv.org/abs/2509.17664

  10. [10]

    arXiv preprint arXiv:2311.10081 (2023)

    Chen, Y., Sikka, K., Cogswell, M., Ji, H., Divakaran, A.: Dress: Instructing large vision-language models to align and interact with humans via natural language feedback. arXiv preprint arXiv:2311.10081 (2023)

  11. [11]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)

  12. [12]

    Cheng, K., YanTao, L., Xu, F., Zhang, J., Zhou, H., Liu, Y.: Vision-language models can self-improve reasoning via reflection (2025)

  13. [13]

    Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning,

    Cheng, Z., Cheng, Z.Q., He, J.Y., Sun, J., Wang, K., Lin, Y., Lian, Z., Peng, X., Hauptmann, A.: Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning (2024),https://arxiv.org/abs/2406.11161 16 T.-H. Nguyen et al

  14. [14]

    Choi, D., Son, G., Kim, S.Y., Paik, G., Hong, S.: Improving fine-grained visual understanding in vlms through text-only training (2024),https://arxiv.org/ abs/2412.12940

  15. [15]

    Proceedings of the National Academy of Sciences 114(38), E7900–E7909 (2017)

    Cowen, A.S., Keltner, D.: Self-report captures 27 distinct categories of emotion bridged by continuous gradients. Proceedings of the National Academy of Sciences 114(38), E7900–E7909 (2017)

  16. [16]

    DeepMind, G., Ballantyne, I., Cameron, G., Cruz, M., Lacombe, O., Quan, K., Sanseviero, O.: Gemma 4 model card.https://ai.google.dev/gemma/docs/ core/model_card_4(2026),https://ai.google.dev/gemma/docs/core/model_ card_4

  17. [17]

    Deng, S., Zhao, W., Li, Y.J., Wan, K., Miranda, D., Kale, A., Tian, Y.: Efficient self-improvement in multimodal large language models: A model-level judge-free approach (2024)

  18. [18]

    Deng, Y., Chen, G., Gu, T., Kong, L., Li, Y., Tang, Z., Zhang, K.: Towards self-refinement of vision-language models with triangular consistency (2025)

  19. [19]

    Ding, Y., Qiu, Z., Li, B., Zhang, R.: Learning self-correction in vision-language models via rollout augmentation (2026)

  20. [20]

    Ding, Y., Zhang, R.: Sherlock: Self-correcting reasoning in vision-language models (2025)

  21. [21]

    Duan, C., Sun, K., Fang, R., Zhang, M., Feng, Y., Luo, Y., Liu, Y., Wang, K., Pei, P., Cai, X., Li, H., Ma, Y., Liu, X.: Codeplot-cot: Mathematical visual reason- ing by thinking with code-driven images (2025),https://arxiv.org/abs/2510. 11718

  22. [22]

    Cognition & Emotion6(3–4), 169– 200 (1992)

    Ekman, P.: An argument for basic emotions. Cognition & Emotion6(3–4), 169– 200 (1992)

  23. [23]

    Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not perceive (2024)

  24. [24]

    Gao, G.J., Li, T., Shi, J., Li, Y., Zhang, Z., Figueroa, N., Jayaraman, D.: Vlmgi- neer: Vision language models as robotic toolsmiths (2025),https://arxiv.org/ abs/2507.12644

  25. [25]

    Gariboldi, C., Tokida, H., Kinjo, K., Asada, Y., Carballo, A.: Vlad: A vlm- augmented autonomous driving framework with hierarchical planning and inter- pretable decision process (2025),https://arxiv.org/abs/2507.01284

  26. [26]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Cauchete...

  27. [27]

    Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., et al.: Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models (2024)

  28. [28]

    Guo, Q., Mello, S.D., Yin, H., Byeon, W., Cheung, K.C., Yu, Y., Luo, P., Liu, S.: Regiongpt: Towards region understanding vision language model (2024),https: //arxiv.org/abs/2403.02330

  29. [29]

    In: Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S

    He, C., Zhu, S., Liu, H., Gao, F., Jia, Y., Zan, H., Peng, M.: DialogueMMT: Dialogue scenes understanding enhanced multi-modal multi-task tuning for emo- tion recognition in conversations. In: Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S. (eds.) Proceedings of the 31st International Conference on Computational Lingu...

  30. [30]

    He, J., Lin, H., Wang, Q., Fung, Y.R., Ji, H.: Self-correction is more than refine- ment: A learning framework for visual and language reasoning tasks (2025)

  31. [31]

    Hu, H., Zhou, Y., You, L., Xu, H., Wang, Q., Lian, Z., Yu, F.R., Ma, F., Cui, L.: Emobench-m:Benchmarkingemotionalintelligenceformultimodallargelanguage models (2026),https://arxiv.org/abs/2502.04424

  32. [32]

    tseHuang,J.,Lam,M.H.,Li,E.J.,Ren,S.,Wang,W.,Jiao,W.,Tu,Z.,Lyu,M.R.: Emotionally numb or empathetic? evaluating how llms feel using emotionbench (2024),https://arxiv.org/abs/2308.03656

  33. [33]

    Huang, J., Chen, X., Mishra, S., Zheng, H.S., Yu, A.W., Song, X., Zhou, D.: Large language models cannot self-correct reasoning yet (2023)

  34. [34]

    Jian, P., Wu, J., Sun, W., Wang, C., Ren, S., Zhang, J.: Look again, think slowly: Enhancing visual reflection in vision-language models (2025)

  35. [35]

    In: European Conference on Computer Vision (ECCV)

    Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A diagram is worth a dozen images. In: European Conference on Computer Vision (ECCV). pp. 235–251 (2016)

  36. [36]

    Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners (2023),https://arxiv.org/abs/2205.11916

  37. [37]

    Laurençon, H., Tronchon, L., Cord, M., Sanh, V.: What matters when building vision-language models? (2024),https://arxiv.org/abs/2405.02246 ESC19

  38. [38]

    Le, T.Q.K., Vu, N.L.V., Pham, H.H., Huynh, X.L., Nguyen, T.H., Le, M.H.N., Nguyen, Q., Nguyen, H.D.: Hdc: Hierarchical distillation for multi-level noisy consistency in semi-supervised fetal ultrasound segmentation (2025),https:// arxiv.org/abs/2504.09876

  39. [39]

    Lee, S., Park, S.H., Jo, Y., Seo, M.: Volcano: mitigating multimodal hallucination through self-feedback guided revision (2024)

  40. [40]

    Annual Review of Psychology66, 799–823 (2015)

    Lerner, J.S., Li, Y., Valdesolo, P., Kassam, K.S.: Emotion and decision making. Annual Review of Psychology66, 799–823 (2015)

  41. [41]

    Li, B., Zhang, P., Yang, J., Zhang, Y., Pu, F., Liu, Z.: Otterhd: A high-resolution multi-modality model (2023),https://arxiv.org/abs/2311.04219

  42. [42]

    Li, C., Wang, J., Zhang, Y., Zhu, K., Hou, W., Lian, J., Luo, F., Yang, Q., Xie, X.: Large language models understand and can be enhanced by emotional stimuli (2023)

  43. [43]

    Li, C., Wang, J., Zhang, Y., Zhu, K., Wang, X., Hou, W., Lian, J., Luo, F., Yang, Q., Xie, X.: The good, the bad, and why: Unveiling emotions in generative ai (2023)

  44. [44]

    Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models (2023)

  45. [45]

    Li, Z., Yang, B., Liu, Q., Ma, Z., Zhang, S., Yang, J., Sun, Y., Liu, Y., Bai, X.: Monkey: Image resolution and text label are important things for large multi- modal models (2024),https://arxiv.org/abs/2311.06607

  46. [46]

    Liang, Z., Guo, K., Liu, G., Guo, T., Zhou, Y., Yang, T., Jiao, J., Pi, R., Zhang, J., Zhang, X.: Scemqa: A scientific college entrance level multimodal question answering benchmark (2024),https://arxiv.org/abs/2402.05138

  47. [47]

    Liao, Y.H., Mahmood, R., Fidler, S., Acuna, D.: Can large vision-language models correct semantic grounding errors by themselves? (2025)

  48. [48]

    Lin, J., Yin, H., Ping, W., Lu, Y., Molchanov, P., Tao, A., Mao, H., Kautz, J., Shoeybi, M., Han, S.: Vila: On pre-training for visual language models (2024), https://arxiv.org/abs/2312.07533

  49. [49]

    In: Proceedings of the 2024 International Conference on Multimedia Retrieval

    Liu, C., Xie, Z., Zhao, S., Zhou, J., Xu, T., Li, M., Chen, E.: Speak from heart: An emotion-guided llm-based multimodal method for emotional dialogue generation. In: Proceedings of the 2024 International Conference on Multimedia Retrieval. p. 533–542. ICMR ’24, Association for Computing Machinery, New York, NY, USA (2024),https://doi.org/10.1145/3652583.3658104

  50. [50]

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2024)

  51. [51]

    Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llavanext: Improved reasoning, ocr, and world knowledge (2024)

  52. [52]

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)

  53. [53]

    In: European Conference on Computer Vision (ECCV) (2024)

    Liu, X., Zhu, Y., Gu, J., Lan, Y., Yang, C., Qiao, Y.: Mm-safetybench: A bench- mark for safety evaluation of multimodal large language models. In: European Conference on Computer Vision (ECCV) (2024)

  54. [54]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023)

  55. [55]

    Man, F., Chen, X., Wang, H., Zhao, B., Li, H., Chen, X.: Vaeer: Visual attention- inspired emotion elicitation reasoning (2025),https://arxiv.org/abs/2505. 24342

  56. [56]

    MIT Press (1974) 20 T.-H

    Mehrabian, A., Russell, J.A.: An Approach to Environmental Psychology. MIT Press (1974) 20 T.-H. Nguyen et al

  57. [57]

    Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., Gao, J.: Large language models: A survey (2025),https://arxiv.org/abs/2402. 06196

  58. [58]

    Ad- vances in Neural Information Processing Systems37, 53969–54002 (2024)

    Mozikov, M., Severin, N., Bodishtianu, V., Glushanina, M., Nasonov, I., Orekhov, D., Pekhotin, V., Makovetskiy, I., Baklashkin, M., Lavrentyev, V., et al.: Eai: Emotional decision-making of llms in strategic games and ethical dilemmas. Ad- vances in Neural Information Processing Systems37, 53969–54002 (2024)

  59. [59]

    Nguyen, T.H., Tran, H.L., Ngo, T.D.: Itself: Attention guided fine-grained align- ment for vision-language retrieval (2026),https://arxiv.org/abs/2601.01024

  60. [60]

    Nguyen, T.H., Tran, H.L., Phan-Nguyen, H.P., Dinh, Q.V.: Hybrid, unified and iterative: A novel framework for text-based person anomaly retrieval (2025), https://arxiv.org/abs/2511.22470

  61. [61]

    Nguyen,T.H.,Tran,Q.K.,Quang-Hoang,A.T.:Improvinggeneralizationinvisual reasoning via self-ensemble (2024),https://arxiv.org/abs/2410.20883

  62. [62]

    Nguyen-Nhu, T.A., Minh, T.D.H., To-Thanh, D., Le-Gia, P., Vo-Lan, T., Nguyen, T.H.: Ster-vlm: Spatio-temporal with enhanced reference vision-language models (2025),https://arxiv.org/abs/2508.13470

  63. [63]

    OpenAI, :, Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., Mądry, A., Baker- Whitcomb, A., Beutel, A., Borzunov, A., Carney, A., Chow, A., Kirillov, A., Nichol, A., Paino, A., Renzin, A., Passos, A.T., Kirillov, A., Christakis, A., Con- neau,A.,Kamali,A.,Jabri,A.,Moyer,A.,Tam,A.,...

  64. [64]

    GPT-4 Technical Report

    OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boiko, O., Boyd, M., Brakman, A.L., Brockman, ...

  65. [65]

    org/abs/2409.00147

    Peng, S., Fu, D., Gao, L., Zhong, X., Fu, H., Tang, Z.: Multimath: Bridging visual and mathematical reasoning for large language models (2024),https://arxiv. org/abs/2409.00147

  66. [66]

    Harper & Row (1980)

    Plutchik, R.: Emotion: A Psychoevolutionary Synthesis. Harper & Row (1980)

  67. [67]

    Development and psychopathology17(3), 715–734 (2005)

    Posner, J., Russell, J.A., Peterson, B.S.: The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development, and psy- chopathology. Development and psychopathology17(3), 715–734 (2005)

  68. [68]

    arXiv preprint arXiv:2601.01483 (2026)

    Qiu, X., Jia, H., Zeng, Z., Shen, S., Meng, C., Yang, Y., Zhu, L.: Unified gen- eration and self-verification for vision-language models via advantage decoupled preference optimization. arXiv preprint arXiv:2601.01483 (2026)

  69. [69]

    In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision

    Qu, M., Hu, Y., Han, K., Wei, Y., Zhao, Y.: Recot: Reflective self-correction training for mitigating confirmation bias in large vision-language models. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9147–9157 (2025)

  70. [70]

    In: International conference on machine learning

    Radford,A.,Kim,J.W.,Hallacy,C.,Ramesh,A.,Goh,G.,Agarwal,S.,Sastry,G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  71. [71]

    Journal of personality and social psychology39(6), 1161 (1980)

    Russell, J.A.: A circumplex model of affect. Journal of personality and social psychology39(6), 1161 (1980)

  72. [72]

    Psychological Bulletin 110(3), 426–450 (1991)

    Russell, J.A.: Culture and the categorization of emotions. Psychological Bulletin 110(3), 426–450 (1991)

  73. [73]

    Psycho- logical Review110(1), 145–172 (2003) ESC23

    Russell, J.A.: Core affect and the psychological construction of emotion. Psycho- logical Review110(1), 145–172 (2003) ESC23

  74. [74]

    Sabour, S., Liu, S., Zhang, Z., Liu, J.M., Zhou, J., Sunaryo, A.S., Li, J., Lee, T.M.C., Mihalcea, R., Huang, M.: Emobench: Evaluating the emotional intelli- gence of large language models (2024),https://arxiv.org/abs/2402.12071

  75. [75]

    In:Scherer,K.R.,Schorr,A.,Johnstone,T.(eds.)AppraisalProcessesinEmotion: Theory, Methods, Research, pp

    Scherer, K.R.: Appraisal considered as a process of multilevel sequential checking. In:Scherer,K.R.,Schorr,A.,Johnstone,T.(eds.)AppraisalProcessesinEmotion: Theory, Methods, Research, pp. 92–120. Oxford University Press (2001)

  76. [76]

    Scherer, K.R.: What are emotions? and how can they be measured? Social Science Information44(4), 695–729 (2005)

  77. [77]

    Cognition and Emotion23(7), 1307–1351 (2009)

    Scherer, K.R.: The dynamic architecture of emotion: Evidence for the component process model. Cognition and Emotion23(7), 1307–1351 (2009)

  78. [78]

    In: Handbook of Theories of Social Psychology, pp

    Schwarz, N.: Feelings-as-information theory. In: Handbook of Theories of Social Psychology, pp. 289–308. Sage (2012)

  79. [79]

    Shao, H., Qian, S., Xiao, H., Song, G., Zong, Z., Wang, L., Liu, Y., Li, H.: Vi- sual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning (2024),https://arxiv.org/abs/ 2403.16999

  80. [80]

    Shao, R., Li, W., Zhang, L., Zhang, R., Liu, Z., Chen, R., Nie, L.: Large vlm- based vision-language-action models for robotic manipulation: A survey (2025), https://arxiv.org/abs/2508.13073

Showing first 80 references.