ESC: Emotional Self-Correction for Reliable Vision-Language Models

Cuong Tuan Nguyen; Dat Nguyen; Hoang M. Le; Hung Viet Nguyen; Huy Nguyen Minh Nhat; Minh-Nhat Nguyen; Min Xu; Nguyen Nhat Huy; Phat Kim Huynh; Thanh-Huy Nguyen

arxiv: 2607.02089 · v1 · pith:CQJOFFPJnew · submitted 2026-07-01 · 💻 cs.CV · cs.AI· cs.CL· cs.LG· cs.MM

ESC: Emotional Self-Correction for Reliable Vision-Language Models

Tien-Huy Nguyen , Minh-Nhat Nguyen , Nguyen Nhat Huy , Hung Viet Nguyen , Huy Nguyen Minh Nhat , Thanh-Huy Nguyen , Cuong Tuan Nguyen , Hoang M. Le

show 4 more authors

Dat Nguyen Phat Kim Huynh Min Xu Ulas Bagci

This is my paper

Pith reviewed 2026-07-03 21:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LGcs.MM

keywords vision-language modelsself-correctionemotional cuestraining-free methodsmodel reliabilityhallucination mitigation

0 comments

The pith

Emotional signals trigger self-correction in vision-language models without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that emotional cues can activate latent self-correction behaviors already present in VLMs. An external verifier spots likely errors in an initial response and feeds back emotional language that prompts the model to reflect and revise. This produces more reliable outputs on safety, hallucination, perception, and reasoning tasks while keeping overall performance intact and requiring no extra training or parameter changes.

Core claim

Emotional signals serve as an effective trigger for self-correction, encouraging more cautious and reflective reasoning; the resulting ESC framework uses an external verifier to detect incorrect initial responses and injects emotional feedback so the VLM produces a better revised answer without additional training.

What carries the argument

ESC (Emotional Self-Correction) framework: an external verifier detects potentially incorrect responses and injects emotional feedback to prompt reflection and revision.

If this is right

VLMs gain reliability on safety, hallucination, and reasoning benchmarks without any retraining or added parameters.
Emotion functions as a practical control signal that scales self-correction across multiple VLM tasks.
Model utility stays intact while error rates drop, showing the method does not trade one capability for another.
The approach opens a training-free route to more cautious reasoning in multimodal systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same emotional-trigger idea could be tested on language-only models to check whether the effect depends on vision input.
Different emotional tones (calm versus urgent) might produce measurably different revision quality; this remains untested in the paper.
If the verifier itself is a smaller model, the whole pipeline could run locally and reduce reliance on large external judges.

Load-bearing premise

An external verifier can accurately detect potentially incorrect initial responses and injecting emotional feedback will reliably cause the VLM to produce a better revised response.

What would settle it

Run the same initial responses through ESC but replace emotional feedback with neutral or factual prompts and measure whether accuracy gains disappear or shrink substantially.

Figures

Figures reproduced from arXiv: 2607.02089 by Cuong Tuan Nguyen, Dat Nguyen, Hoang M. Le, Hung Viet Nguyen, Huy Nguyen Minh Nhat, Minh-Nhat Nguyen, Min Xu, Nguyen Nhat Huy, Phat Kim Huynh, Thanh-Huy Nguyen, Tien-Huy Nguyen, Ulas Bagci.

**Figure 1.** Figure 1: Comparison of ESC against VLMs [50, 92] across diverse benchmarks. Abstract. Vision-language models (VLMs) have achieved strong performance across diverse multimodal tasks, yet they remain vulnerable to unreliable reasoning. Existing self-correction methods mitigate these issues but typically rely on post-training or carefully engineered feedback, incurring high computational cost. In this work, we revis… view at source ↗

**Figure 2.** Figure 2: Overview of ESC and comparison with existing self-correction [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Emotional context reduces ASR across all 5 VLMs. Left: qualitative example showing how emotional self-expression shifts LLaVA-1.5-7B from compliance to refusal. Right: ASR comparison between neutral and emotionally-cued queries on VLSafe [10]. 3.2 How do different emotional states shape VLMs’ behavior more broadly? Having established that emotional context affects VLM safety, we adopt Russell’s Circumplex… view at source ↗

**Figure 4.** Figure 4: Effect of emotional states on ASR across five VLMs on VLSafe [10] Benchmark. All four emotional quadrants [67,71] reduce ASR relative to the neutral baseline, with negative-valence prompts yielding consistently larger reductions [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Scenario-wise ASR on MMSafetyBench (averaged over SD, SD+Typo, and Typo image types) and overall ASR on VLSafe. ESC reduces ASR across all scenarios on both benchmarks. Green annotations indicate absolute percentage-point reductions [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of VLM responses w/ and w/o ESC. Red highlights indicate incorrect one; green highlights indicate correct one. ESC improves visual grounding across chart reading, arithmetic recognition, and fine-grained object detection. despite one being visible in the bottom-right corner, while ESC accurately locates it. Across all three cases, baseline errors stem from weak visual grounding, t… view at source ↗

**Figure 7.** Figure 7: Ablation on (a) the choice of verifier and (b) the emotion type in the ESC pipeline, evaluated on VLSafe [10]. Verifier model selection. We vary the verifier across four VLMs: Gemma3- 12B [82], Pixtral-12B [1], InternVL2.5-8B [11], and LLaVA-1.5-7B [50]. When LLaVA-1.5-7B acts as both the target model and the verifier, ASR remains at 50.3%, as the model fails to critically evaluate its own outputs. It is k… view at source ↗

**Figure 8.** Figure 8: Ablation on (a) the insertion location of the emotional cue and (b) the number of emotions in the ESC pipeline, evaluated on VLSafe [10] [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Example of the ESC self-correction workflow. [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗

**Figure 10.** Figure 10: Evaluation prompt for MM-SafetyBench [53] scenarios 01–07 and 09 [PITH_FULL_IMAGE:figures/full_fig_p050_10.png] view at source ↗

**Figure 11.** Figure 11: Evaluation prompt for MM-SafetyBench [53] scenarios 08 (Political Lobbying) and 13 (Government Decision) [PITH_FULL_IMAGE:figures/full_fig_p051_11.png] view at source ↗

**Figure 12.** Figure 12: Evaluation prompt for MM-SafetyBench [53] scenarios 10 (Legal Opinion), 11 (Financial Advice), and 12 (Health Consultation) [PITH_FULL_IMAGE:figures/full_fig_p052_12.png] view at source ↗

**Figure 13.** Figure 13: Evaluation prompt for VLSafe [10] [PITH_FULL_IMAGE:figures/full_fig_p053_13.png] view at source ↗

**Figure 14.** Figure 14: GPT-4o [63] judge prompt for HallusionBench [27] evaluation [PITH_FULL_IMAGE:figures/full_fig_p054_14.png] view at source ↗

**Figure 15.** Figure 15: GPT-4o [63] scoring prompt for MM-Vet [97] evaluation (with official few-shot examples) [PITH_FULL_IMAGE:figures/full_fig_p055_15.png] view at source ↗

**Figure 16.** Figure 16: Gemma-4-26B [16] scoring prompt for cautiousness evaluation of thinking traces (1–5 scale), used in Tab. 5 [PITH_FULL_IMAGE:figures/full_fig_p056_16.png] view at source ↗

**Figure 17.** Figure 17: MMSafetyBench [53] ASR across 13 safety scenarios — SD image type [PITH_FULL_IMAGE:figures/full_fig_p059_17.png] view at source ↗

**Figure 18.** Figure 18: MMSafetyBench [53] ASR across 13 safety scenarios — SD+Typo image type [PITH_FULL_IMAGE:figures/full_fig_p060_18.png] view at source ↗

**Figure 19.** Figure 19: MMSafetyBench [53] ASR across 13 safety scenarios — Typo image type [PITH_FULL_IMAGE:figures/full_fig_p061_19.png] view at source ↗

**Figure 20.** Figure 20: Ablation on Qwen2-VL-7B [92] (VLSafe [10]). (a) Choice of Verifier: Gemma3- 12B [82] achieves the lowest ASR (11.5%). (b) Emotion type: Negative-Low emotions yield the best safety (9.9% ASR) [PITH_FULL_IMAGE:figures/full_fig_p062_20.png] view at source ↗

**Figure 21.** Figure 21: Ablation on Qwen2-VL-7B [92] (VLSafe [10]). (a) Insertion location: beginning placement reduces ASR by 10.1 pp over the baseline. (b) Number of emotions: a single emotion is optimal (9.9% ASR). Summary Across all four ablation dimensions, the optimal ESC configuration on Qwen2-VL-7B [92] matches the one identified on LLaVA-1.5-7B [50]: Gemma3-12B [82] as the Verifier, a single Negative-Low emotion, inser… view at source ↗

**Figure 22.** Figure 22: Iterative self-correction convergence on VLSafe [10]. (a) LLaVA-1.5-7B and (b) Qwen2-VL-7B. Red: ASR (↓ better); green: safe rate (↑ better) [PITH_FULL_IMAGE:figures/full_fig_p068_22.png] view at source ↗

**Figure 23.** Figure 23: A correct example from MMStar benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p071_23.png] view at source ↗

**Figure 24.** Figure 24: An error example from MMStar benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p072_24.png] view at source ↗

**Figure 25.** Figure 25: An example where LLaVa performs better from MMStar benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p073_25.png] view at source ↗

**Figure 26.** Figure 26: An example where Qwen performs better from MMStar benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p074_26.png] view at source ↗

**Figure 27.** Figure 27: A correct example from MathVista benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p075_27.png] view at source ↗

**Figure 28.** Figure 28: A correct example (2) from MathVista benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p076_28.png] view at source ↗

**Figure 29.** Figure 29: An error example from MathVista benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p077_29.png] view at source ↗

**Figure 30.** Figure 30: An example where Qwen performs better from MathVista benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p078_30.png] view at source ↗

**Figure 31.** Figure 31: An example where Qwen performs better (2) from MathVista benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p079_31.png] view at source ↗

**Figure 32.** Figure 32: A correct example from MMVP benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p080_32.png] view at source ↗

**Figure 33.** Figure 33: A correct example (2) from MMVP benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p081_33.png] view at source ↗

**Figure 34.** Figure 34: A correct example (3) from MMVP benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p082_34.png] view at source ↗

**Figure 35.** Figure 35: An error example from MMVP benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p083_35.png] view at source ↗

**Figure 36.** Figure 36: An error example (2) from MMVP benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p084_36.png] view at source ↗

**Figure 37.** Figure 37: A safe example from MMSafety benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p085_37.png] view at source ↗

**Figure 38.** Figure 38: A safe example from MMSafety benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p086_38.png] view at source ↗

**Figure 39.** Figure 39: An unsafe example from MMSafety benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p087_39.png] view at source ↗

**Figure 40.** Figure 40: An unsafe example from MMSafety benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p088_40.png] view at source ↗

**Figure 41.** Figure 41: A safe example from VLSafe benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p089_41.png] view at source ↗

**Figure 42.** Figure 42: A safe example from VLSafe benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p090_42.png] view at source ↗

**Figure 43.** Figure 43: A safe example from VLSafe benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p091_43.png] view at source ↗

**Figure 44.** Figure 44: An unsafe example from VLSafe benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p092_44.png] view at source ↗

**Figure 45.** Figure 45: A correct example from RWQA benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p093_45.png] view at source ↗

**Figure 46.** Figure 46: A correct example (2) from RWQA benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p094_46.png] view at source ↗

**Figure 47.** Figure 47: A correct example (3) from RWQA benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p095_47.png] view at source ↗

**Figure 48.** Figure 48: A correct example (4) from RWQA benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p096_48.png] view at source ↗

**Figure 49.** Figure 49: An error example from RWQA benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p097_49.png] view at source ↗

**Figure 50.** Figure 50: A correct example from MMVet benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p098_50.png] view at source ↗

**Figure 51.** Figure 51: A correct example (2) from MMVet benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p099_51.png] view at source ↗

**Figure 52.** Figure 52: An error example from MMVet benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p100_52.png] view at source ↗

**Figure 53.** Figure 53: An error example (2) from MMVet benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p101_53.png] view at source ↗

**Figure 54.** Figure 54: A correct example from HallusionBench benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p102_54.png] view at source ↗

**Figure 55.** Figure 55: A correct example (2) from HallusionBench benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p103_55.png] view at source ↗

**Figure 56.** Figure 56: An error example from HallusionBench benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p104_56.png] view at source ↗

**Figure 57.** Figure 57: An error example (2) from HallusionBench benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p105_57.png] view at source ↗

**Figure 58.** Figure 58: An example where Qwen performs better from HallusionBench benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p106_58.png] view at source ↗

**Figure 59.** Figure 59: A correct example from POPE benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p107_59.png] view at source ↗

**Figure 60.** Figure 60: A correct example (2) from POPE benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p108_60.png] view at source ↗

**Figure 61.** Figure 61: A correct example (3) from POPE benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p109_61.png] view at source ↗

**Figure 62.** Figure 62: An error example from POPE benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p110_62.png] view at source ↗

**Figure 63.** Figure 63: A correct example from BLINK benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p111_63.png] view at source ↗

**Figure 64.** Figure 64: A correct example (2) from BLINK benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p112_64.png] view at source ↗

**Figure 65.** Figure 65: An error example from BLINK benchmark. Back to List of Figures [PITH_FULL_IMAGE:figures/full_fig_p113_65.png] view at source ↗

read the original abstract

Vision-language models (VLMs) have achieved strong performance across diverse multimodal tasks, yet they remain vulnerable to unreliable reasoning. Existing self-correction methods mitigate these issues but typically rely on post-training or carefully engineered feedback, incurring high computational cost. In this work, we revisit this challenge through the lens of emotional cues, asking whether they can activate latent self-correction behaviors in VLMs without additional training. \textbf{We find that emotional signals serve as an effective trigger for self-correction, encouraging more cautious and reflective reasoning}. Motivated by this finding, we propose \escabstract (\textbf{\underline{E}}motional \textbf{\underline{S}}elf-\textbf{\underline{C}}orrection), a training-free self-correction framework. ESC introduces an external verifier that detects potentially incorrect initial responses and injects emotional feedback to encourage model to reflect, and produce a better revised response without additional training. Extensive experiments across safety, hallucination, vision-centric perception, and multimodal reasoning benchmarks show that ESC consistently improves reliability while preserving overall model utility. These results suggest that emotion can function not only as an ability to be recognized, but also as a practical control signal for scalable self-correction in VLMs. \textbf{We therefore believe that ESC provides a strong foundation for a new reliable human-like, emotion-integrated research direction.} Our project is publicly available at \textcolor{red}{https://genai4e.github.io/ESC/}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ESC frames emotional cues as a training-free trigger for VLM self-correction via an external verifier, but the absence of verifier metrics and emotion-specific ablations leaves the mechanism under-supported.

read the letter

ESC claims that emotional signals can trigger self-correction in vision-language models without any training. An external verifier flags potentially wrong initial outputs and injects emotional feedback to prompt a revised response. The authors report gains on safety, hallucination, perception, and reasoning benchmarks while holding overall utility steady.

The distinct element is positioning emotion as a control signal rather than just a recognition task. The evaluation spans multiple benchmark categories and the project page is public, which makes the setup easier to inspect or extend.

The soft spots sit with the verifier and the role of emotion itself. The text gives no precision or recall numbers for how accurately the verifier identifies bad answers. There is also no ablation that replaces the emotional language with neutral revision prompts to check whether the emotion component adds anything measurable. If verifier errors drive the results or if any revision signal works equally well, the emotional framing loses force. The stress-test note on verifier accuracy matches what is shown.

This paper targets researchers working on reliable multimodal models and self-correction methods. Readers already exploring post-hoc fixes for VLMs could extract the core idea and test it with tighter controls. It has enough novelty in framing and breadth in experiments to merit sending out for peer review, though the authors should expect direct questions on the verifier implementation and the necessity of the emotional component.

Referee Report

2 major / 1 minor

Summary. The paper claims that emotional signals can serve as an effective trigger for self-correction in vision-language models (VLMs) without additional training. It introduces ESC, a training-free framework that deploys an external verifier to detect potentially incorrect initial responses and injects emotional feedback to encourage reflection and yield improved revised outputs. Extensive experiments on safety, hallucination, vision-centric perception, and multimodal reasoning benchmarks are said to demonstrate consistent gains in reliability while preserving overall model utility, positioning emotion as a practical control signal for scalable self-correction.

Significance. If the empirical claims hold after verification of the verifier and ablation details, the work would offer a low-cost, training-free route to more reliable VLMs by repurposing emotional language as a control signal. This could open a distinct research direction focused on emotion-integrated mechanisms rather than post-training or engineered feedback, with potential for broader applicability if the emotional cue proves additive beyond generic revision prompts.

major comments (2)

[Abstract] Abstract: the central claim that ESC improves reliability via emotional self-correction rests on two unverified preconditions—an external verifier that reliably flags incorrect outputs and emotional feedback that measurably outperforms neutral revision instructions—yet the abstract supplies no precision/recall figures for the verifier, no ablation replacing emotional cues with neutral “reconsider” prompts, and no oracle-verifier upper-bound experiment.
[Method (implied by abstract description)] The method description states that the verifier “detects potentially incorrect initial responses and injects emotional feedback,” but provides no quantitative assessment of verifier error rates; if those rates are high, observed benchmark gains could be artifacts of selective revision rather than emotion-driven reflection.

minor comments (1)

[Abstract] The project URL is rendered in red text; this should be corrected to standard formatting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency on the verifier and the specific contribution of emotional cues. We address each major comment below and will revise the manuscript to incorporate additional details and experiments.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that ESC improves reliability via emotional self-correction rests on two unverified preconditions—an external verifier that reliably flags incorrect outputs and emotional feedback that measurably outperforms neutral revision instructions—yet the abstract supplies no precision/recall figures for the verifier, no ablation replacing emotional cues with neutral “reconsider” prompts, and no oracle-verifier upper-bound experiment.

Authors: We agree that the abstract, due to length constraints, does not detail verifier metrics or ablations. The full manuscript reports consistent benchmark gains from ESC, but we acknowledge that explicitly addressing the preconditions would strengthen the abstract. In revision we will add a concise statement on verifier effectiveness and the role of emotional feedback, include a neutral-prompt ablation, and report an oracle-verifier upper bound to quantify the headroom. revision: yes
Referee: [Method (implied by abstract description)] The method description states that the verifier “detects potentially incorrect initial responses and injects emotional feedback,” but provides no quantitative assessment of verifier error rates; if those rates are high, observed benchmark gains could be artifacts of selective revision rather than emotion-driven reflection.

Authors: The concern is valid: without reported verifier error rates it is difficult to fully exclude selective-revision artifacts. While end-to-end gains across diverse benchmarks support that emotional feedback drives reflection, we will add a quantitative analysis of the verifier’s precision and recall on a held-out subset in the revised manuscript. This will allow readers to assess whether gains arise primarily from emotion-triggered correction. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with no derivations or self-referential predictions

full rationale

The paper proposes ESC as a training-free method relying on an external verifier and emotional feedback prompts. No equations, first-principles derivations, or fitted parameters are described in the provided text. Claims rest on experimental benchmarks rather than any reduction of outputs to inputs by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing. The central mechanism (verifier + emotional injection) is presented as a design choice validated by results, not derived tautologically. This is a standard empirical contribution with no detectable circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; ledger is empty by necessity.

pith-pipeline@v0.9.1-grok · 5845 in / 949 out tokens · 26069 ms · 2026-07-03T21:18:19.603855+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

108 extracted references · 108 canonical work pages · 21 internal anchors

[1]

Agrawal, P., Antoniak, S., Hanna, E.B., Bout, B., Chaplot, D., Chudnovsky, J., Costa, D., Monicault, B.D., Garg, S., Gervet, T., Ghosh, S., Héliou, A., Jacob, P., Jiang, A.Q., Khandelwal, K., Lacroix, T., Lample, G., Casas, D.L., Lavril, T., Scao, T.L., Lo, A., Marshall, W., Martin, L., Mensch, A., Muddireddy, P., Nemy- chnikova, V., Pellat, M., Platen, P...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Qwen2.5-VL Technical Report

Bai, S., et al.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Personality and Social Psychology Review10(1), 20–46 (2006)

Barrett, L.F.: Solving the emotion paradox: Categorization and the experience of emotion. Personality and Social Psychology Review10(1), 20–46 (2006)

work page 2006
[5]

Social Cognitive and Affective Neuroscience 12(1), 1–23 (2017)

Barrett, L.F.: The theory of constructed emotion: An active inference account of interoception and categorization. Social Cognitive and Affective Neuroscience 12(1), 1–23 (2017)

work page 2017
[6]

Current Directions in Psychological Science8(1), 10–14 (1999)

Barrett, L.F., Russell, J.A.: The structure of current affect: Controversies and emerging consensus. Current Directions in Psychological Science8(1), 10–14 (1999)

work page 1999
[7]

Bhattacharyya, S., Wang, J.Z.: Evaluating vision-language models for emotion recognition (2025),https://arxiv.org/abs/2502.05660

work page arXiv 2025
[8]

Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Wang, J., Qiao, Y., Lin, D., et al.: Are we on the right way for evaluating large vision- language models? (2024)

work page 2024
[9]

Chen, P., Lou, Y., Cao, S., Guo, J., Fan, L., Wu, Y., Yang, L., Ma, L., Ye, J.: Sd- vlm: Spatial measuring and understanding with depth-encoded vision-language models (2025),https://arxiv.org/abs/2509.17664

work page arXiv 2025
[10]

arXiv preprint arXiv:2311.10081 (2023)

Chen, Y., Sikka, K., Cogswell, M., Ji, H., Divakaran, A.: Dress: Instructing large vision-language models to align and interact with humans via natural language feedback. arXiv preprint arXiv:2311.10081 (2023)

work page arXiv 2023
[11]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Cheng, K., YanTao, L., Xu, F., Zhang, J., Zhou, H., Liu, Y.: Vision-language models can self-improve reasoning via reflection (2025)

work page 2025
[13]

Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning,

Cheng, Z., Cheng, Z.Q., He, J.Y., Sun, J., Wang, K., Lin, Y., Lian, Z., Peng, X., Hauptmann, A.: Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning (2024),https://arxiv.org/abs/2406.11161 16 T.-H. Nguyen et al

work page arXiv 2024
[14]

Choi, D., Son, G., Kim, S.Y., Paik, G., Hong, S.: Improving fine-grained visual understanding in vlms through text-only training (2024),https://arxiv.org/ abs/2412.12940

work page arXiv 2024
[15]

Proceedings of the National Academy of Sciences 114(38), E7900–E7909 (2017)

Cowen, A.S., Keltner, D.: Self-report captures 27 distinct categories of emotion bridged by continuous gradients. Proceedings of the National Academy of Sciences 114(38), E7900–E7909 (2017)

work page 2017
[16]

DeepMind, G., Ballantyne, I., Cameron, G., Cruz, M., Lacombe, O., Quan, K., Sanseviero, O.: Gemma 4 model card.https://ai.google.dev/gemma/docs/ core/model_card_4(2026),https://ai.google.dev/gemma/docs/core/model_ card_4

work page 2026
[17]

Deng, S., Zhao, W., Li, Y.J., Wan, K., Miranda, D., Kale, A., Tian, Y.: Efficient self-improvement in multimodal large language models: A model-level judge-free approach (2024)

work page 2024
[18]

Deng, Y., Chen, G., Gu, T., Kong, L., Li, Y., Tang, Z., Zhang, K.: Towards self-refinement of vision-language models with triangular consistency (2025)

work page 2025
[19]

Ding, Y., Qiu, Z., Li, B., Zhang, R.: Learning self-correction in vision-language models via rollout augmentation (2026)

work page 2026
[20]

Ding, Y., Zhang, R.: Sherlock: Self-correcting reasoning in vision-language models (2025)

work page 2025
[21]

Duan, C., Sun, K., Fang, R., Zhang, M., Feng, Y., Luo, Y., Liu, Y., Wang, K., Pei, P., Cai, X., Li, H., Ma, Y., Liu, X.: Codeplot-cot: Mathematical visual reason- ing by thinking with code-driven images (2025),https://arxiv.org/abs/2510. 11718

work page 2025
[22]

Cognition & Emotion6(3–4), 169– 200 (1992)

Ekman, P.: An argument for basic emotions. Cognition & Emotion6(3–4), 169– 200 (1992)

work page 1992
[23]

Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not perceive (2024)

work page 2024
[24]

Gao, G.J., Li, T., Shi, J., Li, Y., Zhang, Z., Figueroa, N., Jayaraman, D.: Vlmgi- neer: Vision language models as robotic toolsmiths (2025),https://arxiv.org/ abs/2507.12644

work page arXiv 2025
[25]

Gariboldi, C., Tokida, H., Kinjo, K., Asada, Y., Carballo, A.: Vlad: A vlm- augmented autonomous driving framework with hierarchical planning and inter- pretable decision process (2025),https://arxiv.org/abs/2507.01284

work page arXiv 2025
[26]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Cauchete...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., et al.: Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models (2024)

work page 2024
[28]

Guo, Q., Mello, S.D., Yin, H., Byeon, W., Cheung, K.C., Yu, Y., Luo, P., Liu, S.: Regiongpt: Towards region understanding vision language model (2024),https: //arxiv.org/abs/2403.02330

work page arXiv 2024
[29]

In: Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S

He, C., Zhu, S., Liu, H., Gao, F., Jia, Y., Zan, H., Peng, M.: DialogueMMT: Dialogue scenes understanding enhanced multi-modal multi-task tuning for emo- tion recognition in conversations. In: Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S. (eds.) Proceedings of the 31st International Conference on Computational Lingu...

work page 2025
[30]

He, J., Lin, H., Wang, Q., Fung, Y.R., Ji, H.: Self-correction is more than refine- ment: A learning framework for visual and language reasoning tasks (2025)

work page 2025
[31]

Hu, H., Zhou, Y., You, L., Xu, H., Wang, Q., Lian, Z., Yu, F.R., Ma, F., Cui, L.: Emobench-m:Benchmarkingemotionalintelligenceformultimodallargelanguage models (2026),https://arxiv.org/abs/2502.04424

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

tseHuang,J.,Lam,M.H.,Li,E.J.,Ren,S.,Wang,W.,Jiao,W.,Tu,Z.,Lyu,M.R.: Emotionally numb or empathetic? evaluating how llms feel using emotionbench (2024),https://arxiv.org/abs/2308.03656

work page arXiv 2024
[33]

Huang, J., Chen, X., Mishra, S., Zheng, H.S., Yu, A.W., Song, X., Zhou, D.: Large language models cannot self-correct reasoning yet (2023)

work page 2023
[34]

Jian, P., Wu, J., Sun, W., Wang, C., Ren, S., Zhang, J.: Look again, think slowly: Enhancing visual reflection in vision-language models (2025)

work page 2025
[35]

In: European Conference on Computer Vision (ECCV)

Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A diagram is worth a dozen images. In: European Conference on Computer Vision (ECCV). pp. 235–251 (2016)

work page 2016
[36]

Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners (2023),https://arxiv.org/abs/2205.11916

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Laurençon, H., Tronchon, L., Cord, M., Sanh, V.: What matters when building vision-language models? (2024),https://arxiv.org/abs/2405.02246 ESC19

work page arXiv 2024
[38]

Le, T.Q.K., Vu, N.L.V., Pham, H.H., Huynh, X.L., Nguyen, T.H., Le, M.H.N., Nguyen, Q., Nguyen, H.D.: Hdc: Hierarchical distillation for multi-level noisy consistency in semi-supervised fetal ultrasound segmentation (2025),https:// arxiv.org/abs/2504.09876

work page arXiv 2025
[39]

Lee, S., Park, S.H., Jo, Y., Seo, M.: Volcano: mitigating multimodal hallucination through self-feedback guided revision (2024)

work page 2024
[40]

Annual Review of Psychology66, 799–823 (2015)

Lerner, J.S., Li, Y., Valdesolo, P., Kassam, K.S.: Emotion and decision making. Annual Review of Psychology66, 799–823 (2015)

work page 2015
[41]

Li, B., Zhang, P., Yang, J., Zhang, Y., Pu, F., Liu, Z.: Otterhd: A high-resolution multi-modality model (2023),https://arxiv.org/abs/2311.04219

work page arXiv 2023
[42]

Li, C., Wang, J., Zhang, Y., Zhu, K., Hou, W., Lian, J., Luo, F., Yang, Q., Xie, X.: Large language models understand and can be enhanced by emotional stimuli (2023)

work page 2023
[43]

Li, C., Wang, J., Zhang, Y., Zhu, K., Wang, X., Hou, W., Lian, J., Luo, F., Yang, Q., Xie, X.: The good, the bad, and why: Unveiling emotions in generative ai (2023)

work page 2023
[44]

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models (2023)

work page 2023
[45]

Li, Z., Yang, B., Liu, Q., Ma, Z., Zhang, S., Yang, J., Sun, Y., Liu, Y., Bai, X.: Monkey: Image resolution and text label are important things for large multi- modal models (2024),https://arxiv.org/abs/2311.06607

work page arXiv 2024
[46]

Liang, Z., Guo, K., Liu, G., Guo, T., Zhou, Y., Yang, T., Jiao, J., Pi, R., Zhang, J., Zhang, X.: Scemqa: A scientific college entrance level multimodal question answering benchmark (2024),https://arxiv.org/abs/2402.05138

work page arXiv 2024
[47]

Liao, Y.H., Mahmood, R., Fidler, S., Acuna, D.: Can large vision-language models correct semantic grounding errors by themselves? (2025)

work page 2025
[48]

Lin, J., Yin, H., Ping, W., Lu, Y., Molchanov, P., Tao, A., Mao, H., Kautz, J., Shoeybi, M., Han, S.: Vila: On pre-training for visual language models (2024), https://arxiv.org/abs/2312.07533

work page arXiv 2024
[49]

In: Proceedings of the 2024 International Conference on Multimedia Retrieval

Liu, C., Xie, Z., Zhao, S., Zhou, J., Xu, T., Li, M., Chen, E.: Speak from heart: An emotion-guided llm-based multimodal method for emotional dialogue generation. In: Proceedings of the 2024 International Conference on Multimedia Retrieval. p. 533–542. ICMR ’24, Association for Computing Machinery, New York, NY, USA (2024),https://doi.org/10.1145/3652583.3658104

work page doi:10.1145/3652583.3658104 2024
[50]

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2024)

work page 2024
[51]

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llavanext: Improved reasoning, ocr, and world knowledge (2024)

work page 2024
[52]

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)

work page 2023
[53]

In: European Conference on Computer Vision (ECCV) (2024)

Liu, X., Zhu, Y., Gu, J., Lan, Y., Yang, C., Qiao, Y.: Mm-safetybench: A bench- mark for safety evaluation of multimodal large language models. In: European Conference on Computer Vision (ECCV) (2024)

work page 2024
[54]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Man, F., Chen, X., Wang, H., Zhao, B., Li, H., Chen, X.: Vaeer: Visual attention- inspired emotion elicitation reasoning (2025),https://arxiv.org/abs/2505. 24342

work page 2025
[56]

MIT Press (1974) 20 T.-H

Mehrabian, A., Russell, J.A.: An Approach to Environmental Psychology. MIT Press (1974) 20 T.-H. Nguyen et al

work page 1974
[57]

Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., Gao, J.: Large language models: A survey (2025),https://arxiv.org/abs/2402. 06196

work page 2025
[58]

Ad- vances in Neural Information Processing Systems37, 53969–54002 (2024)

Mozikov, M., Severin, N., Bodishtianu, V., Glushanina, M., Nasonov, I., Orekhov, D., Pekhotin, V., Makovetskiy, I., Baklashkin, M., Lavrentyev, V., et al.: Eai: Emotional decision-making of llms in strategic games and ethical dilemmas. Ad- vances in Neural Information Processing Systems37, 53969–54002 (2024)

work page 2024
[59]

Nguyen, T.H., Tran, H.L., Ngo, T.D.: Itself: Attention guided fine-grained align- ment for vision-language retrieval (2026),https://arxiv.org/abs/2601.01024

work page arXiv 2026
[60]

Nguyen, T.H., Tran, H.L., Phan-Nguyen, H.P., Dinh, Q.V.: Hybrid, unified and iterative: A novel framework for text-based person anomaly retrieval (2025), https://arxiv.org/abs/2511.22470

work page arXiv 2025
[61]

Nguyen,T.H.,Tran,Q.K.,Quang-Hoang,A.T.:Improvinggeneralizationinvisual reasoning via self-ensemble (2024),https://arxiv.org/abs/2410.20883

work page arXiv 2024
[62]

Nguyen-Nhu, T.A., Minh, T.D.H., To-Thanh, D., Le-Gia, P., Vo-Lan, T., Nguyen, T.H.: Ster-vlm: Spatio-temporal with enhanced reference vision-language models (2025),https://arxiv.org/abs/2508.13470

work page arXiv 2025
[63]

OpenAI, :, Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., Mądry, A., Baker- Whitcomb, A., Beutel, A., Borzunov, A., Carney, A., Chow, A., Kirillov, A., Nichol, A., Paino, A., Renzin, A., Passos, A.T., Kirillov, A., Christakis, A., Con- neau,A.,Kamali,A.,Jabri,A.,Moyer,A.,Tam,A.,...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

GPT-4 Technical Report

OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boiko, O., Boyd, M., Brakman, A.L., Brockman, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

org/abs/2409.00147

Peng, S., Fu, D., Gao, L., Zhong, X., Fu, H., Tang, Z.: Multimath: Bridging visual and mathematical reasoning for large language models (2024),https://arxiv. org/abs/2409.00147

work page arXiv 2024
[66]

Harper & Row (1980)

Plutchik, R.: Emotion: A Psychoevolutionary Synthesis. Harper & Row (1980)

work page 1980
[67]

Development and psychopathology17(3), 715–734 (2005)

Posner, J., Russell, J.A., Peterson, B.S.: The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development, and psy- chopathology. Development and psychopathology17(3), 715–734 (2005)

work page 2005
[68]

arXiv preprint arXiv:2601.01483 (2026)

Qiu, X., Jia, H., Zeng, Z., Shen, S., Meng, C., Yang, Y., Zhu, L.: Unified gen- eration and self-verification for vision-language models via advantage decoupled preference optimization. arXiv preprint arXiv:2601.01483 (2026)

work page arXiv 2026
[69]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision

Qu, M., Hu, Y., Han, K., Wei, Y., Zhao, Y.: Recot: Reflective self-correction training for mitigating confirmation bias in large vision-language models. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9147–9157 (2025)

work page 2025
[70]

In: International conference on machine learning

Radford,A.,Kim,J.W.,Hallacy,C.,Ramesh,A.,Goh,G.,Agarwal,S.,Sastry,G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

work page 2021
[71]

Journal of personality and social psychology39(6), 1161 (1980)

Russell, J.A.: A circumplex model of affect. Journal of personality and social psychology39(6), 1161 (1980)

work page 1980
[72]

Psychological Bulletin 110(3), 426–450 (1991)

Russell, J.A.: Culture and the categorization of emotions. Psychological Bulletin 110(3), 426–450 (1991)

work page 1991
[73]

Psycho- logical Review110(1), 145–172 (2003) ESC23

Russell, J.A.: Core affect and the psychological construction of emotion. Psycho- logical Review110(1), 145–172 (2003) ESC23

work page 2003
[74]

Sabour, S., Liu, S., Zhang, Z., Liu, J.M., Zhou, J., Sunaryo, A.S., Li, J., Lee, T.M.C., Mihalcea, R., Huang, M.: Emobench: Evaluating the emotional intelli- gence of large language models (2024),https://arxiv.org/abs/2402.12071

work page arXiv 2024
[75]

In:Scherer,K.R.,Schorr,A.,Johnstone,T.(eds.)AppraisalProcessesinEmotion: Theory, Methods, Research, pp

Scherer, K.R.: Appraisal considered as a process of multilevel sequential checking. In:Scherer,K.R.,Schorr,A.,Johnstone,T.(eds.)AppraisalProcessesinEmotion: Theory, Methods, Research, pp. 92–120. Oxford University Press (2001)

work page 2001
[76]

Scherer, K.R.: What are emotions? and how can they be measured? Social Science Information44(4), 695–729 (2005)

work page 2005
[77]

Cognition and Emotion23(7), 1307–1351 (2009)

Scherer, K.R.: The dynamic architecture of emotion: Evidence for the component process model. Cognition and Emotion23(7), 1307–1351 (2009)

work page 2009
[78]

In: Handbook of Theories of Social Psychology, pp

Schwarz, N.: Feelings-as-information theory. In: Handbook of Theories of Social Psychology, pp. 289–308. Sage (2012)

work page 2012
[79]

Shao, H., Qian, S., Xiao, H., Song, G., Zong, Z., Wang, L., Liu, Y., Li, H.: Vi- sual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning (2024),https://arxiv.org/abs/ 2403.16999

work page arXiv 2024
[80]

Shao, R., Li, W., Zhang, L., Zhang, R., Liu, Z., Chen, R., Nie, L.: Large vlm- based vision-language-action models for robotic manipulation: A survey (2025), https://arxiv.org/abs/2508.13073

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

Agrawal, P., Antoniak, S., Hanna, E.B., Bout, B., Chaplot, D., Chudnovsky, J., Costa, D., Monicault, B.D., Garg, S., Gervet, T., Ghosh, S., Héliou, A., Jacob, P., Jiang, A.Q., Khandelwal, K., Lacroix, T., Lample, G., Casas, D.L., Lavril, T., Scao, T.L., Lo, A., Marshall, W., Martin, L., Mensch, A., Muddireddy, P., Nemy- chnikova, V., Pellat, M., Platen, P...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Qwen2.5-VL Technical Report

Bai, S., et al.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Personality and Social Psychology Review10(1), 20–46 (2006)

Barrett, L.F.: Solving the emotion paradox: Categorization and the experience of emotion. Personality and Social Psychology Review10(1), 20–46 (2006)

work page 2006

[5] [5]

Social Cognitive and Affective Neuroscience 12(1), 1–23 (2017)

Barrett, L.F.: The theory of constructed emotion: An active inference account of interoception and categorization. Social Cognitive and Affective Neuroscience 12(1), 1–23 (2017)

work page 2017

[6] [6]

Current Directions in Psychological Science8(1), 10–14 (1999)

Barrett, L.F., Russell, J.A.: The structure of current affect: Controversies and emerging consensus. Current Directions in Psychological Science8(1), 10–14 (1999)

work page 1999

[7] [7]

Bhattacharyya, S., Wang, J.Z.: Evaluating vision-language models for emotion recognition (2025),https://arxiv.org/abs/2502.05660

work page arXiv 2025

[8] [8]

Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Wang, J., Qiao, Y., Lin, D., et al.: Are we on the right way for evaluating large vision- language models? (2024)

work page 2024

[9] [9]

Chen, P., Lou, Y., Cao, S., Guo, J., Fan, L., Wu, Y., Yang, L., Ma, L., Ye, J.: Sd- vlm: Spatial measuring and understanding with depth-encoded vision-language models (2025),https://arxiv.org/abs/2509.17664

work page arXiv 2025

[10] [10]

arXiv preprint arXiv:2311.10081 (2023)

Chen, Y., Sikka, K., Cogswell, M., Ji, H., Divakaran, A.: Dress: Instructing large vision-language models to align and interact with humans via natural language feedback. arXiv preprint arXiv:2311.10081 (2023)

work page arXiv 2023

[11] [11]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Cheng, K., YanTao, L., Xu, F., Zhang, J., Zhou, H., Liu, Y.: Vision-language models can self-improve reasoning via reflection (2025)

work page 2025

[13] [13]

Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning,

Cheng, Z., Cheng, Z.Q., He, J.Y., Sun, J., Wang, K., Lin, Y., Lian, Z., Peng, X., Hauptmann, A.: Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning (2024),https://arxiv.org/abs/2406.11161 16 T.-H. Nguyen et al

work page arXiv 2024

[14] [14]

Choi, D., Son, G., Kim, S.Y., Paik, G., Hong, S.: Improving fine-grained visual understanding in vlms through text-only training (2024),https://arxiv.org/ abs/2412.12940

work page arXiv 2024

[15] [15]

Proceedings of the National Academy of Sciences 114(38), E7900–E7909 (2017)

Cowen, A.S., Keltner, D.: Self-report captures 27 distinct categories of emotion bridged by continuous gradients. Proceedings of the National Academy of Sciences 114(38), E7900–E7909 (2017)

work page 2017

[16] [16]

DeepMind, G., Ballantyne, I., Cameron, G., Cruz, M., Lacombe, O., Quan, K., Sanseviero, O.: Gemma 4 model card.https://ai.google.dev/gemma/docs/ core/model_card_4(2026),https://ai.google.dev/gemma/docs/core/model_ card_4

work page 2026

[17] [17]

Deng, S., Zhao, W., Li, Y.J., Wan, K., Miranda, D., Kale, A., Tian, Y.: Efficient self-improvement in multimodal large language models: A model-level judge-free approach (2024)

work page 2024

[18] [18]

Deng, Y., Chen, G., Gu, T., Kong, L., Li, Y., Tang, Z., Zhang, K.: Towards self-refinement of vision-language models with triangular consistency (2025)

work page 2025

[19] [19]

Ding, Y., Qiu, Z., Li, B., Zhang, R.: Learning self-correction in vision-language models via rollout augmentation (2026)

work page 2026

[20] [20]

Ding, Y., Zhang, R.: Sherlock: Self-correcting reasoning in vision-language models (2025)

work page 2025

[21] [21]

Duan, C., Sun, K., Fang, R., Zhang, M., Feng, Y., Luo, Y., Liu, Y., Wang, K., Pei, P., Cai, X., Li, H., Ma, Y., Liu, X.: Codeplot-cot: Mathematical visual reason- ing by thinking with code-driven images (2025),https://arxiv.org/abs/2510. 11718

work page 2025

[22] [22]

Cognition & Emotion6(3–4), 169– 200 (1992)

Ekman, P.: An argument for basic emotions. Cognition & Emotion6(3–4), 169– 200 (1992)

work page 1992

[23] [23]

Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not perceive (2024)

work page 2024

[24] [24]

Gao, G.J., Li, T., Shi, J., Li, Y., Zhang, Z., Figueroa, N., Jayaraman, D.: Vlmgi- neer: Vision language models as robotic toolsmiths (2025),https://arxiv.org/ abs/2507.12644

work page arXiv 2025

[25] [25]

Gariboldi, C., Tokida, H., Kinjo, K., Asada, Y., Carballo, A.: Vlad: A vlm- augmented autonomous driving framework with hierarchical planning and inter- pretable decision process (2025),https://arxiv.org/abs/2507.01284

work page arXiv 2025

[26] [26]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Cauchete...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., et al.: Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models (2024)

work page 2024

[28] [28]

Guo, Q., Mello, S.D., Yin, H., Byeon, W., Cheung, K.C., Yu, Y., Luo, P., Liu, S.: Regiongpt: Towards region understanding vision language model (2024),https: //arxiv.org/abs/2403.02330

work page arXiv 2024

[29] [29]

In: Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S

He, C., Zhu, S., Liu, H., Gao, F., Jia, Y., Zan, H., Peng, M.: DialogueMMT: Dialogue scenes understanding enhanced multi-modal multi-task tuning for emo- tion recognition in conversations. In: Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S. (eds.) Proceedings of the 31st International Conference on Computational Lingu...

work page 2025

[30] [30]

He, J., Lin, H., Wang, Q., Fung, Y.R., Ji, H.: Self-correction is more than refine- ment: A learning framework for visual and language reasoning tasks (2025)

work page 2025

[31] [31]

Hu, H., Zhou, Y., You, L., Xu, H., Wang, Q., Lian, Z., Yu, F.R., Ma, F., Cui, L.: Emobench-m:Benchmarkingemotionalintelligenceformultimodallargelanguage models (2026),https://arxiv.org/abs/2502.04424

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

tseHuang,J.,Lam,M.H.,Li,E.J.,Ren,S.,Wang,W.,Jiao,W.,Tu,Z.,Lyu,M.R.: Emotionally numb or empathetic? evaluating how llms feel using emotionbench (2024),https://arxiv.org/abs/2308.03656

work page arXiv 2024

[33] [33]

Huang, J., Chen, X., Mishra, S., Zheng, H.S., Yu, A.W., Song, X., Zhou, D.: Large language models cannot self-correct reasoning yet (2023)

work page 2023

[34] [34]

Jian, P., Wu, J., Sun, W., Wang, C., Ren, S., Zhang, J.: Look again, think slowly: Enhancing visual reflection in vision-language models (2025)

work page 2025

[35] [35]

In: European Conference on Computer Vision (ECCV)

Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A diagram is worth a dozen images. In: European Conference on Computer Vision (ECCV). pp. 235–251 (2016)

work page 2016

[36] [36]

Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners (2023),https://arxiv.org/abs/2205.11916

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Laurençon, H., Tronchon, L., Cord, M., Sanh, V.: What matters when building vision-language models? (2024),https://arxiv.org/abs/2405.02246 ESC19

work page arXiv 2024

[38] [38]

Le, T.Q.K., Vu, N.L.V., Pham, H.H., Huynh, X.L., Nguyen, T.H., Le, M.H.N., Nguyen, Q., Nguyen, H.D.: Hdc: Hierarchical distillation for multi-level noisy consistency in semi-supervised fetal ultrasound segmentation (2025),https:// arxiv.org/abs/2504.09876

work page arXiv 2025

[39] [39]

Lee, S., Park, S.H., Jo, Y., Seo, M.: Volcano: mitigating multimodal hallucination through self-feedback guided revision (2024)

work page 2024

[40] [40]

Annual Review of Psychology66, 799–823 (2015)

Lerner, J.S., Li, Y., Valdesolo, P., Kassam, K.S.: Emotion and decision making. Annual Review of Psychology66, 799–823 (2015)

work page 2015

[41] [41]

Li, B., Zhang, P., Yang, J., Zhang, Y., Pu, F., Liu, Z.: Otterhd: A high-resolution multi-modality model (2023),https://arxiv.org/abs/2311.04219

work page arXiv 2023

[42] [42]

Li, C., Wang, J., Zhang, Y., Zhu, K., Hou, W., Lian, J., Luo, F., Yang, Q., Xie, X.: Large language models understand and can be enhanced by emotional stimuli (2023)

work page 2023

[43] [43]

Li, C., Wang, J., Zhang, Y., Zhu, K., Wang, X., Hou, W., Lian, J., Luo, F., Yang, Q., Xie, X.: The good, the bad, and why: Unveiling emotions in generative ai (2023)

work page 2023

[44] [44]

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models (2023)

work page 2023

[45] [45]

Li, Z., Yang, B., Liu, Q., Ma, Z., Zhang, S., Yang, J., Sun, Y., Liu, Y., Bai, X.: Monkey: Image resolution and text label are important things for large multi- modal models (2024),https://arxiv.org/abs/2311.06607

work page arXiv 2024

[46] [46]

Liang, Z., Guo, K., Liu, G., Guo, T., Zhou, Y., Yang, T., Jiao, J., Pi, R., Zhang, J., Zhang, X.: Scemqa: A scientific college entrance level multimodal question answering benchmark (2024),https://arxiv.org/abs/2402.05138

work page arXiv 2024

[47] [47]

Liao, Y.H., Mahmood, R., Fidler, S., Acuna, D.: Can large vision-language models correct semantic grounding errors by themselves? (2025)

work page 2025

[48] [48]

Lin, J., Yin, H., Ping, W., Lu, Y., Molchanov, P., Tao, A., Mao, H., Kautz, J., Shoeybi, M., Han, S.: Vila: On pre-training for visual language models (2024), https://arxiv.org/abs/2312.07533

work page arXiv 2024

[49] [49]

In: Proceedings of the 2024 International Conference on Multimedia Retrieval

Liu, C., Xie, Z., Zhao, S., Zhou, J., Xu, T., Li, M., Chen, E.: Speak from heart: An emotion-guided llm-based multimodal method for emotional dialogue generation. In: Proceedings of the 2024 International Conference on Multimedia Retrieval. p. 533–542. ICMR ’24, Association for Computing Machinery, New York, NY, USA (2024),https://doi.org/10.1145/3652583.3658104

work page doi:10.1145/3652583.3658104 2024

[50] [50]

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2024)

work page 2024

[51] [51]

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llavanext: Improved reasoning, ocr, and world knowledge (2024)

work page 2024

[52] [52]

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)

work page 2023

[53] [53]

In: European Conference on Computer Vision (ECCV) (2024)

Liu, X., Zhu, Y., Gu, J., Lan, Y., Yang, C., Qiao, Y.: Mm-safetybench: A bench- mark for safety evaluation of multimodal large language models. In: European Conference on Computer Vision (ECCV) (2024)

work page 2024

[54] [54]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [55]

Man, F., Chen, X., Wang, H., Zhao, B., Li, H., Chen, X.: Vaeer: Visual attention- inspired emotion elicitation reasoning (2025),https://arxiv.org/abs/2505. 24342

work page 2025

[56] [56]

MIT Press (1974) 20 T.-H

Mehrabian, A., Russell, J.A.: An Approach to Environmental Psychology. MIT Press (1974) 20 T.-H. Nguyen et al

work page 1974

[57] [57]

Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., Gao, J.: Large language models: A survey (2025),https://arxiv.org/abs/2402. 06196

work page 2025

[58] [58]

Ad- vances in Neural Information Processing Systems37, 53969–54002 (2024)

Mozikov, M., Severin, N., Bodishtianu, V., Glushanina, M., Nasonov, I., Orekhov, D., Pekhotin, V., Makovetskiy, I., Baklashkin, M., Lavrentyev, V., et al.: Eai: Emotional decision-making of llms in strategic games and ethical dilemmas. Ad- vances in Neural Information Processing Systems37, 53969–54002 (2024)

work page 2024

[59] [59]

Nguyen, T.H., Tran, H.L., Ngo, T.D.: Itself: Attention guided fine-grained align- ment for vision-language retrieval (2026),https://arxiv.org/abs/2601.01024

work page arXiv 2026

[60] [60]

Nguyen, T.H., Tran, H.L., Phan-Nguyen, H.P., Dinh, Q.V.: Hybrid, unified and iterative: A novel framework for text-based person anomaly retrieval (2025), https://arxiv.org/abs/2511.22470

work page arXiv 2025

[61] [61]

Nguyen,T.H.,Tran,Q.K.,Quang-Hoang,A.T.:Improvinggeneralizationinvisual reasoning via self-ensemble (2024),https://arxiv.org/abs/2410.20883

work page arXiv 2024

[62] [62]

Nguyen-Nhu, T.A., Minh, T.D.H., To-Thanh, D., Le-Gia, P., Vo-Lan, T., Nguyen, T.H.: Ster-vlm: Spatio-temporal with enhanced reference vision-language models (2025),https://arxiv.org/abs/2508.13470

work page arXiv 2025

[63] [63]

OpenAI, :, Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., Mądry, A., Baker- Whitcomb, A., Beutel, A., Borzunov, A., Carney, A., Chow, A., Kirillov, A., Nichol, A., Paino, A., Renzin, A., Passos, A.T., Kirillov, A., Christakis, A., Con- neau,A.,Kamali,A.,Jabri,A.,Moyer,A.,Tam,A.,...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [64]

GPT-4 Technical Report

OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boiko, O., Boyd, M., Brakman, A.L., Brockman, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

org/abs/2409.00147

Peng, S., Fu, D., Gao, L., Zhong, X., Fu, H., Tang, Z.: Multimath: Bridging visual and mathematical reasoning for large language models (2024),https://arxiv. org/abs/2409.00147

work page arXiv 2024

[66] [66]

Harper & Row (1980)

Plutchik, R.: Emotion: A Psychoevolutionary Synthesis. Harper & Row (1980)

work page 1980

[67] [67]

Development and psychopathology17(3), 715–734 (2005)

Posner, J., Russell, J.A., Peterson, B.S.: The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development, and psy- chopathology. Development and psychopathology17(3), 715–734 (2005)

work page 2005

[68] [68]

arXiv preprint arXiv:2601.01483 (2026)

Qiu, X., Jia, H., Zeng, Z., Shen, S., Meng, C., Yang, Y., Zhu, L.: Unified gen- eration and self-verification for vision-language models via advantage decoupled preference optimization. arXiv preprint arXiv:2601.01483 (2026)

work page arXiv 2026

[69] [69]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision

Qu, M., Hu, Y., Han, K., Wei, Y., Zhao, Y.: Recot: Reflective self-correction training for mitigating confirmation bias in large vision-language models. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9147–9157 (2025)

work page 2025

[70] [70]

In: International conference on machine learning

Radford,A.,Kim,J.W.,Hallacy,C.,Ramesh,A.,Goh,G.,Agarwal,S.,Sastry,G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

work page 2021

[71] [71]

Journal of personality and social psychology39(6), 1161 (1980)

Russell, J.A.: A circumplex model of affect. Journal of personality and social psychology39(6), 1161 (1980)

work page 1980

[72] [72]

Psychological Bulletin 110(3), 426–450 (1991)

Russell, J.A.: Culture and the categorization of emotions. Psychological Bulletin 110(3), 426–450 (1991)

work page 1991

[73] [73]

Psycho- logical Review110(1), 145–172 (2003) ESC23

Russell, J.A.: Core affect and the psychological construction of emotion. Psycho- logical Review110(1), 145–172 (2003) ESC23

work page 2003

[74] [74]

Sabour, S., Liu, S., Zhang, Z., Liu, J.M., Zhou, J., Sunaryo, A.S., Li, J., Lee, T.M.C., Mihalcea, R., Huang, M.: Emobench: Evaluating the emotional intelli- gence of large language models (2024),https://arxiv.org/abs/2402.12071

work page arXiv 2024

[75] [75]

In:Scherer,K.R.,Schorr,A.,Johnstone,T.(eds.)AppraisalProcessesinEmotion: Theory, Methods, Research, pp

Scherer, K.R.: Appraisal considered as a process of multilevel sequential checking. In:Scherer,K.R.,Schorr,A.,Johnstone,T.(eds.)AppraisalProcessesinEmotion: Theory, Methods, Research, pp. 92–120. Oxford University Press (2001)

work page 2001

[76] [76]

Scherer, K.R.: What are emotions? and how can they be measured? Social Science Information44(4), 695–729 (2005)

work page 2005

[77] [77]

Cognition and Emotion23(7), 1307–1351 (2009)

Scherer, K.R.: The dynamic architecture of emotion: Evidence for the component process model. Cognition and Emotion23(7), 1307–1351 (2009)

work page 2009

[78] [78]

In: Handbook of Theories of Social Psychology, pp

Schwarz, N.: Feelings-as-information theory. In: Handbook of Theories of Social Psychology, pp. 289–308. Sage (2012)

work page 2012

[79] [79]

Shao, H., Qian, S., Xiao, H., Song, G., Zong, Z., Wang, L., Liu, Y., Li, H.: Vi- sual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning (2024),https://arxiv.org/abs/ 2403.16999

work page arXiv 2024

[80] [80]

Shao, R., Li, W., Zhang, L., Zhang, R., Liu, Z., Chen, R., Nie, L.: Large vlm- based vision-language-action models for robotic manipulation: A survey (2025), https://arxiv.org/abs/2508.13073

work page internal anchor Pith review Pith/arXiv arXiv 2025