arxiv: 2604.05900 · v1 · submitted 2026-04-07 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

Dong She , Xianrong Yao , Liqun Chen , Jinghe Yu , Yang Gao , Zhanpeng Jin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:17 UTC · model grok-4.3

classification 💻 cs.CV

keywords Vision-Language ModelsAffective Image AnalysisEmotion ReasoningBenchmarkingPrompting MethodsMultimodal Understanding

0 comments

The pith

Vision-language models show weak calibration of emotional intensity and produce shallow descriptions in affective image analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes AICA-Bench as a new evaluation framework for vision-language models on integrated affective tasks including understanding, reasoning, and generation. Evaluation of 23 models highlights two limitations in handling emotion intensity and generating detailed descriptions. Grounded Affective Tree Prompting is introduced as a method to combine visual cues with step-by-step reasoning, leading to measurable improvements in error reduction and response quality. A sympathetic reader would care because accurate affective understanding is key to safer and more intuitive AI interactions with visual content.

Core claim

VLMs demonstrate strong perception but lag in holistic Affective Image Content Analysis. AICA-Bench with its three tasks reveals weak intensity calibration and shallow open-ended descriptions across 23 models. GAT Prompting, using visual scaffolding and hierarchical reasoning, addresses these by reducing intensity errors and enhancing descriptive depth.

What carries the argument

Grounded Affective Tree (GAT) Prompting, which integrates visual scaffolding with hierarchical reasoning in a training-free manner.

If this is right

GAT Prompting lowers errors in emotional intensity estimation.
It increases the depth and quality of open-ended affective descriptions.
The benchmark serves as a standard for assessing VLM affective capabilities.
Future work can build on GAT as a baseline for affective multimodal systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The prompting method may apply to non-affective visual reasoning tasks requiring calibration.
Expanding the benchmark could reveal if these limitations are specific to emotion or general to subjective judgments.
Improved affective analysis could benefit applications like automated content filtering for emotional impact.

Load-bearing premise

The three tasks in AICA-Bench holistically capture affective image content analysis capabilities and that GAT improvements generalize beyond tested models.

What would settle it

A new VLM achieving high intensity calibration and deep descriptions without using GAT, or GAT showing no improvement on additional test images, would falsify the identified limitations and solution.

Figures

Figures reproduced from arXiv: 2604.05900 by Dong She, Jinghe Yu, Liqun Chen, Xianrong Yao, Yang Gao, Zhanpeng Jin.

**Figure 2.** Figure 2: The instruction curation pipeline of the AICA-Bench benchmark. It consists of two stages: (1) image [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Performance gains from model scaling diminish beyond 7B parameters. (b) Models consistently struggle with abstract art compared to realistic photos. (c) Masking facial cues causes an 11.1% performance drop, revealing the models’ heavy reliance on visual shortcuts (faces) over holistic context. Gemini-2.5-Pro, Gemini-2.5-Flash, Gemini-2.0- Flash, GPT-4o, GPT-4o-Mini, Qwen-VL-Max, Qwen-VL-Plus. Open-sour… view at source ↗

**Figure 4.** Figure 4: (a) EU: The dominance of intensity errors (Blue) over valence errors (Red) reveals an arousal bottleneck. (b)-(c) ER & EGCG: Across both tasks, models achieve high emotion alignment but consistently lack descriptive depth [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Representative failure cases of EU, ER, and EGCG tasks. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: The GAT Prompting Framework. Visual Scaffolding. The visual scaffolding is generated using an efficient graph-based image segmentation method(Chakrabarti, 2020), which creates large, contiguous regions that serve as explicit visual anchors in the prompt to guide the VLM’s attention. (Please refer to Appendix C for illustrative examples). Based on these segmented anchors, the prompt instructs the VLM to… view at source ↗

**Figure 7.** Figure 7: GAT corrects major intensity-confusion errors [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Example of an Evoked Emotion Prediction instruction in the EU Basic and CoT setting. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Example of an Expressed Emotion Recognition instruction in the EU Basic and CoT setting. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Examples of an ER and EGCG. from existing publicly released affective image datasets that provide de-identified visual content under research-friendly licenses, and we do not collect any new personally identifying information about the individuals depicted. Our work focuses on secondary use and additional annotations over these public resources. Annotators were informed that their decisions would be used … view at source ↗

**Figure 11.** Figure 11: Human Annotation Interface depicted, each entry includes a "Question" prompt that instructs the expert evaluator on the task (e.g., Emotion Reasoning, Emotion-Guided Content Generation) and the specific criteria for assessment. An "Input Sample" section presents the original imagebased prompt and the MLLM’s generated "Output" response. Crucially, the "Evaluation Criteria" section lists the specific asp… view at source ↗

**Figure 12.** Figure 12: Examples of the Instruction-Tuning Dataset Format [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Radar charts comparing performance across tasks for eight major VLM model series (both open-source [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Performance comparison across subtasks (EU, ER, EGCG) and overall average scores for open-source [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Visual Scaffolding examples used in GAT Prompting. The red contours highlight the segmented regions [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 18.** Figure 18: A sample error case of EU-Basic [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗

**Figure 19.** Figure 19: A sample error case of EU-Basic [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗

**Figure 17.** Figure 17: A sample correct case of EU-Basic [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 24.** Figure 24: A sample correct case of EU-CoT [PITH_FULL_IMAGE:figures/full_fig_p025_24.png] view at source ↗

**Figure 25.** Figure 25: A sample error case of EU-CoT [PITH_FULL_IMAGE:figures/full_fig_p025_25.png] view at source ↗

**Figure 26.** Figure 26: A sample error case of EU-CoT [PITH_FULL_IMAGE:figures/full_fig_p025_26.png] view at source ↗

**Figure 27.** Figure 27: A sample error case of EU-CoT [PITH_FULL_IMAGE:figures/full_fig_p026_27.png] view at source ↗

**Figure 28.** Figure 28: A sample error case of EU-CoT [PITH_FULL_IMAGE:figures/full_fig_p026_28.png] view at source ↗

**Figure 29.** Figure 29: Basic Prompting vs. CoT Prompting Case Study F.2 Emotion Reasoning Case The following presents our sample analysis of Emotion Reasoning (ER) cases, including representative correct and error examples. Refer to Figures 30,31,32, and 33 for detailed illustrations. In each figure, we manually annotate the True Answer, which represents the ground-truth reasoning outcome based on the emotion context of the… view at source ↗

**Figure 31.** Figure 31: A sample correct case of EGCG [PITH_FULL_IMAGE:figures/full_fig_p027_31.png] view at source ↗

**Figure 32.** Figure 32: A sample correct case of EGCG [PITH_FULL_IMAGE:figures/full_fig_p027_32.png] view at source ↗

**Figure 35.** Figure 35: A sample correct case of EGCG [PITH_FULL_IMAGE:figures/full_fig_p028_35.png] view at source ↗

**Figure 36.** Figure 36: A sample error case of EGCG [PITH_FULL_IMAGE:figures/full_fig_p028_36.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). We evaluate 23 VLMs and identify two major limitations: weak intensity calibration and shallow open-ended descriptions. To address these issues, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that combines visual scaffolding with hierarchical reasoning. Experiments show that GAT reduces intensity errors and improves descriptive depth, providing a strong baseline for future research on affective multimodal understanding and generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AICA-Bench puts together three affective tasks for VLMs and adds a training-free prompting method that shows some gains on intensity and description quality.

read the letter

The paper's main contribution is AICA-Bench, which combines emotion understanding, reasoning, and guided content generation into one evaluation setup, plus the GAT Prompting approach that layers visual cues with step-by-step reasoning. They run it on 23 models and report that the prompting reduces intensity miscalibration and makes open-ended outputs more detailed. That combination of tasks and the no-training fix is the concrete new piece; prior work has touched on emotion recognition but not this integrated benchmark with the specific prompting structure. The scale of the model sweep is useful for spotting consistent weaknesses across current VLMs. The prompting itself is straightforward to implement and does not rely on fine-tuning, which makes the reported improvements easier to check. The abstract and stress-test note give no sign of circular metrics or hidden fitting, and the claims stay tied to the observed outputs. The soft spot is that dataset construction, exact scoring rules, and statistical checks on the improvements are not visible in the summary, so it is hard to judge how representative the three tasks are or how much the gains generalize. If the full paper has clear sourcing for the images and reproducible prompts, those gaps close; otherwise the benchmark risks being another collection that is hard to extend. This is the kind of work that matters for groups building VLMs for content tools or moderation, where emotional accuracy actually affects downstream use. It is solid enough to send for peer review so the community can test the data and prompting details directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces AICA-Bench, a benchmark for holistic Affective Image Content Analysis (AICA) in Vision-Language Models (VLMs) consisting of three tasks—Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). It evaluates 23 VLMs, identifies limitations in weak intensity calibration and shallow open-ended descriptions, and proposes the training-free Grounded Affective Tree (GAT) Prompting framework that combines visual scaffolding with hierarchical reasoning. Experiments demonstrate that GAT reduces intensity errors and improves descriptive depth, establishing a baseline for affective multimodal research.

Significance. If the dataset construction, metrics, and statistical analyses hold up under scrutiny, this work offers a timely and useful benchmark in an underexplored area of VLM capabilities. The broad evaluation across 23 models and the practical, training-free GAT approach provide concrete starting points for improving affective perception, reasoning, and generation. The identification of specific, actionable limitations (intensity calibration and description depth) could guide targeted future improvements in multimodal affective AI.

major comments (2)

[§5] §5 (Experiments): The claim that GAT 'reduces intensity errors and improves descriptive depth' is presented without reported variance, statistical significance tests, or per-model breakdowns across the 23 VLMs. This detail is load-bearing for the central claim that GAT reliably mitigates the identified limitations.
[§3] §3 (Benchmark Design): The assertion that the three tasks (EU, ER, EGCG) provide a 'holistic' measure of affective capabilities lacks explicit justification of coverage across affective dimensions or comparison to existing affective benchmarks; this assumption underpins the entire evaluation framework.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one quantitative result (e.g., average intensity error reduction) to support the stated improvements.
Ensure consistent expansion of acronyms (EU, ER, EGCG) on first use in the main text and that all figure captions are fully self-contained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your constructive review and positive overall assessment of AICA-Bench. We address each major comment below and will revise the manuscript to strengthen the presentation of results and justification of the benchmark design.

read point-by-point responses

Referee: [§5] §5 (Experiments): The claim that GAT 'reduces intensity errors and improves descriptive depth' is presented without reported variance, statistical significance tests, or per-model breakdowns across the 23 VLMs. This detail is load-bearing for the central claim that GAT reliably mitigates the identified limitations.

Authors: We appreciate this observation. The current manuscript reports aggregate improvements but does not include per-model breakdowns, standard deviations, or formal significance testing. In the revised version we will add (i) per-model performance tables for all 23 VLMs, (ii) standard deviations computed over multiple prompt runs where applicable, and (iii) statistical significance tests (e.g., paired Wilcoxon signed-rank tests) comparing GAT against the baseline prompting strategy. These additions will appear in Section 5 and the supplementary material. revision: yes
Referee: [§3] §3 (Benchmark Design): The assertion that the three tasks (EU, ER, EGCG) provide a 'holistic' measure of affective capabilities lacks explicit justification of coverage across affective dimensions or comparison to existing affective benchmarks; this assumption underpins the entire evaluation framework.

Authors: We agree that the holistic framing requires more explicit support. In the revision we will expand Section 3 with (i) a mapping of the three tasks onto core affective dimensions (valence, arousal, basic emotions, and complex states), (ii) a comparison table and discussion relative to existing benchmarks (e.g., EmoSet, AffectNet, and emotion-generation datasets), and (iii) an argument that the combination of perception, reasoning, and guided generation fills a gap not covered by prior single-task evaluations. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

This is an empirical benchmark paper that introduces AICA-Bench with three tasks (EU, ER, EGCG), evaluates 23 VLMs, identifies limitations, and proposes GAT Prompting as a training-free framework. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text or abstract. The claims rest on standard experimental evaluation and prompting rather than any self-referential construction or tautology. The work is self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are described. The work relies on standard VLM evaluation practices and prompting techniques without detailing any ad-hoc choices or new postulated components.

pith-pipeline@v0.9.0 · 5450 in / 1220 out tokens · 47746 ms · 2026-05-10T19:17:46.177896+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce AICA-Bench... three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG)... GAT Prompting... visual scaffolding with hierarchical reasoning
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

models struggle with intensity calibration and suffer from descriptive shallowness

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 37 canonical work pages · 10 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millicah, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, and 8 others. 2022. Flamingo: a visual language model for few-shot le...

2022
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, and 8 others. 2025. https://arxiv.org/abs/2502.13923 Qwen2.5-vl technical report . Preprint, arXiv:2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Sree Bhattacharyya and James Z. Wang. 2025. https://doi.org/10.18653/v1/2025.findings-naacl.97 Evaluating vision-language models for emotion recognition . In Findings of the Association for Computational Linguistics: NAACL 2025, pages 1798--1820, Albuquerque, New Mexico. Association for Computational Linguistics

work page doi:10.18653/v1/2025.findings-naacl.97 2025
[6]

Erik Cambria. 2016. https://doi.org/10.1109/MIS.2016.31 Affective computing and sentiment analysis . IEEE Intelligent Systems, 31(2):102–107

work page doi:10.1109/mis.2016.31 2016
[7]

Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao. 2020. https://arxiv.org/abs/2006.14799 Evaluation of text generation: A survey . In NeurIPS

work page arXiv 2020
[8]

Soumik Chakrabarti. 2020. Felzenszwalb segmentation: Python implementation of efficient graph-based image segmentation. https://github.com/soumik12345/felzenszwalb_segmentation. Accessed: 2025-12-10

2020
[9]

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yimin Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, and 23 others. 2025. https://arxiv.org/abs/2412.05271 Expanding performance boundaries of open-source multimodal models with model,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. 2024. https://arxiv.org/abs/2312.14238 Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks . Preprint, arXiv:2312.14238

work page internal anchor Pith review arXiv 2024
[11]

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. 2024. https://arxiv.org/abs/2306.13394 Mme: A comprehensive evaluation benchmark for multimodal large language models . Preprint, arXiv:2306.13394

work page internal anchor Pith review arXiv 2024
[12]

Lancheng Gao, Ziheng Jia, Yunhao Zeng, Wei Sun, Yiming Zhang, Wei Zhou, Guangtao Zhai, and Xiongkuo Min. 2025. https://doi.org/10.1145/3746027.3755777 Eemo-bench: A benchmark for multi-modal large language models on image evoked emotion assessment . In Proceedings of the 33rd ACM International Conference on Multimedia, MM '25, page 7064–7073, New York, NY...

work page doi:10.1145/3746027.3755777 2025
[13]

He Hu, Yucheng Zhou, Lianzhong You, Hongbo Xu, Qianning Wang, Zheng Lian, Fei Richard Yu, Fei Ma, and Laizhong Cui. 2025. https://arxiv.org/abs/2502.04424 Emobench-m: Benchmarking emotional intelligence for multimodal large language models . Preprint, arXiv:2502.04424

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Alvarez, Adria Recasens, and Agata Lapedriza

Ronak Kosti, Jose M. Alvarez, Adria Recasens, and Agata Lapedriza. 2017. https://doi.org/10.1109/CVPRW.2017.285 Emotic: Emotions in context dataset . In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2309--2317

work page doi:10.1109/cvprw.2017.285 2017
[15]

Alvarez, Adria Recasens, and Agata Lapedriza

Ronak Kosti, Jose M. Alvarez, Adria Recasens, and Agata Lapedriza. 2020. https://doi.org/10.1109/TPAMI.2019.2916866 Context based emotion recognition using emotic dataset . IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(11):2755--2766

work page doi:10.1109/tpami.2019.2916866 2020
[16]

Xuanyu Lei, Zonghan Yang, Xinrui Chen, Peng Li, and Yang Liu. 2025. https://aclanthology.org/2025.coling-main.195/ Scaffolding coordinates to promote vision-language coordination in large multi-modal models . In Proceedings of the 31st International Conference on Computational Linguistics, pages 2886--2903, Abu Dhabi, UAE. Association for Computational Li...

2025
[17]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2024. https://arxiv.org/abs/2408.03326 Llava-onevision: Easy visual task transfer . Preprint, arXiv:2408.03326

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven C.H. Hoi. 2022. https://doi.org/10.48550/arXiv.2201.12086 Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation . arXiv preprint arXiv:2201.12086

work page doi:10.48550/arxiv.2201.12086 2022
[19]

Zheng Lian, Haoyu Chen, Lan Chen, Haiyang Sun, Licai Sun, Yong Ren, Zebang Cheng, Bin Liu, Rui Liu, Xiaojiang Peng, Jiangyan Yi, and Jianhua Tao. 2025. https://openreview.net/forum?id=xmbdACI0xu Affect GPT : A new dataset, model, and benchmark for emotion understanding with multimodal large language models . In Forty-second International Conference on Mac...

2025
[20]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta-Navas, Drew Arad Hudson, and 31 others. 2023 a . https://arxiv.org/abs/2211....

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta-Navas, Drew Arad Hudson, and 31 others. 2023 b . https://openreview.net/foru...

2023
[22]

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024 a . https://doi.org/10.1109/CVPR52733.2024.02484 Improved baselines with visual instruction tuning . In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26286--26296

work page doi:10.1109/cvpr52733.2024.02484 2024
[23]

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024 b . https://llava-vl.github.io/blog/2024-01-30-llava-next/ Llava-next: Improved reasoning, ocr, and world knowledge

2024
[24]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023 a . Visual instruction tuning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23, Red Hook, NY, USA. Curran Associates Inc

2023
[25]

Xin Liu, Yichen Zhu, Yunshi Lan, Chao Yang, and Yu Qiao. 2023 b . https://api.semanticscholar.org/CorpusID:270701898 Query-relevant images jailbreak large multi-modal models . ArXiv, abs/2311.17600

work page arXiv 2023
[26]

Yexiang Liu, Zekun Li, Zhi Fang, Nan Xu, Ran He, and Tieniu Tan. 2025. https://doi.org/10.18653/v1/2025.acl-long.1356 Rethinking the role of prompting strategies in LLM test-time scaling: A perspective of probability theory . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27962--27...

work page doi:10.18653/v1/2025.acl-long.1356 2025
[27]

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. 2024 c . https://doi.org/10.1007/978-3-031-72658-3_13 Mmbench: Is your multi-modal model an all-around player? In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024...

work page doi:10.1007/978-3-031-72658-3_13 2024
[28]

Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. 2024. https://arxiv.org/abs/2405.20797 Ovis: Structural embedding alignment for multimodal large language model . Preprint, arXiv:2405.20797

work page arXiv 2024
[29]

Jana Machajdik and Allan Hanbury. 2010. https://doi.org/10.1145/1873951.1873965 Affective image classification using features inspired by psychology and art theory . In Proceedings of the 18th ACM International Conference on Multimedia, MM '10, page 83–92, New York, NY, USA. Association for Computing Machinery

work page doi:10.1145/1873951.1873965 2010
[30]

Laurent Mertens, Elahe Yargholi, Hans Op de Beeck, Jan Van den Stock, and Joost Vennekens. 2024. https://openreview.net/forum?id=1q3b2Z95ec Findingemo: An image dataset for emotion recognition in the wild . In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track

2024
[31]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. https://doi.org/10.3115/1073083.1073135 B leu: a method for automatic evaluation of machine translation . In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics

work page doi:10.3115/1073083.1073135 2002
[32]

Kuan-Chuan Peng, Tsuhan Chen, Amir Sadovnik, and Andrew Gallagher. 2015. https://doi.org/10.1109/CVPR.2015.7298687 A mixed bag of emotions: Model, predict, and transfer emotion distributions . In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 860--868

work page doi:10.1109/cvpr.2015.7298687 2015
[33]

Rosalind W. Picard. 1997. https://doi.org/10.7551/mitpress/1140.001.0001 Affective Computing . The MIT Press

work page doi:10.7551/mitpress/1140.001.0001 1997
[34]

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. 2022. https://doi.org/10.1007/978-3-031-20074-8_9 A-okvqa: A benchmark for visual question answering using world knowledge . In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII, page 146–162, Berli...

work page doi:10.1007/978-3-031-20074-8_9 2022
[35]

Jo \ a o Sedoc, Daniel Preo t iuc-Pietro, and Lyle Ungar. 2017. https://aclanthology.org/E17-2090/ Predicting emotional word ratings using distributional representations and signed clustering . In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages 564--571, Valencia,...

2017
[36]

Shezheng Song, Chengxiang He, Shasha Li, Shan Zhao, Chengyu Wang, Tianwei Yan, Xiaopeng Li, Qian Wan, Jun Ma, Jie Yu, and Xiaoguang Mao. 2024. https://arxiv.org/abs/2412.00060 Mosabench: Multi-object sentiment analysis benchmark for evaluating multimodal large language models understanding of complex image . Preprint, arXiv:2412.00060

work page arXiv 2024
[37]

Thapliyal, Jordi Pont Tuset, Xi Chen, and Radu Soricut

Ashish V. Thapliyal, Jordi Pont Tuset, Xi Chen, and Radu Soricut. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.45 Crossmodal-3600: A massively multilingual multimodal evaluation dataset . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 715--729, Abu Dhabi, United Arab Emirates. Association for Computat...

work page doi:10.18653/v1/2022.emnlp-main.45 2022
[38]

Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, and Cihang Xie. 2024. https://doi.org/10.1007/978-3-031-72983-6_3 How many are in this image a safety evaluation benchmark for vision llms . In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceed...

work page doi:10.1007/978-3-031-72983-6_3 2024
[39]

Guangzhi Wang, Yixiao Ge, Xiaohan Ding, Mohan Kankanhalli, and Ying Shan. 2023. https://arxiv.org/abs/2305.12223 What makes for good visual tokenizers for large language models? Preprint, arXiv:2305.12223

work page arXiv 2023
[40]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. https://arxiv.org/abs/2409.12191 Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution ....

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. 2025. https://openreview.net/forum?id=K2CckZjNy0 Axbench: Steering LLM s? even simple baselines outperform sparse autoencoders . In Forty-second International Conference on Machine Learning

2025
[42]

Jingyuan Yang, Qirui Huang, Tingting Ding, Dani Lischinski, Daniel Cohen-Or, and Hui Huang. 2023. https://doi.org/10.1109/ICCV51070.2023.01864 Emoset: A large-scale visual emotion dataset with rich attributes . In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 20326--20337

work page doi:10.1109/iccv51070.2023.01864 2023
[43]

Jufeng Yang, Dongyu She, and Ming Sun. 2017. https://doi.org/10.24963/ijcai.2017/456 Joint image emotion classification and distribution learning via deep convolutional neural network . In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17 , pages 3266--3272

work page doi:10.24963/ijcai.2017/456 2017
[44]

Qu Yang, Mang Ye, and Bo Du. 2024. https://arxiv.org/abs/2406.16442 Emollm: Multimodal emotional understanding meets large language models . Preprint, arXiv:2406.16442

work page arXiv 2024
[45]

Griffiths, Yuan Cao, and Karthik R Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R Narasimhan. 2023. https://openreview.net/forum?id=5Xc1ecxO1h Tree of thoughts: Deliberate problem solving with large language models . In Thirty-seventh Conference on Neural Information Processing Systems

2023
[46]

Y. Yao, T. Yu, A. Zhang, J. Liu, H. Wang, and X. Chen. 2025. https://doi.org/10.1038/s41467-025-61040-5 Efficient gpt-4v level multimodal large language model for deployment on edge devices . Nature Communications, 16:5509

work page doi:10.1038/s41467-025-61040-5 2025
[47]

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, and 4 others. 2024. https://arxiv.org/abs/2408.01800 Minicpm-v: A gpt-4v level mllm on your phone . Preprint, arXiv:2408.01800

work page internal anchor Pith review arXiv 2024
[48]

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2024. https://doi.org/10.1093/nsr/nwae403 A survey on multimodal large language models . National Science Review, 11(12)

work page doi:10.1093/nsr/nwae403 2024
[49]

Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. 2016. Building a large scale dataset for image emotion recognition: the fine print and the benchmark. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI'16, page 308–314. AAAI Press

2016
[50]

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. 2025. https://doi.org/10.18653/v1/2025.acl-long.736 MMMU -pro: A more robust multi-discipline multimodal understanding benchmark . In Proceedings of the 63rd Annual Meeting of the Association for Comp...

work page doi:10.18653/v1/2025.acl-long.736 2025
[52]

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, and 32 others. 2025 b . https://arxiv.org/abs/2504.10479 Internvl3: Exploring advanced training and test-time recipes for open...

work page internal anchor Pith review Pith/arXiv arXiv 2025