Recognition: 2 theorem links
· Lean TheoremAICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis
Pith reviewed 2026-05-10 19:17 UTC · model grok-4.3
The pith
Vision-language models show weak calibration of emotional intensity and produce shallow descriptions in affective image analysis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VLMs demonstrate strong perception but lag in holistic Affective Image Content Analysis. AICA-Bench with its three tasks reveals weak intensity calibration and shallow open-ended descriptions across 23 models. GAT Prompting, using visual scaffolding and hierarchical reasoning, addresses these by reducing intensity errors and enhancing descriptive depth.
What carries the argument
Grounded Affective Tree (GAT) Prompting, which integrates visual scaffolding with hierarchical reasoning in a training-free manner.
If this is right
- GAT Prompting lowers errors in emotional intensity estimation.
- It increases the depth and quality of open-ended affective descriptions.
- The benchmark serves as a standard for assessing VLM affective capabilities.
- Future work can build on GAT as a baseline for affective multimodal systems.
Where Pith is reading between the lines
- The prompting method may apply to non-affective visual reasoning tasks requiring calibration.
- Expanding the benchmark could reveal if these limitations are specific to emotion or general to subjective judgments.
- Improved affective analysis could benefit applications like automated content filtering for emotional impact.
Load-bearing premise
The three tasks in AICA-Bench holistically capture affective image content analysis capabilities and that GAT improvements generalize beyond tested models.
What would settle it
A new VLM achieving high intensity calibration and deep descriptions without using GAT, or GAT showing no improvement on additional test images, would falsify the identified limitations and solution.
Figures
read the original abstract
Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). We evaluate 23 VLMs and identify two major limitations: weak intensity calibration and shallow open-ended descriptions. To address these issues, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that combines visual scaffolding with hierarchical reasoning. Experiments show that GAT reduces intensity errors and improves descriptive depth, providing a strong baseline for future research on affective multimodal understanding and generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AICA-Bench, a benchmark for holistic Affective Image Content Analysis (AICA) in Vision-Language Models (VLMs) consisting of three tasks—Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). It evaluates 23 VLMs, identifies limitations in weak intensity calibration and shallow open-ended descriptions, and proposes the training-free Grounded Affective Tree (GAT) Prompting framework that combines visual scaffolding with hierarchical reasoning. Experiments demonstrate that GAT reduces intensity errors and improves descriptive depth, establishing a baseline for affective multimodal research.
Significance. If the dataset construction, metrics, and statistical analyses hold up under scrutiny, this work offers a timely and useful benchmark in an underexplored area of VLM capabilities. The broad evaluation across 23 models and the practical, training-free GAT approach provide concrete starting points for improving affective perception, reasoning, and generation. The identification of specific, actionable limitations (intensity calibration and description depth) could guide targeted future improvements in multimodal affective AI.
major comments (2)
- [§5] §5 (Experiments): The claim that GAT 'reduces intensity errors and improves descriptive depth' is presented without reported variance, statistical significance tests, or per-model breakdowns across the 23 VLMs. This detail is load-bearing for the central claim that GAT reliably mitigates the identified limitations.
- [§3] §3 (Benchmark Design): The assertion that the three tasks (EU, ER, EGCG) provide a 'holistic' measure of affective capabilities lacks explicit justification of coverage across affective dimensions or comparison to existing affective benchmarks; this assumption underpins the entire evaluation framework.
minor comments (2)
- [Abstract] The abstract would be strengthened by including at least one quantitative result (e.g., average intensity error reduction) to support the stated improvements.
- Ensure consistent expansion of acronyms (EU, ER, EGCG) on first use in the main text and that all figure captions are fully self-contained.
Simulated Author's Rebuttal
Thank you for your constructive review and positive overall assessment of AICA-Bench. We address each major comment below and will revise the manuscript to strengthen the presentation of results and justification of the benchmark design.
read point-by-point responses
-
Referee: [§5] §5 (Experiments): The claim that GAT 'reduces intensity errors and improves descriptive depth' is presented without reported variance, statistical significance tests, or per-model breakdowns across the 23 VLMs. This detail is load-bearing for the central claim that GAT reliably mitigates the identified limitations.
Authors: We appreciate this observation. The current manuscript reports aggregate improvements but does not include per-model breakdowns, standard deviations, or formal significance testing. In the revised version we will add (i) per-model performance tables for all 23 VLMs, (ii) standard deviations computed over multiple prompt runs where applicable, and (iii) statistical significance tests (e.g., paired Wilcoxon signed-rank tests) comparing GAT against the baseline prompting strategy. These additions will appear in Section 5 and the supplementary material. revision: yes
-
Referee: [§3] §3 (Benchmark Design): The assertion that the three tasks (EU, ER, EGCG) provide a 'holistic' measure of affective capabilities lacks explicit justification of coverage across affective dimensions or comparison to existing affective benchmarks; this assumption underpins the entire evaluation framework.
Authors: We agree that the holistic framing requires more explicit support. In the revision we will expand Section 3 with (i) a mapping of the three tasks onto core affective dimensions (valence, arousal, basic emotions, and complex states), (ii) a comparison table and discussion relative to existing benchmarks (e.g., EmoSet, AffectNet, and emotion-generation datasets), and (iii) an argument that the combination of perception, reasoning, and guided generation fills a gap not covered by prior single-task evaluations. revision: yes
Circularity Check
No significant circularity detected
full rationale
This is an empirical benchmark paper that introduces AICA-Bench with three tasks (EU, ER, EGCG), evaluates 23 VLMs, identifies limitations, and proposes GAT Prompting as a training-free framework. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text or abstract. The claims rest on standard experimental evaluation and prompting rather than any self-referential construction or tautology. The work is self-contained against external benchmarks with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce AICA-Bench... three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG)... GAT Prompting... visual scaffolding with hierarchical reasoning
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
models struggle with intensity calibration and suffer from descriptive shallowness
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
online" 'onlinestring :=
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[2]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millicah, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, and 8 others. 2022. Flamingo: a visual language model for few-shot le...
2022
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, and 8 others. 2025. https://arxiv.org/abs/2502.13923 Qwen2.5-vl technical report . Preprint, arXiv:2502.13923
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Sree Bhattacharyya and James Z. Wang. 2025. https://doi.org/10.18653/v1/2025.findings-naacl.97 Evaluating vision-language models for emotion recognition . In Findings of the Association for Computational Linguistics: NAACL 2025, pages 1798--1820, Albuquerque, New Mexico. Association for Computational Linguistics
-
[6]
Erik Cambria. 2016. https://doi.org/10.1109/MIS.2016.31 Affective computing and sentiment analysis . IEEE Intelligent Systems, 31(2):102–107
- [7]
-
[8]
Soumik Chakrabarti. 2020. Felzenszwalb segmentation: Python implementation of efficient graph-based image segmentation. https://github.com/soumik12345/felzenszwalb_segmentation. Accessed: 2025-12-10
2020
-
[9]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yimin Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, and 23 others. 2025. https://arxiv.org/abs/2412.05271 Expanding performance boundaries of open-source multimodal models with model,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. 2024. https://arxiv.org/abs/2312.14238 Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks . Preprint, arXiv:2312.14238
work page internal anchor Pith review arXiv 2024
-
[11]
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. 2024. https://arxiv.org/abs/2306.13394 Mme: A comprehensive evaluation benchmark for multimodal large language models . Preprint, arXiv:2306.13394
work page internal anchor Pith review arXiv 2024
-
[12]
Lancheng Gao, Ziheng Jia, Yunhao Zeng, Wei Sun, Yiming Zhang, Wei Zhou, Guangtao Zhai, and Xiongkuo Min. 2025. https://doi.org/10.1145/3746027.3755777 Eemo-bench: A benchmark for multi-modal large language models on image evoked emotion assessment . In Proceedings of the 33rd ACM International Conference on Multimedia, MM '25, page 7064–7073, New York, NY...
-
[13]
He Hu, Yucheng Zhou, Lianzhong You, Hongbo Xu, Qianning Wang, Zheng Lian, Fei Richard Yu, Fei Ma, and Laizhong Cui. 2025. https://arxiv.org/abs/2502.04424 Emobench-m: Benchmarking emotional intelligence for multimodal large language models . Preprint, arXiv:2502.04424
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Alvarez, Adria Recasens, and Agata Lapedriza
Ronak Kosti, Jose M. Alvarez, Adria Recasens, and Agata Lapedriza. 2017. https://doi.org/10.1109/CVPRW.2017.285 Emotic: Emotions in context dataset . In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2309--2317
-
[15]
Alvarez, Adria Recasens, and Agata Lapedriza
Ronak Kosti, Jose M. Alvarez, Adria Recasens, and Agata Lapedriza. 2020. https://doi.org/10.1109/TPAMI.2019.2916866 Context based emotion recognition using emotic dataset . IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(11):2755--2766
-
[16]
Xuanyu Lei, Zonghan Yang, Xinrui Chen, Peng Li, and Yang Liu. 2025. https://aclanthology.org/2025.coling-main.195/ Scaffolding coordinates to promote vision-language coordination in large multi-modal models . In Proceedings of the 31st International Conference on Computational Linguistics, pages 2886--2903, Abu Dhabi, UAE. Association for Computational Li...
2025
-
[17]
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2024. https://arxiv.org/abs/2408.03326 Llava-onevision: Easy visual task transfer . Preprint, arXiv:2408.03326
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven C.H. Hoi. 2022. https://doi.org/10.48550/arXiv.2201.12086 Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation . arXiv preprint arXiv:2201.12086
-
[19]
Zheng Lian, Haoyu Chen, Lan Chen, Haiyang Sun, Licai Sun, Yong Ren, Zebang Cheng, Bin Liu, Rui Liu, Xiaojiang Peng, Jiangyan Yi, and Jianhua Tao. 2025. https://openreview.net/forum?id=xmbdACI0xu Affect GPT : A new dataset, model, and benchmark for emotion understanding with multimodal large language models . In Forty-second International Conference on Mac...
2025
-
[20]
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta-Navas, Drew Arad Hudson, and 31 others. 2023 a . https://arxiv.org/abs/2211....
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta-Navas, Drew Arad Hudson, and 31 others. 2023 b . https://openreview.net/foru...
2023
-
[22]
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024 a . https://doi.org/10.1109/CVPR52733.2024.02484 Improved baselines with visual instruction tuning . In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26286--26296
-
[23]
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024 b . https://llava-vl.github.io/blog/2024-01-30-llava-next/ Llava-next: Improved reasoning, ocr, and world knowledge
2024
-
[24]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023 a . Visual instruction tuning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23, Red Hook, NY, USA. Curran Associates Inc
2023
- [25]
-
[26]
Yexiang Liu, Zekun Li, Zhi Fang, Nan Xu, Ran He, and Tieniu Tan. 2025. https://doi.org/10.18653/v1/2025.acl-long.1356 Rethinking the role of prompting strategies in LLM test-time scaling: A perspective of probability theory . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27962--27...
-
[27]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. 2024 c . https://doi.org/10.1007/978-3-031-72658-3_13 Mmbench: Is your multi-modal model an all-around player? In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024...
- [28]
-
[29]
Jana Machajdik and Allan Hanbury. 2010. https://doi.org/10.1145/1873951.1873965 Affective image classification using features inspired by psychology and art theory . In Proceedings of the 18th ACM International Conference on Multimedia, MM '10, page 83–92, New York, NY, USA. Association for Computing Machinery
-
[30]
Laurent Mertens, Elahe Yargholi, Hans Op de Beeck, Jan Van den Stock, and Joost Vennekens. 2024. https://openreview.net/forum?id=1q3b2Z95ec Findingemo: An image dataset for emotion recognition in the wild . In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track
2024
-
[31]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. https://doi.org/10.3115/1073083.1073135 B leu: a method for automatic evaluation of machine translation . In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics
-
[32]
Kuan-Chuan Peng, Tsuhan Chen, Amir Sadovnik, and Andrew Gallagher. 2015. https://doi.org/10.1109/CVPR.2015.7298687 A mixed bag of emotions: Model, predict, and transfer emotion distributions . In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 860--868
-
[33]
Rosalind W. Picard. 1997. https://doi.org/10.7551/mitpress/1140.001.0001 Affective Computing . The MIT Press
-
[34]
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. 2022. https://doi.org/10.1007/978-3-031-20074-8_9 A-okvqa: A benchmark for visual question answering using world knowledge . In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII, page 146–162, Berli...
-
[35]
Jo \ a o Sedoc, Daniel Preo t iuc-Pietro, and Lyle Ungar. 2017. https://aclanthology.org/E17-2090/ Predicting emotional word ratings using distributional representations and signed clustering . In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages 564--571, Valencia,...
2017
-
[36]
Shezheng Song, Chengxiang He, Shasha Li, Shan Zhao, Chengyu Wang, Tianwei Yan, Xiaopeng Li, Qian Wan, Jun Ma, Jie Yu, and Xiaoguang Mao. 2024. https://arxiv.org/abs/2412.00060 Mosabench: Multi-object sentiment analysis benchmark for evaluating multimodal large language models understanding of complex image . Preprint, arXiv:2412.00060
-
[37]
Thapliyal, Jordi Pont Tuset, Xi Chen, and Radu Soricut
Ashish V. Thapliyal, Jordi Pont Tuset, Xi Chen, and Radu Soricut. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.45 Crossmodal-3600: A massively multilingual multimodal evaluation dataset . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 715--729, Abu Dhabi, United Arab Emirates. Association for Computat...
-
[38]
Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, and Cihang Xie. 2024. https://doi.org/10.1007/978-3-031-72983-6_3 How many are in this image a safety evaluation benchmark for vision llms . In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceed...
- [39]
-
[40]
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. https://arxiv.org/abs/2409.12191 Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution ....
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. 2025. https://openreview.net/forum?id=K2CckZjNy0 Axbench: Steering LLM s? even simple baselines outperform sparse autoencoders . In Forty-second International Conference on Machine Learning
2025
-
[42]
Jingyuan Yang, Qirui Huang, Tingting Ding, Dani Lischinski, Daniel Cohen-Or, and Hui Huang. 2023. https://doi.org/10.1109/ICCV51070.2023.01864 Emoset: A large-scale visual emotion dataset with rich attributes . In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 20326--20337
-
[43]
Jufeng Yang, Dongyu She, and Ming Sun. 2017. https://doi.org/10.24963/ijcai.2017/456 Joint image emotion classification and distribution learning via deep convolutional neural network . In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17 , pages 3266--3272
- [44]
-
[45]
Griffiths, Yuan Cao, and Karthik R Narasimhan
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R Narasimhan. 2023. https://openreview.net/forum?id=5Xc1ecxO1h Tree of thoughts: Deliberate problem solving with large language models . In Thirty-seventh Conference on Neural Information Processing Systems
2023
-
[46]
Y. Yao, T. Yu, A. Zhang, J. Liu, H. Wang, and X. Chen. 2025. https://doi.org/10.1038/s41467-025-61040-5 Efficient gpt-4v level multimodal large language model for deployment on edge devices . Nature Communications, 16:5509
-
[47]
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, and 4 others. 2024. https://arxiv.org/abs/2408.01800 Minicpm-v: A gpt-4v level mllm on your phone . Preprint, arXiv:2408.01800
work page internal anchor Pith review arXiv 2024
-
[48]
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2024. https://doi.org/10.1093/nsr/nwae403 A survey on multimodal large language models . National Science Review, 11(12)
-
[49]
Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. 2016. Building a large scale dataset for image emotion recognition: the fine print and the benchmark. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI'16, page 308–314. AAAI Press
2016
-
[50]
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. 2025. https://doi.org/10.18653/v1/2025.acl-long.736 MMMU -pro: A more robust multi-discipline multimodal understanding benchmark . In Proceedings of the 63rd Annual Meeting of the Association for Comp...
-
[52]
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, and 32 others. 2025 b . https://arxiv.org/abs/2504.10479 Internvl3: Exploring advanced training and test-time recipes for open...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.