When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning?
Pith reviewed 2026-05-22 22:35 UTC · model grok-4.3
The pith
Even advanced vision-language models significantly underperform humans when comprehending contradictory humor in comics via comparative reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Current vision-language models cannot reliably comprehend contradictory humor in comics because they lack robust comparative reasoning between contradictory narrative elements, as shown by their consistent underperformance relative to humans on the YesBut (V2) benchmark across surface-level and deep-reasoning tasks, accompanied by failures in visual perception, key element identification, comparative analysis, and hallucinations.
What carries the argument
The YesBut (V2) benchmark of 1,262 comic images with annotations for narrative understanding and four complementary tasks that test progression from surface content to comparative reasoning on contradictions.
If this is right
- Text-based training strategies can raise model performance on contradictory humor tasks.
- Social knowledge augmentation improves model handling of cultural contradictions in comics.
- Persistent model weaknesses in visual perception and comparative analysis limit AI engagement with creative visual narratives.
- Addressing these gaps would support development of context-aware models for deeper narrative understanding.
Where Pith is reading between the lines
- The benchmark results imply that standard VLM training may undervalue relational comparison across image panels.
- Similar evaluation setups could be applied to test model understanding of contradictions in other visual media such as film or illustration sequences.
- Improved comparative reasoning in VLMs might transfer to non-humor tasks that require detecting inconsistencies in visual stories.
Load-bearing premise
The annotations supplied with the YesBut (V2) benchmark accurately and consistently capture the narrative understanding and comparative reasoning demands of contradictory humor in the selected comics.
What would settle it
Demonstrating that one or more state-of-the-art VLMs reach or exceed human accuracy levels across all four tasks on the full YesBut (V2) set would falsify the claim of significant underperformance.
Figures
read the original abstract
Understanding humor-particularly when it involves complex, contradictory narratives that require comparative reasoning-remains a significant challenge for large vision-language models (VLMs). This limitation hinders AI's ability to engage in human-like reasoning and cultural expression. In this paper, we investigate this challenge through an in-depth analysis of comics that juxtapose panels to create humor through contradictions. We introduce the YesBut (V2), a novel benchmark with 1,262 comic images from diverse multilingual and multicultural contexts, featuring comprehensive annotations that capture various aspects of narrative understanding. Using this benchmark, we systematically evaluate a wide range of VLMs through four complementary tasks spanning from surface content comprehension to deep narrative reasoning, with particular emphasis on comparative reasoning between contradictory elements. Our extensive experiments reveal that even the most advanced models significantly underperform compared to humans, with common failures in visual perception, key element identification, comparative analysis and hallucinations. We further investigate text-based training strategies and social knowledge augmentation methods to enhance model performance. Our findings not only highlight critical weaknesses in VLMs' understanding of cultural and creative expressions but also provide pathways toward developing context-aware models capable of deeper narrative understanding though comparative reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that even advanced vision-language models significantly underperform humans on understanding contradictory humor in comics, which requires comparative reasoning between juxtaposed panels. It introduces the YesBut (V2) benchmark of 1,262 multilingual comics with annotations for narrative understanding, evaluates VLMs across four tasks from surface comprehension to deep comparative reasoning, documents failures in perception, element identification, comparison, and hallucinations, and tests text-based training plus social knowledge augmentation as mitigation strategies.
Significance. If the benchmark annotations reliably proxy the targeted reasoning, the work would demonstrate important gaps in VLMs for culturally nuanced narrative tasks and supply a new multilingual resource plus improvement pathways. The multilingual comic collection and explicit investigation of augmentation methods are clear strengths that could support follow-on research in vision-language reasoning.
major comments (2)
- [§3] §3 (YesBut (V2) benchmark construction): the central claim that VLMs underperform due to failures in comparative reasoning depends on the annotations faithfully capturing narrative and comparative demands. The manuscript states that the annotations are 'comprehensive' but supplies no annotation protocol, inter-annotator agreement statistics, adjudication procedure for contradictory elements, or external validation against human reasoning traces. This is load-bearing; without it, the reported model failures cannot be distinguished from possible benchmark artifacts.
- [§4] §4 (experimental evaluation): the reported performance gaps versus humans and the listed failure modes (visual perception, key element identification, comparative analysis, hallucinations) are presented without per-task quantitative breakdowns, statistical significance tests, confidence intervals, or details on human baseline collection. These omissions prevent assessment of whether the underperformance conclusion is robust across the four tasks.
minor comments (2)
- [Abstract] The abstract lists four complementary tasks but does not name or briefly characterize them; adding one sentence would improve accessibility.
- Figure and table captions should explicitly link each visual or numeric result to the specific task and metric being measured.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to incorporate additional details where the concerns are valid.
read point-by-point responses
-
Referee: [§3] §3 (YesBut (V2) benchmark construction): the central claim that VLMs underperform due to failures in comparative reasoning depends on the annotations faithfully capturing narrative and comparative demands. The manuscript states that the annotations are 'comprehensive' but supplies no annotation protocol, inter-annotator agreement statistics, adjudication procedure for contradictory elements, or external validation against human reasoning traces. This is load-bearing; without it, the reported model failures cannot be distinguished from possible benchmark artifacts.
Authors: We agree that the original manuscript would have been strengthened by an explicit description of the annotation process. In the revised version, we have expanded Section 3 with a dedicated subsection on benchmark construction that details the annotation protocol (including guidelines for identifying contradictory narrative elements), reports inter-annotator agreement statistics, describes the adjudication procedure used to resolve disagreements, and presents results from an external validation study comparing the annotations against independent human reasoning traces. These additions directly substantiate the reliability of the benchmark for the comparative-reasoning claims. revision: yes
-
Referee: [§4] §4 (experimental evaluation): the reported performance gaps versus humans and the listed failure modes (visual perception, key element identification, comparative analysis, hallucinations) are presented without per-task quantitative breakdowns, statistical significance tests, confidence intervals, or details on human baseline collection. These omissions prevent assessment of whether the underperformance conclusion is robust across the four tasks.
Authors: We acknowledge that the experimental reporting in the original manuscript lacked sufficient granularity. The revised manuscript now includes per-task quantitative breakdowns for all four tasks, reports statistical significance tests on the performance gaps, provides confidence intervals for key metrics, and adds a detailed account of human baseline collection (participant count, instructions, and inter-human agreement). These changes allow readers to evaluate the robustness of the underperformance findings across tasks. revision: yes
Circularity Check
No circularity: empirical benchmark evaluation is self-contained
full rationale
The paper introduces the new YesBut (V2) benchmark with 1,262 comics and four tasks, then reports empirical VLM vs. human performance on those tasks. No equations, fitted parameters, or derivations are present. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim (VLMs underperform on contradictory humor) rests on direct evaluation against the newly collected and annotated data rather than reducing to any prior author work or self-defined quantities by construction. This is the standard non-circular case for a benchmark paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The annotations in YesBut (V2) accurately capture narrative understanding and comparative reasoning requirements for contradictory humor.
Reference graph
Works this paper leans on
-
[1]
Z. Hu and T. Shu, “Language models, agent models, and world models: The law for machine reasoning and planning,” arXiv preprint arXiv:2312.05230, 2023. 1, 12
-
[2]
Do androids laugh at electric sheep? humor “understanding
J. Hessel, A. Marasovic, J. D. Hwang, L. Lee, J. Da, R. Zellers, R. Mankoff, and Y. Choi, “Do androids laugh at electric sheep? humor “understanding” benchmarks from the new yorker caption contest,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , A. Rogers, J. Boyd-Graber, and N. Okazaki...
work page 2023
-
[3]
Artificial general intelli- gence: Roadmap to achieving human-level capabilities,
A. Rayhan, R. Rayhan, and S. Rayhan, “Artificial general intelli- gence: Roadmap to achieving human-level capabilities,” 2023. 1
work page 2023
-
[4]
Best humans still outperform arti- ficial intelligence in a creative divergent thinking task,
M. Koivisto and S. Grassini, “Best humans still outperform arti- ficial intelligence in a creative divergent thinking task,” Scientific reports, vol. 13, no. 1, p. 13601, 2023. 1
work page 2023
-
[5]
R. Duncan and M. J. Smith, The power of comics: History, form and culture. A&C Black, 2009. 1 14
work page 2009
-
[6]
Y. Yang, Z. Li, Q. Dong, H. Xia, and Z. Sui, “Can large multimodal models uncover deep semantics behind images?” arXiv preprint arXiv:2402.11281, 2024. 1, 3
-
[7]
J. O. Young, Art and knowledge. Routledge, 2003. 1
work page 2003
-
[8]
Groensteen, Comics and narration
T. Groensteen, Comics and narration . Univ. Press of Mississippi,
-
[9]
Rethinking literacy: Communication, representation and text,
E. Bearne, “Rethinking literacy: Communication, representation and text,” Reading, vol. 37, no. 3, pp. 98–103, 2003. 1
work page 2003
-
[10]
Comic book visualities: a methodological manifesto on geography, montage and narration,
J. Dittmer, “Comic book visualities: a methodological manifesto on geography, montage and narration,” Transactions of the Institute of British Geographers, vol. 35, no. 2, pp. 222–236, 2010. 1
work page 2010
-
[11]
Juxtaposition: A new way to combine logics,
J. Schechter, “Juxtaposition: A new way to combine logics,” The Review of Symbolic Logic, vol. 4, no. 4, pp. 560–606, 2011. 1
work page 2011
-
[12]
Comics- based research: The affordances of comics for research across disciplines,
P . J. Kuttner, M. B. Weaver-Hightower, and N. Sousanis, “Comics- based research: The affordances of comics for research across disciplines,” Qualitative Research, vol. 21, no. 2, pp. 195–214, 2021. 2
work page 2021
-
[13]
Y. Tong, Y. Wang, D. Li, S. Wang, Z. Lin, S. Han, and J. Shang, “Eliminating reasoning via inferring with planning: A new framework to guide llms’ non-linear thinking,” arXiv preprint arXiv:2310.12342, 2023. 2
-
[14]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
S. Bubeck, V . Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P . Lee, Y. T. Lee, Y. Li, S. Lundberg et al. , “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Cracking the code of juxtaposition: Can ai models understand the humorous contradictions,
Z. Hu, T. Liang, J. Li, Y. Lu, Y. Zhou, Y. Qiao, J. Ma, and Y. Yin, “Cracking the code of juxtaposition: Can ai models understand the humorous contradictions,” in Advances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, pp. 47 166–47 ...
work page 2024
-
[16]
Training language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems , vol. 35, pp. 27 730–27 744,
-
[17]
AI@Meta, “Llama 3 model card,” 2024. [Online]. Available: https://github.com/meta-llama/llama3/blob/main/ MODEL CARD.md 3, 6, 17
work page 2024
-
[18]
Large Language Models: A Survey
S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain, and J. Gao, “Large language models: A survey,” arXiv preprint arXiv:2402.06196, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
A Survey on Multimodal Large Language Models
S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,” arXiv preprint arXiv:2306.13549, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P . Mishkin, J. Clark et al. , “Learning transferable visual models from natural language supervision,” in International conference on machine learning . PmLR, 2021, pp. 8748–8763. 3
work page 2021
-
[21]
Judging llm-as-a-judge with mt- bench and chatbot arena,
L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al., “Judging llm-as-a-judge with mt- bench and chatbot arena,”Advances in Neural Information Processing Systems, vol. 36, 2024. 3
work page 2024
-
[22]
Alpacafarm: A simulation framework for methods that learn from human feed- back,
Y. Dubois, C. X. Li, R. Taori, T. Zhang, I. Gulrajani, J. Ba, C. Guestrin, P . S. Liang, and T. B. Hashimoto, “Alpacafarm: A simulation framework for methods that learn from human feed- back,” Advances in Neural Information Processing Systems , vol. 36,
-
[23]
Y. Wang, Z. Yu, Z. Zeng, L. Yang, C. Wang, H. Chen, C. Jiang, R. Xie, J. Wang, X. Xie et al., “Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization,” arXiv preprint arXiv:2306.05087, 2023. 3
-
[24]
C-eval: A multi-level multi-discipline chi- nese evaluation suite for foundation models,
Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, Y. Fu et al., “C-eval: A multi-level multi-discipline chi- nese evaluation suite for foundation models,” Advances in Neural Information Processing Systems, vol. 36, 2024. 3
work page 2024
-
[25]
K. Ying, F. Meng, J. Wang, Z. Li, H. Lin, Y. Yang, H. Zhang, W. Zhang, Y. Lin, S. Liuet al., “Mmt-bench: A comprehensive mul- timodal benchmark for evaluating large vision-language models towards multitask agi,” arXiv preprint arXiv:2404.16006, 2024. 3
-
[26]
Y. Bitton, H. Bansal, J. Hessel, R. Shao, W. Zhu, A. Awadalla, J. Gardner, R. Taori, and L. Schimdt, “Visit-bench: A benchmark for vision-language instruction following inspired by real-world use,” arXiv preprint arXiv:2308.06595, 2023. 3
-
[27]
N. Bitton-Guetta, Y. Bitton, J. Hessel, L. Schmidt, Y. Elovici, G. Stanovsky, and R. Schwartz, “Breaking common sense: Whoops! a vision-and-language benchmark of synthetic and com- positional images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2616–2627. 3
work page 2023
-
[28]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan, “Seed-bench: Benchmarking multimodal llms with generative comprehension,” arXiv preprint arXiv:2307.16125, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Seed-bench-2: Benchmarking multimodal large language models.arXiv preprint arXiv:2311.17092, 2023
B. Li, Y. Ge, Y. Ge, G. Wang, R. Wang, R. Zhang, and Y. Shan, “Seed-bench-2: Benchmarking multimodal large language mod- els,” arXiv preprint arXiv:2311.17092, 2023. 3
-
[30]
Valse: A task-independent benchmark for vision and language models centered on linguistic phenomena,
L. Parcalabescu, M. Cafagna, L. Muradjan, A. Frank, I. Calixto, and A. Gatt, “Valse: A task-independent benchmark for vision and language models centered on linguistic phenomena,” arXiv preprint arXiv:2112.07566, 2021. 3
-
[31]
Exploring the spectrum of visio-linguistic compositionality and recognition,
Y. Oh, P . Ahn, J. Kim, G. Song, S. Lee, I. S. Kweon, and J. Kim, “Exploring the spectrum of visio-linguistic compositionality and recognition,” arXiv preprint arXiv:2406.09388, 2024. 3
-
[32]
Smart vision-language reasoners,
D. Roberts and L. Roberts, “Smart vision-language reasoners,” arXiv preprint arXiv:2407.04212, 2024. 3
-
[33]
VIVA: A benchmark for vision-grounded decision-making with human values,
Z. Hu, Y. Ren, J. Li, and Y. Yin, “VIVA: A benchmark for vision-grounded decision-making with human values,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 2294–2311. [Online]. Available...
work page 2024
-
[34]
Oxfordtvg-hic: Can machine make humorous captions from images?
R. Li, S. Sun, M. Elhoseiny, and P . Torr, “Oxfordtvg-hic: Can machine make humorous captions from images?” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 20 293–20 303. 3
work page 2023
-
[35]
Palmer, Taking humour seriously
J. Palmer, Taking humour seriously. Routledge, 2003. 3
work page 2003
-
[36]
Brain networks for visual creativity: a functional connectivity study of planning a visual artwork,
N. De Pisapia, F. Bacci, D. Parrott, and D. Melcher, “Brain networks for visual creativity: a functional connectivity study of planning a visual artwork,” Scientific reports, vol. 6, no. 1, p. 39185, 2016. 3
work page 2016
-
[37]
Predicting Audience's Laughter Using Convolutional Neural Network
L. Chen and C. M. Lee, “Predicting audience’s laughter using convolutional neural network,” arXiv preprint arXiv:1702.02584 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Recognizing humour using word associ- ations and humour anchor extraction,
A. Cattle and X. Ma, “Recognizing humour using word associ- ations and humour anchor extraction,” in Proceedings of the 27th international conference on computational linguistics , 2018, pp. 1849–
work page 2018
-
[39]
Humor recognition and humor anchor extraction,
D. Yang, A. Lavie, C. Dyer, and E. Hovy, “Humor recognition and humor anchor extraction,” in Proceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp. 2367–2376. 3
work page 2015
-
[40]
A survey on approaches to computational humor generation,
M. Amin and M. Burghardt, “A survey on approaches to computational humor generation,” in Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature , S. DeGaetano, A. Kazantseva, N. Reiter, and S. Szpakowicz, Eds. Online: International Committee on Computational Linguistics, ...
work page 2020
-
[41]
S. Jentzsch and K. Kersting, “Chatgpt is fun, but it is not funny! humor is still challenging large language models,” arXiv preprint arXiv:2306.04563, 2023. 3
-
[42]
S. Zhong, Z. Huang, S. Gao, W. Wen, L. Lin, M. Zitnik, and P . Zhou, “Let’s think outside the box: Exploring leap-of-thought in large language models with creative humor generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 246–13 257. 3
work page 2024
-
[43]
Inside jokes: Identify- ing humorous cartoon captions,
D. Shahaf, E. Horvitz, and R. Mankoff, “Inside jokes: Identify- ing humorous cartoon captions,” in Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015, pp. 1065–1074. 3
work page 2015
-
[44]
D. Radev, A. Stent, J. Tetreault, A. Pappu, A. Iliakopoulou, A. Chanfreau, P . de Juan, J. Vallmitjana, A. Jaimes, R. Jha et al. , “Humor in collective discourse: Unsupervised funniness detec- tion in the new yorker cartoon caption contest,” arXiv preprint arXiv:1506.08126, 2015. 3
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[45]
Is ai fun? humordb: a curated dataset and benchmark to investigate graphical humor,
V . Jain, F. d. S. A. Feitosa, and G. Kreiman, “Is ai fun? humordb: a curated dataset and benchmark to investigate graphical humor,” arXiv preprint arXiv:2406.13564, 2024. 3 15
-
[46]
We are humor beings: Understanding and predicting visual humor,
A. Chandrasekaran, A. K. Vijayakumar, S. Antol, M. Bansal, D. Batra, C. L. Zitnick, and D. Parikh, “We are humor beings: Understanding and predicting visual humor,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2016, pp. 4603–4612. 3
work page 2016
-
[47]
The laughing machine: Predicting humor in video,
Y. Kayatani, Z. Yang, M. Otani, N. Garcia, C. Chu, Y. Nakashima, and H. Takemura, “The laughing machine: Predicting humor in video,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2073–2082. 3
work page 2021
-
[48]
Y. Liu, T. Shen, D. Zhang, Q. Sun, S. Li, and G. Zhou, “Comment-aided video-language alignment via contrastive pre- training for short-form video humor detection,” arXiv preprint arXiv:2402.09055, 2024. 3
-
[49]
MemeCap: A dataset for captioning and interpreting memes,
E. Hwang and V . Shwartz, “MemeCap: A dataset for captioning and interpreting memes,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 1433–1445. [Online]. Available: https://aclanthology.org/2023.emnlp-mai...
work page 2023
-
[50]
D. Radev, A. Stent, J. Tetreault, A. Pappu, A. Iliakopoulou, A. Chanfreau, P . de Juan, J. Vallmitjana, A. Jaimes, R. Jha, and R. Mankoff, “Humor in collective discourse: Unsupervised funniness detection in the new yorker cartoon caption contest,” in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) , N. Calz...
work page 2016
-
[51]
Beyond single frames: Can lmms comprehend temporal and contextual narratives in image sequences?
X. Wang, H. Xia, J. Song, L. Guan, Y. Yang, Q. Dong, W. Luo, Y. Pu, Y. Wang, X. Meng et al. , “Beyond single frames: Can lmms comprehend temporal and contextual narratives in image sequences?” arXiv preprint arXiv:2502.13925, 2025. 3
-
[52]
Z. Chen, Q. Zhou, Y. Shen, Y. Hong, H. Zhang, and C. Gan, “See, think, confirm: Interactive prompting between vision and language models for knowledge-based visual reasoning,” arXiv preprint arXiv:2301.05226, 2023. 3
-
[53]
Hydra: A hyper agent for dynamic composi- tional visual reasoning,
F. Ke, Z. Cai, S. Jahangard, W. Wang, P . D. Haghighi, and H. Rezatofighi, “Hydra: A hyper agent for dynamic composi- tional visual reasoning,” in European Conference on Computer Vision. Springer, 2024, pp. 132–149. 3
work page 2024
-
[54]
Visual programming: Compositional visual reasoning without training,
T. Gupta and A. Kembhavi, “Visual programming: Compositional visual reasoning without training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 14 953–14 962. 3
work page 2023
-
[55]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P . Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,” arXiv preprint arXiv:2308.12966, 2023. 3, 17
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
CogVLM: Visual Expert for Pretrained Language Models
W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang, J. Ji, Z. Yang, L. Zhao, X. Song et al. , “Cogvlm: Visual expert for pretrained language models,” arXiv preprint arXiv:2311.03079, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
mplug-owl2: Revolutionizing multi- modal large language model with modality collaboration,
Q. Ye, H. Xu, J. Ye, M. Yan, A. Hu, H. Liu, Q. Qian, J. Zhang, F. Huang, and J. Zhou, “mplug-owl2: Revolutionizing multi- modal large language model with modality collaboration,” 2023. 3
work page 2023
-
[58]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Llava-next: Improved reasoning, ocr, and world knowledge,
H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024. [Online]. Available: https://llava-vl.github.io/ blog/2024-01-30-llava-next/ 3, 6, 8, 17
work page 2024
-
[60]
Y. Wang and Y. Zhao, “Gemini in reasoning: Unveiling com- monsense in multimodal large language models,” arXiv preprint arXiv:2312.17661, 2023. 3
-
[61]
From recognition to cognition: Visual commonsense reasoning,
R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi, “From recognition to cognition: Visual commonsense reasoning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6720–6731. 3
work page 2019
-
[62]
Gqa: A new dataset for real- world visual reasoning and compositional question answering,
D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real- world visual reasoning and compositional question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6700–6709. 3
work page 2019
-
[63]
Winoground: Probing vision and language models for visio-linguistic compositionality,
T. Thrush, R. Jiang, M. Bartolo, A. Singh, A. Williams, D. Kiela, and C. Ross, “Winoground: Probing vision and language models for visio-linguistic compositionality,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022, pp. 5238–5248. 3
work page 2022
-
[64]
Learn to explain: Multimodal reasoning via thought chains for science question answering,
P . Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P . Clark, and A. Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,” Advances in Neural Information Processing Systems, vol. 35, pp. 2507–2521, 2022. 3
work page 2022
-
[65]
Pre-training language models for comparative reasoning,
M. Yu, Z. Zhang, W. Yu, and M. Jiang, “Pre-training language models for comparative reasoning,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 12 421–12 433. [Online]. Available: https://aclanthology.org/...
work page 2023
-
[66]
Identifying comparative sentences in text documents,
N. Jindal and B. Liu, “Identifying comparative sentences in text documents,” in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 2006, pp. 244–251. 3
work page 2006
-
[67]
Mllm-compbench: A comparative reasoning benchmark for multimodal llms,
J. Kil, Z. Mai, J. Lee, A. Chowdhury, Z. Wang, K. Cheng, L. Wang, Y. Liu, and W.-L. H. Chao, “Mllm-compbench: A comparative reasoning benchmark for multimodal llms,” Advances in Neural Information Processing Systems, vol. 37, pp. 28 798–28 827, 2024. 3
work page 2024
-
[68]
Improved Baselines with Visual Instruction Tuning
H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved baselines with visual instruction tuning,” arXiv preprint arXiv:2310.03744, 2023. 6, 17
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
CogVLM2: Visual Language Models for Image and Video Understanding
W. Hong, W. Wang, M. Ding, W. Yu, Q. Lv, Y. Wang, Y. Cheng, S. Huang, J. Ji, Z. Xue et al., “Cogvlm2: Visual language models for image and video understanding,” arXiv preprint arXiv:2408.16500,
work page internal anchor Pith review Pith/arXiv arXiv
-
[70]
LLaVA-OneVision: Easy Visual Task Transfer
B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P . Zhang, Y. Li, Z. Liu et al. , “Llava-onevision: Easy visual task transfer,” arXiv preprint arXiv:2408.03326, 2024. 6, 17
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[71]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
P . Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge et al. , “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,” arXiv preprint arXiv:2409.12191, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[72]
A. Hurst, A. Lerer, A. P . Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford et al. , “Gpt-4o system card,” arXiv preprint arXiv:2410.21276, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[73]
Minigpt- 4: Enhancing vision-language understanding with advanced large language models,
D. Zhu, J. Chen, X. Shen, xiang Li, and M. Elhoseiny, “Minigpt- 4: Enhancing vision-language understanding with advanced large language models,” 2023. 6
work page 2023
-
[74]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P . Wang, X. Bi et al. , “Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[75]
A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei et al., “Qwen2. 5 technical report,” arXiv preprint arXiv:2412.15115, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
arXiv preprint arXiv:2006.14799 , year=
A. Celikyilmaz, E. Clark, and J. Gao, “Evaluation of text genera- tion: A survey,” arXiv preprint arXiv:2006.14799, 2020. 7
-
[77]
Llm-based nlg evaluation: Current status and challenges,
M. Gao, X. Hu, J. Ruan, X. Pu, and X. Wan, “Llm-based nlg evaluation: Current status and challenges,” arXiv preprint arXiv:2402.01383, 2024. 7
-
[78]
Judging llm-as-a-judge with mt-bench and chatbot arena,
L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging llm-as-a-judge with mt-bench and chatbot arena,” in Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2023. [Online]. Available: https://openreview.net/forum?id=ucc...
work page 2023
-
[79]
A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Yang, J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P . Zhang, P . Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li,...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[80]
https://platform.openai.com/docs/models/ (DeepSeek-R1-Distill-Llama-70B), and Qwen2.5, available in 7B and 72B versions. C.2 Implementation Details All commercial models are accessed through their official API, while open-sourced models are implemented using Hugging Face Transformers 4. Inference for GPT-3, GPT- 4, GPT-4o, and GPT-4-Vision-Turbo is perfor...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.