pith. sign in

arxiv: 2405.19088 · v3 · submitted 2024-05-29 · 💻 cs.CL · cs.CV

Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions

Pith reviewed 2026-05-24 00:51 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords multimodal modelshumor understandingcomic juxtapositionnarrative contradictionbenchmark evaluationvision-language reasoningAI limitations
0
0 comments X

The pith

Even state-of-the-art AI models lag behind humans at understanding humorous contradictions in two-panel comics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the YesBut benchmark to test whether large vision-language models can grasp humor that arises from contradictory narratives across juxtaposed comic panels. It evaluates models on a progression of tasks from describing literal panel content to reasoning about the implied joke. Experiments show current systems remain below human performance even on the easier variants. A reader would care because this form of nonlinear juxtaposition is central to many everyday jokes, and persistent failure here points to a concrete gap in how models process creative human expression.

Core claim

The paper presents the YesBut benchmark of two-panel comics that generate humor through narrative contradiction and demonstrates via systematic testing that recent commercial and open-source multimodal models continue to underperform humans across literal, interpretive, and deep-reasoning subtasks.

What carries the argument

The YesBut benchmark, a collection of contradictory two-panel comics paired with tasks that escalate from surface description to narrative-humor reasoning.

If this is right

  • Models must develop stronger mechanisms for integrating contradictory information across sequential panels to handle narrative humor.
  • Performance gaps on deeper reasoning subtasks indicate that literal content understanding does not automatically yield joke interpretation.
  • Insights from the benchmark can guide targeted data or training adjustments aimed at nonlinear narrative structures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Success on YesBut could transfer to other juxtaposition-based humor such as memes or short video clips that rely on visual contradiction.
  • Persistent shortfalls here may limit AI usefulness in creative writing assistants or conversational agents that need to recognize or generate ironic content.
  • The benchmark could be extended to measure whether scaling alone closes the gap or whether architectural changes are required.

Load-bearing premise

The YesBut comics and task design validly measure the specific capability of understanding humorous contradictions via juxtaposition and nonlinear narratives.

What would settle it

A model that matches or exceeds average human accuracy on the full set of YesBut reasoning tasks while using the same comic set.

Figures

Figures reproduced from arXiv: 2405.19088 by Jing Li, Jing Ma, Tuo Liang, Yiran Qiao, Yiren Lu, Yunlai Zhou, Yu Yin, Zhe Hu.

Figure 1
Figure 1. Figure 1: We introduce YESBUT dataset for comic understanding of juxtaposed comic panels. Given a two-panel comic with a contradictory narrative, we propose several tasks including narrative under￾standing, underlying philosophy selection and title matching, tackling different levels of comic under￾standing. (Comic by Anton Gudim). In this work, we examine VLMs’ ability to un￾derstand comics, specifically focusing o… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the data construction pipeline. Pos represents the positive options, and Neg [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Human Evaluation on literal description and contradiction generation. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: VLMs with image only input and im￾age + oracle description as inputs. challenging than generating literal descriptions, which requires in-depth reasoning to compare the various aspects of both panels. A comparison of the scores for literal description and contradiction reveals a strong correlation between the two tasks: models that perform well on literal descriptions also tend to achieve good results on c… view at source ↗
Figure 6
Figure 6. Figure 6: Sample outputs of contradiction explanations generated by different vision language models, [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sample comic with all annotated tasks. Analysis on Data Diversity. In order to show the diversity of our benchmark, we prompt ChatGPT to generate topical keywords for each comic based on its description, and then cluster these keywords. All these scenarios are presented in [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The clusters of comic topics covered by our benchmark. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompts for GPT based evaluations. B.4 Human Evaluation Details We present 30 random samples on each task for human evaluation. We anonymize the models and shuffle the outputs to the annotators. Following [44], we include the following aspects: • Correctness: Does the model output correctly convey the narrative of the comic? • Completeness: Does the model output cover all the important elements of the comi… view at source ↗
Figure 10
Figure 10. Figure 10: Sample outputs of model generated literal description and contradiction. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Sample outputs of model generated literal description and contradiction. [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Sample outputs of model generated literal description and contradiction. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompts for Literal Description Generation in experiments. [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompts for Contradiction Generation in experiments. [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prompts for Underlying Selection Task in experiments. [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Prompts for Title Matching Task in experiments. [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
read the original abstract

Recent advancements in large multimodal language models have demonstrated remarkable proficiency across a wide range of tasks. Yet, these models still struggle with understanding the nuances of human humor through juxtaposition, particularly when it involves nonlinear narratives that underpin many jokes and humor cues. This paper investigates this challenge by focusing on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction. We introduce the YesBut benchmark, which comprises tasks of varying difficulty aimed at assessing AI's capabilities in recognizing and interpreting these comics, ranging from literal content comprehension to deep narrative reasoning. Through extensive experimentation and analysis of recent commercial or open-sourced large (vision) language models, we assess their capability to comprehend the complex interplay of the narrative humor inherent in these comics. Our results show that even state-of-the-art models still lag behind human performance on this task. Our findings offer insights into the current limitations and potential improvements for AI in understanding human creative expressions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces the YesBut benchmark consisting of two-panel comics that create humorous contradictions through juxtaposition and nonlinear narratives. It defines a graduated task suite ranging from literal content comprehension to narrative reasoning and reports that state-of-the-art vision-language models underperform relative to humans on these tasks.

Significance. If the benchmark and task design are shown to isolate juxtaposition-based humorous contradiction understanding, the work supplies a targeted evaluation resource that exposes a gap in current multimodal models' handling of creative, non-linear reasoning. This could usefully direct future efforts in humor comprehension and narrative inference.

major comments (2)
  1. [Abstract] Abstract and experimental section: the central claim that SOTA models lag humans rests on the YesBut benchmark validly measuring the target capability, yet the provided text supplies no information on comic selection criteria, inter-annotator agreement, or controls for confounding factors such as visual style or panel ordering; without these, the performance gap cannot be attributed specifically to juxtaposition understanding.
  2. [Abstract] Results and analysis: the manuscript reports model-human gaps but omits statistical tests, confidence intervals, or error analysis (e.g., breakdown by task difficulty or failure modes), making it impossible to determine whether the observed differences are reliable or driven by a small number of items.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments, which highlight important aspects of benchmark validity and statistical reporting. We address each point below and will revise the manuscript to strengthen these elements.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental section: the central claim that SOTA models lag humans rests on the YesBut benchmark validly measuring the target capability, yet the provided text supplies no information on comic selection criteria, inter-annotator agreement, or controls for confounding factors such as visual style or panel ordering; without these, the performance gap cannot be attributed specifically to juxtaposition understanding.

    Authors: We agree that these details are necessary to support the claim that the performance gap reflects juxtaposition understanding rather than other factors. The full manuscript describes the overall task design and data collection at a high level, but does not provide the requested specifics on selection criteria, inter-annotator agreement statistics, or explicit controls for visual style and panel ordering. In the revision we will add a dedicated subsection on benchmark construction that includes: (i) the comic selection criteria and sourcing process, (ii) inter-annotator agreement metrics, and (iii) analyses or controls addressing potential confounders such as visual style and panel ordering. revision: yes

  2. Referee: [Abstract] Results and analysis: the manuscript reports model-human gaps but omits statistical tests, confidence intervals, or error analysis (e.g., breakdown by task difficulty or failure modes), making it impossible to determine whether the observed differences are reliable or driven by a small number of items.

    Authors: We acknowledge that the current results section lacks formal statistical tests, confidence intervals, and systematic error analysis. In the revised manuscript we will add: (i) appropriate statistical tests (e.g., paired t-tests or non-parametric equivalents) comparing model and human performance with reported p-values and confidence intervals, (ii) a breakdown of results by task difficulty level, and (iii) an error analysis that categorizes failure modes across models and tasks to demonstrate that the gaps are not driven by a small subset of items. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces the YesBut benchmark and reports empirical evaluations of multimodal LLMs on graduated tasks for humorous contradiction understanding in comics. No equations, derivations, fitted parameters, or predictions appear in the provided text. The central claim (SOTA models lag humans) rests on direct benchmark testing rather than any self-referential construction, self-citation chain, or ansatz. This is a standard empirical benchmark paper with independent, falsifiable results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or free parameters; the work is an empirical benchmark study relying on standard assumptions about model evaluation and human performance baselines.

pith-pipeline@v0.9.0 · 5712 in / 801 out tokens · 19272 ms · 2026-05-24T00:51:20.253611+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 14 internal anchors

  1. [1]

    Oxford University Press, USA, 2014

    Jessica Pressman.Digital modernism: Making it new in new media. Oxford University Press, USA, 2014

  2. [2]

    Understanding comics: The invisible art

    Alan D Manning. Understanding comics: The invisible art. 1998

  3. [3]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023

  4. [4]

    A Survey on Multimodal Large Language Models

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023

  5. [5]

    Multimodal large language models: A survey

    Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and S Yu Philip. Multimodal large language models: A survey. In2023 IEEE International Conference on Big Data (BigData), pages 2247–2256. IEEE, 2023

  6. [6]

    Language models, agent models, and world models: The law for machine reasoning and planning.arXiv preprint arXiv:2312.05230, 2023

    Zhiting Hu and Tianmin Shu. Language models, agent models, and world models: The law for machine reasoning and planning.arXiv preprint arXiv:2312.05230, 2023

  7. [7]

    understanding

    Jack Hessel, Ana Marasovic, Jena D. Hwang, Lillian Lee, Jeff Da, Rowan Zellers, Robert Mankoff, and Yejin Choi. Do androids laugh at electric sheep? humor “understanding” bench- marks from the new yorker caption contest. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computatio...

  8. [8]

    Artificial general intelligence: Roadmap to achieving human-level capabilities, 2023

    Abu Rayhan, Rajan Rayhan, and Swajan Rayhan. Artificial general intelligence: Roadmap to achieving human-level capabilities, 2023

  9. [9]

    A&C Black, 2009

    Randy Duncan and Matthew J Smith.The power of comics: History, form and culture. A&C Black, 2009

  10. [10]

    Can large multimodal models uncover deep semantics behind images?arXiv preprint arXiv:2402.11281, 2024

    Yixin Yang, Zheng Li, Qingxiu Dong, Heming Xia, and Zhifang Sui. Can large multimodal models uncover deep semantics behind images?arXiv preprint arXiv:2402.11281, 2024

  11. [11]

    Routledge, 2003

    James O Young.Art and knowledge. Routledge, 2003

  12. [12]

    Thierry Groensteen.Comics and narration. Univ. Press of Mississippi, 2013

  13. [13]

    Rethinking literacy: Communication, representation and text.Reading, 37(3):98– 103, 2003

    Eve Bearne. Rethinking literacy: Communication, representation and text.Reading, 37(3):98– 103, 2003

  14. [14]

    Comic book visualities: a methodological manifesto on geography, montage and narration.Transactions of the Institute of British Geographers, 35(2):222–236, 2010

    Jason Dittmer. Comic book visualities: a methodological manifesto on geography, montage and narration.Transactions of the Institute of British Geographers, 35(2):222–236, 2010

  15. [15]

    Juxtaposition: A new way to combine logics.The Review of Symbolic Logic, 4(4):560–606, 2011

    Joshua Schechter. Juxtaposition: A new way to combine logics.The Review of Symbolic Logic, 4(4):560–606, 2011

  16. [16]

    Comics-based research: The affordances of comics for research across disciplines.Qualitative Research, 21(2):195–214, 2021

    Paul J Kuttner, Marcus B Weaver-Hightower, and Nick Sousanis. Comics-based research: The affordances of comics for research across disciplines.Qualitative Research, 21(2):195–214, 2021

  17. [17]

    Eliminating reasoning via inferring with planning: A new framework to guide llms’ non-linear thinking.arXiv preprint arXiv:2310.12342, 2023

    Yongqi Tong, Yifan Wang, Dawei Li, Sizhe Wang, Zi Lin, Simeng Han, and Jingbo Shang. Eliminating reasoning via inferring with planning: A new framework to guide llms’ non-linear thinking.arXiv preprint arXiv:2310.12342, 2023

  18. [18]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4.arXiv preprint arXiv:2303.12712, 2023

  19. [19]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022. 11

  20. [20]

    Llama 3 model card

    AI@Meta. Llama 3 model card. 2024

  21. [21]

    Large Language Models: A Survey

    Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey.arXiv preprint arXiv:2402.06196, 2024

  22. [22]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36, 2024

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36, 2024

  23. [23]

    Alpacafarm: A simulation framework for methods that learn from human feedback.Advances in Neural Information Processing Systems, 36, 2024

    Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy S Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback.Advances in Neural Information Processing Systems, 36, 2024

  24. [24]

    Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization.arXiv preprint arXiv:2306.05087, 2023

    Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, et al. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization.arXiv preprint arXiv:2306.05087, 2023

  25. [25]

    C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.Advances in Neural Information Processing Systems, 36, 2024

    Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.Advances in Neural Information Processing Systems, 36, 2024

  26. [26]

    Mmt-bench: A comprehensive multimodal benchmark for eval- uating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006, 2024

    Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, et al. Mmt-bench: A comprehensive multimodal benchmark for eval- uating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006, 2024

  27. [27]

    Visit-bench: A benchmark for vision-language instruction following inspired by real-world use.arXiv preprint arXiv:2308.06595, 2023

    Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schimdt. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use.arXiv preprint arXiv:2308.06595, 2023

  28. [28]

    Breaking common sense: Whoops! a vision-and-language bench- mark of synthetic and compositional images

    Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, and Roy Schwartz. Breaking common sense: Whoops! a vision-and-language bench- mark of synthetic and compositional images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2616–2627, 2023

  29. [29]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023

  30. [30]

    Seed-bench-2: Benchmarking multimodal large language models.arXiv preprint arXiv:2311.17092, 2023

    Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench-2: Benchmarking multimodal large language models.arXiv preprint arXiv:2311.17092, 2023

  31. [31]

    Routledge, 2003

    Jerry Palmer.Taking humour seriously. Routledge, 2003

  32. [32]

    Predicting Audience's Laughter Using Convolutional Neural Network

    Lei Chen and Chong MIn Lee. Predicting audience’s laughter using convolutional neural network.arXiv preprint arXiv:1702.02584, 2017

  33. [33]

    Recognizing humour using word associations and humour anchor extraction

    Andrew Cattle and Xiaojuan Ma. Recognizing humour using word associations and humour anchor extraction. InProceedings of the 27th international conference on computational linguistics, pages 1849–1858, 2018

  34. [34]

    Humor recognition and humor anchor extraction

    Diyi Yang, Alon Lavie, Chris Dyer, and Eduard Hovy. Humor recognition and humor anchor extraction. InProceedings of the 2015 conference on empirical methods in natural language processing, pages 2367–2376, 2015

  35. [35]

    A survey on approaches to computational humor gen- eration

    Miriam Amin and Manuel Burghardt. A survey on approaches to computational humor gen- eration. In Stefania DeGaetano, Anna Kazantseva, Nils Reiter, and Stan Szpakowicz, editors, Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 29–41, Online, December 2020. Int...

  36. [36]

    We are humor beings: Understanding and predict- ing visual humor

    Arjun Chandrasekaran, Ashwin K Vijayakumar, Stanislaw Antol, Mohit Bansal, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. We are humor beings: Understanding and predict- ing visual humor. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4603–4612, 2016

  37. [37]

    Inside jokes: Identifying humorous cartoon captions

    Dafna Shahaf, Eric Horvitz, and Robert Mankoff. Inside jokes: Identifying humorous cartoon captions. InProceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 1065–1074, 2015

  38. [38]

    Humor in Collective Discourse: Unsupervised Funniness Detection in the New Yorker Cartoon Caption Contest

    Dragomir Radev, Amanda Stent, Joel Tetreault, Aasish Pappu, Aikaterini Iliakopoulou, Agustin Chanfreau, Paloma de Juan, Jordi Vallmitjana, Alejandro Jaimes, Rahul Jha, et al. Humor in collective discourse: Unsupervised funniness detection in the new yorker cartoon caption contest.arXiv preprint arXiv:1506.08126, 2015

  39. [39]

    The laughing machine: Predicting humor in video

    Yuta Kayatani, Zekun Yang, Mayu Otani, Noa Garcia, Chenhui Chu, Yuta Nakashima, and Haruo Takemura. The laughing machine: Predicting humor in video. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2073–2082, 2021

  40. [40]

    Comment-aided video-language alignment via contrastive pre-training for short-form video humor detection.arXiv preprint arXiv:2402.09055, 2024

    Yang Liu, Tongfei Shen, Dong Zhang, Qingying Sun, Shoushan Li, and Guodong Zhou. Comment-aided video-language alignment via contrastive pre-training for short-form video humor detection.arXiv preprint arXiv:2402.09055, 2024

  41. [41]

    Chatgpt is fun, but it is not funny! humor is still challenging large language models.arXiv preprint arXiv:2306.04563, 2023

    Sophie Jentzsch and Kristian Kersting. Chatgpt is fun, but it is not funny! humor is still challenging large language models.arXiv preprint arXiv:2306.04563, 2023

  42. [42]

    Brain networks for visual creativity: a functional connectivity study of planning a visual artwork.Scientific reports, 6(1):39185, 2016

    Nicola De Pisapia, Francesca Bacci, Danielle Parrott, and David Melcher. Brain networks for visual creativity: a functional connectivity study of planning a visual artwork.Scientific reports, 6(1):39185, 2016

  43. [43]

    Best humans still outperform artificial intelligence in a creative divergent thinking task.Scientific reports, 13(1):13601, 2023

    Mika Koivisto and Simone Grassini. Best humans still outperform artificial intelligence in a creative divergent thinking task.Scientific reports, 13(1):13601, 2023

  44. [44]

    MemeCap: A dataset for captioning and interpreting memes

    EunJeong Hwang and Vered Shwartz. MemeCap: A dataset for captioning and interpreting memes. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 1433–1445, Singapore, December 2023. Association for Computational Linguistics

  45. [45]

    Humor in collective discourse: Unsupervised funniness detection in the new yorker cartoon caption contest

    Dragomir Radev, Amanda Stent, Joel Tetreault, Aasish Pappu, Aikaterini Iliakopoulou, Agustin Chanfreau, Paloma de Juan, Jordi Vallmitjana, Alejandro Jaimes, Rahul Jha, and Robert Mankoff. Humor in collective discourse: Unsupervised funniness detection in the new yorker cartoon caption contest. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara...

  46. [46]

    Gemini in reasoning: Unveiling commonsense in multimodal large language models.arXiv preprint arXiv:2312.17661, 2023

    Yuqing Wang and Yun Zhao. Gemini in reasoning: Unveiling commonsense in multimodal large language models.arXiv preprint arXiv:2312.17661, 2023

  47. [47]

    From recognition to cognition: Visual commonsense reasoning

    Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731, 2019

  48. [48]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

  49. [49]

    Winoground: Probing vision and language models for visio-linguistic compositionality

    Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022. 13

  50. [50]

    Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022

  51. [51]

    yes, but

    "yes, but" series created by anton gudim. https://twitter.com/_yesbut_ . Accessed: 2024

  52. [52]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  53. [53]

    The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 2024

    AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 2024

  54. [54]

    Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

  55. [55]

    CogVLM: Visual Expert for Pretrained Language Models

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023

  56. [56]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023

  57. [57]

    mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023

    Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023

  58. [58]

    Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in Neural Information Processing Systems, 36, 2024

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in Neural Information Processing Systems, 36, 2024

  59. [59]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023

  60. [60]

    Evaluation of text generation: A survey

    Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao. Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799, 2020

  61. [61]

    ROUGE: A package for automatic evaluation of summaries

    Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics

  62. [62]

    Weinberger, and Yoav Artzi

    Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. InInternational Conference on Learning Representations, 2020

  63. [63]

    CLAIR: Eval- uating image captions with large language models

    David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, and John Canny. CLAIR: Eval- uating image captions with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13638–13646, Singapore, December 2023. Association for Computational ...

  64. [64]

    G- eval: NLG evaluation using gpt-4 with better human alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G- eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, December 2023. Association for Computational...

  65. [65]

    Americano: Argument generation with discourse-driven decomposition and agent interaction.arXiv preprint arXiv:2310.20352, 2023

    Zhe Hu, Hou Pong Chan, and Yu Yin. Americano: Argument generation with discourse-driven decomposition and agent interaction.arXiv preprint arXiv:2310.20352, 2023. 14

  66. [66]

    Reasoning with language model prompting: A survey

    Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5368–53...

  67. [67]

    Emergent Abilities of Large Language Models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

  68. [68]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate.arXiv preprint arXiv:2305.14325, 2023

  69. [69]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024. 15 A Data Annotation Details Considering the workload of manually writing all components from scratch, we leverage a AI-human collaborative pipeline for annotation. Th...