Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions

Jing Li; Jing Ma; Tuo Liang; Yiran Qiao; Yiren Lu; Yunlai Zhou; Yu Yin; Zhe Hu

arxiv: 2405.19088 · v3 · submitted 2024-05-29 · 💻 cs.CL · cs.CV

Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions

Zhe Hu , Tuo Liang , Jing Li , Yiren Lu , Yunlai Zhou , Yiran Qiao , Jing Ma , Yu Yin This is my paper

Pith reviewed 2026-05-24 00:51 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords multimodal modelshumor understandingcomic juxtapositionnarrative contradictionbenchmark evaluationvision-language reasoningAI limitations

0 comments

The pith

Even state-of-the-art AI models lag behind humans at understanding humorous contradictions in two-panel comics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the YesBut benchmark to test whether large vision-language models can grasp humor that arises from contradictory narratives across juxtaposed comic panels. It evaluates models on a progression of tasks from describing literal panel content to reasoning about the implied joke. Experiments show current systems remain below human performance even on the easier variants. A reader would care because this form of nonlinear juxtaposition is central to many everyday jokes, and persistent failure here points to a concrete gap in how models process creative human expression.

Core claim

The paper presents the YesBut benchmark of two-panel comics that generate humor through narrative contradiction and demonstrates via systematic testing that recent commercial and open-source multimodal models continue to underperform humans across literal, interpretive, and deep-reasoning subtasks.

What carries the argument

The YesBut benchmark, a collection of contradictory two-panel comics paired with tasks that escalate from surface description to narrative-humor reasoning.

If this is right

Models must develop stronger mechanisms for integrating contradictory information across sequential panels to handle narrative humor.
Performance gaps on deeper reasoning subtasks indicate that literal content understanding does not automatically yield joke interpretation.
Insights from the benchmark can guide targeted data or training adjustments aimed at nonlinear narrative structures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Success on YesBut could transfer to other juxtaposition-based humor such as memes or short video clips that rely on visual contradiction.
Persistent shortfalls here may limit AI usefulness in creative writing assistants or conversational agents that need to recognize or generate ironic content.
The benchmark could be extended to measure whether scaling alone closes the gap or whether architectural changes are required.

Load-bearing premise

The YesBut comics and task design validly measure the specific capability of understanding humorous contradictions via juxtaposition and nonlinear narratives.

What would settle it

A model that matches or exceeds average human accuracy on the full set of YesBut reasoning tasks while using the same comic set.

Figures

Figures reproduced from arXiv: 2405.19088 by Jing Li, Jing Ma, Tuo Liang, Yiran Qiao, Yiren Lu, Yunlai Zhou, Yu Yin, Zhe Hu.

**Figure 1.** Figure 1: We introduce YESBUT dataset for comic understanding of juxtaposed comic panels. Given a two-panel comic with a contradictory narrative, we propose several tasks including narrative understanding, underlying philosophy selection and title matching, tackling different levels of comic understanding. (Comic by Anton Gudim). In this work, we examine VLMs’ ability to understand comics, specifically focusing o… view at source ↗

**Figure 2.** Figure 2: Overview of the data construction pipeline. Pos represents the positive options, and Neg [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Human Evaluation on literal description and contradiction generation. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: VLMs with image only input and image + oracle description as inputs. challenging than generating literal descriptions, which requires in-depth reasoning to compare the various aspects of both panels. A comparison of the scores for literal description and contradiction reveals a strong correlation between the two tasks: models that perform well on literal descriptions also tend to achieve good results on c… view at source ↗

**Figure 6.** Figure 6: Sample outputs of contradiction explanations generated by different vision language models, [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Sample comic with all annotated tasks. Analysis on Data Diversity. In order to show the diversity of our benchmark, we prompt ChatGPT to generate topical keywords for each comic based on its description, and then cluster these keywords. All these scenarios are presented in [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: The clusters of comic topics covered by our benchmark. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Prompts for GPT based evaluations. B.4 Human Evaluation Details We present 30 random samples on each task for human evaluation. We anonymize the models and shuffle the outputs to the annotators. Following [44], we include the following aspects: • Correctness: Does the model output correctly convey the narrative of the comic? • Completeness: Does the model output cover all the important elements of the comi… view at source ↗

**Figure 10.** Figure 10: Sample outputs of model generated literal description and contradiction. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Sample outputs of model generated literal description and contradiction. [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Sample outputs of model generated literal description and contradiction. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Prompts for Literal Description Generation in experiments. [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Prompts for Contradiction Generation in experiments. [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Prompts for Underlying Selection Task in experiments. [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Prompts for Title Matching Task in experiments. [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

read the original abstract

Recent advancements in large multimodal language models have demonstrated remarkable proficiency across a wide range of tasks. Yet, these models still struggle with understanding the nuances of human humor through juxtaposition, particularly when it involves nonlinear narratives that underpin many jokes and humor cues. This paper investigates this challenge by focusing on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction. We introduce the YesBut benchmark, which comprises tasks of varying difficulty aimed at assessing AI's capabilities in recognizing and interpreting these comics, ranging from literal content comprehension to deep narrative reasoning. Through extensive experimentation and analysis of recent commercial or open-sourced large (vision) language models, we assess their capability to comprehend the complex interplay of the narrative humor inherent in these comics. Our results show that even state-of-the-art models still lag behind human performance on this task. Our findings offer insights into the current limitations and potential improvements for AI in understanding human creative expressions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

YesBut introduces a new benchmark for multimodal humor via comic contradictions, but the abstract gives almost no methodological details to assess the results.

read the letter

The paper's main contribution is the YesBut benchmark, built around two-panel comics that create humorous contradictions through juxtaposition and nonlinear narratives. It sets up a graduated set of tasks from literal panel description up to reasoning about why the contradiction is funny, then reports that current large vision-language models fall short of human performance on the harder levels. That framing is useful because it targets a specific gap in how models handle the kind of implicit contrast that drives many jokes, rather than generic visual QA or captioning. The graduated difficulty is a reasonable design choice for trying to isolate the target capability. The results directionally match what one would expect given known weaknesses in current models on creative or social reasoning. The clear soft spot is the complete absence of information on comic selection or creation criteria, dataset size, inter-annotator agreement, statistical testing, or error analysis. Without those, the performance gap cannot be evaluated for robustness or possible confounds in the data. The abstract alone does not let a reader judge whether the benchmark actually measures what it claims. This work is aimed at researchers building or evaluating multimodal models for tasks that involve humor or narrative inference. Anyone looking for new test sets in that area could get value once the methods are filled in. I would send it to peer review so the full paper can be checked on the missing details; the core idea is worth referee time even if the current write-up is thin.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces the YesBut benchmark consisting of two-panel comics that create humorous contradictions through juxtaposition and nonlinear narratives. It defines a graduated task suite ranging from literal content comprehension to narrative reasoning and reports that state-of-the-art vision-language models underperform relative to humans on these tasks.

Significance. If the benchmark and task design are shown to isolate juxtaposition-based humorous contradiction understanding, the work supplies a targeted evaluation resource that exposes a gap in current multimodal models' handling of creative, non-linear reasoning. This could usefully direct future efforts in humor comprehension and narrative inference.

major comments (2)

[Abstract] Abstract and experimental section: the central claim that SOTA models lag humans rests on the YesBut benchmark validly measuring the target capability, yet the provided text supplies no information on comic selection criteria, inter-annotator agreement, or controls for confounding factors such as visual style or panel ordering; without these, the performance gap cannot be attributed specifically to juxtaposition understanding.
[Abstract] Results and analysis: the manuscript reports model-human gaps but omits statistical tests, confidence intervals, or error analysis (e.g., breakdown by task difficulty or failure modes), making it impossible to determine whether the observed differences are reliable or driven by a small number of items.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments, which highlight important aspects of benchmark validity and statistical reporting. We address each point below and will revise the manuscript to strengthen these elements.

read point-by-point responses

Referee: [Abstract] Abstract and experimental section: the central claim that SOTA models lag humans rests on the YesBut benchmark validly measuring the target capability, yet the provided text supplies no information on comic selection criteria, inter-annotator agreement, or controls for confounding factors such as visual style or panel ordering; without these, the performance gap cannot be attributed specifically to juxtaposition understanding.

Authors: We agree that these details are necessary to support the claim that the performance gap reflects juxtaposition understanding rather than other factors. The full manuscript describes the overall task design and data collection at a high level, but does not provide the requested specifics on selection criteria, inter-annotator agreement statistics, or explicit controls for visual style and panel ordering. In the revision we will add a dedicated subsection on benchmark construction that includes: (i) the comic selection criteria and sourcing process, (ii) inter-annotator agreement metrics, and (iii) analyses or controls addressing potential confounders such as visual style and panel ordering. revision: yes
Referee: [Abstract] Results and analysis: the manuscript reports model-human gaps but omits statistical tests, confidence intervals, or error analysis (e.g., breakdown by task difficulty or failure modes), making it impossible to determine whether the observed differences are reliable or driven by a small number of items.

Authors: We acknowledge that the current results section lacks formal statistical tests, confidence intervals, and systematic error analysis. In the revised manuscript we will add: (i) appropriate statistical tests (e.g., paired t-tests or non-parametric equivalents) comparing model and human performance with reported p-values and confidence intervals, (ii) a breakdown of results by task difficulty level, and (iii) an error analysis that categorizes failure modes across models and tasks to demonstrate that the gaps are not driven by a small subset of items. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces the YesBut benchmark and reports empirical evaluations of multimodal LLMs on graduated tasks for humorous contradiction understanding in comics. No equations, derivations, fitted parameters, or predictions appear in the provided text. The central claim (SOTA models lag humans) rests on direct benchmark testing rather than any self-referential construction, self-citation chain, or ansatz. This is a standard empirical benchmark paper with independent, falsifiable results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or free parameters; the work is an empirical benchmark study relying on standard assumptions about model evaluation and human performance baselines.

pith-pipeline@v0.9.0 · 5712 in / 801 out tokens · 19272 ms · 2026-05-24T00:51:20.253611+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 14 internal anchors

[1]

Oxford University Press, USA, 2014

Jessica Pressman.Digital modernism: Making it new in new media. Oxford University Press, USA, 2014

work page 2014
[2]

Understanding comics: The invisible art

Alan D Manning. Understanding comics: The invisible art. 1998

work page 1998
[3]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

A Survey on Multimodal Large Language Models

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Multimodal large language models: A survey

Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and S Yu Philip. Multimodal large language models: A survey. In2023 IEEE International Conference on Big Data (BigData), pages 2247–2256. IEEE, 2023

work page 2023
[6]

Language models, agent models, and world models: The law for machine reasoning and planning.arXiv preprint arXiv:2312.05230, 2023

Zhiting Hu and Tianmin Shu. Language models, agent models, and world models: The law for machine reasoning and planning.arXiv preprint arXiv:2312.05230, 2023

work page arXiv 2023
[7]

understanding

Jack Hessel, Ana Marasovic, Jena D. Hwang, Lillian Lee, Jeff Da, Rowan Zellers, Robert Mankoff, and Yejin Choi. Do androids laugh at electric sheep? humor “understanding” bench- marks from the new yorker caption contest. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computatio...

work page 2023
[8]

Artificial general intelligence: Roadmap to achieving human-level capabilities, 2023

Abu Rayhan, Rajan Rayhan, and Swajan Rayhan. Artificial general intelligence: Roadmap to achieving human-level capabilities, 2023

work page 2023
[9]

A&C Black, 2009

Randy Duncan and Matthew J Smith.The power of comics: History, form and culture. A&C Black, 2009

work page 2009
[10]

Can large multimodal models uncover deep semantics behind images?arXiv preprint arXiv:2402.11281, 2024

Yixin Yang, Zheng Li, Qingxiu Dong, Heming Xia, and Zhifang Sui. Can large multimodal models uncover deep semantics behind images?arXiv preprint arXiv:2402.11281, 2024

work page arXiv 2024
[11]

Routledge, 2003

James O Young.Art and knowledge. Routledge, 2003

work page 2003
[12]

Thierry Groensteen.Comics and narration. Univ. Press of Mississippi, 2013

work page 2013
[13]

Rethinking literacy: Communication, representation and text.Reading, 37(3):98– 103, 2003

Eve Bearne. Rethinking literacy: Communication, representation and text.Reading, 37(3):98– 103, 2003

work page 2003
[14]

Comic book visualities: a methodological manifesto on geography, montage and narration.Transactions of the Institute of British Geographers, 35(2):222–236, 2010

Jason Dittmer. Comic book visualities: a methodological manifesto on geography, montage and narration.Transactions of the Institute of British Geographers, 35(2):222–236, 2010

work page 2010
[15]

Juxtaposition: A new way to combine logics.The Review of Symbolic Logic, 4(4):560–606, 2011

Joshua Schechter. Juxtaposition: A new way to combine logics.The Review of Symbolic Logic, 4(4):560–606, 2011

work page 2011
[16]

Comics-based research: The affordances of comics for research across disciplines.Qualitative Research, 21(2):195–214, 2021

Paul J Kuttner, Marcus B Weaver-Hightower, and Nick Sousanis. Comics-based research: The affordances of comics for research across disciplines.Qualitative Research, 21(2):195–214, 2021

work page 2021
[17]

Eliminating reasoning via inferring with planning: A new framework to guide llms’ non-linear thinking.arXiv preprint arXiv:2310.12342, 2023

Yongqi Tong, Yifan Wang, Dawei Li, Sizhe Wang, Zi Lin, Simeng Han, and Jingbo Shang. Eliminating reasoning via inferring with planning: A new framework to guide llms’ non-linear thinking.arXiv preprint arXiv:2310.12342, 2023

work page arXiv 2023
[18]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4.arXiv preprint arXiv:2303.12712, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022. 11

work page 2022
[20]

Llama 3 model card

AI@Meta. Llama 3 model card. 2024

work page 2024
[21]

Large Language Models: A Survey

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey.arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36, 2024

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[23]

Alpacafarm: A simulation framework for methods that learn from human feedback.Advances in Neural Information Processing Systems, 36, 2024

Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy S Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[24]

Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization.arXiv preprint arXiv:2306.05087, 2023

Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, et al. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization.arXiv preprint arXiv:2306.05087, 2023

work page arXiv 2023
[25]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.Advances in Neural Information Processing Systems, 36, 2024

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[26]

Mmt-bench: A comprehensive multimodal benchmark for eval- uating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006, 2024

Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, et al. Mmt-bench: A comprehensive multimodal benchmark for eval- uating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006, 2024

work page arXiv 2024
[27]

Visit-bench: A benchmark for vision-language instruction following inspired by real-world use.arXiv preprint arXiv:2308.06595, 2023

Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schimdt. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use.arXiv preprint arXiv:2308.06595, 2023

work page arXiv 2023
[28]

Breaking common sense: Whoops! a vision-and-language bench- mark of synthetic and compositional images

Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, and Roy Schwartz. Breaking common sense: Whoops! a vision-and-language bench- mark of synthetic and compositional images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2616–2627, 2023

work page 2023
[29]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Seed-bench-2: Benchmarking multimodal large language models.arXiv preprint arXiv:2311.17092, 2023

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench-2: Benchmarking multimodal large language models.arXiv preprint arXiv:2311.17092, 2023

work page arXiv 2023
[31]

Routledge, 2003

Jerry Palmer.Taking humour seriously. Routledge, 2003

work page 2003
[32]

Predicting Audience's Laughter Using Convolutional Neural Network

Lei Chen and Chong MIn Lee. Predicting audience’s laughter using convolutional neural network.arXiv preprint arXiv:1702.02584, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

Recognizing humour using word associations and humour anchor extraction

Andrew Cattle and Xiaojuan Ma. Recognizing humour using word associations and humour anchor extraction. InProceedings of the 27th international conference on computational linguistics, pages 1849–1858, 2018

work page 2018
[34]

Humor recognition and humor anchor extraction

Diyi Yang, Alon Lavie, Chris Dyer, and Eduard Hovy. Humor recognition and humor anchor extraction. InProceedings of the 2015 conference on empirical methods in natural language processing, pages 2367–2376, 2015

work page 2015
[35]

A survey on approaches to computational humor gen- eration

Miriam Amin and Manuel Burghardt. A survey on approaches to computational humor gen- eration. In Stefania DeGaetano, Anna Kazantseva, Nils Reiter, and Stan Szpakowicz, editors, Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 29–41, Online, December 2020. Int...

work page 2020
[36]

We are humor beings: Understanding and predict- ing visual humor

Arjun Chandrasekaran, Ashwin K Vijayakumar, Stanislaw Antol, Mohit Bansal, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. We are humor beings: Understanding and predict- ing visual humor. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4603–4612, 2016

work page 2016
[37]

Inside jokes: Identifying humorous cartoon captions

Dafna Shahaf, Eric Horvitz, and Robert Mankoff. Inside jokes: Identifying humorous cartoon captions. InProceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 1065–1074, 2015

work page 2015
[38]

Humor in Collective Discourse: Unsupervised Funniness Detection in the New Yorker Cartoon Caption Contest

Dragomir Radev, Amanda Stent, Joel Tetreault, Aasish Pappu, Aikaterini Iliakopoulou, Agustin Chanfreau, Paloma de Juan, Jordi Vallmitjana, Alejandro Jaimes, Rahul Jha, et al. Humor in collective discourse: Unsupervised funniness detection in the new yorker cartoon caption contest.arXiv preprint arXiv:1506.08126, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[39]

The laughing machine: Predicting humor in video

Yuta Kayatani, Zekun Yang, Mayu Otani, Noa Garcia, Chenhui Chu, Yuta Nakashima, and Haruo Takemura. The laughing machine: Predicting humor in video. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2073–2082, 2021

work page 2073
[40]

Comment-aided video-language alignment via contrastive pre-training for short-form video humor detection.arXiv preprint arXiv:2402.09055, 2024

Yang Liu, Tongfei Shen, Dong Zhang, Qingying Sun, Shoushan Li, and Guodong Zhou. Comment-aided video-language alignment via contrastive pre-training for short-form video humor detection.arXiv preprint arXiv:2402.09055, 2024

work page arXiv 2024
[41]

Chatgpt is fun, but it is not funny! humor is still challenging large language models.arXiv preprint arXiv:2306.04563, 2023

Sophie Jentzsch and Kristian Kersting. Chatgpt is fun, but it is not funny! humor is still challenging large language models.arXiv preprint arXiv:2306.04563, 2023

work page arXiv 2023
[42]

Brain networks for visual creativity: a functional connectivity study of planning a visual artwork.Scientific reports, 6(1):39185, 2016

Nicola De Pisapia, Francesca Bacci, Danielle Parrott, and David Melcher. Brain networks for visual creativity: a functional connectivity study of planning a visual artwork.Scientific reports, 6(1):39185, 2016

work page 2016
[43]

Best humans still outperform artificial intelligence in a creative divergent thinking task.Scientific reports, 13(1):13601, 2023

Mika Koivisto and Simone Grassini. Best humans still outperform artificial intelligence in a creative divergent thinking task.Scientific reports, 13(1):13601, 2023

work page 2023
[44]

MemeCap: A dataset for captioning and interpreting memes

EunJeong Hwang and Vered Shwartz. MemeCap: A dataset for captioning and interpreting memes. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 1433–1445, Singapore, December 2023. Association for Computational Linguistics

work page 2023
[45]

Humor in collective discourse: Unsupervised funniness detection in the new yorker cartoon caption contest

Dragomir Radev, Amanda Stent, Joel Tetreault, Aasish Pappu, Aikaterini Iliakopoulou, Agustin Chanfreau, Paloma de Juan, Jordi Vallmitjana, Alejandro Jaimes, Rahul Jha, and Robert Mankoff. Humor in collective discourse: Unsupervised funniness detection in the new yorker cartoon caption contest. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara...

work page 2016
[46]

Gemini in reasoning: Unveiling commonsense in multimodal large language models.arXiv preprint arXiv:2312.17661, 2023

Yuqing Wang and Yun Zhao. Gemini in reasoning: Unveiling commonsense in multimodal large language models.arXiv preprint arXiv:2312.17661, 2023

work page arXiv 2023
[47]

From recognition to cognition: Visual commonsense reasoning

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731, 2019

work page 2019
[48]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

work page 2019
[49]

Winoground: Probing vision and language models for visio-linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022. 13

work page 2022
[50]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022

work page 2022
[51]

yes, but

"yes, but" series created by anton gudim. https://twitter.com/_yesbut_ . Accessed: 2024

work page 2024
[52]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 2024

AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 2024

work page 2024
[54]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

work page 2024
[55]

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023

work page 2023
[58]

Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in Neural Information Processing Systems, 36, 2024

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[59]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Evaluation of text generation: A survey

Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao. Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799, 2020

work page arXiv 2006
[61]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics

work page 2004
[62]

Weinberger, and Yoav Artzi

Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. InInternational Conference on Learning Representations, 2020

work page 2020
[63]

CLAIR: Eval- uating image captions with large language models

David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, and John Canny. CLAIR: Eval- uating image captions with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13638–13646, Singapore, December 2023. Association for Computational ...

work page 2023
[64]

G- eval: NLG evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G- eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, December 2023. Association for Computational...

work page 2023
[65]

Americano: Argument generation with discourse-driven decomposition and agent interaction.arXiv preprint arXiv:2310.20352, 2023

Zhe Hu, Hou Pong Chan, and Yu Yin. Americano: Argument generation with discourse-driven decomposition and agent interaction.arXiv preprint arXiv:2310.20352, 2023. 14

work page arXiv 2023
[66]

Reasoning with language model prompting: A survey

Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5368–53...

work page 2023
[67]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[68]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate.arXiv preprint arXiv:2305.14325, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024. 15 A Data Annotation Details Considering the workload of manually writing all components from scratch, we leverage a AI-human collaborative pipeline for annotation. Th...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Oxford University Press, USA, 2014

Jessica Pressman.Digital modernism: Making it new in new media. Oxford University Press, USA, 2014

work page 2014

[2] [2]

Understanding comics: The invisible art

Alan D Manning. Understanding comics: The invisible art. 1998

work page 1998

[3] [3]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

A Survey on Multimodal Large Language Models

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Multimodal large language models: A survey

Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and S Yu Philip. Multimodal large language models: A survey. In2023 IEEE International Conference on Big Data (BigData), pages 2247–2256. IEEE, 2023

work page 2023

[6] [6]

Language models, agent models, and world models: The law for machine reasoning and planning.arXiv preprint arXiv:2312.05230, 2023

Zhiting Hu and Tianmin Shu. Language models, agent models, and world models: The law for machine reasoning and planning.arXiv preprint arXiv:2312.05230, 2023

work page arXiv 2023

[7] [7]

understanding

Jack Hessel, Ana Marasovic, Jena D. Hwang, Lillian Lee, Jeff Da, Rowan Zellers, Robert Mankoff, and Yejin Choi. Do androids laugh at electric sheep? humor “understanding” bench- marks from the new yorker caption contest. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computatio...

work page 2023

[8] [8]

Artificial general intelligence: Roadmap to achieving human-level capabilities, 2023

Abu Rayhan, Rajan Rayhan, and Swajan Rayhan. Artificial general intelligence: Roadmap to achieving human-level capabilities, 2023

work page 2023

[9] [9]

A&C Black, 2009

Randy Duncan and Matthew J Smith.The power of comics: History, form and culture. A&C Black, 2009

work page 2009

[10] [10]

Can large multimodal models uncover deep semantics behind images?arXiv preprint arXiv:2402.11281, 2024

Yixin Yang, Zheng Li, Qingxiu Dong, Heming Xia, and Zhifang Sui. Can large multimodal models uncover deep semantics behind images?arXiv preprint arXiv:2402.11281, 2024

work page arXiv 2024

[11] [11]

Routledge, 2003

James O Young.Art and knowledge. Routledge, 2003

work page 2003

[12] [12]

Thierry Groensteen.Comics and narration. Univ. Press of Mississippi, 2013

work page 2013

[13] [13]

Rethinking literacy: Communication, representation and text.Reading, 37(3):98– 103, 2003

Eve Bearne. Rethinking literacy: Communication, representation and text.Reading, 37(3):98– 103, 2003

work page 2003

[14] [14]

Comic book visualities: a methodological manifesto on geography, montage and narration.Transactions of the Institute of British Geographers, 35(2):222–236, 2010

Jason Dittmer. Comic book visualities: a methodological manifesto on geography, montage and narration.Transactions of the Institute of British Geographers, 35(2):222–236, 2010

work page 2010

[15] [15]

Juxtaposition: A new way to combine logics.The Review of Symbolic Logic, 4(4):560–606, 2011

Joshua Schechter. Juxtaposition: A new way to combine logics.The Review of Symbolic Logic, 4(4):560–606, 2011

work page 2011

[16] [16]

Comics-based research: The affordances of comics for research across disciplines.Qualitative Research, 21(2):195–214, 2021

Paul J Kuttner, Marcus B Weaver-Hightower, and Nick Sousanis. Comics-based research: The affordances of comics for research across disciplines.Qualitative Research, 21(2):195–214, 2021

work page 2021

[17] [17]

Eliminating reasoning via inferring with planning: A new framework to guide llms’ non-linear thinking.arXiv preprint arXiv:2310.12342, 2023

Yongqi Tong, Yifan Wang, Dawei Li, Sizhe Wang, Zi Lin, Simeng Han, and Jingbo Shang. Eliminating reasoning via inferring with planning: A new framework to guide llms’ non-linear thinking.arXiv preprint arXiv:2310.12342, 2023

work page arXiv 2023

[18] [18]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4.arXiv preprint arXiv:2303.12712, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022. 11

work page 2022

[20] [20]

Llama 3 model card

AI@Meta. Llama 3 model card. 2024

work page 2024

[21] [21]

Large Language Models: A Survey

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey.arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36, 2024

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[23] [23]

Alpacafarm: A simulation framework for methods that learn from human feedback.Advances in Neural Information Processing Systems, 36, 2024

Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy S Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback.Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[24] [24]

Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization.arXiv preprint arXiv:2306.05087, 2023

Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, et al. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization.arXiv preprint arXiv:2306.05087, 2023

work page arXiv 2023

[25] [25]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.Advances in Neural Information Processing Systems, 36, 2024

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[26] [26]

Mmt-bench: A comprehensive multimodal benchmark for eval- uating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006, 2024

Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, et al. Mmt-bench: A comprehensive multimodal benchmark for eval- uating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006, 2024

work page arXiv 2024

[27] [27]

Visit-bench: A benchmark for vision-language instruction following inspired by real-world use.arXiv preprint arXiv:2308.06595, 2023

Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schimdt. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use.arXiv preprint arXiv:2308.06595, 2023

work page arXiv 2023

[28] [28]

Breaking common sense: Whoops! a vision-and-language bench- mark of synthetic and compositional images

Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, and Roy Schwartz. Breaking common sense: Whoops! a vision-and-language bench- mark of synthetic and compositional images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2616–2627, 2023

work page 2023

[29] [29]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Seed-bench-2: Benchmarking multimodal large language models.arXiv preprint arXiv:2311.17092, 2023

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench-2: Benchmarking multimodal large language models.arXiv preprint arXiv:2311.17092, 2023

work page arXiv 2023

[31] [31]

Routledge, 2003

Jerry Palmer.Taking humour seriously. Routledge, 2003

work page 2003

[32] [32]

Predicting Audience's Laughter Using Convolutional Neural Network

Lei Chen and Chong MIn Lee. Predicting audience’s laughter using convolutional neural network.arXiv preprint arXiv:1702.02584, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[33] [33]

Recognizing humour using word associations and humour anchor extraction

Andrew Cattle and Xiaojuan Ma. Recognizing humour using word associations and humour anchor extraction. InProceedings of the 27th international conference on computational linguistics, pages 1849–1858, 2018

work page 2018

[34] [34]

Humor recognition and humor anchor extraction

Diyi Yang, Alon Lavie, Chris Dyer, and Eduard Hovy. Humor recognition and humor anchor extraction. InProceedings of the 2015 conference on empirical methods in natural language processing, pages 2367–2376, 2015

work page 2015

[35] [35]

A survey on approaches to computational humor gen- eration

Miriam Amin and Manuel Burghardt. A survey on approaches to computational humor gen- eration. In Stefania DeGaetano, Anna Kazantseva, Nils Reiter, and Stan Szpakowicz, editors, Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 29–41, Online, December 2020. Int...

work page 2020

[36] [36]

We are humor beings: Understanding and predict- ing visual humor

Arjun Chandrasekaran, Ashwin K Vijayakumar, Stanislaw Antol, Mohit Bansal, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. We are humor beings: Understanding and predict- ing visual humor. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4603–4612, 2016

work page 2016

[37] [37]

Inside jokes: Identifying humorous cartoon captions

Dafna Shahaf, Eric Horvitz, and Robert Mankoff. Inside jokes: Identifying humorous cartoon captions. InProceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 1065–1074, 2015

work page 2015

[38] [38]

Humor in Collective Discourse: Unsupervised Funniness Detection in the New Yorker Cartoon Caption Contest

Dragomir Radev, Amanda Stent, Joel Tetreault, Aasish Pappu, Aikaterini Iliakopoulou, Agustin Chanfreau, Paloma de Juan, Jordi Vallmitjana, Alejandro Jaimes, Rahul Jha, et al. Humor in collective discourse: Unsupervised funniness detection in the new yorker cartoon caption contest.arXiv preprint arXiv:1506.08126, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[39] [39]

The laughing machine: Predicting humor in video

Yuta Kayatani, Zekun Yang, Mayu Otani, Noa Garcia, Chenhui Chu, Yuta Nakashima, and Haruo Takemura. The laughing machine: Predicting humor in video. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2073–2082, 2021

work page 2073

[40] [40]

Comment-aided video-language alignment via contrastive pre-training for short-form video humor detection.arXiv preprint arXiv:2402.09055, 2024

Yang Liu, Tongfei Shen, Dong Zhang, Qingying Sun, Shoushan Li, and Guodong Zhou. Comment-aided video-language alignment via contrastive pre-training for short-form video humor detection.arXiv preprint arXiv:2402.09055, 2024

work page arXiv 2024

[41] [41]

Chatgpt is fun, but it is not funny! humor is still challenging large language models.arXiv preprint arXiv:2306.04563, 2023

Sophie Jentzsch and Kristian Kersting. Chatgpt is fun, but it is not funny! humor is still challenging large language models.arXiv preprint arXiv:2306.04563, 2023

work page arXiv 2023

[42] [42]

Brain networks for visual creativity: a functional connectivity study of planning a visual artwork.Scientific reports, 6(1):39185, 2016

Nicola De Pisapia, Francesca Bacci, Danielle Parrott, and David Melcher. Brain networks for visual creativity: a functional connectivity study of planning a visual artwork.Scientific reports, 6(1):39185, 2016

work page 2016

[43] [43]

Best humans still outperform artificial intelligence in a creative divergent thinking task.Scientific reports, 13(1):13601, 2023

Mika Koivisto and Simone Grassini. Best humans still outperform artificial intelligence in a creative divergent thinking task.Scientific reports, 13(1):13601, 2023

work page 2023

[44] [44]

MemeCap: A dataset for captioning and interpreting memes

EunJeong Hwang and Vered Shwartz. MemeCap: A dataset for captioning and interpreting memes. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 1433–1445, Singapore, December 2023. Association for Computational Linguistics

work page 2023

[45] [45]

Humor in collective discourse: Unsupervised funniness detection in the new yorker cartoon caption contest

Dragomir Radev, Amanda Stent, Joel Tetreault, Aasish Pappu, Aikaterini Iliakopoulou, Agustin Chanfreau, Paloma de Juan, Jordi Vallmitjana, Alejandro Jaimes, Rahul Jha, and Robert Mankoff. Humor in collective discourse: Unsupervised funniness detection in the new yorker cartoon caption contest. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara...

work page 2016

[46] [46]

Gemini in reasoning: Unveiling commonsense in multimodal large language models.arXiv preprint arXiv:2312.17661, 2023

Yuqing Wang and Yun Zhao. Gemini in reasoning: Unveiling commonsense in multimodal large language models.arXiv preprint arXiv:2312.17661, 2023

work page arXiv 2023

[47] [47]

From recognition to cognition: Visual commonsense reasoning

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731, 2019

work page 2019

[48] [48]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

work page 2019

[49] [49]

Winoground: Probing vision and language models for visio-linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022. 13

work page 2022

[50] [50]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022

work page 2022

[51] [51]

yes, but

"yes, but" series created by anton gudim. https://twitter.com/_yesbut_ . Accessed: 2024

work page 2024

[52] [52]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [53]

The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 2024

AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 2024

work page 2024

[54] [54]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

work page 2024

[55] [55]

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [57]

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023

work page 2023

[58] [58]

Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in Neural Information Processing Systems, 36, 2024

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[59] [59]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[60] [60]

Evaluation of text generation: A survey

Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao. Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799, 2020

work page arXiv 2006

[61] [61]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics

work page 2004

[62] [62]

Weinberger, and Yoav Artzi

Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. InInternational Conference on Learning Representations, 2020

work page 2020

[63] [63]

CLAIR: Eval- uating image captions with large language models

David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, and John Canny. CLAIR: Eval- uating image captions with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13638–13646, Singapore, December 2023. Association for Computational ...

work page 2023

[64] [64]

G- eval: NLG evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G- eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, December 2023. Association for Computational...

work page 2023

[65] [65]

Americano: Argument generation with discourse-driven decomposition and agent interaction.arXiv preprint arXiv:2310.20352, 2023

Zhe Hu, Hou Pong Chan, and Yu Yin. Americano: Argument generation with discourse-driven decomposition and agent interaction.arXiv preprint arXiv:2310.20352, 2023. 14

work page arXiv 2023

[66] [66]

Reasoning with language model prompting: A survey

Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5368–53...

work page 2023

[67] [67]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[68] [68]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate.arXiv preprint arXiv:2305.14325, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[69] [69]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024. 15 A Data Annotation Details Considering the workload of manually writing all components from scratch, we leverage a AI-human collaborative pipeline for annotation. Th...

work page internal anchor Pith review Pith/arXiv arXiv 2024