Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions
Pith reviewed 2026-05-24 00:51 UTC · model grok-4.3
The pith
Even state-of-the-art AI models lag behind humans at understanding humorous contradictions in two-panel comics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper presents the YesBut benchmark of two-panel comics that generate humor through narrative contradiction and demonstrates via systematic testing that recent commercial and open-source multimodal models continue to underperform humans across literal, interpretive, and deep-reasoning subtasks.
What carries the argument
The YesBut benchmark, a collection of contradictory two-panel comics paired with tasks that escalate from surface description to narrative-humor reasoning.
If this is right
- Models must develop stronger mechanisms for integrating contradictory information across sequential panels to handle narrative humor.
- Performance gaps on deeper reasoning subtasks indicate that literal content understanding does not automatically yield joke interpretation.
- Insights from the benchmark can guide targeted data or training adjustments aimed at nonlinear narrative structures.
Where Pith is reading between the lines
- Success on YesBut could transfer to other juxtaposition-based humor such as memes or short video clips that rely on visual contradiction.
- Persistent shortfalls here may limit AI usefulness in creative writing assistants or conversational agents that need to recognize or generate ironic content.
- The benchmark could be extended to measure whether scaling alone closes the gap or whether architectural changes are required.
Load-bearing premise
The YesBut comics and task design validly measure the specific capability of understanding humorous contradictions via juxtaposition and nonlinear narratives.
What would settle it
A model that matches or exceeds average human accuracy on the full set of YesBut reasoning tasks while using the same comic set.
Figures
read the original abstract
Recent advancements in large multimodal language models have demonstrated remarkable proficiency across a wide range of tasks. Yet, these models still struggle with understanding the nuances of human humor through juxtaposition, particularly when it involves nonlinear narratives that underpin many jokes and humor cues. This paper investigates this challenge by focusing on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction. We introduce the YesBut benchmark, which comprises tasks of varying difficulty aimed at assessing AI's capabilities in recognizing and interpreting these comics, ranging from literal content comprehension to deep narrative reasoning. Through extensive experimentation and analysis of recent commercial or open-sourced large (vision) language models, we assess their capability to comprehend the complex interplay of the narrative humor inherent in these comics. Our results show that even state-of-the-art models still lag behind human performance on this task. Our findings offer insights into the current limitations and potential improvements for AI in understanding human creative expressions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the YesBut benchmark consisting of two-panel comics that create humorous contradictions through juxtaposition and nonlinear narratives. It defines a graduated task suite ranging from literal content comprehension to narrative reasoning and reports that state-of-the-art vision-language models underperform relative to humans on these tasks.
Significance. If the benchmark and task design are shown to isolate juxtaposition-based humorous contradiction understanding, the work supplies a targeted evaluation resource that exposes a gap in current multimodal models' handling of creative, non-linear reasoning. This could usefully direct future efforts in humor comprehension and narrative inference.
major comments (2)
- [Abstract] Abstract and experimental section: the central claim that SOTA models lag humans rests on the YesBut benchmark validly measuring the target capability, yet the provided text supplies no information on comic selection criteria, inter-annotator agreement, or controls for confounding factors such as visual style or panel ordering; without these, the performance gap cannot be attributed specifically to juxtaposition understanding.
- [Abstract] Results and analysis: the manuscript reports model-human gaps but omits statistical tests, confidence intervals, or error analysis (e.g., breakdown by task difficulty or failure modes), making it impossible to determine whether the observed differences are reliable or driven by a small number of items.
Simulated Author's Rebuttal
We thank the referee for these constructive comments, which highlight important aspects of benchmark validity and statistical reporting. We address each point below and will revise the manuscript to strengthen these elements.
read point-by-point responses
-
Referee: [Abstract] Abstract and experimental section: the central claim that SOTA models lag humans rests on the YesBut benchmark validly measuring the target capability, yet the provided text supplies no information on comic selection criteria, inter-annotator agreement, or controls for confounding factors such as visual style or panel ordering; without these, the performance gap cannot be attributed specifically to juxtaposition understanding.
Authors: We agree that these details are necessary to support the claim that the performance gap reflects juxtaposition understanding rather than other factors. The full manuscript describes the overall task design and data collection at a high level, but does not provide the requested specifics on selection criteria, inter-annotator agreement statistics, or explicit controls for visual style and panel ordering. In the revision we will add a dedicated subsection on benchmark construction that includes: (i) the comic selection criteria and sourcing process, (ii) inter-annotator agreement metrics, and (iii) analyses or controls addressing potential confounders such as visual style and panel ordering. revision: yes
-
Referee: [Abstract] Results and analysis: the manuscript reports model-human gaps but omits statistical tests, confidence intervals, or error analysis (e.g., breakdown by task difficulty or failure modes), making it impossible to determine whether the observed differences are reliable or driven by a small number of items.
Authors: We acknowledge that the current results section lacks formal statistical tests, confidence intervals, and systematic error analysis. In the revised manuscript we will add: (i) appropriate statistical tests (e.g., paired t-tests or non-parametric equivalents) comparing model and human performance with reported p-values and confidence intervals, (ii) a breakdown of results by task difficulty level, and (iii) an error analysis that categorizes failure modes across models and tasks to demonstrate that the gaps are not driven by a small subset of items. revision: yes
Circularity Check
No significant circularity
full rationale
The paper introduces the YesBut benchmark and reports empirical evaluations of multimodal LLMs on graduated tasks for humorous contradiction understanding in comics. No equations, derivations, fitted parameters, or predictions appear in the provided text. The central claim (SOTA models lag humans) rests on direct benchmark testing rather than any self-referential construction, self-citation chain, or ansatz. This is a standard empirical benchmark paper with independent, falsifiable results.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Oxford University Press, USA, 2014
Jessica Pressman.Digital modernism: Making it new in new media. Oxford University Press, USA, 2014
work page 2014
-
[2]
Understanding comics: The invisible art
Alan D Manning. Understanding comics: The invisible art. 1998
work page 1998
-
[3]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
A Survey on Multimodal Large Language Models
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Multimodal large language models: A survey
Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and S Yu Philip. Multimodal large language models: A survey. In2023 IEEE International Conference on Big Data (BigData), pages 2247–2256. IEEE, 2023
work page 2023
-
[6]
Zhiting Hu and Tianmin Shu. Language models, agent models, and world models: The law for machine reasoning and planning.arXiv preprint arXiv:2312.05230, 2023
-
[7]
Jack Hessel, Ana Marasovic, Jena D. Hwang, Lillian Lee, Jeff Da, Rowan Zellers, Robert Mankoff, and Yejin Choi. Do androids laugh at electric sheep? humor “understanding” bench- marks from the new yorker caption contest. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computatio...
work page 2023
-
[8]
Artificial general intelligence: Roadmap to achieving human-level capabilities, 2023
Abu Rayhan, Rajan Rayhan, and Swajan Rayhan. Artificial general intelligence: Roadmap to achieving human-level capabilities, 2023
work page 2023
-
[9]
Randy Duncan and Matthew J Smith.The power of comics: History, form and culture. A&C Black, 2009
work page 2009
-
[10]
Yixin Yang, Zheng Li, Qingxiu Dong, Heming Xia, and Zhifang Sui. Can large multimodal models uncover deep semantics behind images?arXiv preprint arXiv:2402.11281, 2024
- [11]
-
[12]
Thierry Groensteen.Comics and narration. Univ. Press of Mississippi, 2013
work page 2013
-
[13]
Rethinking literacy: Communication, representation and text.Reading, 37(3):98– 103, 2003
Eve Bearne. Rethinking literacy: Communication, representation and text.Reading, 37(3):98– 103, 2003
work page 2003
-
[14]
Jason Dittmer. Comic book visualities: a methodological manifesto on geography, montage and narration.Transactions of the Institute of British Geographers, 35(2):222–236, 2010
work page 2010
-
[15]
Juxtaposition: A new way to combine logics.The Review of Symbolic Logic, 4(4):560–606, 2011
Joshua Schechter. Juxtaposition: A new way to combine logics.The Review of Symbolic Logic, 4(4):560–606, 2011
work page 2011
-
[16]
Paul J Kuttner, Marcus B Weaver-Hightower, and Nick Sousanis. Comics-based research: The affordances of comics for research across disciplines.Qualitative Research, 21(2):195–214, 2021
work page 2021
-
[17]
Yongqi Tong, Yifan Wang, Dawei Li, Sizhe Wang, Zi Lin, Simeng Han, and Jingbo Shang. Eliminating reasoning via inferring with planning: A new framework to guide llms’ non-linear thinking.arXiv preprint arXiv:2310.12342, 2023
-
[18]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4.arXiv preprint arXiv:2303.12712, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022. 11
work page 2022
- [20]
-
[21]
Large Language Models: A Survey
Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey.arXiv preprint arXiv:2402.06196, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[23]
Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy S Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback.Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[24]
Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, et al. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization.arXiv preprint arXiv:2306.05087, 2023
-
[25]
Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[26]
Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, et al. Mmt-bench: A comprehensive multimodal benchmark for eval- uating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006, 2024
-
[27]
Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schimdt. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use.arXiv preprint arXiv:2308.06595, 2023
-
[28]
Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, and Roy Schwartz. Breaking common sense: Whoops! a vision-and-language bench- mark of synthetic and compositional images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2616–2627, 2023
work page 2023
-
[29]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Seed-bench-2: Benchmarking multimodal large language models.arXiv preprint arXiv:2311.17092, 2023
Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench-2: Benchmarking multimodal large language models.arXiv preprint arXiv:2311.17092, 2023
- [31]
-
[32]
Predicting Audience's Laughter Using Convolutional Neural Network
Lei Chen and Chong MIn Lee. Predicting audience’s laughter using convolutional neural network.arXiv preprint arXiv:1702.02584, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[33]
Recognizing humour using word associations and humour anchor extraction
Andrew Cattle and Xiaojuan Ma. Recognizing humour using word associations and humour anchor extraction. InProceedings of the 27th international conference on computational linguistics, pages 1849–1858, 2018
work page 2018
-
[34]
Humor recognition and humor anchor extraction
Diyi Yang, Alon Lavie, Chris Dyer, and Eduard Hovy. Humor recognition and humor anchor extraction. InProceedings of the 2015 conference on empirical methods in natural language processing, pages 2367–2376, 2015
work page 2015
-
[35]
A survey on approaches to computational humor gen- eration
Miriam Amin and Manuel Burghardt. A survey on approaches to computational humor gen- eration. In Stefania DeGaetano, Anna Kazantseva, Nils Reiter, and Stan Szpakowicz, editors, Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 29–41, Online, December 2020. Int...
work page 2020
-
[36]
We are humor beings: Understanding and predict- ing visual humor
Arjun Chandrasekaran, Ashwin K Vijayakumar, Stanislaw Antol, Mohit Bansal, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. We are humor beings: Understanding and predict- ing visual humor. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4603–4612, 2016
work page 2016
-
[37]
Inside jokes: Identifying humorous cartoon captions
Dafna Shahaf, Eric Horvitz, and Robert Mankoff. Inside jokes: Identifying humorous cartoon captions. InProceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 1065–1074, 2015
work page 2015
-
[38]
Dragomir Radev, Amanda Stent, Joel Tetreault, Aasish Pappu, Aikaterini Iliakopoulou, Agustin Chanfreau, Paloma de Juan, Jordi Vallmitjana, Alejandro Jaimes, Rahul Jha, et al. Humor in collective discourse: Unsupervised funniness detection in the new yorker cartoon caption contest.arXiv preprint arXiv:1506.08126, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[39]
The laughing machine: Predicting humor in video
Yuta Kayatani, Zekun Yang, Mayu Otani, Noa Garcia, Chenhui Chu, Yuta Nakashima, and Haruo Takemura. The laughing machine: Predicting humor in video. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2073–2082, 2021
work page 2073
-
[40]
Yang Liu, Tongfei Shen, Dong Zhang, Qingying Sun, Shoushan Li, and Guodong Zhou. Comment-aided video-language alignment via contrastive pre-training for short-form video humor detection.arXiv preprint arXiv:2402.09055, 2024
-
[41]
Sophie Jentzsch and Kristian Kersting. Chatgpt is fun, but it is not funny! humor is still challenging large language models.arXiv preprint arXiv:2306.04563, 2023
-
[42]
Nicola De Pisapia, Francesca Bacci, Danielle Parrott, and David Melcher. Brain networks for visual creativity: a functional connectivity study of planning a visual artwork.Scientific reports, 6(1):39185, 2016
work page 2016
-
[43]
Mika Koivisto and Simone Grassini. Best humans still outperform artificial intelligence in a creative divergent thinking task.Scientific reports, 13(1):13601, 2023
work page 2023
-
[44]
MemeCap: A dataset for captioning and interpreting memes
EunJeong Hwang and Vered Shwartz. MemeCap: A dataset for captioning and interpreting memes. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 1433–1445, Singapore, December 2023. Association for Computational Linguistics
work page 2023
-
[45]
Dragomir Radev, Amanda Stent, Joel Tetreault, Aasish Pappu, Aikaterini Iliakopoulou, Agustin Chanfreau, Paloma de Juan, Jordi Vallmitjana, Alejandro Jaimes, Rahul Jha, and Robert Mankoff. Humor in collective discourse: Unsupervised funniness detection in the new yorker cartoon caption contest. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara...
work page 2016
-
[46]
Yuqing Wang and Yun Zhao. Gemini in reasoning: Unveiling commonsense in multimodal large language models.arXiv preprint arXiv:2312.17661, 2023
-
[47]
From recognition to cognition: Visual commonsense reasoning
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731, 2019
work page 2019
-
[48]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019
work page 2019
-
[49]
Winoground: Probing vision and language models for visio-linguistic compositionality
Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022. 13
work page 2022
-
[50]
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022
work page 2022
- [51]
-
[52]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 2024
AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 2024
work page 2024
-
[54]
Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
work page 2024
-
[55]
CogVLM: Visual Expert for Pretrained Language Models
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023
Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023
work page 2023
-
[58]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[59]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Evaluation of text generation: A survey
Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao. Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799, 2020
-
[61]
ROUGE: A package for automatic evaluation of summaries
Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics
work page 2004
-
[62]
Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. InInternational Conference on Learning Representations, 2020
work page 2020
-
[63]
CLAIR: Eval- uating image captions with large language models
David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, and John Canny. CLAIR: Eval- uating image captions with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13638–13646, Singapore, December 2023. Association for Computational ...
work page 2023
-
[64]
G- eval: NLG evaluation using gpt-4 with better human alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G- eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, December 2023. Association for Computational...
work page 2023
-
[65]
Zhe Hu, Hou Pong Chan, and Yu Yin. Americano: Argument generation with discourse-driven decomposition and agent interaction.arXiv preprint arXiv:2310.20352, 2023. 14
-
[66]
Reasoning with language model prompting: A survey
Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5368–53...
work page 2023
-
[67]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[68]
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate.arXiv preprint arXiv:2305.14325, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024. 15 A Data Annotation Details Considering the workload of manually writing all components from scratch, we leverage a AI-human collaborative pipeline for annotation. Th...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.