pith. sign in

arxiv: 2506.06211 · v2 · submitted 2025-06-06 · 💻 cs.CL · cs.AI· cs.CV

PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts

Pith reviewed 2026-05-19 10:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV
keywords puzzlehuntmultimodal reasoningopen-ended reasoningAI benchmarkreasoning tracesvisual reasoningstepwise accuracy
0
0 comments X

The pith

State-of-the-art models solve only 18 percent of puzzlehunt problems and reach 40 percent stepwise accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PuzzleWorld, a benchmark of 667 puzzlehunt problems that lack clear problem definitions and instead require models to discover structure from multimodal clues through iterative steps. It reports that leading models achieve only 1-4 percent final answer accuracy, with the strongest reaching 18 percent puzzle solutions and 40 percent on individual steps, levels that align with human novices but fall well short of enthusiasts. The work shows that fine-tuning a small model on the provided reasoning traces lifts stepwise accuracy from 4 percent to 11 percent and produces gains on separate visual reasoning tests. These results matter because the puzzles are presented as proxies for the open-ended discovery needed in scientific investigation and data analysis, where problems are not handed to the solver in advance.

Core claim

We present PuzzleWorld, a benchmark of 667 puzzlehunt-style problems each supplied with final solutions, detailed reasoning traces, and cognitive skill labels. State-of-the-art models achieve only 1-4 percent final answer accuracy. The best model solves 18 percent of the puzzles and attains 40 percent stepwise accuracy, matching novice human solvers but significantly behind enthusiasts. Fine-tuning a small model on the reasoning traces improves stepwise accuracy from 4 percent to 11 percent, with gains that transfer to downstream visual reasoning tasks. Error analysis shows models suffer from myopic reasoning, limits of language-based inference, and insufficient sketching for visual and sp

What carries the argument

PuzzleWorld benchmark of 667 annotated puzzlehunt problems that require discovering underlying problem structure from multimodal evidence without predefined instructions.

If this is right

  • Fine-tuning on detailed reasoning traces raises stepwise accuracy from 4 percent to 11 percent and transfers to other visual reasoning tasks.
  • Current models are limited by myopic reasoning and by the absence of sketching abilities needed for visual and spatial problems.
  • The performance gap between models and puzzle enthusiasts points to the need for systems that can handle open-ended structure discovery.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarks built around iterative clue interpretation could be adapted to measure progress toward AI systems that assist in exploratory data analysis.
  • Equipping models with external sketching or diagram tools might directly address one of the reported bottlenecks in visual reasoning.
  • The novice-to-enthusiast performance difference suggests that targeted training on creative, multi-step traces could close part of the observed gap.

Load-bearing premise

The selected puzzlehunt problems and their annotations serve as a valid proxy for open-ended reasoning challenges in domains such as scientific discovery and investigative problem-solving, with human baselines accurately reflecting novice and enthusiast performance.

What would settle it

A model that reaches 50 percent solve rate on PuzzleWorld yet shows no corresponding gains on independent tests of scientific discovery or investigative problem-solving would indicate the benchmark does not measure the intended general capability.

Figures

Figures reproduced from arXiv: 2506.06211 by Adithya Balachandran, Alexander Naehu, Brendon Jiang, Chanakya Ekbote, Hengzhi Li, Justin Zhang, Megan Tjandrasuwita, Paul Pu Liang, Rebecca Chang, Regan Song, Steven-Shine Chen, Wei Dai.

Figure 1
Figure 1. Figure 1: Overview of PUZZLEWORLD: PUZZLEWORLD is a dataset of complex puzzles that lack explicit instructions, requiring solvers to deduce the final answer from nuanced, multimodal cues from the puzzle content as well as external domain-specific knowledge. The raw puzzles and solutions are sourced from PuzzledPint, and the solutions, which are PNG images, are transcribed into a sequence of reasoning steps by human … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of samples from PUZZLEWORLD Left: To gain a deeper understanding of model performance on PUZZLEWORLD, each puzzle is annotated with the input modalities of the puzzle content, the reasoning skills required to solve the puzzle, and step-by-step reasoning steps. Right: Example modality and reasoning skill annotations on three puzzles. High-resolution puzzle images are in Appendix B. [3, 18, 19]. The… view at source ↗
Figure 3
Figure 3. Figure 3: Dataset construction procedure and statistics: Left: First, we source raw puzzles and solutions from PuzzledPint. As the PuzzledPint solutions are in PDF format and are often not correctly parsed by OCR (for example, some solutions consist of annotated figures rather than a textual description), the metadata and reasoning steps for each puzzle are human-annotated. We use GPT-4 to automatically flag puzzles… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of metadata schema: All puzzles are annotated with their accompanying metadata, which includes the puzzle title, flavor text, difficulty, final an￾swer, reasoning steps, input modalities, reasoning skills, and the PuzzledPint link to the puzzle. Each puzzle is annotated using a structured JSON schema comprising several key fields: a ti￾tle that serves as a descriptive identifier; optional flav… view at source ↗
Figure 5
Figure 5. Figure 5: PUZZLEWORLD dataset statistics. Distributions of modalities and reasoning skills are balanced across different puzzles. While the majority of puzzles are classified as medium difficulty, there are a significant number of easy and hard puzzles. The number of reasoning steps follows a long-tail distribution, with many puzzle solutions requiring more than 5 steps and some hard puzzles requiring up to 30 steps… view at source ↗
Figure 6
Figure 6. Figure 6: Example puzzle errors. Left: (myopic reasoning) The model outputs an incorrect plan to solve the puzzle and is unable to successfully backtrack when it hits a dead end. Middle: (language bottleneck/lack of visual understanding) The model misinterprets the visual contents of the puzzle due to inherent limitations in language. Right: (sketching errors) While the model may produce a plausible plan, it fails t… view at source ↗
Figure 7
Figure 7. Figure 7: Stepwise accuracy distribution of GPT-o3. GPT-o3 receives stepwise accuracy of 0 for most puz￾zles, indicating the model frequently fails to identify even the first step of the correct reasoning trace. This highlights GPT-o3’s tendency toward myopic reasoning and its inability to recover or backtrack once committed to an incorrect path. Strong models exhibit myopic reasoning. Despite strong performance on … view at source ↗
Figure 8
Figure 8. Figure 8: Example failure case where GPT-o3 fails to convert a complex structured puzzle into text. While GPT-o3 correctly solves the word clues within each loop, it fails to capture the global layout when converting the puzzle into text, as shown in [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Reasoning skills of failed steps. We annotated the reasoning steps responsible for 30 puzzle errors with their corresponding reasoning skills. To better understand the role of sketching in model perfor￾mance, we manually analyzed 30 puzzles where GPT-o3 produced incorrect answers. For each failure case, we annotated the reasoning step responsible for the error with its corresponding reasoning skill categor… view at source ↗
read the original abstract

Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions and constrained environments, puzzlehunts requires discovering the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite progress in foundation models, their performance on open-ended settings remains largely untested. We introduce PuzzleWorld, a comprehensive benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic analysis. Most state-of-the-art models achieve only 1-4% final answer accuracy. On PuzzleWorld, the best model solves only 18% of puzzles and reaches 40% stepwise accuracy, matching human puzzle novices but falling significantly behind puzzle enthusiasts. To demonstrate the value of our reasoning annotations, we show that fine-tuning a small model on reasoning traces boosts stepwise accuracy from 4% to 11%, which translates to improvements in downstream visual reasoning tasks. Our detailed error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning. We release PuzzleWorld at https://github.com/MIT-MI/PuzzleWorld to support future work on building more general, open-ended, and creative reasoning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PuzzleWorld, a benchmark of 667 puzzlehunt-style problems annotated with final solutions, detailed reasoning traces, and cognitive skill labels to evaluate multimodal open-ended reasoning. It reports that most state-of-the-art models achieve only 1-4% final answer accuracy, with the best model solving 18% of puzzles and reaching 40% stepwise accuracy (matching human novices but lagging enthusiasts), and shows that fine-tuning a small model on the reasoning traces boosts stepwise accuracy from 4% to 11% with transfer to other visual reasoning tasks. An error analysis identifies limitations in myopic reasoning, language-based inference, and sketching.

Significance. If the human baselines and puzzle selection criteria are properly documented and validated, this work would be a meaningful contribution by providing a challenging testbed for open-ended multimodal reasoning that mirrors real-world domains like scientific discovery. The public release of the dataset and annotations, the fine-tuning experiment demonstrating the utility of the traces, and the error analysis highlighting specific model weaknesses (e.g., lack of sketching) are clear strengths that support future research on more general reasoning systems.

major comments (2)
  1. [Human Baselines] Human Baselines section: Quantitative details on recruitment of novice and enthusiast cohorts, number of participants per puzzle, instructions provided, time limits, and inter-rater reliability for the annotated reasoning traces are absent. This directly undermines the load-bearing claim that the best model matches human puzzle novices at ~40% stepwise accuracy while falling significantly behind enthusiasts.
  2. [Dataset Construction] Dataset Construction section: No metrics on puzzle sourcing, diversity, or explicit validation that the selected problems serve as a valid proxy for open-ended reasoning challenges in scientific discovery and investigative problem-solving. This raises the risk that reported performance gaps are driven by selection bias rather than genuine differences in reasoning ability.
minor comments (2)
  1. [Abstract] Abstract: The statement that 'most state-of-the-art models achieve only 1-4% final answer accuracy' would benefit from specifying the exact models evaluated and including error bars or variance measures for all reported metrics.
  2. [Error Analysis] Error Analysis: Consider adding more quantitative breakdowns or concrete examples to support claims about myopic reasoning and lack of sketching capabilities.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential value of PuzzleWorld as a challenging testbed for open-ended multimodal reasoning. We address each major comment below and will incorporate the requested documentation into the revised manuscript.

read point-by-point responses
  1. Referee: [Human Baselines] Human Baselines section: Quantitative details on recruitment of novice and enthusiast cohorts, number of participants per puzzle, instructions provided, time limits, and inter-rater reliability for the annotated reasoning traces are absent. This directly undermines the load-bearing claim that the best model matches human puzzle novices at ~40% stepwise accuracy while falling significantly behind enthusiasts.

    Authors: We agree that the current Human Baselines section lacks sufficient quantitative detail to fully support the reported comparisons. In the revision we will expand this section with explicit information on recruitment methods for the novice and enthusiast cohorts, the number of participants assigned to each puzzle, the precise instructions and time limits given to solvers, and inter-rater reliability statistics for the reasoning-trace annotations. These additions will directly substantiate the claim that the best model reaches approximately 40% stepwise accuracy, comparable to novices yet below enthusiasts. revision: yes

  2. Referee: [Dataset Construction] Dataset Construction section: No metrics on puzzle sourcing, diversity, or explicit validation that the selected problems serve as a valid proxy for open-ended reasoning challenges in scientific discovery and investigative problem-solving. This raises the risk that reported performance gaps are driven by selection bias rather than genuine differences in reasoning ability.

    Authors: We acknowledge the need for greater transparency on dataset construction. The revised manuscript will include quantitative metrics on puzzle sourcing (including original sources and selection criteria), diversity statistics across puzzle types, themes, and cognitive-skill labels, and a dedicated validation subsection that explains how the chosen problems function as proxies for open-ended reasoning in scientific discovery and investigative problem-solving. These additions will reduce concerns about selection bias and strengthen the benchmark's claimed relevance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with no self-referential derivations

full rationale

The paper introduces PuzzleWorld as a new benchmark of 667 puzzles with annotations for solutions, reasoning traces, and skill labels. All reported results (model accuracies of 1-4% final answer, 18% puzzle solve rate, 40% stepwise accuracy; fine-tuning boost from 4% to 11%; error analysis on myopic reasoning and sketching limitations) are direct empirical measurements obtained by evaluating models on the released dataset and comparing against separately collected human baselines. No equations, fitted parameters, predictions derived from the same data, or self-citation chains are used to justify the central claims. The derivation chain consists solely of standard benchmark evaluation procedures that remain independent of the reported outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on domain assumptions about what constitutes open-ended reasoning rather than new mathematical axioms, free parameters, or invented entities.

axioms (1)
  • domain assumption Puzzlehunts require discovering the underlying problem structure from multimodal evidence and iterative reasoning
    Invoked in the abstract to contrast with conventional benchmarks and to motivate the benchmark design.

pith-pipeline@v0.9.0 · 5855 in / 1370 out tokens · 43211 ms · 2026-05-19T10:38:58.386089+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    AgentEscapeBench shows LLM agents' success rates drop from 90% to 60% as tool-dependency depth increases from 5 to 25 steps, while humans drop only from 98% to 80%.

  2. AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    AgentEscapeBench is a benchmark of 270 tasks across five difficulty tiers that measures LLM agents' ability to manage long-range tool dependencies, state tracking, and intermediate result propagation, revealing sharp ...

  3. CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness

    q-bio.NC 2026-04 unverdicted novelty 6.0

    CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 2 Pith papers · 11 internal anchors

  1. [1]

    URL https://api.semanticscholar.org/ CorpusID:268232499

    The claude 3 model family: Opus, sonnet, haiku. URL https://api.semanticscholar.org/ CorpusID:268232499

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Large language models for mathematical reasoning: Progresses and challenges, 2024

    Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges, 2024. URL https://arxiv.org/abs/2402. 00157

  4. [4]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023

  5. [5]

    A survey on evaluation of large language models

    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. ACM transactions on intelligent systems and technology , 15(3):1–45, 2024

  6. [6]

    Interactive sketchpad: A multimodal tutoring system for collaborative, visual problem-solving

    Steven-Shine Chen, Jimin Lee, and Paul Pu Liang. Interactive sketchpad: A multimodal tutoring system for collaborative, visual problem-solving. arXiv preprint arXiv:2503.16434, 2025

  7. [7]

    Modeling: A novel dataset for testing linguistic reasoning in language models

    Nathan A Chi, Teodor Malchev, Riley Kong, Ryan A Chi, Lucas Huang, Ethan A Chi, R Thomas McCoy, and Dragomir Radev. Modeling: A novel dataset for testing linguistic reasoning in language models. arXiv preprint arXiv:2406.17038, 2024

  8. [8]

    PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns

    Yew Ken Chia, Vernon Toh Yan Han, Deepanway Ghosal, Lidong Bing, and Soujanya Poria. PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns. URL http://arxiv.org/abs/2403.13315

  9. [9]

    On the Measure of Intelligence

    François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019

  10. [10]

    Faithful reasoning using large language models

    Antonia Creswell and Murray Shanahan. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271, 2022

  11. [11]

    Lanzendörfer, Yannick Niedermayr, and Roger Wattenhofer

    Benjamin Estermann, Luca A. Lanzendörfer, Yannick Niedermayr, and Roger Wattenhofer. PUZZLES: A Benchmark for Neural Algorithmic Reasoning. Advances in Neural Information Processing Systems , 37: 127059–127098, December 2024

  12. [12]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

  13. [13]

    Are language models puzzle prodigies? algorithmic puzzles unveil serious challenges in multimodal reasoning

    Deepanway Ghosal, Vernon Toh Yan Han, Chia Yew Ken, and Soujanya Poria. Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning. URL http://arxiv.org/abs/2403.03864

  14. [14]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  15. [15]

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024

  16. [16]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. arXiv preprint arXiv:2406.09403, 2024

  17. [17]

    Identifying and mitigating vulnerabilities in llm-integrated applications

    Fengqing Jiang. Identifying and mitigating vulnerabilities in llm-integrated applications. Master’s thesis, University of Washington, 2024

  18. [18]

    A Survey on Large Language Models for Code Generation

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation, 2024. URL https://arxiv.org/abs/2406.00515

  19. [19]

    A closer look at logical reasoning with llms: The choice of tool matters, 2024

    Long Hei Matthew Lam, Ramya Keerthy Thatikonda, and Ehsan Shareghi. A closer look at logical reasoning with llms: The choice of tool matters, 2024. URL https://arxiv.org/abs/2406.00284. 10

  20. [20]

    Mimeqa: Towards socially-intelligent nonverbal foundation models.arXiv preprint arXiv:2502.16671, 2025

    Hengzhi Li, Megan Tjandrasuwita, Yi R Fung, Armando Solar-Lezama, and Paul Pu Liang. Mimeqa: Towards socially-intelligent nonverbal foundation models. arXiv preprint arXiv:2502.16671, 2025

  21. [21]

    Hemm: Holistic evaluation of multimodal foundation models

    Paul Pu Liang, Akshay Goindani, Talha Chafekar, Leena Mathur, Haofei Yu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Hemm: Holistic evaluation of multimodal foundation models. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2024

  22. [22]

    Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

    Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 56(10):1–42, 2024

  23. [23]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In International Conference on Learning Representations (ICLR) , 2024

  24. [24]

    Reasoning on graphs: Faithful and interpretable large language model reasoning

    Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. Reasoning on graphs: Faithful and interpretable large language model reasoning. arXiv preprint arXiv:2310.01061, 2023

  25. [25]

    Social genome: Grounded social reasoning abilities of multimodal models.arXiv preprint arXiv:2502.15109, 2025

    Leena Mathur, Marian Qian, Paul Pu Liang, and Louis-Philippe Morency. Social genome: Grounded social reasoning abilities of multimodal models. arXiv preprint arXiv:2502.15109, 2025

  26. [26]

    Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey

    Philipp Mondorf and Barbara Plank. Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey. arXiv preprint arXiv:2404.01869, 2024

  27. [27]

    Dickerson

    Arseny Moskvichev, Victor Vikram Odouard, and Melanie Mitchell. The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain. 2023. doi: 10.48550/ARXIV .2305. 07141

  28. [28]

    Openai o3 and o4-mini system card

    OpenAI. Openai o3 and o4-mini system card. https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf , 2025. Ac- cessed: 2025-05-16

  29. [29]

    Puzzled pint

    Puzzled Pint. Puzzled pint. https://puzzledpint.org/, 2025. CC BY-NC-SA Intl. 4.0

  30. [30]

    Ma- chine translation using deep learning: An overview

    Shashi Pal Singh, Ajai Kumar, Hemant Darbari, Lenali Singh, Anshika Rastogi, and Shikha Jain. Ma- chine translation using deep learning: An overview. In 2017 international conference on computer , communications and electronics (comptelix), pages 162–167. IEEE, 2017

  31. [31]

    A literature review on question answering techniques, paradigms and systems

    Marco Antonio Calijorne Soares and Fernando Silva Parreiras. A literature review on question answering techniques, paradigms and systems. Journal of King Saud University-Computer and Information Sciences , 32(6):635–646, 2020

  32. [32]

    Winnie Street, John Oliver Siy, Geoff Keeling, Adrien Baranes, Benjamin Barnett, Michael McKibben, Tatenda Kanyere, Alison Lentz, Blaise Aguera y Arcas, and Robin I. M. Dunbar. Llms achieve adult human performance on higher-order theory of mind tasks, 2024. URL https://arxiv.org/abs/2405.18870

  33. [33]

    A benchmark for learning to translate a new language from one grammar book

    Garrett Tanzer, Mirac Suzgun, Eline Visser, Dan Jurafsky, and Luke Melas-Kyriazi. A benchmark for learning to translate a new language from one grammar book. arXiv preprint arXiv:2309.16575, 2023

  34. [34]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report. arXiv preprint arXiv:2504.07491, 2025

  35. [35]

    Qvq: To see the world with wisdom, December 2024

    Qwen Team. Qvq: To see the world with wisdom, December 2024. URL https://qwenlm.github.io/ blog/qvq-72b-preview/

  36. [36]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  37. [37]

    Enigmaeval: A benchmark of long multimodal reasoning challenges,

    Clinton J Wang, Dean Lee, Cristina Menghini, Johannes Mols, Jack Doughty, Adam Khoja, Jayson Lynch, Sean Hendryx, Summer Yue, and Dan Hendrycks. Enigmaeval: A benchmark of long multimodal reasoning challenges. arXiv preprint arXiv:2502.08859, 2025

  38. [38]

    Is a picture worth a thousand words? delving into spatial reasoning for vision language models

    Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. In The Thirty-Eighth Annual Conference on Neural Information Processing Systems , 2024

  39. [39]

    Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang

    Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. SciBench: Evaluating College-Level Scientific Problem- Solving Abilities of Large Language Models. In Proceedings of the F orty-First International Conference on Machine Learning, 2024. 11

  40. [40]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  41. [41]

    Review of automatic text summarization techniques & methods

    Adhika Pramita Widyassari, Supriadi Rustad, Guruh Fajar Shidik, Edi Noersasongko, Abdul Syukur, Affandy Affandy, and De Rosal Ignatius Moses Setiadi. Review of automatic text summarization techniques & methods. Journal of King Saud University-Computer and Information Sciences , 34(4):1029–1046, 2022

  42. [42]

    Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts

    Tongshuang Wu, Michael Terry, and Carrie Jun Cai. Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In Proceedings of the 2022 CHI conference on human factors in computing systems , pages 1–22, 2022

  43. [43]

    Mind’s eye of llms: Visualization-of-thought elicits spatial reasoning in large language models

    Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. Mind’s eye of llms: Visualization-of-thought elicits spatial reasoning in large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024

  44. [44]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023

  45. [45]

    Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi, 2024

    Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, and Wenqi Shao. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models toward...

  46. [46]

    Kiva: Kid-inspired visual analogies for testing large multimodal models

    Eunice Yiu, Maan Qraitem, Anisa Noor Majhi, Charlie Wong, Yutong Bai, Shiry Ginosar, Alison Gopnik, and Kate Saenko. Kiva: Kid-inspired visual analogies for testing large multimodal models. arXiv preprint arXiv:2407.17773, 2024

  47. [47]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

  48. [48]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

  49. [49]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems , 36:46595–46623, 2023

  50. [50]

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 3: System Demonstrations) , Bangkok, Thailand, 2024. Association for Computational Lingui...

  51. [51]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 12 A Limitations and Broader Impact To ensure consistency and standardization across the dataset, ...