PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts

Adithya Balachandran; Alexander Naehu; Brendon Jiang; Chanakya Ekbote; Hengzhi Li; Justin Zhang; Megan Tjandrasuwita; Paul Pu Liang; Rebecca Chang; Regan Song

arxiv: 2506.06211 · v2 · submitted 2025-06-06 · 💻 cs.CL · cs.AI· cs.CV

PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts

Hengzhi Li , Justin Zhang , Brendon Jiang , Alexander Naehu , Regan Song , Megan Tjandrasuwita , Chanakya Ekbote , Steven-Shine Chen

show 4 more authors

Adithya Balachandran Wei Dai Rebecca Chang Paul Pu Liang

This is my paper

Pith reviewed 2026-05-19 10:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV

keywords puzzlehuntmultimodal reasoningopen-ended reasoningAI benchmarkreasoning tracesvisual reasoningstepwise accuracy

0 comments

The pith

State-of-the-art models solve only 18 percent of puzzlehunt problems and reach 40 percent stepwise accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PuzzleWorld, a benchmark of 667 puzzlehunt problems that lack clear problem definitions and instead require models to discover structure from multimodal clues through iterative steps. It reports that leading models achieve only 1-4 percent final answer accuracy, with the strongest reaching 18 percent puzzle solutions and 40 percent on individual steps, levels that align with human novices but fall well short of enthusiasts. The work shows that fine-tuning a small model on the provided reasoning traces lifts stepwise accuracy from 4 percent to 11 percent and produces gains on separate visual reasoning tests. These results matter because the puzzles are presented as proxies for the open-ended discovery needed in scientific investigation and data analysis, where problems are not handed to the solver in advance.

Core claim

We present PuzzleWorld, a benchmark of 667 puzzlehunt-style problems each supplied with final solutions, detailed reasoning traces, and cognitive skill labels. State-of-the-art models achieve only 1-4 percent final answer accuracy. The best model solves 18 percent of the puzzles and attains 40 percent stepwise accuracy, matching novice human solvers but significantly behind enthusiasts. Fine-tuning a small model on the reasoning traces improves stepwise accuracy from 4 percent to 11 percent, with gains that transfer to downstream visual reasoning tasks. Error analysis shows models suffer from myopic reasoning, limits of language-based inference, and insufficient sketching for visual and sp

What carries the argument

PuzzleWorld benchmark of 667 annotated puzzlehunt problems that require discovering underlying problem structure from multimodal evidence without predefined instructions.

If this is right

Fine-tuning on detailed reasoning traces raises stepwise accuracy from 4 percent to 11 percent and transfers to other visual reasoning tasks.
Current models are limited by myopic reasoning and by the absence of sketching abilities needed for visual and spatial problems.
The performance gap between models and puzzle enthusiasts points to the need for systems that can handle open-ended structure discovery.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks built around iterative clue interpretation could be adapted to measure progress toward AI systems that assist in exploratory data analysis.
Equipping models with external sketching or diagram tools might directly address one of the reported bottlenecks in visual reasoning.
The novice-to-enthusiast performance difference suggests that targeted training on creative, multi-step traces could close part of the observed gap.

Load-bearing premise

The selected puzzlehunt problems and their annotations serve as a valid proxy for open-ended reasoning challenges in domains such as scientific discovery and investigative problem-solving, with human baselines accurately reflecting novice and enthusiast performance.

What would settle it

A model that reaches 50 percent solve rate on PuzzleWorld yet shows no corresponding gains on independent tests of scientific discovery or investigative problem-solving would indicate the benchmark does not measure the intended general capability.

Figures

Figures reproduced from arXiv: 2506.06211 by Adithya Balachandran, Alexander Naehu, Brendon Jiang, Chanakya Ekbote, Hengzhi Li, Justin Zhang, Megan Tjandrasuwita, Paul Pu Liang, Rebecca Chang, Regan Song, Steven-Shine Chen, Wei Dai.

**Figure 1.** Figure 1: Overview of PUZZLEWORLD: PUZZLEWORLD is a dataset of complex puzzles that lack explicit instructions, requiring solvers to deduce the final answer from nuanced, multimodal cues from the puzzle content as well as external domain-specific knowledge. The raw puzzles and solutions are sourced from PuzzledPint, and the solutions, which are PNG images, are transcribed into a sequence of reasoning steps by human … view at source ↗

**Figure 2.** Figure 2: Overview of samples from PUZZLEWORLD Left: To gain a deeper understanding of model performance on PUZZLEWORLD, each puzzle is annotated with the input modalities of the puzzle content, the reasoning skills required to solve the puzzle, and step-by-step reasoning steps. Right: Example modality and reasoning skill annotations on three puzzles. High-resolution puzzle images are in Appendix B. [3, 18, 19]. The… view at source ↗

**Figure 3.** Figure 3: Dataset construction procedure and statistics: Left: First, we source raw puzzles and solutions from PuzzledPint. As the PuzzledPint solutions are in PDF format and are often not correctly parsed by OCR (for example, some solutions consist of annotated figures rather than a textual description), the metadata and reasoning steps for each puzzle are human-annotated. We use GPT-4 to automatically flag puzzles… view at source ↗

**Figure 4.** Figure 4: Illustration of metadata schema: All puzzles are annotated with their accompanying metadata, which includes the puzzle title, flavor text, difficulty, final answer, reasoning steps, input modalities, reasoning skills, and the PuzzledPint link to the puzzle. Each puzzle is annotated using a structured JSON schema comprising several key fields: a title that serves as a descriptive identifier; optional flav… view at source ↗

**Figure 5.** Figure 5: PUZZLEWORLD dataset statistics. Distributions of modalities and reasoning skills are balanced across different puzzles. While the majority of puzzles are classified as medium difficulty, there are a significant number of easy and hard puzzles. The number of reasoning steps follows a long-tail distribution, with many puzzle solutions requiring more than 5 steps and some hard puzzles requiring up to 30 steps… view at source ↗

**Figure 6.** Figure 6: Example puzzle errors. Left: (myopic reasoning) The model outputs an incorrect plan to solve the puzzle and is unable to successfully backtrack when it hits a dead end. Middle: (language bottleneck/lack of visual understanding) The model misinterprets the visual contents of the puzzle due to inherent limitations in language. Right: (sketching errors) While the model may produce a plausible plan, it fails t… view at source ↗

**Figure 7.** Figure 7: Stepwise accuracy distribution of GPT-o3. GPT-o3 receives stepwise accuracy of 0 for most puzzles, indicating the model frequently fails to identify even the first step of the correct reasoning trace. This highlights GPT-o3’s tendency toward myopic reasoning and its inability to recover or backtrack once committed to an incorrect path. Strong models exhibit myopic reasoning. Despite strong performance on … view at source ↗

**Figure 8.** Figure 8: Example failure case where GPT-o3 fails to convert a complex structured puzzle into text. While GPT-o3 correctly solves the word clues within each loop, it fails to capture the global layout when converting the puzzle into text, as shown in [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Reasoning skills of failed steps. We annotated the reasoning steps responsible for 30 puzzle errors with their corresponding reasoning skills. To better understand the role of sketching in model performance, we manually analyzed 30 puzzles where GPT-o3 produced incorrect answers. For each failure case, we annotated the reasoning step responsible for the error with its corresponding reasoning skill categor… view at source ↗

read the original abstract

Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions and constrained environments, puzzlehunts requires discovering the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite progress in foundation models, their performance on open-ended settings remains largely untested. We introduce PuzzleWorld, a comprehensive benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic analysis. Most state-of-the-art models achieve only 1-4% final answer accuracy. On PuzzleWorld, the best model solves only 18% of puzzles and reaches 40% stepwise accuracy, matching human puzzle novices but falling significantly behind puzzle enthusiasts. To demonstrate the value of our reasoning annotations, we show that fine-tuning a small model on reasoning traces boosts stepwise accuracy from 4% to 11%, which translates to improvements in downstream visual reasoning tasks. Our detailed error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning. We release PuzzleWorld at https://github.com/MIT-MI/PuzzleWorld to support future work on building more general, open-ended, and creative reasoning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PuzzleWorld is a new benchmark showing models struggle with open-ended multimodal puzzles, but the human baseline details are too thin to back the performance gap claims strongly.

read the letter

This paper introduces PuzzleWorld, a benchmark of 667 puzzlehunt problems for testing open-ended multimodal reasoning. The key result is that top models only reach 18% solve rate and 40% stepwise accuracy, close to human novices but well behind enthusiasts. They also show a small fine-tuning boost from the reasoning annotations. What the paper does well is release a dataset with full annotations for solutions, traces, and cognitive skills. This supports the error analysis that calls out myopic reasoning, language bottlenecks, and weak sketching. The fine-tuning experiment is simple but shows the annotations have value beyond the benchmark itself. Putting the whole thing on GitHub is helpful for follow-up work. The soft spots are in the human side and the proxy claim. Details on recruiting the novice and enthusiast groups, how many people solved each puzzle, the instructions given, time limits, and reliability of the trace annotations are missing. That leaves the performance gap open to questions about bias or inconsistency. Puzzle selection and diversity metrics are also not reported at a level that strongly supports using these as a stand-in for scientific discovery or investigative tasks. This is for researchers focused on AI reasoning benchmarks and multimodal systems. People who run evaluations or want data for training on step-by-step reasoning will find it practical. It deserves a serious referee because the benchmark is new, the numbers are specific, and the experiments are reproducible with the release. I would send it to peer review, with a note to strengthen the human data section and the justification for the proxy.

Referee Report

2 major / 2 minor

Summary. The paper introduces PuzzleWorld, a benchmark of 667 puzzlehunt-style problems annotated with final solutions, detailed reasoning traces, and cognitive skill labels to evaluate multimodal open-ended reasoning. It reports that most state-of-the-art models achieve only 1-4% final answer accuracy, with the best model solving 18% of puzzles and reaching 40% stepwise accuracy (matching human novices but lagging enthusiasts), and shows that fine-tuning a small model on the reasoning traces boosts stepwise accuracy from 4% to 11% with transfer to other visual reasoning tasks. An error analysis identifies limitations in myopic reasoning, language-based inference, and sketching.

Significance. If the human baselines and puzzle selection criteria are properly documented and validated, this work would be a meaningful contribution by providing a challenging testbed for open-ended multimodal reasoning that mirrors real-world domains like scientific discovery. The public release of the dataset and annotations, the fine-tuning experiment demonstrating the utility of the traces, and the error analysis highlighting specific model weaknesses (e.g., lack of sketching) are clear strengths that support future research on more general reasoning systems.

major comments (2)

[Human Baselines] Human Baselines section: Quantitative details on recruitment of novice and enthusiast cohorts, number of participants per puzzle, instructions provided, time limits, and inter-rater reliability for the annotated reasoning traces are absent. This directly undermines the load-bearing claim that the best model matches human puzzle novices at ~40% stepwise accuracy while falling significantly behind enthusiasts.
[Dataset Construction] Dataset Construction section: No metrics on puzzle sourcing, diversity, or explicit validation that the selected problems serve as a valid proxy for open-ended reasoning challenges in scientific discovery and investigative problem-solving. This raises the risk that reported performance gaps are driven by selection bias rather than genuine differences in reasoning ability.

minor comments (2)

[Abstract] Abstract: The statement that 'most state-of-the-art models achieve only 1-4% final answer accuracy' would benefit from specifying the exact models evaluated and including error bars or variance measures for all reported metrics.
[Error Analysis] Error Analysis: Consider adding more quantitative breakdowns or concrete examples to support claims about myopic reasoning and lack of sketching capabilities.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential value of PuzzleWorld as a challenging testbed for open-ended multimodal reasoning. We address each major comment below and will incorporate the requested documentation into the revised manuscript.

read point-by-point responses

Referee: [Human Baselines] Human Baselines section: Quantitative details on recruitment of novice and enthusiast cohorts, number of participants per puzzle, instructions provided, time limits, and inter-rater reliability for the annotated reasoning traces are absent. This directly undermines the load-bearing claim that the best model matches human puzzle novices at ~40% stepwise accuracy while falling significantly behind enthusiasts.

Authors: We agree that the current Human Baselines section lacks sufficient quantitative detail to fully support the reported comparisons. In the revision we will expand this section with explicit information on recruitment methods for the novice and enthusiast cohorts, the number of participants assigned to each puzzle, the precise instructions and time limits given to solvers, and inter-rater reliability statistics for the reasoning-trace annotations. These additions will directly substantiate the claim that the best model reaches approximately 40% stepwise accuracy, comparable to novices yet below enthusiasts. revision: yes
Referee: [Dataset Construction] Dataset Construction section: No metrics on puzzle sourcing, diversity, or explicit validation that the selected problems serve as a valid proxy for open-ended reasoning challenges in scientific discovery and investigative problem-solving. This raises the risk that reported performance gaps are driven by selection bias rather than genuine differences in reasoning ability.

Authors: We acknowledge the need for greater transparency on dataset construction. The revised manuscript will include quantitative metrics on puzzle sourcing (including original sources and selection criteria), diversity statistics across puzzle types, themes, and cognitive-skill labels, and a dedicated validation subsection that explains how the chosen problems function as proxies for open-ended reasoning in scientific discovery and investigative problem-solving. These additions will reduce concerns about selection bias and strengthen the benchmark's claimed relevance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with no self-referential derivations

full rationale

The paper introduces PuzzleWorld as a new benchmark of 667 puzzles with annotations for solutions, reasoning traces, and skill labels. All reported results (model accuracies of 1-4% final answer, 18% puzzle solve rate, 40% stepwise accuracy; fine-tuning boost from 4% to 11%; error analysis on myopic reasoning and sketching limitations) are direct empirical measurements obtained by evaluating models on the released dataset and comparing against separately collected human baselines. No equations, fitted parameters, predictions derived from the same data, or self-citation chains are used to justify the central claims. The derivation chain consists solely of standard benchmark evaluation procedures that remain independent of the reported outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on domain assumptions about what constitutes open-ended reasoning rather than new mathematical axioms, free parameters, or invented entities.

axioms (1)

domain assumption Puzzlehunts require discovering the underlying problem structure from multimodal evidence and iterative reasoning
Invoked in the abstract to contrast with conventional benchmarks and to motivate the benchmark design.

pith-pipeline@v0.9.0 · 5855 in / 1370 out tokens · 43211 ms · 2026-05-19T10:38:58.386089+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce PUZZLE WORLD, a comprehensive benchmark of 667 puzzlehunt-style problems... annotated with the final solution, detailed reasoning traces, and cognitive skill labels
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fine-tuning a small model on reasoning traces boosts stepwise accuracy from 4% to 11%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
cs.AI 2026-05 unverdicted novelty 7.0

AgentEscapeBench shows LLM agents' success rates drop from 90% to 60% as tool-dependency depth increases from 5 to 25 steps, while humans drop only from 98% to 80%.
AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
cs.AI 2026-05 unverdicted novelty 7.0

AgentEscapeBench is a benchmark of 270 tasks across five difficulty tiers that measures LLM agents' ability to manage long-range tool dependencies, state tracking, and intermediate result propagation, revealing sharp ...
CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness
q-bio.NC 2026-04 unverdicted novelty 6.0

CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 2 Pith papers · 11 internal anchors

[1]

URL https://api.semanticscholar.org/ CorpusID:268232499

The claude 3 model family: Opus, sonnet, haiku. URL https://api.semanticscholar.org/ CorpusID:268232499

work page
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Large language models for mathematical reasoning: Progresses and challenges, 2024

Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges, 2024. URL https://arxiv.org/abs/2402. 00157

work page 2024
[4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

A survey on evaluation of large language models

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. ACM transactions on intelligent systems and technology , 15(3):1–45, 2024

work page 2024
[6]

Interactive sketchpad: A multimodal tutoring system for collaborative, visual problem-solving

Steven-Shine Chen, Jimin Lee, and Paul Pu Liang. Interactive sketchpad: A multimodal tutoring system for collaborative, visual problem-solving. arXiv preprint arXiv:2503.16434, 2025

work page arXiv 2025
[7]

Modeling: A novel dataset for testing linguistic reasoning in language models

Nathan A Chi, Teodor Malchev, Riley Kong, Ryan A Chi, Lucas Huang, Ethan A Chi, R Thomas McCoy, and Dragomir Radev. Modeling: A novel dataset for testing linguistic reasoning in language models. arXiv preprint arXiv:2406.17038, 2024

work page arXiv 2024
[8]

PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns

Yew Ken Chia, Vernon Toh Yan Han, Deepanway Ghosal, Lidong Bing, and Soujanya Poria. PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns. URL http://arxiv.org/abs/2403.13315

work page arXiv
[9]

On the Measure of Intelligence

François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[10]

Faithful reasoning using large language models

Antonia Creswell and Murray Shanahan. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271, 2022

work page arXiv 2022
[11]

Lanzendörfer, Yannick Niedermayr, and Roger Wattenhofer

Benjamin Estermann, Luca A. Lanzendörfer, Yannick Niedermayr, and Roger Wattenhofer. PUZZLES: A Benchmark for Neural Algorithmic Reasoning. Advances in Neural Information Processing Systems , 37: 127059–127098, December 2024

work page 2024
[12]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Are language models puzzle prodigies? algorithmic puzzles unveil serious challenges in multimodal reasoning

Deepanway Ghosal, Vernon Toh Yan Han, Chia Yew Ken, and Soujanya Poria. Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning. URL http://arxiv.org/abs/2403.03864

work page arXiv
[14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024

work page 2024
[16]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. arXiv preprint arXiv:2406.09403, 2024

work page arXiv 2024
[17]

Identifying and mitigating vulnerabilities in llm-integrated applications

Fengqing Jiang. Identifying and mitigating vulnerabilities in llm-integrated applications. Master’s thesis, University of Washington, 2024

work page 2024
[18]

A Survey on Large Language Models for Code Generation

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation, 2024. URL https://arxiv.org/abs/2406.00515

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

A closer look at logical reasoning with llms: The choice of tool matters, 2024

Long Hei Matthew Lam, Ramya Keerthy Thatikonda, and Ehsan Shareghi. A closer look at logical reasoning with llms: The choice of tool matters, 2024. URL https://arxiv.org/abs/2406.00284. 10

work page arXiv 2024
[20]

Mimeqa: Towards socially-intelligent nonverbal foundation models.arXiv preprint arXiv:2502.16671, 2025

Hengzhi Li, Megan Tjandrasuwita, Yi R Fung, Armando Solar-Lezama, and Paul Pu Liang. Mimeqa: Towards socially-intelligent nonverbal foundation models. arXiv preprint arXiv:2502.16671, 2025

work page arXiv 2025
[21]

Hemm: Holistic evaluation of multimodal foundation models

Paul Pu Liang, Akshay Goindani, Talha Chafekar, Leena Mathur, Haofei Yu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Hemm: Holistic evaluation of multimodal foundation models. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2024

work page 2024
[22]

Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 56(10):1–42, 2024

work page 2024
[23]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In International Conference on Learning Representations (ICLR) , 2024

work page 2024
[24]

Reasoning on graphs: Faithful and interpretable large language model reasoning

Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. Reasoning on graphs: Faithful and interpretable large language model reasoning. arXiv preprint arXiv:2310.01061, 2023

work page arXiv 2023
[25]

Social genome: Grounded social reasoning abilities of multimodal models.arXiv preprint arXiv:2502.15109, 2025

Leena Mathur, Marian Qian, Paul Pu Liang, and Louis-Philippe Morency. Social genome: Grounded social reasoning abilities of multimodal models. arXiv preprint arXiv:2502.15109, 2025

work page arXiv 2025
[26]

Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey

Philipp Mondorf and Barbara Plank. Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey. arXiv preprint arXiv:2404.01869, 2024

work page arXiv 2024
[27]

Dickerson

Arseny Moskvichev, Victor Vikram Odouard, and Melanie Mitchell. The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain. 2023. doi: 10.48550/ARXIV .2305. 07141

work page internal anchor Pith review doi:10.48550/arxiv 2023
[28]

Openai o3 and o4-mini system card

OpenAI. Openai o3 and o4-mini system card. https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf , 2025. Ac- cessed: 2025-05-16

work page 2025
[29]

Puzzled pint

Puzzled Pint. Puzzled pint. https://puzzledpint.org/, 2025. CC BY-NC-SA Intl. 4.0

work page 2025
[30]

Ma- chine translation using deep learning: An overview

Shashi Pal Singh, Ajai Kumar, Hemant Darbari, Lenali Singh, Anshika Rastogi, and Shikha Jain. Ma- chine translation using deep learning: An overview. In 2017 international conference on computer , communications and electronics (comptelix), pages 162–167. IEEE, 2017

work page 2017
[31]

A literature review on question answering techniques, paradigms and systems

Marco Antonio Calijorne Soares and Fernando Silva Parreiras. A literature review on question answering techniques, paradigms and systems. Journal of King Saud University-Computer and Information Sciences , 32(6):635–646, 2020

work page 2020
[32]

Winnie Street, John Oliver Siy, Geoff Keeling, Adrien Baranes, Benjamin Barnett, Michael McKibben, Tatenda Kanyere, Alison Lentz, Blaise Aguera y Arcas, and Robin I. M. Dunbar. Llms achieve adult human performance on higher-order theory of mind tasks, 2024. URL https://arxiv.org/abs/2405.18870

work page arXiv 2024
[33]

A benchmark for learning to translate a new language from one grammar book

Garrett Tanzer, Mirac Suzgun, Eline Visser, Dan Jurafsky, and Luke Melas-Kyriazi. A benchmark for learning to translate a new language from one grammar book. arXiv preprint arXiv:2309.16575, 2023

work page arXiv 2023
[34]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report. arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Qvq: To see the world with wisdom, December 2024

Qwen Team. Qvq: To see the world with wisdom, December 2024. URL https://qwenlm.github.io/ blog/qvq-72b-preview/

work page 2024
[36]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Enigmaeval: A benchmark of long multimodal reasoning challenges,

Clinton J Wang, Dean Lee, Cristina Menghini, Johannes Mols, Jack Doughty, Adam Khoja, Jayson Lynch, Sean Hendryx, Summer Yue, and Dan Hendrycks. Enigmaeval: A benchmark of long multimodal reasoning challenges. arXiv preprint arXiv:2502.08859, 2025

work page arXiv 2025
[38]

Is a picture worth a thousand words? delving into spatial reasoning for vision language models

Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. In The Thirty-Eighth Annual Conference on Neural Information Processing Systems , 2024

work page 2024
[39]

Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. SciBench: Evaluating College-Level Scientific Problem- Solving Abilities of Large Language Models. In Proceedings of the F orty-First International Conference on Machine Learning, 2024. 11

work page 2024
[40]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[41]

Review of automatic text summarization techniques & methods

Adhika Pramita Widyassari, Supriadi Rustad, Guruh Fajar Shidik, Edi Noersasongko, Abdul Syukur, Affandy Affandy, and De Rosal Ignatius Moses Setiadi. Review of automatic text summarization techniques & methods. Journal of King Saud University-Computer and Information Sciences , 34(4):1029–1046, 2022

work page 2022
[42]

Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts

Tongshuang Wu, Michael Terry, and Carrie Jun Cai. Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In Proceedings of the 2022 CHI conference on human factors in computing systems , pages 1–22, 2022

work page 2022
[43]

Mind’s eye of llms: Visualization-of-thought elicits spatial reasoning in large language models

Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. Mind’s eye of llms: Visualization-of-thought elicits spatial reasoning in large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024

work page 2024
[44]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023

work page 2023
[45]

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi, 2024

Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, and Wenqi Shao. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models toward...

work page 2024
[46]

Kiva: Kid-inspired visual analogies for testing large multimodal models

Eunice Yiu, Maan Qraitem, Anisa Noor Majhi, Charlie Wong, Yutong Bai, Shiry Ginosar, Alison Gopnik, and Kate Saenko. Kiva: Kid-inspired visual analogies for testing large multimodal models. arXiv preprint arXiv:2407.17773, 2024

work page arXiv 2024
[47]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

work page 2024
[48]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

work page 2024
[49]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems , 36:46595–46623, 2023

work page 2023
[50]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 3: System Demonstrations) , Bangkok, Thailand, 2024. Association for Computational Lingui...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 12 A Limitations and Broader Impact To ensure consistency and standardization across the dataset, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

URL https://api.semanticscholar.org/ CorpusID:268232499

The claude 3 model family: Opus, sonnet, haiku. URL https://api.semanticscholar.org/ CorpusID:268232499

work page

[2] [2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Large language models for mathematical reasoning: Progresses and challenges, 2024

Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges, 2024. URL https://arxiv.org/abs/2402. 00157

work page 2024

[4] [4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

A survey on evaluation of large language models

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. ACM transactions on intelligent systems and technology , 15(3):1–45, 2024

work page 2024

[6] [6]

Interactive sketchpad: A multimodal tutoring system for collaborative, visual problem-solving

Steven-Shine Chen, Jimin Lee, and Paul Pu Liang. Interactive sketchpad: A multimodal tutoring system for collaborative, visual problem-solving. arXiv preprint arXiv:2503.16434, 2025

work page arXiv 2025

[7] [7]

Modeling: A novel dataset for testing linguistic reasoning in language models

Nathan A Chi, Teodor Malchev, Riley Kong, Ryan A Chi, Lucas Huang, Ethan A Chi, R Thomas McCoy, and Dragomir Radev. Modeling: A novel dataset for testing linguistic reasoning in language models. arXiv preprint arXiv:2406.17038, 2024

work page arXiv 2024

[8] [8]

PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns

Yew Ken Chia, Vernon Toh Yan Han, Deepanway Ghosal, Lidong Bing, and Soujanya Poria. PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns. URL http://arxiv.org/abs/2403.13315

work page arXiv

[9] [9]

On the Measure of Intelligence

François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911

[10] [10]

Faithful reasoning using large language models

Antonia Creswell and Murray Shanahan. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271, 2022

work page arXiv 2022

[11] [11]

Lanzendörfer, Yannick Niedermayr, and Roger Wattenhofer

Benjamin Estermann, Luca A. Lanzendörfer, Yannick Niedermayr, and Roger Wattenhofer. PUZZLES: A Benchmark for Neural Algorithmic Reasoning. Advances in Neural Information Processing Systems , 37: 127059–127098, December 2024

work page 2024

[12] [12]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Are language models puzzle prodigies? algorithmic puzzles unveil serious challenges in multimodal reasoning

Deepanway Ghosal, Vernon Toh Yan Han, Chia Yew Ken, and Soujanya Poria. Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning. URL http://arxiv.org/abs/2403.03864

work page arXiv

[14] [14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024

work page 2024

[16] [16]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. arXiv preprint arXiv:2406.09403, 2024

work page arXiv 2024

[17] [17]

Identifying and mitigating vulnerabilities in llm-integrated applications

Fengqing Jiang. Identifying and mitigating vulnerabilities in llm-integrated applications. Master’s thesis, University of Washington, 2024

work page 2024

[18] [18]

A Survey on Large Language Models for Code Generation

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation, 2024. URL https://arxiv.org/abs/2406.00515

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

A closer look at logical reasoning with llms: The choice of tool matters, 2024

Long Hei Matthew Lam, Ramya Keerthy Thatikonda, and Ehsan Shareghi. A closer look at logical reasoning with llms: The choice of tool matters, 2024. URL https://arxiv.org/abs/2406.00284. 10

work page arXiv 2024

[20] [20]

Mimeqa: Towards socially-intelligent nonverbal foundation models.arXiv preprint arXiv:2502.16671, 2025

Hengzhi Li, Megan Tjandrasuwita, Yi R Fung, Armando Solar-Lezama, and Paul Pu Liang. Mimeqa: Towards socially-intelligent nonverbal foundation models. arXiv preprint arXiv:2502.16671, 2025

work page arXiv 2025

[21] [21]

Hemm: Holistic evaluation of multimodal foundation models

Paul Pu Liang, Akshay Goindani, Talha Chafekar, Leena Mathur, Haofei Yu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Hemm: Holistic evaluation of multimodal foundation models. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2024

work page 2024

[22] [22]

Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 56(10):1–42, 2024

work page 2024

[23] [23]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In International Conference on Learning Representations (ICLR) , 2024

work page 2024

[24] [24]

Reasoning on graphs: Faithful and interpretable large language model reasoning

Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. Reasoning on graphs: Faithful and interpretable large language model reasoning. arXiv preprint arXiv:2310.01061, 2023

work page arXiv 2023

[25] [25]

Social genome: Grounded social reasoning abilities of multimodal models.arXiv preprint arXiv:2502.15109, 2025

Leena Mathur, Marian Qian, Paul Pu Liang, and Louis-Philippe Morency. Social genome: Grounded social reasoning abilities of multimodal models. arXiv preprint arXiv:2502.15109, 2025

work page arXiv 2025

[26] [26]

Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey

Philipp Mondorf and Barbara Plank. Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey. arXiv preprint arXiv:2404.01869, 2024

work page arXiv 2024

[27] [27]

Dickerson

Arseny Moskvichev, Victor Vikram Odouard, and Melanie Mitchell. The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain. 2023. doi: 10.48550/ARXIV .2305. 07141

work page internal anchor Pith review doi:10.48550/arxiv 2023

[28] [28]

Openai o3 and o4-mini system card

OpenAI. Openai o3 and o4-mini system card. https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf , 2025. Ac- cessed: 2025-05-16

work page 2025

[29] [29]

Puzzled pint

Puzzled Pint. Puzzled pint. https://puzzledpint.org/, 2025. CC BY-NC-SA Intl. 4.0

work page 2025

[30] [30]

Ma- chine translation using deep learning: An overview

Shashi Pal Singh, Ajai Kumar, Hemant Darbari, Lenali Singh, Anshika Rastogi, and Shikha Jain. Ma- chine translation using deep learning: An overview. In 2017 international conference on computer , communications and electronics (comptelix), pages 162–167. IEEE, 2017

work page 2017

[31] [31]

A literature review on question answering techniques, paradigms and systems

Marco Antonio Calijorne Soares and Fernando Silva Parreiras. A literature review on question answering techniques, paradigms and systems. Journal of King Saud University-Computer and Information Sciences , 32(6):635–646, 2020

work page 2020

[32] [32]

Winnie Street, John Oliver Siy, Geoff Keeling, Adrien Baranes, Benjamin Barnett, Michael McKibben, Tatenda Kanyere, Alison Lentz, Blaise Aguera y Arcas, and Robin I. M. Dunbar. Llms achieve adult human performance on higher-order theory of mind tasks, 2024. URL https://arxiv.org/abs/2405.18870

work page arXiv 2024

[33] [33]

A benchmark for learning to translate a new language from one grammar book

Garrett Tanzer, Mirac Suzgun, Eline Visser, Dan Jurafsky, and Luke Melas-Kyriazi. A benchmark for learning to translate a new language from one grammar book. arXiv preprint arXiv:2309.16575, 2023

work page arXiv 2023

[34] [34]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report. arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Qvq: To see the world with wisdom, December 2024

Qwen Team. Qvq: To see the world with wisdom, December 2024. URL https://qwenlm.github.io/ blog/qvq-72b-preview/

work page 2024

[36] [36]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Enigmaeval: A benchmark of long multimodal reasoning challenges,

Clinton J Wang, Dean Lee, Cristina Menghini, Johannes Mols, Jack Doughty, Adam Khoja, Jayson Lynch, Sean Hendryx, Summer Yue, and Dan Hendrycks. Enigmaeval: A benchmark of long multimodal reasoning challenges. arXiv preprint arXiv:2502.08859, 2025

work page arXiv 2025

[38] [38]

Is a picture worth a thousand words? delving into spatial reasoning for vision language models

Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. In The Thirty-Eighth Annual Conference on Neural Information Processing Systems , 2024

work page 2024

[39] [39]

Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. SciBench: Evaluating College-Level Scientific Problem- Solving Abilities of Large Language Models. In Proceedings of the F orty-First International Conference on Machine Learning, 2024. 11

work page 2024

[40] [40]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[41] [41]

Review of automatic text summarization techniques & methods

Adhika Pramita Widyassari, Supriadi Rustad, Guruh Fajar Shidik, Edi Noersasongko, Abdul Syukur, Affandy Affandy, and De Rosal Ignatius Moses Setiadi. Review of automatic text summarization techniques & methods. Journal of King Saud University-Computer and Information Sciences , 34(4):1029–1046, 2022

work page 2022

[42] [42]

Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts

Tongshuang Wu, Michael Terry, and Carrie Jun Cai. Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In Proceedings of the 2022 CHI conference on human factors in computing systems , pages 1–22, 2022

work page 2022

[43] [43]

Mind’s eye of llms: Visualization-of-thought elicits spatial reasoning in large language models

Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. Mind’s eye of llms: Visualization-of-thought elicits spatial reasoning in large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024

work page 2024

[44] [44]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023

work page 2023

[45] [45]

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi, 2024

Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, and Wenqi Shao. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models toward...

work page 2024

[46] [46]

Kiva: Kid-inspired visual analogies for testing large multimodal models

Eunice Yiu, Maan Qraitem, Anisa Noor Majhi, Charlie Wong, Yutong Bai, Shiry Ginosar, Alison Gopnik, and Kate Saenko. Kiva: Kid-inspired visual analogies for testing large multimodal models. arXiv preprint arXiv:2407.17773, 2024

work page arXiv 2024

[47] [47]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

work page 2024

[48] [48]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

work page 2024

[49] [49]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems , 36:46595–46623, 2023

work page 2023

[50] [50]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 3: System Demonstrations) , Bangkok, Thailand, 2024. Association for Computational Lingui...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 12 A Limitations and Broader Impact To ensure consistency and standardization across the dataset, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025