PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts
Pith reviewed 2026-05-19 10:38 UTC · model grok-4.3
The pith
State-of-the-art models solve only 18 percent of puzzlehunt problems and reach 40 percent stepwise accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present PuzzleWorld, a benchmark of 667 puzzlehunt-style problems each supplied with final solutions, detailed reasoning traces, and cognitive skill labels. State-of-the-art models achieve only 1-4 percent final answer accuracy. The best model solves 18 percent of the puzzles and attains 40 percent stepwise accuracy, matching novice human solvers but significantly behind enthusiasts. Fine-tuning a small model on the reasoning traces improves stepwise accuracy from 4 percent to 11 percent, with gains that transfer to downstream visual reasoning tasks. Error analysis shows models suffer from myopic reasoning, limits of language-based inference, and insufficient sketching for visual and sp
What carries the argument
PuzzleWorld benchmark of 667 annotated puzzlehunt problems that require discovering underlying problem structure from multimodal evidence without predefined instructions.
If this is right
- Fine-tuning on detailed reasoning traces raises stepwise accuracy from 4 percent to 11 percent and transfers to other visual reasoning tasks.
- Current models are limited by myopic reasoning and by the absence of sketching abilities needed for visual and spatial problems.
- The performance gap between models and puzzle enthusiasts points to the need for systems that can handle open-ended structure discovery.
Where Pith is reading between the lines
- Benchmarks built around iterative clue interpretation could be adapted to measure progress toward AI systems that assist in exploratory data analysis.
- Equipping models with external sketching or diagram tools might directly address one of the reported bottlenecks in visual reasoning.
- The novice-to-enthusiast performance difference suggests that targeted training on creative, multi-step traces could close part of the observed gap.
Load-bearing premise
The selected puzzlehunt problems and their annotations serve as a valid proxy for open-ended reasoning challenges in domains such as scientific discovery and investigative problem-solving, with human baselines accurately reflecting novice and enthusiast performance.
What would settle it
A model that reaches 50 percent solve rate on PuzzleWorld yet shows no corresponding gains on independent tests of scientific discovery or investigative problem-solving would indicate the benchmark does not measure the intended general capability.
Figures
read the original abstract
Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions and constrained environments, puzzlehunts requires discovering the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite progress in foundation models, their performance on open-ended settings remains largely untested. We introduce PuzzleWorld, a comprehensive benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic analysis. Most state-of-the-art models achieve only 1-4% final answer accuracy. On PuzzleWorld, the best model solves only 18% of puzzles and reaches 40% stepwise accuracy, matching human puzzle novices but falling significantly behind puzzle enthusiasts. To demonstrate the value of our reasoning annotations, we show that fine-tuning a small model on reasoning traces boosts stepwise accuracy from 4% to 11%, which translates to improvements in downstream visual reasoning tasks. Our detailed error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning. We release PuzzleWorld at https://github.com/MIT-MI/PuzzleWorld to support future work on building more general, open-ended, and creative reasoning systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PuzzleWorld, a benchmark of 667 puzzlehunt-style problems annotated with final solutions, detailed reasoning traces, and cognitive skill labels to evaluate multimodal open-ended reasoning. It reports that most state-of-the-art models achieve only 1-4% final answer accuracy, with the best model solving 18% of puzzles and reaching 40% stepwise accuracy (matching human novices but lagging enthusiasts), and shows that fine-tuning a small model on the reasoning traces boosts stepwise accuracy from 4% to 11% with transfer to other visual reasoning tasks. An error analysis identifies limitations in myopic reasoning, language-based inference, and sketching.
Significance. If the human baselines and puzzle selection criteria are properly documented and validated, this work would be a meaningful contribution by providing a challenging testbed for open-ended multimodal reasoning that mirrors real-world domains like scientific discovery. The public release of the dataset and annotations, the fine-tuning experiment demonstrating the utility of the traces, and the error analysis highlighting specific model weaknesses (e.g., lack of sketching) are clear strengths that support future research on more general reasoning systems.
major comments (2)
- [Human Baselines] Human Baselines section: Quantitative details on recruitment of novice and enthusiast cohorts, number of participants per puzzle, instructions provided, time limits, and inter-rater reliability for the annotated reasoning traces are absent. This directly undermines the load-bearing claim that the best model matches human puzzle novices at ~40% stepwise accuracy while falling significantly behind enthusiasts.
- [Dataset Construction] Dataset Construction section: No metrics on puzzle sourcing, diversity, or explicit validation that the selected problems serve as a valid proxy for open-ended reasoning challenges in scientific discovery and investigative problem-solving. This raises the risk that reported performance gaps are driven by selection bias rather than genuine differences in reasoning ability.
minor comments (2)
- [Abstract] Abstract: The statement that 'most state-of-the-art models achieve only 1-4% final answer accuracy' would benefit from specifying the exact models evaluated and including error bars or variance measures for all reported metrics.
- [Error Analysis] Error Analysis: Consider adding more quantitative breakdowns or concrete examples to support claims about myopic reasoning and lack of sketching capabilities.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential value of PuzzleWorld as a challenging testbed for open-ended multimodal reasoning. We address each major comment below and will incorporate the requested documentation into the revised manuscript.
read point-by-point responses
-
Referee: [Human Baselines] Human Baselines section: Quantitative details on recruitment of novice and enthusiast cohorts, number of participants per puzzle, instructions provided, time limits, and inter-rater reliability for the annotated reasoning traces are absent. This directly undermines the load-bearing claim that the best model matches human puzzle novices at ~40% stepwise accuracy while falling significantly behind enthusiasts.
Authors: We agree that the current Human Baselines section lacks sufficient quantitative detail to fully support the reported comparisons. In the revision we will expand this section with explicit information on recruitment methods for the novice and enthusiast cohorts, the number of participants assigned to each puzzle, the precise instructions and time limits given to solvers, and inter-rater reliability statistics for the reasoning-trace annotations. These additions will directly substantiate the claim that the best model reaches approximately 40% stepwise accuracy, comparable to novices yet below enthusiasts. revision: yes
-
Referee: [Dataset Construction] Dataset Construction section: No metrics on puzzle sourcing, diversity, or explicit validation that the selected problems serve as a valid proxy for open-ended reasoning challenges in scientific discovery and investigative problem-solving. This raises the risk that reported performance gaps are driven by selection bias rather than genuine differences in reasoning ability.
Authors: We acknowledge the need for greater transparency on dataset construction. The revised manuscript will include quantitative metrics on puzzle sourcing (including original sources and selection criteria), diversity statistics across puzzle types, themes, and cognitive-skill labels, and a dedicated validation subsection that explains how the chosen problems function as proxies for open-ended reasoning in scientific discovery and investigative problem-solving. These additions will reduce concerns about selection bias and strengthen the benchmark's claimed relevance. revision: yes
Circularity Check
No circularity: empirical benchmark results with no self-referential derivations
full rationale
The paper introduces PuzzleWorld as a new benchmark of 667 puzzles with annotations for solutions, reasoning traces, and skill labels. All reported results (model accuracies of 1-4% final answer, 18% puzzle solve rate, 40% stepwise accuracy; fine-tuning boost from 4% to 11%; error analysis on myopic reasoning and sketching limitations) are direct empirical measurements obtained by evaluating models on the released dataset and comparing against separately collected human baselines. No equations, fitted parameters, predictions derived from the same data, or self-citation chains are used to justify the central claims. The derivation chain consists solely of standard benchmark evaluation procedures that remain independent of the reported outcomes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Puzzlehunts require discovering the underlying problem structure from multimodal evidence and iterative reasoning
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce PUZZLE WORLD, a comprehensive benchmark of 667 puzzlehunt-style problems... annotated with the final solution, detailed reasoning traces, and cognitive skill labels
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
fine-tuning a small model on reasoning traces boosts stepwise accuracy from 4% to 11%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
AgentEscapeBench shows LLM agents' success rates drop from 90% to 60% as tool-dependency depth increases from 5 to 25 steps, while humans drop only from 98% to 80%.
-
AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
AgentEscapeBench is a benchmark of 270 tasks across five difficulty tiers that measures LLM agents' ability to manage long-range tool dependencies, state tracking, and intermediate result propagation, revealing sharp ...
-
CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness
CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.
Reference graph
Works this paper leans on
-
[1]
URL https://api.semanticscholar.org/ CorpusID:268232499
The claude 3 model family: Opus, sonnet, haiku. URL https://api.semanticscholar.org/ CorpusID:268232499
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Large language models for mathematical reasoning: Progresses and challenges, 2024
Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges, 2024. URL https://arxiv.org/abs/2402. 00157
work page 2024
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
A survey on evaluation of large language models
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. ACM transactions on intelligent systems and technology , 15(3):1–45, 2024
work page 2024
-
[6]
Interactive sketchpad: A multimodal tutoring system for collaborative, visual problem-solving
Steven-Shine Chen, Jimin Lee, and Paul Pu Liang. Interactive sketchpad: A multimodal tutoring system for collaborative, visual problem-solving. arXiv preprint arXiv:2503.16434, 2025
-
[7]
Modeling: A novel dataset for testing linguistic reasoning in language models
Nathan A Chi, Teodor Malchev, Riley Kong, Ryan A Chi, Lucas Huang, Ethan A Chi, R Thomas McCoy, and Dragomir Radev. Modeling: A novel dataset for testing linguistic reasoning in language models. arXiv preprint arXiv:2406.17038, 2024
-
[8]
Yew Ken Chia, Vernon Toh Yan Han, Deepanway Ghosal, Lidong Bing, and Soujanya Poria. PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns. URL http://arxiv.org/abs/2403.13315
-
[9]
On the Measure of Intelligence
François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[10]
Faithful reasoning using large language models
Antonia Creswell and Murray Shanahan. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271, 2022
-
[11]
Lanzendörfer, Yannick Niedermayr, and Roger Wattenhofer
Benjamin Estermann, Luca A. Lanzendörfer, Yannick Niedermayr, and Roger Wattenhofer. PUZZLES: A Benchmark for Neural Algorithmic Reasoning. Advances in Neural Information Processing Systems , 37: 127059–127098, December 2024
work page 2024
-
[12]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Deepanway Ghosal, Vernon Toh Yan Han, Chia Yew Ken, and Soujanya Poria. Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning. URL http://arxiv.org/abs/2403.03864
-
[14]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024
work page 2024
-
[16]
Visual sketchpad: Sketching as a visual chain of thought for multimodal language models
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. arXiv preprint arXiv:2406.09403, 2024
-
[17]
Identifying and mitigating vulnerabilities in llm-integrated applications
Fengqing Jiang. Identifying and mitigating vulnerabilities in llm-integrated applications. Master’s thesis, University of Washington, 2024
work page 2024
-
[18]
A Survey on Large Language Models for Code Generation
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation, 2024. URL https://arxiv.org/abs/2406.00515
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
A closer look at logical reasoning with llms: The choice of tool matters, 2024
Long Hei Matthew Lam, Ramya Keerthy Thatikonda, and Ehsan Shareghi. A closer look at logical reasoning with llms: The choice of tool matters, 2024. URL https://arxiv.org/abs/2406.00284. 10
-
[20]
Hengzhi Li, Megan Tjandrasuwita, Yi R Fung, Armando Solar-Lezama, and Paul Pu Liang. Mimeqa: Towards socially-intelligent nonverbal foundation models. arXiv preprint arXiv:2502.16671, 2025
-
[21]
Hemm: Holistic evaluation of multimodal foundation models
Paul Pu Liang, Akshay Goindani, Talha Chafekar, Leena Mathur, Haofei Yu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Hemm: Holistic evaluation of multimodal foundation models. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2024
work page 2024
-
[22]
Foundations & trends in multimodal machine learning: Principles, challenges, and open questions
Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 56(10):1–42, 2024
work page 2024
-
[23]
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In International Conference on Learning Representations (ICLR) , 2024
work page 2024
-
[24]
Reasoning on graphs: Faithful and interpretable large language model reasoning
Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. Reasoning on graphs: Faithful and interpretable large language model reasoning. arXiv preprint arXiv:2310.01061, 2023
-
[25]
Leena Mathur, Marian Qian, Paul Pu Liang, and Louis-Philippe Morency. Social genome: Grounded social reasoning abilities of multimodal models. arXiv preprint arXiv:2502.15109, 2025
-
[26]
Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey
Philipp Mondorf and Barbara Plank. Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey. arXiv preprint arXiv:2404.01869, 2024
-
[27]
Arseny Moskvichev, Victor Vikram Odouard, and Melanie Mitchell. The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain. 2023. doi: 10.48550/ARXIV .2305. 07141
work page internal anchor Pith review doi:10.48550/arxiv 2023
-
[28]
Openai o3 and o4-mini system card
OpenAI. Openai o3 and o4-mini system card. https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf , 2025. Ac- cessed: 2025-05-16
work page 2025
-
[29]
Puzzled Pint. Puzzled pint. https://puzzledpint.org/, 2025. CC BY-NC-SA Intl. 4.0
work page 2025
-
[30]
Ma- chine translation using deep learning: An overview
Shashi Pal Singh, Ajai Kumar, Hemant Darbari, Lenali Singh, Anshika Rastogi, and Shikha Jain. Ma- chine translation using deep learning: An overview. In 2017 international conference on computer , communications and electronics (comptelix), pages 162–167. IEEE, 2017
work page 2017
-
[31]
A literature review on question answering techniques, paradigms and systems
Marco Antonio Calijorne Soares and Fernando Silva Parreiras. A literature review on question answering techniques, paradigms and systems. Journal of King Saud University-Computer and Information Sciences , 32(6):635–646, 2020
work page 2020
-
[32]
Winnie Street, John Oliver Siy, Geoff Keeling, Adrien Baranes, Benjamin Barnett, Michael McKibben, Tatenda Kanyere, Alison Lentz, Blaise Aguera y Arcas, and Robin I. M. Dunbar. Llms achieve adult human performance on higher-order theory of mind tasks, 2024. URL https://arxiv.org/abs/2405.18870
-
[33]
A benchmark for learning to translate a new language from one grammar book
Garrett Tanzer, Mirac Suzgun, Eline Visser, Dan Jurafsky, and Luke Melas-Kyriazi. A benchmark for learning to translate a new language from one grammar book. arXiv preprint arXiv:2309.16575, 2023
-
[34]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report. arXiv preprint arXiv:2504.07491, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Qvq: To see the world with wisdom, December 2024
Qwen Team. Qvq: To see the world with wisdom, December 2024. URL https://qwenlm.github.io/ blog/qvq-72b-preview/
work page 2024
-
[36]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Enigmaeval: A benchmark of long multimodal reasoning challenges,
Clinton J Wang, Dean Lee, Cristina Menghini, Johannes Mols, Jack Doughty, Adam Khoja, Jayson Lynch, Sean Hendryx, Summer Yue, and Dan Hendrycks. Enigmaeval: A benchmark of long multimodal reasoning challenges. arXiv preprint arXiv:2502.08859, 2025
-
[38]
Is a picture worth a thousand words? delving into spatial reasoning for vision language models
Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. In The Thirty-Eighth Annual Conference on Neural Information Processing Systems , 2024
work page 2024
-
[39]
Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang
Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. SciBench: Evaluating College-Level Scientific Problem- Solving Abilities of Large Language Models. In Proceedings of the F orty-First International Conference on Machine Learning, 2024. 11
work page 2024
-
[40]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[41]
Review of automatic text summarization techniques & methods
Adhika Pramita Widyassari, Supriadi Rustad, Guruh Fajar Shidik, Edi Noersasongko, Abdul Syukur, Affandy Affandy, and De Rosal Ignatius Moses Setiadi. Review of automatic text summarization techniques & methods. Journal of King Saud University-Computer and Information Sciences , 34(4):1029–1046, 2022
work page 2022
-
[42]
Tongshuang Wu, Michael Terry, and Carrie Jun Cai. Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In Proceedings of the 2022 CHI conference on human factors in computing systems , pages 1–22, 2022
work page 2022
-
[43]
Mind’s eye of llms: Visualization-of-thought elicits spatial reasoning in large language models
Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. Mind’s eye of llms: Visualization-of-thought elicits spatial reasoning in large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024
work page 2024
-
[44]
Tree of thoughts: Deliberate problem solving with large language models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023
work page 2023
-
[45]
Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, and Wenqi Shao. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models toward...
work page 2024
-
[46]
Kiva: Kid-inspired visual analogies for testing large multimodal models
Eunice Yiu, Maan Qraitem, Anisa Noor Majhi, Charlie Wong, Yutong Bai, Shiry Ginosar, Alison Gopnik, and Kate Saenko. Kiva: Kid-inspired visual analogies for testing large multimodal models. arXiv preprint arXiv:2407.17773, 2024
-
[47]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...
work page 2024
-
[48]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024
work page 2024
-
[49]
Judging llm-as-a-judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems , 36:46595–46623, 2023
work page 2023
-
[50]
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 3: System Demonstrations) , Bangkok, Thailand, 2024. Association for Computational Lingui...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 12 A Limitations and Broader Impact To ensure consistency and standardization across the dataset, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.