arxiv: 2605.06040 · v1 · submitted 2026-05-07 · 💻 cs.AI · cs.CL

Recognition: unknown

Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning

Leon Hamm , Zlatan Ajanovic

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:38 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords noveltytree-of-thoughtLLM reasoningsearch pruningplanningtoken efficiencylanguage modelsreasoning benchmarks

0 comments

The pith

Novelty of thoughts judged by LLM prompts allows pruning of tree-of-thought searches to cut overall token costs on reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to improve the efficiency of tree-of-thought methods for large language models on reasoning and planning problems. It defines a measurable novelty score that captures how unique a newly generated thought is compared to all thoughts already present in the search tree. This score is obtained by prompting the LLM itself, drawing on its pre-trained knowledge to make the comparison. Branches with low novelty are pruned, which shrinks the overall tree size. The result is lower total token consumption even though novelty assessment adds prompts at each step. Experiments on multiple benchmarks confirm that performance stays competitive while search scope decreases.

Core claim

By estimating the novelty of each new thought through an LLM prompt that compares it to prior nodes in the tree, and then pruning branches with low novelty, the scope of tree-of-thought search can be reduced. This procedure lowers overall token consumption compared with standard tree-of-thought despite the extra prompts per state, and it achieves comparable results on language-based planning and general reasoning benchmarks.

What carries the argument

The novelty metric, which quantifies the uniqueness of a new thought relative to the existing search tree by prompting an LLM and enables pruning of low-novelty branches.

If this is right

Pruning low-novelty branches reduces the number of expanded nodes while preserving solution quality on the tested benchmarks.
Overall token usage falls because the smaller tree requires fewer LLM generations despite the added novelty checks.
The method applies directly to both planning and general reasoning tasks by directing search toward more original reasoning steps.
Within a fixed token budget, novelty pruning permits deeper or wider exploration than unpruned tree-of-thought.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same novelty-checking step could be added to other LLM search procedures such as beam search to improve efficiency without external heuristics.
Over repeated use the approach might reveal whether LLMs can internally judge which reasoning directions are worth pursuing.
Scaling the method to longer-horizon tasks would test whether novelty remains a good proxy for progress when solution paths are more complex.

Load-bearing premise

That an LLM prompt can reliably estimate the true uniqueness of a thought using only pre-trained knowledge such that pruning does not discard paths needed for correct solutions.

What would settle it

A benchmark instance where novelty-pruned searches produce wrong answers on tasks that standard tree-of-thought solves correctly, or where total tokens used increase rather than decrease.

Figures

Figures reproduced from arXiv: 2605.06040 by Leon Hamm, Zlatan Ajanovic.

**Figure 1.** Figure 1: Distributions of computed widths and percentage view at source ↗

**Figure 2.** Figure 2: Comparison of performance and average token view at source ↗

read the original abstract

Although advances such as chain-of-thought, tree-of-thought or reinforcement learning have improved the performance of LLMs in reasoning and planning tasks, they are still brittle and have not achieved human-level performance in many domains, and often suffer from high time and token costs. Inspired by the success of width-based search in planning, we explore how the concept of novelty can be transferred to language domains and how it can improve tree-of-thought reasoning. A tree of thoughts relies on building possible "paths" of consecutive ideas or thoughts. These are generated by repeatedly prompting an LLM. In our paper, a measurable concept of novelty is proposed that describes the uniqueness of a new node (thought) in comparison to nodes previously seen in the search tree. Novelty is estimated by prompting an LLM and making use of embedded general knowledge from pre-training. This metric can then be used to prune branches and reduce the scope of the search. Although this method introduces more prompts per state, the overall token cost can be reduced by pruning and reducing the overall tree size. This procedure is tested and compared using several benchmarks in language-based planning and general reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They adapt novelty pruning from classical planning to trim Tree-of-Thought branches via LLM prompts, but the payoff depends on whether those novelty scores actually spare the paths that solve the problem.

read the letter

The core move is straightforward: they prompt the LLM to score how novel each new thought is relative to the ones already generated in the tree, then drop the low-novelty branches to keep the search from exploding in tokens. This is a direct transfer of the width-based novelty idea from planning, applied to language models without inventing a new search framework from scratch. The paper tests the approach on language planning and general reasoning benchmarks, which is the right place to look for practical gains. The writing keeps the method clear and explains why extra novelty prompts might still net out cheaper if pruning is effective. That part is useful for anyone already running ToT-style searches and watching token budgets. The main risk is exactly what the stress-test note flags: the LLM's pre-trained sense of novelty may not line up with task-specific usefulness. A thought that looks repetitive could still be the one that unlocks the solution later, and pruning it would lose correctness even if the tree stays small. The abstract claims net token reduction, so the full paper needs to show concrete numbers on how often pruning happens, how much accuracy holds, and whether the extra prompts per state are offset in practice. If those results are only marginal or depend on particular benchmarks, the method stays more of an incremental tweak than a reliable fix. Readers who work on efficient LLM reasoning or search heuristics will get the most out of it. It is worth sending for peer review because the underlying transfer is legitimate and the token-cost problem is real, but the referees should press for detailed ablations on solution preservation and actual cost curves.

Referee Report

2 major / 2 minor

Summary. The paper proposes transferring the concept of novelty from width-based planning to Tree-of-Thought (ToT) search for LLMs. It defines a novelty metric for individual thoughts (nodes) by prompting the LLM to assess uniqueness relative to previously generated nodes in the search tree, using the model's pre-trained knowledge. This metric is used to prune low-novelty branches, aiming to shrink the overall tree size and achieve net reductions in token usage despite the extra prompts required per state. The method is evaluated on benchmarks for language-based planning and general reasoning tasks.

Significance. If the novelty-based pruning reliably preserves solution paths while reducing tree size, the approach could improve the efficiency of ToT-style reasoning without sacrificing accuracy, addressing a key limitation of high token costs in current LLM planning methods. It explicitly connects ideas from classical AI planning (novelty search) to LLM prompting, which is a constructive direction if the empirical results hold.

major comments (2)

[Abstract and experimental evaluation] The central claim of net token-cost reduction (Abstract) is load-bearing on the assumption that LLM-estimated novelty safely prunes non-critical branches. However, the manuscript provides no analysis (e.g., in the experimental results or ablation sections) of whether pruned thoughts would have enabled later solutions on the benchmarks; success rates with vs. without novelty pruning on retained vs. discarded paths are not reported.
[Method description] The novelty metric is introduced as an LLM prompt that leverages pre-trained knowledge (Abstract), but no quantitative validation or correlation study is given showing that this estimate aligns with task-specific usefulness rather than superficial similarity. This leaves the pruning correctness unverified, especially since the method adds prompts per state.

minor comments (2)

[Abstract] The abstract states that the procedure 'is tested and compared' but supplies no specific benchmark names, metrics, or quantitative outcomes; these details should be summarized upfront for clarity.
[Method] Notation for the novelty score and pruning threshold is described only qualitatively; an explicit formula or pseudocode would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address the major points below and commit to revisions that strengthen the validation of our novelty-based pruning approach.

read point-by-point responses

Referee: [Abstract and experimental evaluation] The central claim of net token-cost reduction (Abstract) is load-bearing on the assumption that LLM-estimated novelty safely prunes non-critical branches. However, the manuscript provides no analysis (e.g., in the experimental results or ablation sections) of whether pruned thoughts would have enabled later solutions on the benchmarks; success rates with vs. without novelty pruning on retained vs. discarded paths are not reported.

Authors: We agree that a more granular analysis of pruned paths would better support the central claim. Our reported results already show that success rates with novelty pruning remain competitive with standard ToT across the benchmarks while achieving net token reductions. To address the gap, the revised manuscript will add an ablation study that compares success rates with and without pruning and examines a sample of discarded thoughts (via post-hoc simulation) to determine whether any could have led to solutions. revision: yes
Referee: [Method description] The novelty metric is introduced as an LLM prompt that leverages pre-trained knowledge (Abstract), but no quantitative validation or correlation study is given showing that this estimate aligns with task-specific usefulness rather than superficial similarity. This leaves the pruning correctness unverified, especially since the method adds prompts per state.

Authors: The end-to-end benchmark results provide indirect validation: the method delivers net token savings without meaningful accuracy loss, implying the LLM novelty estimates are sufficiently aligned with task-relevant distinctions rather than mere surface similarity. We acknowledge that a direct correlation analysis would increase confidence. In revision we will add a quantitative validation subsection that reports correlations between novelty scores and (a) human judgments of usefulness on a sampled subset and (b) alternative embedding-based similarity metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity in novelty definition or pruning proposal

full rationale

The paper defines novelty explicitly as the uniqueness of a new thought relative to prior nodes in the search tree and estimates it via an external LLM prompt that draws on pre-trained knowledge. This construction does not reduce to a self-referential equation, a fitted parameter renamed as a prediction, or a load-bearing self-citation chain. No equations or derivations in the abstract or description equate the output metric to its own inputs by construction; the method is presented as a heuristic extension of tree-of-thought that relies on the LLM's independent capabilities rather than internal consistency loops. The central efficiency claim (pruning reduces tree size despite extra prompts) remains an empirical proposal open to external validation rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that LLM pre-training encodes sufficient general knowledge to judge thought novelty for pruning decisions.

axioms (1)

domain assumption LLM can estimate novelty using pre-trained knowledge
The paper states novelty is estimated by prompting an LLM and making use of embedded general knowledge from pre-training.

invented entities (1)

Novelty metric for thoughts no independent evidence
purpose: To quantify uniqueness of new thoughts relative to the search tree for pruning decisions
Introduced in the paper as a measurable concept without external validation or independent evidence provided in the abstract.

pith-pipeline@v0.9.0 · 5495 in / 1309 out tokens · 63051 ms · 2026-05-08T10:38:00.081346+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 22 canonical work pages · 10 internal anchors

[1]

and Barto, Andrew G

Sutton, Richard S. and Barto, Andrew G. Reinforcement Learning: An Introduction
[2]

Width and Serialization of Classical Planning Problems

Lipovetzky, Nir and Geffner, Hector. Width and Serialization of Classical Planning Problems. Proceedings of the 20th European Conference on Artificial Intelligence (ECAI-2012)

2012
[3]

GPT-4 Technical Report

OpenAI. GPT-4 Technical Report
[4]

LiveBench: A Challenging, Contamination-Limited

Colin White and Samuel Dooley and Manley Roberts and Arka Pal and Benjamin Feuer and Siddhartha Jain and Ravid Shwartz-Ziv and Neel Jain and Khalid Saifullah and Sreemanti Dey and Shubh-Agrawal and Sandeep Singh Sandha and Siddartha Venkat Naidu and Chinmay Hegde and Yann LeCun and Tom Goldstein and Willie Neiswanger and Micah Goldblum , booktitle=. LiveB...

2025
[5]

Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4

Liu, Hanmeng and Ning, Ruoxi and Teng, Zhiyang and Liu, Jian and Zhou, Qiji and and Zhang, Yue. Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4. arXiv:2304.03439

work page arXiv
[6]

GLoRE : Evaluating Logical Reasoning of Large Language Models

Liu, Hanmeng and Teng, Zhiyang and Ning, Ruoxi and Liu, Jian and Zhou, Qiji and and Zhang, Yue. GLoRE : Evaluating Logical Reasoning of Large Language Models. arXiv:2310.09107

work page arXiv
[7]

On the Planning Abilities of Large Language Models - A Critical Investigation

Valmeekam, Karthik and Marquez, Matthew and Sreedharan, Sarath and and Kambhampati, Subbarao. On the Planning Abilities of Large Language Models - A Critical Investigation. Advances in Neural Information Processing Systems 36 (NeurIPS 2023)

2023
[8]

and Cao, Yuan and and Narasimhan, Karthik

Yao, Shunyu and Yu, Dian and Zhao, Jeffrey and Shafran, Izhak and Griffiths, Thomas L. and Cao, Yuan and and Narasimhan, Karthik. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Advances in Neural Information Processing Systems 36 (NeurIPS 2023)

2023
[9]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Thought of Search: Planning with Language Models Through The Lens of Efficiency , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[10]

Planning for Novelty: Width-Based Algorithms for Common Problems in Control, Planning and Reinforcement Learning , url =

Lipovetzky, Nir , booktitle =. Planning for Novelty: Width-Based Algorithms for Common Problems in Control, Planning and Reinforcement Learning , url =
[11]

Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

Valmeekam, Karthik and Marquez, Matthew and Olmo, Alberto and Sreedharan, Sarath and Kambhampati, Subbarao , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

2023
[12]

Practical Planning: Extending the Classical AI Planning Paradigm

Wilkins, David E. Practical Planning: Extending the Classical AI Planning Paradigm
[13]

Fikes and Nils J

Strips: A new approach to the application of theorem proving to problem solving , journal =. 1971 , issn =. doi:https://doi.org/10.1016/0004-3702(71)90010-5 , url =

work page doi:10.1016/0004-3702(71)90010-5 1971
[14]

PDDL | The Planning Domain Definition Language

McDermott, Drew and Ghallab, Malik and Howe, Adele and Knoblock, Craig and Ram, Ashwin and Veloso, Manuela and Weld, Daniel and and Wilkins, David. PDDL | The Planning Domain Definition Language
[15]

The FF Planning System: Fast Plan Generation Through Heuristic Search

Hoffmann, J \" o rg and Nebel, Bernhard. The FF Planning System: Fast Plan Generation Through Heuristic Search. Journal of Artificial Intelligence Research
[16]

The Fast Downward Planning System

Helmert, Malte. The Fast Downward Planning System. Journal of Artificial Intelligence Research
[17]

and Yang, Qiang and and Xie, Xing

Chang, Yupeng and Wang, Xu and Wang, Jindong and Wu, Yuan and Yang, Linyi and Zhu, Kaijie and Chen, Hao and Yi, Xiaoyuan and Wang, Cunxiang and Wang, Yidong and Ye, Wei and Zhang, Yue and Chang, Yi and Yu, Philip S. and Yang, Qiang and and Xie, Xing. A Survey on Evaluation of Large Language Models. ACM Computing Surveys
[18]

A Survey of Large Language Models

Zhao, Wayne Xin and Zhou, Kun and Li, Junyi and Tang, Tianyi and Wang, Xiaolei and Hou, Yupeng and Min, Yingqian and Zhang, Beichen and Zhang, Junjie and Dong, Zican and Du, Yifan and Yang, Chen and Chen, Yushuo and Chen, Zhipeng and Jiang, Jinhao and Ren, Ruiyang and Li, Yifan and Tang, Xinyu and Liu, Zikang and Liu, Peiyu and Nie, Jian-Yun and and Wen, ...

work page internal anchor Pith review arXiv
[19]

Emergent Abilities of Large Language Models

Wei, Jason and Tay, Yi and Bommasani, Rishi and Raffel, Colin and Zoph, Barret and Borgeaud, Sebastian and Yogatama, Dani and Bosma, Maarten and Zhou, Denny and Metzler, Donald and Chi, Ed H. and Hashimoto, Tatsunori and Vinyals, Oriol and Liang, Percy and Dean, Jeff and and Fedus, William. Emergent Abilities of Large Language Models. arXiv:2206.07682

work page internal anchor Pith review arXiv
[20]

Chowdhary, K. R. Natural Language Processing. Fundamentals of Artificial Intelligence
[21]

Attention Is All You Need

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and and Polosukhin, Illia. Attention Is All You Need. arXiv:1706.03762

work page internal anchor Pith review arXiv
[22]

and and Zhou, Denny

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten Soft and Ichter, Brian and Xia, Fei and Chi, Ed and Le, Quoc V. and and Zhou, Denny. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems 35 (NeurIPS 2022)

2022
[23]

Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and and others. Language Models are Few-Shot Learners. arXiv:2005.14165

work page internal anchor Pith review arXiv 2005
[24]

On the Brittle Foundations of ReAct Prompting for Agentic Large Language Models

Verma, Mudit and Bhambri, Siddhant and and Kambhampati, Subbarao. On the Brittle Foundations of ReAct Prompting for Agentic Large Language Models. arXiv:2405.13966

work page arXiv
[25]

Graph of Thoughts: Solving Elaborate Problems with Large Language Models

Besta, Maciej and Blach, Nils and Kubicek, Ales and Gerstenberger, Robert and Podstawski, Michal and Gianinazzi, Lukas and Gajda, Joanna and Lehmann, Tomasz and Niewiadomski, Hubert and Nyczyk, Piotr and and others. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence
[26]

arXiv preprint arXiv:2402.01817 , year=

Kambhampati, Subbarao and Valmeekam, Karthik and Guan, Lin and Verma, Mudit and Stechly, Kaya and Bhambri, Siddhant and Saldyt, Lucas and and Murthy, Anil. LLMs Can't Plan, But Can Help Planning in LLM -Modulo Frameworks. arXiv:2402.01817

work page arXiv
[27]

Understanding the planning of LLM agents: A survey

Huang, Xu and Liu, Weiwen and Chen, Xiaolong and Wang, Xingmei and Wang, Hao and Lian, Defu and Wang, Yasheng and Tang, Ruiming and and Chen, Enhong. Understanding the Planning of LLM Agents: A Survey. arXiv:2402.02716

work page internal anchor Pith review arXiv
[28]

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Huang, Wenlong and Abbeel, Pieter and Pathak, Deepak and and Mordatch, Igor. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. Proceedings of the 39th International Conference on Machine Learning (ICML 2022)

2022
[29]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Ahn, Michael and Brohan, Anthony and Brown, Noah and Chebotar, Yevgen and Cortes, Omar and David, Byron and Finn, Chelsea and Fu, Chuyuan and Gopalakrishnan, Keerthana and Hausman, Karol and and others. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. Proceedings of Machine Learning Research
[30]

Translating natural language to planning goals with large-language models,

Xie, Yaqi and Yu, Chen and Zhu, Tongyao and Bai, Jinbin and Gong, Ze and and Soh, Harold. Translating Natural Language to Planning Goals with Large-Language Models. arXiv:2302.05128

work page arXiv
[31]

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

Liu, Bo and Jiang, Yuqian and Zhang, Xiaohan and Liu, Qiang and Zhang, Shiqi and Biswas, Joydeep and and Stone, Peter. LLM+P : Empowering Large Language Models with Optimal Planning Proficiency. arXiv:2304.11477

work page internal anchor Pith review arXiv
[32]

Sharan, S. P. and Pittaluga, Francesco and Kumar B. G., Vijay and and Chandraker, Manmohan. LLM-Assist : Enhancing Closed-Loop Planning with Language-Based Reasoning. arXiv:2401.00125

work page arXiv
[33]

and Chao, Wei-Lun and and Su, Yu

Song, Chan Hee and Wu, Jiaman and Washington, Clayton and Sadler, Brian M. and Chao, Wei-Lun and and Su, Yu. LLM-Planner : Few-Shot Grounded Planning for Embodied Agents with Large Language Models. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
[34]

Dynamic planning with a llm

Dagan, Gautier and Keller, Frank and and Lascarides, Alex. Dynamic Planning with a LLM. arXiv:2308.06391

work page arXiv
[35]

Planning With Large Language Models Via Corrective Re-Prompting

Raman, Shreyas Sundara and Cohen, Vanya and Rosen, Eric and Idrees, Ifrah and Paulius, David and and Tellex, Stefanie. Planning With Large Language Models Via Corrective Re-Prompting. NeurIPS 2022 Foundation Models for Decision Making Workshop

2022
[36]

Chain of Thoughtlessness? An Analysis of CoT in Planning

Stechly, Kaya and Valmeekam, Karthik and and Kambhampati, Subbarao. Chain of Thoughtlessness? An Analysis of CoT in Planning. arXiv:2405.04776

work page arXiv
[37]

A Proofs of Section 4 Proof of Proposition 1.Let Ii be the indicator that on instance i the bias has the opposite sign to θi and |bi|>|θ i|

Ze c evi \' c , Matej and Willig, Moritz and Dhami, Devendra Singh and and Kersting, Kristian. Causal Parrots: Large Language Models May Talk Causality But Are Not Causal. arXiv:2308.13067

work page arXiv
[38]

Automatic Prompt Generation and Optimization by Leveraging Large Language Models to Enhance Few-Shot Learning in Biomedical Tasks

Shi, Yiwen and Hu, Xiaohua. Automatic Prompt Generation and Optimization by Leveraging Large Language Models to Enhance Few-Shot Learning in Biomedical Tasks. 2024 IEEE International Conference on Big Data (BigData)

2024
[39]

OpenAI o1 System Card

Jaech, Aaron and Kalai, Adam and Lerer, Adam and Richardson, Adam and and others. OpenAI o1 System Card. arXiv:2412.16720

work page internal anchor Pith review Pith/arXiv arXiv
[40]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1 : Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948

work page internal anchor Pith review arXiv
[41]

Understanding the effects of rlhf on llm generalisation and diversity.arXiv preprint arXiv:2310.06452,

Kirk, Robert and Mediratta, Ishita and Nalmpantis, Christoforos and Luketina, Jelena and Hambro, Eric and Grefenstette, Edward and and Raileanu, Roberta. Understanding the Effects of RLHF on LLM Generalisation and Diversity. arXiv:2310.06452

work page arXiv
[42]

The 1998 AI Planning Systems Competition

McDermott, Drew M. The 1998 AI Planning Systems Competition. AI Magazine

1998
[43]

Qwen3 Technical Report

Qwen Team. Qwen3 Technical Report. arXiv:2505.09388

work page internal anchor Pith review arXiv
[44]

When more is less: Understanding chain-of-thought length in llms.arXiv preprint arXiv:2502.07266, 2025

Wu, Yuyang and Wang, Yifei and Ye, Ziyu and Du, Tianqi and Jegelka, Stefanie and and Wang, Yisen. When More is Less: Understanding Chain-of-Thought Length in LLMs. arXiv:2502.07266

work page arXiv
[45]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and and Steinhardt, Jacob. Measuring Mathematical Problem Solving With the MATH Dataset. arXiv:2103.03874

work page internal anchor Pith review arXiv