Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Pith reviewed 2026-05-10 14:09 UTC · model grok-4.3
The pith
Optimally allocating test-time compute adaptively lets smaller LLMs outperform 14x larger models when base success rates are non-trivial.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that scaling test-time computation via a difficulty-aware adaptive strategy, using either verifier search or distribution updates, produces higher performance per unit of compute than fixed strategies and, in FLOPs-equivalent comparisons, allows smaller models to surpass much larger models on tasks they can already solve with non-trivial probability.
What carries the argument
A compute-optimal scaling strategy that selects and allocates test-time compute per prompt according to its difficulty, switching between verifier-guided search and adaptive distribution updates to maximize output quality for the given inference budget.
If this is right
- Test-time compute can be traded against pre-training compute to achieve higher performance at lower total resource cost.
- Adaptive allocation per prompt is required to obtain the reported efficiency gains over non-adaptive baselines.
- On tasks where a base model already succeeds with some probability, extra inference compute can substitute for increases in model size.
- The tradeoff between inference-time and pre-training compute shifts in favor of the former when the right adaptive method is used.
Where Pith is reading between the lines
- This result suggests that model training objectives could be redesigned to better support subsequent test-time search and adaptation.
- Resource allocation in large-scale AI systems may move toward lighter pretrained models paired with strong inference-time engines.
- Extending the adaptive allocation idea to longer-horizon or multi-step tasks could support iterative self-improvement loops without further pretraining.
Load-bearing premise
The effectiveness of different test-time scaling methods varies predictably with prompt difficulty in a manner that permits reliable adaptive allocation without introducing new errors or overhead.
What would settle it
Direct measurement on a held-out set of prompts showing that the adaptive per-prompt allocation fails to deliver any efficiency gain over a fixed best-of-N strategy or fails to let the smaller model exceed the 14x larger model in FLOPs-matched runs.
read the original abstract
Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies scaling of test-time computation in LLMs via two mechanisms: searching with process-based verifier reward models and adaptive updates to the response distribution. It finds that effectiveness varies with prompt difficulty, motivating a compute-optimal adaptive allocation strategy. This strategy is claimed to improve efficiency by more than 4x over best-of-N and enable a smaller model to outperform a 14x larger model in FLOPs-matched settings on suitable prompts.
Significance. Should the results prove robust, the work is significant in demonstrating that test-time compute scaling can be more effective than parameter scaling for LLMs. It offers insights into optimal compute allocation and has implications for building self-improving AI agents and rethinking pretraining vs inference tradeoffs. The empirical demonstration of difficulty-dependent performance is a key contribution.
major comments (2)
- [Compute-optimal strategy section (likely §4.3)] The paper motivates the adaptive allocation from observed variation in method effectiveness with difficulty but does not account for the compute cost or error introduced by the difficulty estimator. This is load-bearing for the 4x efficiency claim and the 14x model outperformance, as misallocations or added FLOPs could invalidate the FLOPs-matched comparisons.
- [Experimental results (likely §5)] The results lack sufficient details on experimental setup, including specific benchmarks, baselines, statistical significance, error bars, and exact FLOPs calculation methodology for the adaptive methods. This hinders verification of the central empirical claims.
minor comments (2)
- [Abstract] The abstract could specify the base models and datasets used to provide context for the 14x larger model comparison.
- [Methods] Clarify the distinction between process-based and outcome-based verifiers in the methods section to avoid potential confusion.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments help clarify the presentation of our adaptive compute-optimal strategy and improve the experimental details. We respond to each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Compute-optimal strategy section (likely §4.3)] The paper motivates the adaptive allocation from observed variation in method effectiveness with difficulty but does not account for the compute cost or error introduced by the difficulty estimator. This is load-bearing for the 4x efficiency claim and the 14x model outperformance, as misallocations or added FLOPs could invalidate the FLOPs-matched comparisons.
Authors: We agree that a thorough accounting of the difficulty estimator is necessary to support the efficiency claims. In the revised manuscript, we will add a dedicated analysis in the compute-optimal strategy section. This will include the computational overhead of the estimator (which is a small fraction of the total FLOPs), its prediction accuracy, and sensitivity analysis showing that the reported 4x efficiency improvement and the outperformance results remain valid even when including estimator costs and accounting for potential errors in difficulty assessment. revision: yes
-
Referee: [Experimental results (likely §5)] The results lack sufficient details on experimental setup, including specific benchmarks, baselines, statistical significance, error bars, and exact FLOPs calculation methodology for the adaptive methods. This hinders verification of the central empirical claims.
Authors: We acknowledge the need for greater experimental transparency. The updated manuscript will provide comprehensive details on the experimental setup in §5, including the specific benchmarks employed, all baselines considered, results with error bars from multiple independent runs to establish statistical significance, and a clear, reproducible methodology for calculating FLOPs for both fixed and adaptive test-time compute strategies. revision: yes
Circularity Check
No circularity: claims rest on direct experimental comparisons
full rationale
The paper presents empirical results on test-time compute scaling for LLMs, comparing methods like search against verifiers and adaptive distribution updates. The central finding—that a compute-optimal adaptive strategy yields >4x efficiency gains and allows a smaller model to outperform a 14x larger one in FLOPs-matched settings—is supported by reported experiments on prompt difficulty variation, not by any self-referential definitions, fitted inputs renamed as predictions, or load-bearing self-citations. No equations reduce to tautologies, and the adaptive allocation is described as motivated by observations then validated experimentally rather than derived by construction from prior author work.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...
-
MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning
MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6....
-
Test-Time Training with KV Binding Is Secretly Linear Attention
Test-time training with KV binding reduces to learned linear attention.
-
MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents
MobiBench is the first modular multi-path offline benchmark for mobile GUI agents, achieving 94.72% agreement with human evaluators while allowing component-level analysis.
-
Do generative video models understand physical principles?
Physics-IQ benchmark reveals that generative video models exhibit limited physical understanding unrelated to their visual quality.
-
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.
-
HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection
HIDBench unifies DARPA-E3, DARPA-E5, and NodLink datasets with a data pipeline to benchmark LLMs for host-based intrusion detection, showing high precision on simple logs but sharp drops in MCC and rises in false posi...
-
Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation
Anchored Tree Sampling converts horizon-compounding drift into anchor-bounded drift by organizing video generation as a sparse-to-dense tree of imputations instead of left-to-right autoregressive rollout.
-
Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management
Autonomous AI agents outperform humans in supply chain simulations but exhibit an inherent agent bullwhip effect of amplified decision unreliability, mitigated by GRPO reinforcement learning post-training.
-
Learning How to Cube
A neuro-symbolic post-training pipeline lets a 4B transformer learn cubing heuristics that reach pass@5 of 53 on 100 SAT competition instances, matching the strongest symbolic baseline.
-
CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning
CAPS is a four-stage inference-only cascade that adapts how much of each solution the verifier sees and how comparisons are distributed, halving per-candidate verifier tokens while outperforming uniform pairwise verif...
-
Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
-
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
-
Test-Time Learning with an Evolving Library
EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...
-
Query-Conditioned Test-Time Self-Training for Large Language Models
QueST lets LLMs create query-conditioned problem-solution pairs at inference time and use them for parameter-efficient self-training, outperforming prior test-time baselines on math and science benchmarks.
-
Query-Conditioned Test-Time Self-Training for Large Language Models
QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.
-
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
-
Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents
VeGAS improves MLLM-based embodied agents by sampling action ensembles and using a verifier trained on LLM-synthesized failure cases, yielding up to 36% relative gains on hard multi-object long-horizon tasks in Habita...
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
A softmax-weighted centroid of the local top-K documents interpolated with the query improves nDCG@10 for frozen embedding models across seven families on held-out BEIR data.
-
Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
Agentic program search over frozen embedding APIs yields a parameter-free inference algebra—a softmax-weighted centroid of top-K documents interpolated with the query—that lifts nDCG@10 across seven model families on ...
-
Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities
Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.
-
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.
-
Active Testing of Large Language Models via Approximate Neyman Allocation
Proposes surrogate semantic entropy stratification followed by approximate Neyman allocation for active testing of LLMs on generative benchmarks, reporting up to 28% MSE reduction and 22.9% average budget savings vers...
-
RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement
RubricRefine improves tool-use agent reliability to 0.86 on M3ToolEval by generating rubrics for pre-execution contract checking and iterative repair, outperforming baselines at 2.6X lower latency while showing no gai...
-
RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement
RubricRefine is a training-free pre-execution method that creates rubrics to score and fix inter-tool contract violations in agent code, reaching 0.86 average on M3ToolEval across seven models with zero executions and...
-
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
An exploration-aware RL framework lets LLM agents adaptively explore only under high uncertainty via variational rewards and action grouping, yielding consistent gains on text and GUI agent benchmarks.
-
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...
-
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.
-
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
-
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.
-
CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation
CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candida...
-
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
Conditional optimal transport calibrates PRMs by learning monotonic conditional quantile functions over success probabilities conditioned on hidden states, yielding improved calibration and downstream Best-of-N perfor...
-
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
-
Joint Consistency: A Unified Test-Time Aggregation Framework via Energy Minimization
Joint Consistency casts test-time aggregation as Ising-type energy minimization with pairwise LLM-judge interactions, subsuming voting methods and outperforming baselines across reasoning tasks.
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
-
Logic-Regularized Verifier Elicits Reasoning from LLMs
LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
-
Beyond Static Best-of-N: Bayesian List-wise Alignment for LLM-based Recommendation
BLADE uses Bayesian list-wise alignment with dynamic estimation to create a self-evolving target that overcomes limitations of static references in LLM-based recommendation, yielding sustained gains in ranking and com...
-
POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference
POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
-
LiveFMBench: Unveiling the Power and Limits of Agentic Workflows in Specification Generation
LiveFMBench shows that direct LLM prompting for C program formal specs overestimates accuracy by ~20% due to unfaithful behaviors like deceiving provers, while agentic workflows help under low sampling but overall per...
-
Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference
TokenArena is a continuous benchmark for AI inference endpoints that measures output speed, time to first token, blended price, effective context, quality, and modeled energy to produce composites of joules per correc...
-
Beyond the Training Distribution: Mapping Generalization Boundaries in Neural Program Synthesis
In a controlled arithmetic-grammar program synthesis environment, diverse sampling across semantic and syntactic spaces yields robust density generalization while support generalization for novel syntax remains poor, ...
-
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- a...
-
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
-
Self-Correction as Feedback Control: Error Dynamics, Stability Thresholds, and Prompt Interventions in LLMs
Self-correction in LLMs is stable and non-degrading only when ECR/EIR exceeds initial accuracy over (1-accuracy), with EIR below 0.5% cleanly separating helpful from harmful cases across models.
-
Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees
Distinct Leaf Enumeration (DLE) replaces stochastic self-consistency sampling with deterministic traversal of a truncated decoding tree to enumerate distinct leaves, increasing coverage and reducing redundant computat...
-
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
-
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.
-
PAC-MCTS: Bias-Aware Pruning for Robust LLM-Guided Search and Planning
PAC-MCTS supplies bias-aware confidence bounds for pruning in LLM-guided MCTS, with O((Δ-4L)^{-2}) upper and Ω((Δ-2L)^{-2}) lower sample-complexity guarantees and up to 78% fewer API calls on Blocksworld and ALFWorld.
-
Towards Unconstrained Human-Object Interaction
Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.
-
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
-
AI Achieves a Perfect LSAT Score
Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.
-
Can LLMs Deobfuscate Binary Code? A Systematic Analysis of Large Language Models into Pseudocode Deobfuscation
LLM deobfuscation of binaries to pseudocode depends more on reasoning ability and task-specific fine-tuning than on model size, with reasoning models showing robustness across ISAs and obfuscation levels on the new Bi...
-
Sampling for Quality: Training-Free Reward-Guided LLM Decoding via Sequential Monte Carlo
Sequential Monte Carlo sampling from a reward-augmented sequence distribution improves LLM performance on HumanEval by up to 54.9% and MATH500 by up to 8.8%, outperforming standard sampling and GRPO.
-
From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models
A model-agnostic Geometric Risk Controller reduces extreme errors in VLM-based OCR by requiring cross-view consensus before accepting outputs.
-
Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models
GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than b...
-
On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency
Parallel thinking in LLMs suffers from overscaling where fixed global budgets waste samples; LanBo predicts per-sample budgets from latent states to raise utilization without hurting accuracy.
-
Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification
DeepVerifier enables self-evolving deep research agents via rubric-guided verification at test time, delivering 8-11% accuracy gains on GAIA and XBench-DeepSearch subsets.
-
Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners
Large reasoning models exhibit multilingual latent reasoning that is uneven across languages but internally consistent and English-centered.
-
ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation
ModeX selects the modal semantic output from multiple LLM generations via a similarity graph and recursive spectral clustering without needing reward models or evaluators.
Reference graph
Works this paper leans on
-
[1]
Training revision models with synthetic data. Coming soon, 2024. 16
work page 2024
-
[2]
C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan. An introduction to mcmc for machine learning. 2003
work page 2003
-
[3]
R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, E. Chu, J. H. Clark, L. E. Shafey, Y. Huang, K. Meier-Hellstern, G. Mishra, E. Moreira, M. Omernick, K. Robinson, S. Ruder, Y. Tay, K. Xiao, Y. Xu, Y. Zhang, G. H. Abrego, J. Ahn, J. Austin, P. Barham, J. Botha, J. Bradbury, S. Brahma, K. Brooks, M...
work page 2023
-
[4]
Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran- Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. DasSarma, R. ...
work page 2022
-
[5]
W., Owen, S., and Fran- kle, J
C. Blakeney, M. Paul, B. W. Larsen, S. Owen, and J. Frankle. Does your data spark joy? performance gains from domain upsampling at the end of training, 2024. URLhttps://arxiv.org/abs/ 2406.03476
-
[6]
G. Chen, M. Liao, C. Li, and K. Fan. Alphamath almost zero: process supervision without process, 2024
work page 2024
- [7]
-
[8]
Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch. Improving factuality and reasoning in language models through multiagent debate, 2023
work page 2023
-
[9]
J. S. B. T. Evans. Heuristic and analytic processes in reasoning.British Journal of Psychology, 75(4): 451–468, 1984
work page 1984
-
[10]
X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and J. Wang. Alphazero-like tree-search can guide large language model decoding and training, 2024
work page 2024
-
[11]
L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig. Pal: Program-aided language models, 2023. URLhttps://arxiv.org/abs/2211.10435
work page Pith review arXiv 2023
-
[12]
Think before you speak: Training language models with pause tokens.arXiv preprint arXiv:2310.02226,
S. Goyal, Z. Ji, A. S. Rawat, A. K. Menon, S. Kumar, and V. Nagarajan. Think before you speak: Train- ing language models with pause tokens, 2024. URLhttps://arxiv.org/abs/2310.02226. 17
-
[13]
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset, 2021
work page 2021
-
[14]
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre. Training compute-optimal large language models, 2022
work page 2022
- [15]
-
[16]
A. L. Jones. Scaling scaling laws with board games, 2021. URLhttps://arxiv.org/abs/2104. 03113
work page 2021
- [17]
-
[18]
Kahneman.Thinking, fast and slow
D. Kahneman.Thinking, fast and slow. Farrar, Straus and Giroux, New York, first paperback edition edition, 2013
work page 2013
-
[19]
L. Kocsis and C. Szepesv’ari. Bandit based monte-carlo planning. InEuropean conference on machine learning, pages 282–293. Springer, 2006
work page 2006
-
[20]
A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y. Wu, B. Neyshabur, G. Gur-Ari, and V. Misra. Solving quantitative reasoning problems with language models, 2022
work page 2022
-
[21]
Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou, and W. Chen. Making large language models better reasoners with step-aware verifier, 2023
work page 2023
-
[22]
H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step, 2023
work page 2023
- [23]
-
[24]
N. McAleese, R. Pokorny, J. F. Cerón Uribe, E. Nitishinskaya, M. Trębacz, and J. Leike. Llm critics help catch llm bugs.OpenAI, 2024
work page 2024
- [25]
-
[26]
Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023. URLhttps://arxiv.org/abs/2307.16789
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [27]
-
[28]
Y. Qu, T. Zhang, N. Garg, and A. Kumar. Recursive introspection: Teaching foundation models how to self-improve. 2024. 18
work page 2024
-
[29]
N. Sardana and J. Frankle. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws, 2023
work page 2023
-
[30]
W. Saunders, C. Yeh, J. Wu, S. Bills, L. Ouyang, J. Ward, and J. Leike. Self-critiquing models for assisting human evaluators, 2022
work page 2022
- [31]
-
[32]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024
work page 2024
- [33]
- [34]
-
[35]
A. Singh, J. D. Co-Reyes, R. Agarwal, A. Anand, P. Patil, X. Garcia, P. J. Liu, J. Harrison, J. Lee, K. Xu, A. Parisi, A. Kumar, A. Alemi, A. Rizkowsky, A. Nova, B. Adlam, B. Bohnet, G. Elsayed, H. Sedghi, I. Mordatch, I. Simpson, I. Gur, J. Snoek, J. Pennington, J. Hron, K. Kenealy, K. Swersky, K. Mahajan, L. Culp, L. Xiao, M. L. Bileschi, N. Constant, R...
work page 2024
- [36]
-
[37]
K. Stechly, M. Marquez, and S. Kambhampati. Gpt-4 doesn’t know it’s wrong: An analysis of iterative prompting for reasoning problems, 2023
work page 2023
-
[38]
R. S. Sutton and A. G. Barto.Reinforcement learning: An introduction. Second edition, 2018
work page 2018
-
[39]
G. Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024
work page 2024
-
[40]
Y. Tian, B. Peng, L. Song, L. Jin, D. Yu, H. Mi, and D. Yu. Toward self-improvement of llms via imagination, searching, and criticizing, 2024
work page 2024
-
[41]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H.Touvron, L.Martin, K.Stone, P.Albert, A.Almahairi, Y.Babaei, N.Bashlykov, S.Batra, P.Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Ko...
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [42]
-
[43]
K. Valmeekam, M. Marquez, and S. Kambhampati. Can large language models really improve by self-critiquing their own plans?, 2023
work page 2023
-
[44]
P. Villalobos and D. Atkinson. Trading off compute in training and inference, 2023. URLhttps: //epochai.org/blog/trading-off-compute-in-training-and-inference . Accessed: 2024-07-03
work page 2023
-
[45]
P. Wang, L. Li, Z. Shao, R. X. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations, 2023
work page 2023
- [46]
-
[47]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of- thought prompting elicits reasoning in large language models, 2023
work page 2023
-
[48]
S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023
work page 2023
-
[49]
Z. Yuan, H. Yuan, C. Li, G. Dong, K. Lu, C. Tan, C. Zhou, and J. Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023
work page 2023
-
[50]
E. Zelikman, Y. Wu, J. Mu, and N. D. Goodman. Star: Bootstrapping reasoning with reasoning, 2022
work page 2022
-
[51]
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
E. Zelikman, G. Harik, Y. Shao, V. Jayasiri, N. Haber, and N. D. Goodman. Quiet-star: Language models can teach themselves to think before speaking, 2024. URLhttps://arxiv.org/abs/ 2403.09629. 20 Appendices A. Related Work Language model reasoning.Language model performance on challenging mathematical reasoning tasks has rapidly improved in recent years [...
work page internal anchor Pith review arXiv 2024
-
[52]
improving the LLM proposal distribution by either applying targeted optimization on specific reasoning tasks by finetuning with RL [32, 35, 49, 50] enabling models to critique and revise their answers iteratively [4, 8, 23, 30]; 3) enabling LLMs to benefit from additional test-time computation by finetuning verifiers [6, 7, 10, 22, 40, 42, 45, 48]. Our wo...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.