pith. sign in

super hub Canonical reference

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Canonical reference. 85% of citing Pith papers cite this work as background.

293 Pith papers citing it
Background 85% of classified citations
abstract

Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.

hub tools

citation-role summary

background 48 dataset 3 method 3

citation-polarity summary

claims ledger

  • abstract Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one

authors

co-cited works

clear filters

representative citing papers

Entropy-Gated Latent Recursion

cs.LG · 2026-06-15 · unverdicted · novelty 8.0 · 2 refs

EGLR adds a deterministic layer-recursion axis gated by entropy that is complementary to temperature sampling, raising joint oracle accuracy on MATH-500 from 83.4% to 91.6% for a 3B model.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

cs.CL · 2026-05-08 · conditional · novelty 8.0 · 2 refs

AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning tasks at low cost.

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

cs.LG · 2026-06-09 · unverdicted · novelty 7.0

QGF performs test-time policy optimization for flow models in RL by guiding a behavior-cloned reference policy with value-function gradients, achieving strong results on high-dimensional offline RL benchmarks without additional policy training.

Alpha-RTL: Test-Time Training for RTL Hardware Optimization

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

TTT-RTL performs per-design test-time RL on an LLM policy with EDA-derived PPA rewards and an adaptive KL controller, reducing geometric-mean PPA product by 65.1% on RTLLM v2.0 and ADP by 59.4% on an industrial FPU unit.

ATLAS: Agentic Test-time Learning-to-Allocate Scaling

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.

citing papers explorer

Showing 7 of 7 citing papers after filters.

  • Test-Time Training with KV Binding Is Secretly Linear Attention cs.LG · 2026-02-24 · conditional · none · ref 16 · internal anchor

    Test-time training with KV binding reduces to learned linear attention.

  • Inference-Time Scaling of Diffusion Language Models via Trajectory Refinement cs.LG · 2025-07-11 · conditional · none · ref 47 · internal anchor

    PG-DLM applies particle Gibbs sampling over full trajectories in diffusion language models to enable iterative refinement, yielding higher accuracy on reward-guided generation with theoretical convergence guarantees.

  • When Less is Enough: Efficient Inference via Collaborative Reasoning cs.LG · 2026-05-01 · conditional · none · ref 35 · internal anchor

    A large model generates a compact reasoning signal that a small model uses to solve tasks, reducing the large model's output tokens by up to 60% on benchmarks like AIME and GPQA.

  • CRISP: Compressed Reasoning via Iterative Self-Policy Distillation cs.LG · 2026-03-05 · conditional · none · ref 19 · internal anchor

    CRISP achieves 57-59% token reduction on MATH-500 with 9-16 point accuracy gains on Qwen3 models via iterative self-distillation of concise reasoning behavior.

  • What Does Flow Matching Bring To TD Learning? cs.LG · 2026-03-04 · conditional · none · ref 56 · internal anchor

    Flow matching critics outperform monolithic ones in RL by 2x performance and 5x sample efficiency via test-time error recovery through integration and multi-point velocity supervision that preserves feature plasticity.

  • Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs cs.LG · 2025-08-27 · conditional · none · ref 30 · internal anchor

    GSR jointly trains LLMs to generate candidate solutions and refine a superior final answer from them, achieving state-of-the-art performance on five mathematical benchmarks while transferring across model scales.

  • Superposition Yields Robust Neural Scaling cs.LG · 2025-05-15 · conditional · none · ref 58

    Strong superposition causes neural loss to scale as the inverse of model dimension due to geometric feature overlaps, explaining scaling laws for broad frequency distributions.