super hub Canonical reference

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Aviral Kumar, Charlie Snell, Jaehoon Lee, Kelvin Xu · 2024 · cs.LG · arXiv 2408.03314

Canonical reference. 85% of citing Pith papers cite this work as background.

238 Pith papers citing it

Background 85% of classified citations

open full Pith review browse 238 citing papers more from Aviral Kumar arXiv PDF

abstract

Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 48 dataset 3 method 3

citation-polarity summary

background 46 use method 3 unclear 2 use dataset 2 support 1

claims ledger

abstract Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one

authors

Aviral Kumar Charlie Snell Jaehoon Lee Kelvin Xu

co-cited works

representative citing papers

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

cs.CL · 2026-05-08 · conditional · novelty 8.0 · 2 refs

AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning tasks at low cost.

Test-Time Training with KV Binding Is Secretly Linear Attention

cs.LG · 2026-02-24 · conditional · novelty 8.0

Test-time training with KV binding reduces to learned linear attention.

Do generative video models understand physical principles?

cs.CV · 2025-01-14 · unverdicted · novelty 8.0

Physics-IQ benchmark reveals that generative video models exhibit limited physical understanding unrelated to their visual quality.

MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

KCSAT-ML benchmark supplies human error rates for math problems and DRG metric exposes that model accuracy collapses on high-human-error items while test-time scaling shows non-monotonic gains and alignment failures.

Unlocking the Working Memory of Large Language Models for Latent Reasoning

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

RiM trains LLMs to perform latent reasoning via fixed memory blocks processed in one forward pass using a two-stage curriculum, matching or exceeding prior latent methods on benchmarks.

The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

The paper identifies unfaithful capitulation, a failure mode where chain-of-thought remains correct but the emitted answer flips wrong under sustained adversarial pressure in multi-turn dialogue.

LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

LaneRoPE adds an inter-sequence attention mask and extended RoPE to enable collaborative parallel sequence generation in LLMs, yielding accuracy gains on math reasoning under length limits.

Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

cs.AI · 2026-05-22 · unverdicted · novelty 7.0

Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.

HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection

cs.CR · 2026-05-20 · unverdicted · novelty 7.0

HIDBench unifies DARPA-E3, DARPA-E5, and NodLink datasets with a data pipeline to benchmark LLMs for host-based intrusion detection, showing high precision on simple logs but sharp drops in MCC and rises in false positives on complex noisy data.

Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Anchored Tree Sampling converts horizon-compounding drift into anchor-bounded drift by organizing video generation as a sparse-to-dense tree of imputations instead of left-to-right autoregressive rollout.

Learning How to Cube

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

A neuro-symbolic post-training pipeline lets a 4B transformer learn cubing heuristics that reach pass@5 of 53 on 100 SAT competition instances, matching the strongest symbolic baseline.

CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

CAPS is a four-stage inference-only cascade that adapts how much of each solution the verifier sees and how comparisons are distributed, halving per-candidate verifier tokens while outperforming uniform pairwise verification on most benchmarks.

Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

cs.AI · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.

Test-Time Learning with an Evolving Library

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without parameter updates or supervision.

Query-Conditioned Test-Time Self-Training for Large Language Models

cs.CL · 2026-05-13 · conditional · novelty 7.0 · 2 refs

QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.

Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

VeGAS improves MLLM-based embodied agents by sampling action ensembles and using a verifier trained on LLM-synthesized failure cases, yielding up to 36% relative gains on hard multi-object long-horizon tasks in Habitat and ALFRED.

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

cs.SE · 2026-05-12 · unverdicted · novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

cs.LG · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.

V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.

Active Testing of Large Language Models via Approximate Neyman Allocation

cs.AI · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

Proposes surrogate semantic entropy stratification followed by approximate Neyman allocation for active testing of LLMs on generative benchmarks, reporting up to 28% MSE reduction and 22.9% average budget savings versus uniform sampling.

RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement

cs.LG · 2026-05-10 · unverdicted · novelty 7.0 · 3 refs

RubricRefine is a training-free pre-execution method that creates rubrics to score and fix inter-tool contract violations in agent code, reaching 0.86 average on M3ToolEval across seven models with zero executions and lower latency.

BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preserving exact synchrony.

citing papers explorer

Showing 16 of 16 citing papers after filters.

Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities cs.LG · 2026-05-11 · unverdicted · none · ref 28 · 2 links · internal anchor
Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards cs.LG · 2026-05-08 · unverdicted · none · ref 30 · internal anchor
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees cs.LG · 2026-04-22 · unverdicted · none · ref 15
Distinct Leaf Enumeration (DLE) replaces stochastic self-consistency sampling with deterministic traversal of a truncated decoding tree to enumerate distinct leaves, increasing coverage and reducing redundant computation while improving performance on math, coding, and reasoning benchmarks.
What should post-training optimize? A test-time scaling law perspective cs.LG · 2026-05-11 · unverdicted · none · ref 20 · internal anchor
Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration cs.LG · 2026-05-11 · unverdicted · none · ref 50 · 2 links · internal anchor
SPEX delivers 1.2-3x speedup on ToT algorithms via speculative path selection, dynamic budget allocation, and adaptive early termination, reaching up to 4.1x when combined with token-level speculative decoding.
Revisiting Transformer Layer Parameterization Through Causal Energy Minimization cs.LG · 2026-05-08 · unverdicted · none · ref 17 · internal anchor
CEM recasts Transformer layers as energy minimization steps, enabling constrained parameterizations like weight sharing and low-rank interactions that match standard baselines in 100M-scale language modeling.
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport cs.LG · 2026-05-07 · unverdicted · none · ref 23 · 2 links · internal anchor
Conditional optimal transport is used to turn raw PRM outputs into monotonic quantile functions that improve calibration and downstream Best-of-N performance on MATH-500 and AIME.
State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning cs.LG · 2026-04-30 · unverdicted · none · ref 11 · internal anchor
SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched baselines.
Video models are zero-shot learners and reasoners cs.LG · 2025-09-24 · unverdicted · none · ref 76 · internal anchor
Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
FASTER: Value-Guided Sampling for Fast RL cs.LG · 2026-04-21 · unverdicted · none · ref 2
FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.
Superposition Yields Robust Neural Scaling cs.LG · 2025-05-15 · conditional · none · ref 58
Strong superposition causes neural loss to scale as the inverse of model dimension due to geometric feature overlaps, explaining scaling laws for broad frequency distributions.
Reasoning emerges from constrained inference manifolds in large language models cs.LG · 2026-05-02 · unverdicted · none · ref 26 · internal anchor
Reasoning in LLMs emerges from inference dynamics forming constrained low-dimensional manifolds that preserve non-degenerate information volume, rather than from compression alone.
Physical Foundation Models: Fixed hardware implementations of large-scale neural networks cs.LG · 2026-04-30 · unverdicted · none · ref 119 · internal anchor
Physical Foundation Models are fixed physical hardware realizations of foundation-scale neural networks that compute via inherent material dynamics, potentially delivering orders-of-magnitude gains in energy efficiency, speed, and density over digital systems.
Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR cs.LG · 2026-05-08 · unverdicted · none · ref 24 · internal anchor
Adaptive scheduling of penalties over training time plus confidence-based weighting of mistakes improves LLM performance on math reasoning benchmarks compared to fixed-penalty negative reinforcement.
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws cs.LG · 2026-04-27 · unreviewed · ref 58 · internal anchor
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies cs.LG · 2026-05-04 · unreviewed · ref 2

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer