super hub Canonical reference

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Aviral Kumar, Charlie Snell, Jaehoon Lee, Kelvin Xu · 2024 · cs.LG · arXiv 2408.03314

Canonical reference. 85% of citing Pith papers cite this work as background.

258 Pith papers citing it

Background 85% of classified citations

open full Pith review browse 258 citing papers more from Aviral Kumar arXiv PDF

abstract

Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 48 dataset 3 method 3

citation-polarity summary

background 46 use method 3 unclear 2 use dataset 2 support 1

claims ledger

abstract Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one

authors

Aviral Kumar Charlie Snell Jaehoon Lee Kelvin Xu

co-cited works

representative citing papers

Entropy-Gated Latent Recursion

cs.LG · 2026-06-15 · unverdicted · novelty 8.0 · 2 refs

EGLR adds a deterministic layer-recursion axis gated by entropy that is complementary to temperature sampling, raising joint oracle accuracy on MATH-500 from 83.4% to 91.6% for a 3B model.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

cs.CL · 2026-05-08 · conditional · novelty 8.0 · 2 refs

AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning tasks at low cost.

Test-Time Training with KV Binding Is Secretly Linear Attention

cs.LG · 2026-02-24 · conditional · novelty 8.0

Test-time training with KV binding reduces to learned linear attention.

Do generative video models understand physical principles?

cs.CV · 2025-01-14 · unverdicted · novelty 8.0

Physics-IQ benchmark reveals that generative video models exhibit limited physical understanding unrelated to their visual quality.

MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.

Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.

Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing

cs.CL · 2026-06-24 · unverdicted · novelty 7.0

LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

KCSAT-ML benchmark supplies human error rates for math problems and DRG metric exposes that model accuracy collapses on high-human-error items while test-time scaling shows non-monotonic gains and alignment failures.

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

VLMs formulate differentiable rewards from task-specific rules to enable test-time online LoRA optimization of VGMs, delivering 16.7-point gains on symbolic and general video reasoning benchmarks over VLM-as-solver and Best-of-N baselines.

Unlocking the Working Memory of Large Language Models for Latent Reasoning

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

RiM trains LLMs to perform latent reasoning via fixed memory blocks processed in one forward pass using a two-stage curriculum, matching or exceeding prior latent methods on benchmarks.

The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

The paper identifies unfaithful capitulation, a failure mode where chain-of-thought remains correct but the emitted answer flips wrong under sustained adversarial pressure in multi-turn dialogue.

LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

LaneRoPE adds an inter-sequence attention mask and extended RoPE to enable collaborative parallel sequence generation in LLMs, yielding accuracy gains on math reasoning under length limits.

Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

cs.AI · 2026-05-22 · unverdicted · novelty 7.0

Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.

HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection

cs.CR · 2026-05-20 · unverdicted · novelty 7.0

HIDBench unifies DARPA-E3, DARPA-E5, and NodLink datasets with a data pipeline to benchmark LLMs for host-based intrusion detection, showing high precision on simple logs but sharp drops in MCC and rises in false positives on complex noisy data.

Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Anchored Tree Sampling converts horizon-compounding drift into anchor-bounded drift by organizing video generation as a sparse-to-dense tree of imputations instead of left-to-right autoregressive rollout.

Learning How to Cube

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

A neuro-symbolic post-training pipeline lets a 4B transformer learn cubing heuristics that reach pass@5 of 53 on 100 SAT competition instances, matching the strongest symbolic baseline.

CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

CAPS is a four-stage inference-only cascade that adapts how much of each solution the verifier sees and how comparisons are distributed, halving per-candidate verifier tokens while outperforming uniform pairwise verification on most benchmarks.

Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

cs.AI · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.

Test-Time Learning with an Evolving Library

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without parameter updates or supervision.

Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces

cs.AI · 2026-05-14 · unverdicted · novelty 7.0

Language models produce overcomplete reasoning traces where on average 46% of steps can be removed while preserving the answer in 86% of cases, with necessity concentrated in the top three steps.

Query-Conditioned Test-Time Self-Training for Large Language Models

cs.CL · 2026-05-13 · conditional · novelty 7.0 · 2 refs

QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.

Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

VeGAS improves MLLM-based embodied agents by sampling action ensembles and using a verifier trained on LLM-synthesized failure cases, yielding up to 36% relative gains on hard multi-object long-horizon tasks in Habitat and ALFRED.

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

cs.SE · 2026-05-12 · unverdicted · novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

citing papers explorer

Showing 40 of 40 citing papers after filters.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling cs.CL · 2026-05-08 · conditional · none · ref 7 · 2 links · internal anchor
AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning tasks at low cost.
MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning cs.CL · 2026-04-19 · unverdicted · none · ref 28
MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.
Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing cs.CL · 2026-06-24 · unverdicted · none · ref 24 · internal anchor
LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.
KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty cs.CL · 2026-06-09 · unverdicted · none · ref 14 · internal anchor
KCSAT-ML benchmark supplies human error rates for math problems and DRG metric exposes that model accuracy collapses on high-human-error items while test-time scaling shows non-monotonic gains and alignment failures.
Unlocking the Working Memory of Large Language Models for Latent Reasoning cs.CL · 2026-05-28 · unverdicted · none · ref 15 · internal anchor
RiM trains LLMs to perform latent reasoning via fixed memory blocks processed in one forward pass using a two-stage curriculum, matching or exceeding prior latent methods on benchmarks.
Query-Conditioned Test-Time Self-Training for Large Language Models cs.CL · 2026-05-13 · conditional · none · ref 28 · 2 links · internal anchor
QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.
CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation cs.CL · 2026-05-08 · unverdicted · none · ref 26 · internal anchor
CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candidate voting.
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients cs.CL · 2026-05-07 · unverdicted · none · ref 60 · internal anchor
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.
Logic-Regularized Verifier Elicits Reasoning from LLMs cs.CL · 2026-05-07 · unverdicted · none · ref 5 · 2 links · internal anchor
LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis cs.CL · 2026-04-27 · unverdicted · none · ref 52 · 2 links · internal anchor
DataPRM is an environment-aware generative process reward model that improves LLM data analysis agents by 7-11% on benchmarks via active verification and reflection-aware ternary rewards.
Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners cs.CL · 2026-01-06 · unverdicted · none · ref 9 · internal anchor
Large reasoning models exhibit multilingual latent reasoning that is uneven across languages but internally consistent and English-centered.
ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation cs.CL · 2026-01-05 · unverdicted · none · ref 9 · internal anchor
ModeX selects the modal semantic output from multiple LLM generations via a similarity graph and recursive spectral clustering without needing reward models or evaluators.
Only Ask What You Don't Know: Grounded Delta Planning for Efficient Multi-step RAG cs.CL · 2026-06-21 · unverdicted · none · ref 85 · internal anchor
GDP-RAG targets only information deltas in multi-hop RAG through preliminary grounding, gap-conditioned prompts, and skeletal trajectories, reaching 60.63% accuracy at 0.51 cost-of-pass on HotpotQA, 2WikiMultiHopQA, and MuSiQue.
Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models cs.CL · 2026-06-16 · unverdicted · none · ref 36 · internal anchor
Dynamic Rollout Editing reduces overthinking in RL-trained LLMs by editing post-answer continuations in successful rollouts and preferring the edited versions within GRPO groups.
Boosting Self-Consistency with Ranking cs.CL · 2026-06-03 · unverdicted · none · ref 50 · internal anchor
RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.
Inference Time Optimization with Confidence Dynamics cs.CL · 2026-05-24 · unverdicted · none · ref 11 · internal anchor
Correct reasoning traces exhibit positive confidence gain while incorrect traces show declining confidence, enabling CDG-based voting that boosts performance on AIME, HMMT and BRUMO benchmarks across multiple LLM architectures.
Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving cs.CL · 2026-05-22 · unverdicted · none · ref 13 · 2 links · internal anchor
Fast-dDrive is a block-diffusion VLA that reports SOTA accuracy on WOD-E2E and nuScenes driving benchmarks together with 12x throughput over autoregressive baselines via section scaffolds and test-time averaging.
Unified Data Selection for LLM Reasoning cs.CL · 2026-05-21 · unverdicted · none · ref 40 · internal anchor
High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.
Process Rewards with Learned Reliability cs.CL · 2026-05-15 · unverdicted · none · ref 53 · internal anchor
BetaPRM learns distributional step rewards with explicit reliability via Beta-Binomial modeling, enabling ACA that cuts token use by up to 33.57% while raising final-answer accuracy on reasoning benchmarks.
Reasoning Models Don't Just Think Longer, They Move Differently cs.CL · 2026-05-14 · unverdicted · none · ref 5 · 2 links · internal anchor
After length correction, reasoning-trained language models exhibit distinct hidden-state trajectory geometries on harder problems compared to instruction-tuned baselines, with the strongest effect in code domains.
Hint Tuning: Less Data Makes Better Reasoners cs.CL · 2026-05-09 · unverdicted · none · ref 3 · 2 links · internal anchor
Hint Tuning reduces token usage 24-66% (31.5% avg) in reasoning models via 1K self-annotated samples aligned to an instruct model's capabilities while keeping benchmark accuracy.
AIPO: Learning to Reason from Active Interaction cs.CL · 2026-05-08 · unverdicted · none · ref 58 · 2 links · internal anchor
AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling cs.CL · 2026-04-29 · unverdicted · none · ref 20 · internal anchor
LenVM models token-level remaining generation length as a bounded discounted value function derived from constant negative per-token rewards, providing a scalable proxy for generation horizon.
AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs cs.CL · 2026-04-24 · unverdicted · none · ref 24 · internal anchor
AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents cs.CL · 2026-04-08 · conditional · none · ref 7 · internal anchor
A learned embedding-based router selecting among six reasoning paradigms improves LLM agent accuracy from 47.6% to 53.1% on average, beating the best fixed paradigm by 2.8pp.
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space cs.CL · 2026-04-06 · unverdicted · none · ref 52 · internal anchor
PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
Procedural Knowledge at Scale Improves Reasoning cs.CL · 2026-04-01 · unverdicted · none · ref 35 · internal anchor
Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks by up to 19.2%.
Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation cs.CL · 2026-03-14 · unverdicted · none · ref 9 · internal anchor
CAP-TTA triggers context-aware preconditioned LoRA updates on high bias-risk OOD prompts to reduce toxicity in LLM narrative generation while preserving fluency and avoiding catastrophic forgetting.
Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling cs.CL · 2026-01-29 · unverdicted · none · ref 7 · internal anchor
RSE distills search trajectories into an experience bank for positive and negative recycling, yielding efficiency gains over independent sampling on math reasoning benchmarks.
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation cs.CL · 2026-04-17 · unverdicted · none · ref 19
AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yielding 3.12x and 1.72x speedups on KernelBench Level-2 and Level-3 within 100 steps.
Reinforcement Learning from Denoising Feedback cs.CL · 2026-05-25 · unverdicted · none · ref 22 · internal anchor
RLDF is a new RL paradigm for diffusion language models that optimizes toward clipped clean states with weighted timestep sampling and reports substantial gains on reasoning benchmarks for LLaDA and Dream.
RAS: Reflection-Augmented Scaling with In-Context Learning for Executable Cypher Query Generation cs.CL · 2026-05-21 · unverdicted · none · ref 12 · internal anchor
RAS conditions each new Cypher query attempt on prior execution errors through ICL and reduces execution error rate by 41-50% at n=5 versus 32-38% for independent scaling across three Neo4j datasets and five models.
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis cs.CL · 2026-05-11 · unverdicted · none · ref 17 · internal anchor
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
Statistical Scouting Finds Debate-Safe but Not Debate-Useful Cases: A Matched-Ceiling Study of Open-Weight LLM Reasoning Protocols cs.CL · 2026-05-10 · unverdicted · none · ref 7 · internal anchor
Oracle per-example routing among decoding, voting, and debate yields +13-14 pp gains over the best fixed protocol, but vote-entropy thresholds and learned routers recover only 1-2 pp with non-significant results.
Language models fail at extended rule following cs.CL · 2026-05-03 · unverdicted · none · ref 13 · 2 links · internal anchor
LLMs fail at extended counting of repeated characters due to finite internal states, with abrupt errors persisting across model scales and inference methods.
Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models cs.CL · 2026-04-07 · unverdicted · none · ref 23 · internal anchor
Lack of exploration from conditioning on prior answers is the primary reason parallel sampling outperforms sequential sampling in large reasoning models.
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency cs.CL · 2026-04-03 · unverdicted · none · ref 2 · internal anchor
JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
Knowledge Graph-Driven Expert-Level Reasoning for Neuroscience cs.CL · 2026-05-24 · unverdicted · none · ref 32 · internal anchor
A textbook-derived neuroscience knowledge graph supplies synthetic multi-hop QA supervision and RL rewards to fine-tune a small LM claimed to exceed larger general models on expert reasoning.
Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3 cs.CL · 2026-03-29 · unverdicted · none · ref 2 · internal anchor
Model capability dominates over all tested inference-time prompt optimizations in LLM math reasoning on IMO-level problems.
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning cs.CL · 2026-04-19 · unreviewed · ref 20

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer