CoT transformers simulate any Word RAM algorithm with poly-logarithmic overhead in three architectures, improving on quadratic TM overhead.
super hub Canonical reference
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Canonical reference. 85% of citing Pith papers cite this work as background.
abstract
Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one
authors
co-cited works
representative citing papers
EGLR adds a deterministic layer-recursion axis gated by entropy that is complementary to temperature sampling, raising joint oracle accuracy on MATH-500 from 83.4% to 91.6% for a 3B model.
UniQL is a human-verified benchmark providing aligned natural language questions and dialect-specific SQL queries for 16 SQL systems to evaluate cross-dialect generalization.
AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning tasks at low cost.
Test-time training with KV binding reduces to learned linear attention.
Physics-IQ benchmark reveals that generative video models exhibit limited physical understanding unrelated to their visual quality.
MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.
Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.
Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.
LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.
SPIRAL is a reinforcement learning framework that jointly optimizes sequential reasoning, parallel trace generation, and aggregation in language models for improved test-time performance.
SPOT-E uses entropy shaping on answer predictions with low-entropy anchors to optimize visual spotlights at test time via GRPO for better VLM performance on evidence-intensive tasks.
SWITCH uses explicit <swi> and </swi> boundary tokens to make latent chain-of-thought compatible with on-policy RL (GRPO) and open to causal mechanistic probing, outperforming prior hidden-state recurrence methods.
MARS is a margin-adversarial stopping rule for parallel LLM test-time scaling that saves 25-47% tokens while matching full-budget majority-vote accuracy by learning trace switch probabilities and applying adversarial bounds.
QGF performs test-time policy optimization for flow models in RL by guiding a behavior-cloned reference policy with value-function gradients, achieving strong results on high-dimensional offline RL benchmarks without additional policy training.
KCSAT-ML benchmark supplies human error rates for math problems and DRG metric exposes that model accuracy collapses on high-human-error items while test-time scaling shows non-monotonic gains and alignment failures.
PRISM is a contrastive, policy-aware training framework for process reward models that reduces false positives by 22% on PRMBench and boosts downstream accuracy up to 33% in Best-of-N selection by learning reliable relative comparisons instead of pointwise labels.
Three problem-level trajectory features derived from the distributional signature of failed LLM rollouts enable failure clustering at 84.3% accuracy and a training-free routing rule that improves rescue by 12.2% on hard cases.
TTT-RTL performs per-design test-time RL on an LLM policy with EDA-derived PPA rewards and an adaptive KL controller, reducing geometric-mean PPA product by 65.1% on RTLLM v2.0 and ADP by 59.4% on an industrial FPU unit.
LLMs achieve up to 78.8% accuracy and r=0.590 correlation mimicking individual SOEP respondents using cumulative microdata, with gains from more information but diminishing returns past the 75% entropy point.
Consequence-aware scheduler using an issue-text predictor routes more compute to high-cost failures and cuts cost-weighted loss by 22-33% versus difficulty-based allocation on SWE-bench tasks.
Rotate2Think estimates an orthogonal rotation from input to thinking embeddings via Procrustes analysis on a few examples and injects the resulting vector to prime reasoning traces, raising accuracy in 30 of 32 model-benchmark settings.
VLMs formulate differentiable rewards from task-specific rules to enable test-time online LoRA optimization of VGMs, delivering 16.7-point gains on symbolic and general video reasoning benchmarks over VLM-as-solver and Best-of-N baselines.
ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.
citing papers explorer
-
MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning
MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.
-
Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing
LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.
-
KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty
KCSAT-ML benchmark supplies human error rates for math problems and DRG metric exposes that model accuracy collapses on high-human-error items while test-time scaling shows non-monotonic gains and alignment failures.
-
Unlocking the Working Memory of Large Language Models for Latent Reasoning
RiM trains LLMs to perform latent reasoning via fixed memory blocks processed in one forward pass using a two-stage curriculum, matching or exceeding prior latent methods on benchmarks.
-
CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation
CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candidate voting.
-
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.
-
Logic-Regularized Verifier Elicits Reasoning from LLMs
LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
-
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
DataPRM is an environment-aware generative process reward model that improves LLM data analysis agents by 7-11% on benchmarks via active verification and reflection-aware ternary rewards.
-
Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners
Large reasoning models exhibit multilingual latent reasoning that is uneven across languages but internally consistent and English-centered.
-
ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation
ModeX selects the modal semantic output from multiple LLM generations via a similarity graph and recursive spectral clustering without needing reward models or evaluators.
-
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
LCPO trains L1 reasoning models to adhere to prompt-specified CoT lengths, supporting accuracy-compute trade-offs and yielding short reasoning models that outperform larger baselines at matched lengths.
-
Stay Focused: Problem Drift in Multi-Agent Debate
The paper defines and measures 'problem drift' in multi-agent LLM debates across tasks and proposes DRIFTJudge and DRIFTPolicy as baselines to detect and reduce it.
-
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
-
Training Large Language Models to Reason in a Continuous Latent Space
Coconut lets LLMs perform reasoning directly in continuous latent space by recycling hidden states as inputs, outperforming standard chain-of-thought on search-intensive logical tasks with better accuracy-efficiency trade-offs.
-
Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions
Introduces YesBut benchmark showing state-of-the-art multimodal models lag humans on interpreting humorous contradictions in comics.
-
MSQA: A Natively Sourced Multilingual and Multicultural SimpleQA Benchmark
MSQA benchmark shows LLMs exhibit cultural degradation where competence tracks pre-training data exposure more than reasoning ability, and inference fixes like sampling or retrieval do not close the gap.
-
Only Ask What You Don't Know: Grounded Delta Planning for Efficient Multi-step RAG
GDP-RAG targets only information deltas in multi-hop RAG through preliminary grounding, gap-conditioned prompts, and skeletal trajectories, reaching 60.63% accuracy at 0.51 cost-of-pass on HotpotQA, 2WikiMultiHopQA, and MuSiQue.
-
Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models
Dynamic Rollout Editing reduces overthinking in RL-trained LLMs by editing post-answer continuations in successful rollouts and preferring the edited versions within GRPO groups.
-
From Player to Master: Enhancing Test-Time Learning of LLM Agents via Reinforcement Learning over Memory
MemoPilot trains memory updates for LLM agents via multi-turn GRPO on RPS and poker, achieving top Elo scores and outperforming baselines including DeepSeek-V3.2.
-
Boosting Self-Consistency with Ranking
RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.
-
Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling
RL-trained lightweight controller using answer statistics improves trade-offs among correctness, latency, and total samples in adaptive sampling for LLM test-time scaling.
-
Geometric Latent Reasoning Induces Shorter Generations in LLMs
GLR formulates latent reasoning as geometric path approximation in pretrained embedding space and reports shorter LLM generations on math tasks without an explicit length penalty.
-
Inference Time Optimization with Confidence Dynamics
Correct reasoning traces exhibit positive confidence gain while incorrect traces show declining confidence, enabling CDG-based voting that boosts performance on AIME, HMMT and BRUMO benchmarks across multiple LLM architectures.
-
Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving
Fast-dDrive is a block-diffusion VLA that reports SOTA accuracy on WOD-E2E and nuScenes driving benchmarks together with 12x throughput over autoregressive baselines via section scaffolds and test-time averaging.
-
Unified Data Selection for LLM Reasoning
High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.
-
Process Rewards with Learned Reliability
BetaPRM learns distributional step rewards with explicit reliability via Beta-Binomial modeling, enabling ACA that cuts token use by up to 33.57% while raising final-answer accuracy on reasoning benchmarks.
-
Reasoning Models Don't Just Think Longer, They Move Differently
After length correction, reasoning-trained language models exhibit distinct hidden-state trajectory geometries on harder problems compared to instruction-tuned baselines, with the strongest effect in code domains.
-
Hint Tuning: Less Data Makes Better Reasoners
Hint Tuning reduces token usage 24-66% (31.5% avg) in reasoning models via 1K self-annotated samples aligned to an instruct model's capabilities while keeping benchmark accuracy.
-
AIPO: Learning to Reason from Active Interaction
AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
-
Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling
LenVM models token-level remaining generation length as a bounded discounted value function derived from constant negative per-token rewards, providing a scalable proxy for generation horizon.
-
AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs
AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
-
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
-
Procedural Knowledge at Scale Improves Reasoning
Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks by up to 19.2%.
-
Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation
CAP-TTA triggers context-aware preconditioned LoRA updates on high bias-risk OOD prompts to reduce toxicity in LLM narrative generation while preserving fluency and avoiding catastrophic forgetting.
-
Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling
RSE distills search trajectories into an experience bank for positive and negative recycling, yielding efficiency gains over independent sampling on math reasoning benchmarks.
-
Rectifying LLM Thought from Lens of Optimization
RePro defines a surrogate objective with intensity and stability scores to generate process-level rewards that enhance LLM reasoning efficiency and accuracy within RLVR pipelines.
-
Early Risk Prediction with Temporally and Contextually Grounded Clinical Language Processing
HiTGNN and ReVeAL enable accurate near-term prediction of Type 2 Diabetes from longitudinal clinical notes with temporal modeling and privacy preservation.
-
Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts
EMoE trains MoE models so they maintain performance when the number of activated experts changes at inference, expanding the usable range to 2-3 times the training k with higher peak results.
-
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
-
GrACE: A Generative Approach to Better Confidence Elicitation and Efficient Test-Time Scaling in Large Language Models
GrACE is a fine-tuned generative method that uses similarity to a special token embedding for real-time calibrated confidence in LLMs and enables efficient confidence-based test-time scaling.
-
Dream 7B: Diffusion Large Language Models
Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and quality-speed tradeoffs.
-
Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
Fin-PRM is a domain-specialized process reward model that supplies binary step-level and trajectory-level supervision signals for financial reasoning in LLMs and outperforms general PRMs on CFLUE and FinQA benchmarks.
-
CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning
CoLD mitigates length bias in process reward models for mathematical reasoning via counterfactual guidance, length penalties, bias estimation, and joint training, improving step selection accuracy and conciseness on MATH500 and GSM-Plus while boosting downstream RL performance.
-
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
MEM1 uses end-to-end RL to learn constant-memory agents that update a shared state for memory and reasoning, delivering 3.5x better performance and 3.7x lower memory use than larger baselines on long-horizon QA and shopping tasks.
-
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
ReTool uses outcome-driven RL to train 32B LLMs to dynamically use code tools during reasoning, reaching 72.5% accuracy on AIME and surpassing o1-preview.
-
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs
Language models that naturally exhibit verification, backtracking, subgoal setting, and backward chaining improve substantially during RL on verifiable tasks, and these behaviors can be instilled via priming with reasoning-focused examples or filtered pretraining to enable self-improvement.
-
LIMO: Less is More for Reasoning
LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.
-
Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search
DITS replaces Q-value guidance in MCTS with influence scores for synthetic data synthesis in multi-agent LLM training, claiming better efficiency and performance on eight datasets.
-
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yielding 3.12x and 1.72x speedups on KernelBench Level-2 and Level-3 within 100 steps.
-
Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics
A ridge predictor using prompt-level agreement spread, label-assisted first-correct position, completion-length variance, and entropy reaches Spearman ρ=0.90 with observed best-of-N gains across three model families and six post-training methods.