super hub Canonical reference

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Aviral Kumar, Charlie Snell, Jaehoon Lee, Kelvin Xu · 2024 · cs.LG · arXiv 2408.03314

Canonical reference. 85% of citing Pith papers cite this work as background.

285 Pith papers citing it

Background 85% of classified citations

open full Pith review browse 285 citing papers more from Aviral Kumar arXiv PDF

abstract

Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 48 dataset 3 method 3

citation-polarity summary

background 46 use method 3 unclear 2 use dataset 2 support 1

claims ledger

abstract Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one

authors

Aviral Kumar Charlie Snell Jaehoon Lee Kelvin Xu

co-cited works

representative citing papers

Entropy-Gated Latent Recursion

cs.LG · 2026-06-15 · unverdicted · novelty 8.0 · 2 refs

EGLR adds a deterministic layer-recursion axis gated by entropy that is complementary to temperature sampling, raising joint oracle accuracy on MATH-500 from 83.4% to 91.6% for a 3B model.

UniQL: Towards Dialect-Universal Benchmarking for Text-to-SQL

cs.AI · 2026-06-06 · unverdicted · novelty 8.0

UniQL is a human-verified benchmark providing aligned natural language questions and dialect-specific SQL queries for 16 SQL systems to evaluate cross-dialect generalization.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

cs.CL · 2026-05-08 · conditional · novelty 8.0 · 2 refs

AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning tasks at low cost.

Test-Time Training with KV Binding Is Secretly Linear Attention

cs.LG · 2026-02-24 · conditional · novelty 8.0

Test-time training with KV binding reduces to learned linear attention.

Do generative video models understand physical principles?

cs.CV · 2025-01-14 · unverdicted · novelty 8.0

Physics-IQ benchmark reveals that generative video models exhibit limited physical understanding unrelated to their visual quality.

MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.

Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.

Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.

Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing

cs.CL · 2026-06-24 · unverdicted · novelty 7.0

LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

cs.LG · 2026-06-09 · unverdicted · novelty 7.0

QGF performs test-time policy optimization for flow models in RL by guiding a behavior-cloned reference policy with value-function gradients, achieving strong results on high-dimensional offline RL benchmarks without additional policy training.

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

KCSAT-ML benchmark supplies human error rates for math problems and DRG metric exposes that model accuracy collapses on high-human-error items while test-time scaling shows non-monotonic gains and alignment failures.

The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

PRISM is a contrastive, policy-aware training framework for process reward models that reduces false positives by 22% on PRMBench and boosts downstream accuracy up to 33% in Best-of-N selection by learning reliable relative comparisons instead of pointwise labels.

Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

Three problem-level trajectory features derived from the distributional signature of failed LLM rollouts enable failure clustering at 84.3% accuracy and a training-free routing rule that improves rescue by 12.2% on hard cases.

Alpha-RTL: Test-Time Training for RTL Hardware Optimization

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

TTT-RTL performs per-design test-time RL on an LLM policy with EDA-derived PPA rewards and an adaptive KL controller, reducing geometric-mean PPA product by 65.1% on RTLLM v2.0 and ADP by 59.4% on an industrial FPU unit.

Synthetic Personalities: How Well Can LLMs Mimic Individual Respondents Using Socio-Economic Microdata?

cs.CY · 2026-06-03 · unverdicted · novelty 7.0

LLMs achieve up to 78.8% accuracy and r=0.590 correlation mimicking individual SOEP respondents using cumulative microdata, with gains from more information but diminishing returns past the 75% entropy point.

Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

Consequence-aware scheduler using an issue-text predictor routes more compute to high-cost failures and cuts cost-weighted loss by 22-33% versus difficulty-based allocation on SWE-bench tasks.

Rotate2Think: Geometric Priming via Orthogonal Rotation to Improve Language Model Reasoning

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Rotate2Think estimates an orthogonal rotation from input to thinking embeddings via Procrustes analysis on a few examples and injects the resulting vector to prime reasoning traces, raising accuracy in 30 of 32 model-benchmark settings.

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

cs.CV · 2026-06-01 · unverdicted · novelty 7.0 · 2 refs

VLMs formulate differentiable rewards from task-specific rules to enable test-time online LoRA optimization of VGMs, delivering 16.7-point gains on symbolic and general video reasoning benchmarks over VLM-as-solver and Best-of-N baselines.

ATLAS: Agentic Test-time Learning-to-Allocate Scaling

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.

Unlocking the Working Memory of Large Language Models for Latent Reasoning

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

RiM trains LLMs to perform latent reasoning via fixed memory blocks processed in one forward pass using a two-stage curriculum, matching or exceeding prior latent methods on benchmarks.

The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

The paper identifies unfaithful capitulation, a failure mode where chain-of-thought remains correct but the emitted answer flips wrong under sustained adversarial pressure in multi-turn dialogue.

LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

LaneRoPE adds an inter-sequence attention mask and extended RoPE to enable collaborative parallel sequence generation in LLMs, yielding accuracy gains on math reasoning under length limits.

Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

cs.AI · 2026-05-22 · unverdicted · novelty 7.0

Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.

HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection

cs.CR · 2026-05-20 · unverdicted · novelty 7.0

HIDBench unifies DARPA-E3, DARPA-E5, and NodLink datasets with a data pipeline to benchmark LLMs for host-based intrusion detection, showing high precision on simple logs but sharp drops in MCC and rises in false positives on complex noisy data.

citing papers explorer

Showing 50 of 63 citing papers after filters.

UniQL: Towards Dialect-Universal Benchmarking for Text-to-SQL cs.AI · 2026-06-06 · unverdicted · none · ref 220 · internal anchor
UniQL is a human-verified benchmark providing aligned natural language questions and dialect-specific SQL queries for 16 SQL systems to evaluate cross-dialect generalization.
Agentic generation of verifiable rules for deterministic, self-expanding reaction classification cs.AI · 2026-07-01 · unverdicted · none · ref 29 · internal anchor
Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.
Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation cs.AI · 2026-06-03 · unverdicted · none · ref 12 · internal anchor
Consequence-aware scheduler using an issue-text predictor routes more compute to high-cost failures and cuts cost-weighted loss by 22-33% versus difficulty-based allocation on SWE-bench tasks.
The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure cs.AI · 2026-05-27 · unverdicted · none · ref 4 · internal anchor
The paper identifies unfaithful capitulation, a failure mode where chain-of-thought remains correct but the emitted answer flips wrong under sustained adversarial pressure in multi-turn dialogue.
LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation cs.AI · 2026-05-26 · unverdicted · none · ref 29 · internal anchor
LaneRoPE adds an inter-sequence attention mask and extended RoPE to enable collaborative parallel sequence generation in LLMs, yielding accuracy gains on math reasoning under length limits.
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents cs.AI · 2026-05-22 · unverdicted · none · ref 7 · internal anchor
Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.
CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning cs.AI · 2026-05-15 · unverdicted · none · ref 36 · internal anchor
CAPS is a four-stage inference-only cascade that adapts how much of each solution the verifier sees and how comparisons are distributed, halving per-candidate verifier tokens while outperforming uniform pairwise verification on most benchmarks.
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems cs.AI · 2026-05-14 · unverdicted · none · ref 65 · 2 links · internal anchor
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces cs.AI · 2026-05-14 · unverdicted · none · ref 28 · internal anchor
Language models produce overcomplete reasoning traces where on average 46% of steps can be removed while preserving the answer in 86% of cases, with necessity concentrated in the top three steps.
Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents cs.AI · 2026-05-12 · unverdicted · none · ref 38 · internal anchor
VeGAS improves MLLM-based embodied agents by sampling action ensembles and using a verifier trained on LLM-synthesized failure cases, yielding up to 36% relative gains on hard multi-object long-horizon tasks in Habitat and ALFRED.
Active Testing of Large Language Models via Approximate Neyman Allocation cs.AI · 2026-05-11 · unverdicted · none · ref 12 · 2 links · internal anchor
Proposes surrogate semantic entropy stratification followed by approximate Neyman allocation for active testing of LLMs on generative benchmarks, reporting up to 28% MSE reduction and 22.9% average budget savings versus uniform sampling.
Joint Consistency: A Unified Test-Time Aggregation Framework via Energy Minimization cs.AI · 2026-05-07 · unverdicted · none · ref 36 · internal anchor
Joint Consistency casts test-time aggregation as Ising-type energy minimization with pairwise LLM-judge interactions, subsuming voting methods and outperforming baselines across reasoning tasks.
Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference cs.AI · 2026-05-01 · unverdicted · none · ref 19 · internal anchor
TokenArena is a continuous benchmark for AI inference endpoints that measures output speed, time to first token, blended price, effective context, quality, and modeled energy to produce composites of joules per correct answer, dollars per correct answer, and endpoint fidelity.
AI Achieves a Perfect LSAT Score cs.AI · 2026-04-11 · unverdicted · none · ref 6 · internal anchor
Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.
Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models cs.AI · 2026-02-02 · unverdicted · none · ref 29 · internal anchor
GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.
ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling cs.AI · 2025-10-16 · unverdicted · none · ref 35 · internal anchor
ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.
Bayesian Social Deduction with Graph-Informed Language Models cs.AI · 2025-06-21 · unverdicted · none · ref 53 · internal anchor
Hybrid Bayesian-graph LLM agent reaches competitive performance against large models and achieves 67% win rate against humans in controlled Avalon play, outperforming baselines and human teammates.
Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers cs.AI · 2026-06-16 · unverdicted · none · ref 37 · internal anchor
FPRM is a Transformer-based model using fixed-point convergence for adaptive halting in looped architectures, claimed effective on Sudoku, Maze, state-tracking, and ARC-AGI benchmarks.
Momentum for Reasoning: Dense Intrinsic Signals in Policy Optimization cs.AI · 2026-06-07 · unverdicted · none · ref 30 · internal anchor
ISPO densifies GRPO rewards with sequence-level informativeness and token-level directional signals from policy probabilities to reduce zero-advantage collapse and hallucinated certainty on math benchmarks.
Scaffold Effects on GAIA: A Controlled Comparison cs.AI · 2026-06-07 · unverdicted · none · ref 9 · internal anchor
A controlled comparison shows scaffold choice alters GAIA Level 1-2 accuracy by up to 28 points, with effects varying by model family rather than capability tier alone.
GITCO: Gated Inference-Time Context Optimization in TSFMs cs.AI · 2026-06-03 · unverdicted · none · ref 11 · internal anchor
GITCO delivers +1.95% average MASE reduction on TimesFM 2.5 across 53 datasets by gated inference-time suppression of anomalous patches, capturing 89.9% of the improvement upper bound.
Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs cs.AI · 2026-05-30 · unverdicted · none · ref 31 · internal anchor
LRS trains a latent reward model on final-answer correctness to steer SAE states during inference, improving reasoning performance and implicitly encouraging better cognitive behaviors.
ReasonOps: Operator Segmentation for LLM Reasoning Traces cs.AI · 2026-05-28 · unverdicted · none · ref 32 · internal anchor
Unsupervised clustering on sentence-initial 3-token pivots extracts 7 universal reasoning operators from 44k traces across 12 LLMs that enable model fingerprinting and answer-correctness prediction.
In LLM Reasoning, there is Irrationality on top of Value Misalignment cs.AI · 2026-05-26 · unverdicted · none · ref 133 · internal anchor
LLMs display widespread rational value risk in reasoning that value alignment reduces but does not remove, with risk sensitive to inference strategy and showing diminishing returns from longer reasoning.
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 36 · internal anchor
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
Engagement Process: Rethinking the Temporal Interface of Action and Observation cs.AI · 2026-05-12 · unverdicted · none · ref 37 · 2 links · internal anchor
Engagement Process (EP) decouples actions and observations as independent event streams over time within a POMDP structure to explicitly model temporal dynamics in agent interactions.
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs cs.AI · 2026-05-09 · unverdicted · none · ref 6 · internal anchor
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding cs.AI · 2026-05-04 · unverdicted · none · ref 15 · internal anchor
CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.
Reasoning Structure Matters for Safety Alignment of Reasoning Models cs.AI · 2026-04-21 · unverdicted · none · ref 43 · internal anchor
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling cs.AI · 2026-04-12 · unverdicted · none · ref 2 · internal anchor
Extended reasoning in LLMs exhibits overthinking and diminishing returns, with optimal thinking length varying by problem difficulty, allowing significant compute savings by stopping at moderate budgets.
Conformal Thinking: Risk Control for Reasoning on a Compute Budget cs.AI · 2026-02-03 · unverdicted · none · ref 7 · internal anchor
Conformal risk control with upper and lower thresholds lets LLMs adaptively stop reasoning while guaranteeing a maximum error rate and minimizing token use.
Glia: A Human-Inspired AI for Automated Systems Design and Optimization cs.AI · 2025-10-31 · unverdicted · none · ref 71 · internal anchor
Glia deploys a multi-agent LLM workflow with reasoning, experimentation, and analysis agents to generate interpretable algorithms for request routing, scheduling, and auto-scaling in distributed GPU clusters, reaching human-expert performance levels.
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search cs.AI · 2025-09-29 · unverdicted · none · ref 17 · internal anchor
DeepSearch embeds MCTS into RLVR training with global frontier selection, entropy guidance, and adaptive replay to achieve 62.95% average accuracy on math reasoning benchmarks while using 5.7x fewer GPU hours than extended training.
Retrieval-of-Thought: Efficient Reasoning via Reusing Thoughts cs.AI · 2025-09-26 · unverdicted · none · ref 21 · internal anchor
Retrieval-of-Thought organizes prior reasoning into a thought graph for retrieval and reward-guided recombination, reducing output tokens by up to 40% and latency by 82% while preserving accuracy on reasoning benchmarks.
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning cs.AI · 2025-03-25 · unverdicted · none · ref 28 · internal anchor
ReSearch trains LLMs via RL to integrate search operations into reasoning steps, achieving strong generalization across benchmarks and eliciting reflection and self-correction without supervised reasoning data.
Towards an AI co-scientist cs.AI · 2025-02-26 · unverdicted · none · ref 248 · internal anchor
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training cs.AI · 2025-01-28 · unverdicted · none · ref 6 · internal anchor
Reinforcement learning post-training enables generalization to unseen textual rule variants and visual changes in foundation models, while supervised fine-tuning primarily leads to memorization.
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents cs.AI · 2024-08-13 · unverdicted · none · ref 57 · internal anchor
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning cs.AI · 2026-06-09 · unverdicted · none · ref 36 · internal anchor
ActiveMem proposes a heterogeneous distributed memory framework for LLM agents that separates planning from active memory management, reporting SOTA accuracy with lower overhead on BrowseComp-Plus and GAIA.
When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference cs.AI · 2026-06-06 · unverdicted · none · ref 54 · internal anchor
PPV delegation using letter entropy and per-question embedding cosine beats majority voting by 1.5 pp overall on MMLU-Pro in an unsupervised setting.
BuddyBench: A Privacy-Constrained Multi-Task Benchmark for Pediatric Social-Communication Personalization cs.AI · 2026-05-27 · unverdicted · none · ref 8 · internal anchor
BuddyBench introduces a multi-task benchmark linking drill trajectories, clinical scores, self-reports, and RCT endpoints across 275 children in two cohorts for knowledge tracing, recommendation, prediction, and causal inference while preserving privacy.
A Sober Look at Agentic Misalignment in Automated Workflows cs.AI · 2026-05-22 · unverdicted · none · ref 32 · internal anchor
Agentic misalignment in multi-agent systems arises from generic utilities causing posterior collapse; Agentic Evidence Attribution using self-reflection or weak-to-strong generalization provides context-specific evidence to align agent posteriors.
The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems cs.AI · 2026-05-21 · unverdicted · none · ref 133 · internal anchor
Converts impossibility theorems into architecture-dependent accuracy ceilings and design rules for transformers and other AI subfields, with the Deterministic Horizon measured at 19-31 across twelve models.
ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling cs.AI · 2026-05-21 · unverdicted · none · ref 1 · internal anchor
ExComm adds cross-agent conflict detection and soft belief correction plus trajectory diversification to agentic test-time scaling, yielding 5-6% gains over baselines on AIME and GAIA benchmarks.
Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management cs.AI · 2026-05-16 · unverdicted · none · ref 32 · 2 links · internal anchor
AI agents in supply chain simulations outperform humans but exhibit decision instability that GRPO post-training reduces.
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization cs.AI · 2026-05-09 · unverdicted · none · ref 16 · 2 links · internal anchor
An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation cs.AI · 2026-05-09 · unverdicted · none · ref 37 · internal anchor
Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.
BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models cs.AI · 2026-05-07 · unverdicted · none · ref 7 · internal anchor
BitCal-TTS raises exact-match accuracy by 3.7 points (7B) and 2.8 points (14B) on small GSM8K shards for 4-bit Qwen2.5 models while cutting premature-stop rates and retaining token savings versus fixed-budget decoding.
Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling cs.AI · 2026-05-02 · unverdicted · none · ref 10 · internal anchor
Multi-agent debate and mixture-of-agents outperform self-consistency by 1.3 and 2.7 percentage points respectively at equal compute budgets on MMLU-Pro and BBH, with advantages that continue at higher scales while self-consistency saturates.
Toward a Science of Intent: Closure Gaps and Delegation Envelopes for Open-World AI Agents cs.AI · 2026-04-27 · unverdicted · none · ref 5 · internal anchor
Intent compilation turns vague human goals into verifiable artifacts, using closure-gap vectors and delegation envelopes to separate open-world agent challenges from closed-world solvers and to benchmark closure fixes against extra search.

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer