Derives an exact telescoping decomposition of the naive RLVR reward-design estimator into null, elicitation, and reward-design terms on a tabular-GRPO simulator, measures the components across prior strengths, and validates via pre-registered factorial experiments plus re-audits of prior papers.
hub Canonical reference
TTRL: Test-Time Reinforcement Learning
Canonical reference. 71% of citing Pith papers cite this work as background.
abstract
This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 211% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the maj@n metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model maj@n, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks and highlight TTRL's potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL
hub tools
citation-role summary
citation-polarity summary
representative citing papers
SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
Establishes a quadratic lower bound on query complexity for sampling from large classes of distributions given approximate density oracles, answers an open question on optimality of random walks, and shows circumvention for bounded classes as an abstraction of TTT.
TTT-RTL performs per-design test-time RL on an LLM policy with EDA-derived PPA rewards and an adaptive KL controller, reducing geometric-mean PPA product by 65.1% on RTLLM v2.0 and ADP by 59.4% on an industrial FPU unit.
TTRL-CoCoV is a confidence-conditioned test-time RL framework that selectively applies verification to address pseudo-label errors and diversity collapse, yielding +9.8% Pass@1 and +18.7% Pass@16 gains over prior TTRL on reasoning benchmarks.
TTRL gains are reinterpreted as mostly sharpening rather than learning, with an identified extinction window causing net corruption; TTRL-Guard mitigates via FRS, MPS, and RCSU for improved pass@1.
MLS-Bench is a benchmark with 140 tasks that evaluates AI agents on inventing generalizable and scalable ML methods, finding they lag human performance especially in insight-driven invention rather than tuning.
MemDLM embeds a simulated denoising trajectory into DLM training via bi-level optimization, creating a parametric memory that improves convergence and long-context performance even when the memory is dropped at test time.
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.
DeepVerifier enables self-evolving deep research agents via rubric-guided verification at test time, delivering 8-11% accuracy gains on GAIA and XBench-DeepSearch subsets.
ZeroSiam is an asymmetric architecture using a learnable predictor and stop-gradient that prevents collapse in test-time entropy minimization while also regularizing biased signals for improved performance.
GC-TTT adapts goal-conditioned policies at test time by fine-tuning on self-supervised selected goal-related offline data, yielding performance gains in loco-navigation and manipulation tasks.
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.
A model trained only by proposing and solving its own verifiable code tasks achieves state-of-the-art results on math and coding benchmarks without external data.
T^2VLA is a test-time reinforcement learning framework for VLAs that uses internal confidence to define intrinsic rewards via similarity to high-confidence expert demonstrations and a dual-expert bootstrapping mechanism.
OASIF improves open-source LLMs on obfuscated assembly comprehension by 5-17 percentage points on commercial VM obfuscators via a three-phase self-evolving training pipeline.
Reinforcement learning after SFT conversion narrows the performance gap between sliding-window attention and full self-attention on math reasoning benchmarks while preserving linear complexity.
Traj-Evolve combines non-parametric experience retrieval and multi-agent RL with a leave-one-out unification strategy to outperform baselines on lung cancer prediction from up to five years of multimodal EHRs, including in never-smokers.
SpeciaRL applies a dynamic verifier-based reward in reinforcement learning to steer reasoning LMMs toward correct and specific predictions on fine-grained open-world image classification tasks.
MOA applies multi-objective RL with fine-grained rubrics and thought-augmented rollouts to role-playing agents, enabling an 8B model to match closed-source performance on PersonaGym and RoleMRC benchmarks.
Parallel inference rollouts aggregated into pseudo-references enable reference-free RL supervision that matches expert-annotated performance on health tasks while using 9x less test-time compute.
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
citing papers explorer
-
A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR
Derives an exact telescoping decomposition of the naive RLVR reward-design estimator into null, elicitation, and reward-design terms on a tabular-GRPO simulator, measures the components across prior strengths, and validates via pre-registered factorial experiments plus re-audits of prior papers.
-
SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
-
Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification
DeepVerifier enables self-evolving deep research agents via rubric-guided verification at test time, delivering 8-11% accuracy gains on GAIA and XBench-DeepSearch subsets.
-
Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning
Reinforcement learning after SFT conversion narrows the performance gap between sliding-window attention and full self-attention on math reasoning benchmarks while preserving linear complexity.
-
Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection
Traj-Evolve combines non-parametric experience retrieval and multi-agent RL with a leave-one-out unification strategy to outperform baselines on lung cancer prediction from up to five years of multimodal EHRs, including in never-smokers.
-
Characterizing Model-Native Skills
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.
-
MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation
MARS² integrates multi-agent collaboration with tree-structured search in RL to boost code generation by increasing exploratory diversity and using path-level group advantages for credit assignment.
-
GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning
GrandCode is the first AI system to consistently beat all human participants and place first in live Codeforces competitive programming contests.
-
SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation
SOLAR introduces a self-optimizing agent using meta-learning on model weights and RL-driven strategy discovery for lifelong adaptation in LLMs, claiming superior performance on reasoning tasks across domains.