Recognition: 2 theorem links
· Lean TheoremTTRL: Test-Time Reinforcement Learning
Pith reviewed 2026-05-15 00:42 UTC · model grok-4.3
The pith
TTRL lets LLMs improve reasoning on unlabeled test data by treating majority voting as an RL reward.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TTRL trains LLMs via reinforcement learning on unlabeled test inputs by using majority voting outcomes as the reward, enabling consistent performance gains that surpass the initial model's maj@n upper limit and approach the results of training with ground-truth labels.
What carries the argument
Majority voting as a reward estimator inside an RL loop applied directly to unlabeled test-time inputs.
Load-bearing premise
Majority voting produces a reliable enough reward signal to guide useful RL updates without any ground-truth labels.
What would settle it
Applying TTRL to a new reasoning benchmark and measuring no improvement or a drop in pass@1 accuracy compared with the baseline model would refute the central claim.
read the original abstract
This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 211% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the maj@n metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model maj@n, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks and highlight TTRL's potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Test-Time Reinforcement Learning (TTRL), a method for applying RL to LLMs on unlabeled data for reasoning tasks by deriving rewards from majority voting (maj@n). It claims that this enables self-evolution, yielding a ~211% pass@1 gain on AIME 2024 for Qwen-2.5-Math-7B and allowing the model to surpass the initial model's maj@n upper bound while approaching supervised performance.
Significance. If the central claims hold after verification, the work would be significant for demonstrating label-free RL self-improvement at test time in reasoning domains, extending test-time scaling techniques into a training loop and reducing dependence on ground-truth labels.
major comments (3)
- [Abstract] Abstract: The 211% pass@1 improvement on AIME 2024 is reported without any description of the RL algorithm (e.g., PPO, GRPO), learning rate, number of gradient steps, or value of n in maj@n, preventing verification of the result or reproduction.
- [Experiments] Experiments section: No ablation replaces the maj@n reward with a random or constant baseline reward; without this control, the reported gains cannot be distinguished from artifacts of repeated sampling on the fixed unlabeled test set.
- [Method] Method: The manuscript provides no correlation analysis between maj@n-derived rewards and ground-truth correctness on the AIME 2024 split; given the low base pass@1 of Qwen-2.5-Math-7B, this leaves open the possibility that the reward reinforces dominant errors rather than correct solutions.
minor comments (2)
- [Abstract] Abstract: The notation 'maj@n' is used without an explicit definition or reference to the number of samples n employed in the experiments.
- [Abstract] The GitHub link is provided but the manuscript does not indicate whether the released code includes the exact hyperparameters and random seeds used for the AIME 2024 results.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help improve the clarity and rigor of our work. We address each major comment below and outline the corresponding revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The 211% pass@1 improvement on AIME 2024 is reported without any description of the RL algorithm (e.g., PPO, GRPO), learning rate, number of gradient steps, or value of n in maj@n, preventing verification of the result or reproduction.
Authors: We agree that the abstract should include these implementation details to support verification. The full manuscript describes the use of GRPO as the RL algorithm along with the associated hyperparameters in Section 3. In the revised version we will explicitly state the RL algorithm, learning rate, number of gradient steps, and the value of n in maj@n directly in the abstract. revision: yes
-
Referee: [Experiments] Experiments section: No ablation replaces the maj@n reward with a random or constant baseline reward; without this control, the reported gains cannot be distinguished from artifacts of repeated sampling on the fixed unlabeled test set.
Authors: This is a fair observation. While the reported gains include surpassing the initial model's maj@n upper bound (which would not occur from sampling alone), we acknowledge that an explicit random-reward control would further isolate the contribution of the learned policy. We will add this ablation experiment to the revised Experiments section. revision: yes
-
Referee: [Method] Method: The manuscript provides no correlation analysis between maj@n-derived rewards and ground-truth correctness on the AIME 2024 split; given the low base pass@1 of Qwen-2.5-Math-7B, this leaves open the possibility that the reward reinforces dominant errors rather than correct solutions.
Authors: We recognize the importance of this analysis. The observation that TTRL exceeds the initial maj@n performance already suggests the model is not merely reinforcing majority errors, yet we will strengthen the manuscript by adding an explicit correlation study between maj@n rewards and ground-truth correctness on the AIME 2024 split in the revised Method or Experiments section. revision: yes
Circularity Check
No significant circularity: reward signal is external majority vote, not self-referential fit
full rationale
The TTRL derivation uses majority voting over n samples drawn from the current policy as an external reward signal for RL updates on unlabeled data. This procedure is computed independently of the policy gradient steps and is not defined in terms of the final performance metric; the reported gains (including surpassing initial maj@n) are presented as empirical outcomes of optimization rather than algebraic identities. No load-bearing self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the core chain. The method remains self-contained against external benchmarks such as ground-truth pass@1 on held-out sets.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Majority voting among model outputs provides an effective reward signal for RL training on unlabeled reasoning data
Lean theorems connected to this paper
-
Foundation.LawOfExistencedefect_zero_iff_one unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 211% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the maj@n metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model maj@n.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
-
Bounded Ratio Reinforcement Learning
BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.
-
MemDLM: Memory-Enhanced DLM Training
MemDLM embeds a simulated denoising trajectory into DLM training via bi-level optimization, creating a parametric memory that improves convergence and long-context performance even when the memory is dropped at test time.
-
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
A model trained only by proposing and solving its own verifiable code tasks achieves state-of-the-art results on math and coding benchmarks without external data.
-
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
-
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
-
Gradient Extrapolation-Based Policy Optimization
GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering u...
-
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...
-
Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning
DDRL reduces spurious reward noise in test-time RL for math by excluding ambiguous samples, using fixed advantages, and adding consensus-based updates, outperforming prior TTRL methods on math benchmarks.
-
Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents
ProactAgent learns a proactive retrieval policy via reinforcement learning on paired task continuations, improving lifelong agent performance and cutting retrieval overhead on SciWorld, AlfWorld, and StuLife.
-
TEMPO: Scaling Test-time Training for Large Reasoning Models
TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.
-
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
-
Characterizing Model-Native Skills
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
-
MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation
MARS² integrates multi-agent collaboration with tree-structured search in RL to boost code generation by increasing exploratory diversity and using path-level group advantages for credit assignment.
-
ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?
ZeroCoder co-evolves coder and tester LLMs via self-generated code-test execution feedback to improve code generation up to 21.6% without ground-truth supervision.
-
Can LLMs Learn to Reason Robustly under Noisy Supervision?
Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning be...
-
GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning
GrandCode is the first AI system to consistently beat all human participants and place first in live Codeforces competitive programming contests.
-
Specificity-aware reinforcement learning for fine-grained open-world classification
SpeciaRL applies a dynamic verifier-based reward in reinforcement learning to steer reasoning LMMs toward correct and specific predictions on fine-grained open-world image classification tasks.
-
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...
-
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
-
Triviality Corrected Endogenous Reward
TCER corrects triviality bias in endogenous rewards for text generation by rewarding relative information gain modulated by probability correction, yielding consistent unsupervised improvements on writing benchmarks a...
-
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.