ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement
Pith reviewed 2026-05-13 21:51 UTC · model grok-4.3
The pith
ThinkTwice alternates reasoning optimization with self-refinement using the same binary reward to improve both skills.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ThinkTwice is a two-phase framework built on Group Relative Policy Optimization that jointly trains models by first solving reasoning problems and then refining their own solutions to those problems, using the identical binary correctness reward in each phase. Across five mathematical reasoning benchmarks and two model families, the method raises both pre-refinement and post-refinement accuracy over GRPO baselines, with specific gains of 5 points before refinement and 11.5 points after one refinement step on AIME for the Qwen3-4B model. Training dynamics reveal an implicit rectify-then-fortify curriculum in which refinement first corrects mistakes and later shifts to preserving already-recto
What carries the argument
The alternating two-phase optimization loop in which a model is trained first to produce correct solutions and then to refine those solutions using the shared binary correctness reward.
If this is right
- Higher pass rates on math problems both before and after one self-refinement step.
- An automatic curriculum emerges that first rectifies errors and later preserves correct answers.
- The gains appear across multiple benchmarks and two different model families without extra annotations.
- Joint training of reasoning and refinement is presented as a direct methodology for reinforcement learning with verifiable rewards.
Where Pith is reading between the lines
- The same alternating structure could be tested on sequential tasks outside mathematics where an initial output is later revised.
- Self-refinement might become a built-in part of the generation policy rather than a separate post-processing step.
- Applying different rewards to the two phases could be compared directly to isolate whether the shared reward is essential to the observed curriculum.
Load-bearing premise
The same binary correctness reward can be applied to both the reasoning phase and the refinement phase without causing instability or reward hacking during joint optimization.
What would settle it
A training run that exhibits instability, reward hacking, or no accuracy gain over GRPO when the two phases share the identical binary reward.
Figures
read the original abstract
We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines. Specifically, on Qwen3-4B, ThinkTwice outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step, measured by pass@4. Analysis of the training dynamics of ThinkTwice reveals an implicit rectify-then-fortify curriculum: refinement predominantly corrects errors early in training and naturally shifts toward preserving already-correct solutions as the model improves, yielding a more rectified reward signal. Our work establishes joint training of reasoning and self-refinement as a principled and effective methodology for RLVR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ThinkTwice, a two-phase framework that jointly optimizes LLMs for reasoning and self-refinement via Group Relative Policy Optimization (GRPO). In alternating training steps, the model is first optimized on solving problems and then on refining its own outputs, using the identical binary correctness reward (1 if final answer correct) in both phases without critique annotations or external signals. The authors report substantial gains over GRPO baselines across five math benchmarks and two model families (Qwen3-4B, Olmo3-7B), including +5pp before refinement and +11.5pp after one refinement step on AIME for Qwen3-4B (pass@4). They further identify an emergent 'rectify-then-fortify' curriculum in the training dynamics.
Significance. If the reported gains and curriculum hold under rigorous controls, the work provides a simple, annotation-free method for improving both initial reasoning and self-correction in LLMs through joint RLVR. The observation that refinement naturally shifts from error correction to preservation of correct solutions is a useful empirical finding that could guide future alternating-phase designs. However, the current experimental reporting leaves the magnitude and stability of these gains unverified.
major comments (2)
- [Abstract and Results] Abstract and Results section: the central performance claims (5pp pre-refinement and 11.5pp post-refinement gains on AIME for Qwen3-4B, pass@4) are stated without any information on the number of independent runs, standard deviations, baseline implementation details, or data splits, rendering the deltas impossible to assess for statistical reliability.
- [Training Procedure] Training Procedure section: the claim that identical binary correctness reward across reasoning and refinement phases produces stable joint optimization and the observed rectify-then-fortify curriculum is load-bearing yet unsupported by ablations on phase alternation schedule, GRPO group size, or controls that would detect trivial copying (e.g., near-identical outputs when the initial solution is already correct).
minor comments (1)
- [Abstract] The abstract refers to 'five mathematical reasoning benchmarks' without naming them; an explicit list would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights opportunities to strengthen the statistical reporting and empirical support in our work. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract and Results] Abstract and Results section: the central performance claims (5pp pre-refinement and 11.5pp post-refinement gains on AIME for Qwen3-4B, pass@4) are stated without any information on the number of independent runs, standard deviations, baseline implementation details, or data splits, rendering the deltas impossible to assess for statistical reliability.
Authors: We agree that additional details are needed for assessing reliability. In the revised manuscript, we will report results averaged over 3 independent runs with standard deviations for the AIME and other benchmark results, include more explicit baseline implementation details (e.g., exact hyperparameters and training steps), and clarify the data splits used. These updates will appear in the Results section and a dedicated Experimental Details appendix. revision: yes
-
Referee: [Training Procedure] Training Procedure section: the claim that identical binary correctness reward across reasoning and refinement phases produces stable joint optimization and the observed rectify-then-fortify curriculum is load-bearing yet unsupported by ablations on phase alternation schedule, GRPO group size, or controls that would detect trivial copying (e.g., near-identical outputs when the initial solution is already correct).
Authors: The training dynamics analysis in Section 4 already demonstrates the emergence of the rectify-then-fortify curriculum from the joint optimization with identical binary rewards, without requiring external critique signals. We will expand the discussion in the Training Procedure section to more explicitly link the identical reward to stable joint optimization. However, we cannot add the requested ablations due to computational resource limits. revision: no
- Ablations on phase alternation schedule, GRPO group size, and controls for detecting trivial copying, which would require substantial additional compute not available for this revision.
Circularity Check
No significant circularity: empirical RLVR with external reward and held-out evaluation
full rationale
The paper applies standard GRPO in alternating phases using an externally defined binary correctness reward (1 if final answer matches ground truth). Performance claims are measured on separate held-out benchmarks (AIME, etc.) via pass@4, not defined in terms of fitted internal quantities or training loss. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation; the observed curriculum is a post-hoc observation of training dynamics rather than an input assumption. The method is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A single binary correctness reward is sufficient to train both the reasoning and self-refinement phases effectively
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ThinkTwice alternates Phase 1 reasoning GRPO and Phase 2 refinement GRPO using identical binary correctness reward ri=1[E(yi)=a*] without critique or process signals
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
implicit rectify-then-fortify curriculum from joint optimization
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Check if there are any errors in calculations, logic, or problem understanding
Go through each calculation step-by-step. Check if there are any errors in calculations, logic, or problem understanding
-
[2]
If you find any mistakes, explicitly point out what was wrong and explain the correct approach
-
[3]
If the solution is already correct, verify each step and explain it more clearly
-
[4]
Finally, after finishing the review, provide your refined solution and answer. The refinement instruction is task-agnostic and contains no correctness signals, ensuring the model learns self-refinement without external supervision. B.1.3 Hyperparameter Configuration Table 4 summarizes the key hyperparameters for ThinkTwice training. B.2 Implementation of ...
-
[5]
and REFLEXION(Shinn et al., 2023). All baselines are evaluated innon-thinkingmode. To isolate the effect of the inference procedure from prompt engineering, we keep the refinement instruction fixed across all refinement-based baselines. Prompt formatting.Each evaluation example is rendered with the model’s chat template and an added generation prompt. For...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.