ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

Ashton Anderson; Blair Yang; Difan Jiao; Qianfeng Wen; Zhenwei Tang

arxiv: 2604.01591 · v2 · submitted 2026-04-02 · 💻 cs.AI

ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

Difan Jiao , Qianfeng Wen , Blair Yang , Zhenwei Tang , Ashton Anderson This is my paper

Pith reviewed 2026-05-13 21:51 UTC · model grok-4.3

classification 💻 cs.AI

keywords ThinkTwiceself-refinementjoint optimizationreasoning modelsGRPObinary rewardcurriculum emergencemathematical reasoning

0 comments

The pith

ThinkTwice alternates reasoning optimization with self-refinement using the same binary reward to improve both skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ThinkTwice as a two-phase training process that first optimizes a language model to solve reasoning problems and then optimizes it to refine its own outputs on the same problems. Both phases rely on the identical binary correctness signal without any added critiques or labels. A reader would care because the approach produces stronger initial answers and better self-correction on math benchmarks than standard policy optimization, while also generating an automatic training progression that fixes errors early and later protects correct solutions.

Core claim

ThinkTwice is a two-phase framework built on Group Relative Policy Optimization that jointly trains models by first solving reasoning problems and then refining their own solutions to those problems, using the identical binary correctness reward in each phase. Across five mathematical reasoning benchmarks and two model families, the method raises both pre-refinement and post-refinement accuracy over GRPO baselines, with specific gains of 5 points before refinement and 11.5 points after one refinement step on AIME for the Qwen3-4B model. Training dynamics reveal an implicit rectify-then-fortify curriculum in which refinement first corrects mistakes and later shifts to preserving already-recto

What carries the argument

The alternating two-phase optimization loop in which a model is trained first to produce correct solutions and then to refine those solutions using the shared binary correctness reward.

If this is right

Higher pass rates on math problems both before and after one self-refinement step.
An automatic curriculum emerges that first rectifies errors and later preserves correct answers.
The gains appear across multiple benchmarks and two different model families without extra annotations.
Joint training of reasoning and refinement is presented as a direct methodology for reinforcement learning with verifiable rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alternating structure could be tested on sequential tasks outside mathematics where an initial output is later revised.
Self-refinement might become a built-in part of the generation policy rather than a separate post-processing step.
Applying different rewards to the two phases could be compared directly to isolate whether the shared reward is essential to the observed curriculum.

Load-bearing premise

The same binary correctness reward can be applied to both the reasoning phase and the refinement phase without causing instability or reward hacking during joint optimization.

What would settle it

A training run that exhibits instability, reward hacking, or no accuracy gain over GRPO when the two phases share the identical binary reward.

Figures

Figures reproduced from arXiv: 2604.01591 by Ashton Anderson, Blair Yang, Difan Jiao, Qianfeng Wen, Zhenwei Tang.

**Figure 2.** Figure 2: ThinkTwice at a glance. long-CoT training stability; Dr. GRPO analyzes optimization bias in GRPO; GSPO moves from token-level to sequence-level importance ratios and clipping; and newer variants such as GMPO, GPG, and shrinkage baselines revisit ratio aggregation, simplification, responselength bias, and baseline variance (Yu et al., 2025a; Liu et al., 2025; Zheng et al., 2025; Zhao et al., 2025b; Chu et … view at source ↗

**Figure 3.** Figure 3: Cross-model refinement evaluation (average pass@4, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Training dynamics of refinement across checkpoints. The vertical dashed lines [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Training-time cost and dynamics of ThinkTwice compared with GRPO. * denoted [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Reasoning pass@k curves across five mathematical reasoning benchmarks for Qwen3-4B (top) and OLMo3-7B (bottom). 40 60 80 Qwen3-4B Pass@k (%) AIME 70 80 90 AMC 85 90 95 100 MATH500 35 40 45 50 Minerva 60 70 80 OlympiadBench 1 2 4 8 16 32 30 40 50 60 OLMo3-7B Pass@k (%) 1 2 4 8 16 32 60 70 80 90 1 2 4 8 16 32 85 90 95 1 2 4 8 16 32 30 40 50 1 2 4 8 16 32 60 70 80 Base GRPO DrGRPO DAPO ThinkTwice [PITH_FULL_… view at source ↗

**Figure 7.** Figure 7: Self-refinement pass@k curves across five mathematical reasoning benchmarks for Qwen3-4B (top) and OLMo3-7B (bottom). For reasoning ( [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines. Specifically, on Qwen3-4B, ThinkTwice outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step, measured by pass@4. Analysis of the training dynamics of ThinkTwice reveals an implicit rectify-then-fortify curriculum: refinement predominantly corrects errors early in training and naturally shifts toward preserving already-correct solutions as the model improves, yielding a more rectified reward signal. Our work establishes joint training of reasoning and self-refinement as a principled and effective methodology for RLVR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ThinkTwice adds a clean two-phase GRPO alternation for joint reasoning and self-refinement on the same problems, delivering reported gains over plain GRPO, but the experimental details are too thin to trust the size of those gains yet.

read the letter

The main point is straightforward: train the model first on solving math problems with GRPO, then immediately on refining its own outputs to those same problems, using the identical binary correctness reward in both phases. On Qwen3-4B this produces a 5-point lift on AIME before refinement and 11.5 points after one refinement step, with similar patterns on other benchmarks and the Olmo-7B model. The training dynamics section shows the refinement phase naturally moves from fixing errors early to mostly preserving correct answers later, which is a useful observation that emerges without extra machinery or annotations. That simplicity is the real strength here; it reuses existing signals and produces an implicit curriculum that looks practical for RLVR setups. The soft spots sit in the evaluation. The abstract gives specific deltas but supplies no run counts, standard deviations, baseline implementation details, or data-split information, so the central performance claims rest on single reported numbers. The stress-test concern about trivial edits or reward hacking in the refinement phase also lands: without controls on edit distance or phase scheduling, it is unclear whether the joint training genuinely improves refinement or simply amplifies the base policy when answers are already correct. This work is aimed at groups already running online policy optimization on reasoning models. Readers who want a low-overhead recipe to try on math benchmarks will get immediate value from the method description and the curriculum analysis. It is worth sending to peer review because the core procedure is simple and the reported effect, if it holds under tighter controls, would be useful; referees will need to press for variance numbers and ablations on the refinement phase before the gains can be taken as settled.

Referee Report

2 major / 1 minor

Summary. The paper introduces ThinkTwice, a two-phase framework that jointly optimizes LLMs for reasoning and self-refinement via Group Relative Policy Optimization (GRPO). In alternating training steps, the model is first optimized on solving problems and then on refining its own outputs, using the identical binary correctness reward (1 if final answer correct) in both phases without critique annotations or external signals. The authors report substantial gains over GRPO baselines across five math benchmarks and two model families (Qwen3-4B, Olmo3-7B), including +5pp before refinement and +11.5pp after one refinement step on AIME for Qwen3-4B (pass@4). They further identify an emergent 'rectify-then-fortify' curriculum in the training dynamics.

Significance. If the reported gains and curriculum hold under rigorous controls, the work provides a simple, annotation-free method for improving both initial reasoning and self-correction in LLMs through joint RLVR. The observation that refinement naturally shifts from error correction to preservation of correct solutions is a useful empirical finding that could guide future alternating-phase designs. However, the current experimental reporting leaves the magnitude and stability of these gains unverified.

major comments (2)

[Abstract and Results] Abstract and Results section: the central performance claims (5pp pre-refinement and 11.5pp post-refinement gains on AIME for Qwen3-4B, pass@4) are stated without any information on the number of independent runs, standard deviations, baseline implementation details, or data splits, rendering the deltas impossible to assess for statistical reliability.
[Training Procedure] Training Procedure section: the claim that identical binary correctness reward across reasoning and refinement phases produces stable joint optimization and the observed rectify-then-fortify curriculum is load-bearing yet unsupported by ablations on phase alternation schedule, GRPO group size, or controls that would detect trivial copying (e.g., near-identical outputs when the initial solution is already correct).

minor comments (1)

[Abstract] The abstract refers to 'five mathematical reasoning benchmarks' without naming them; an explicit list would improve clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback, which highlights opportunities to strengthen the statistical reporting and empirical support in our work. We address each major comment below.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results section: the central performance claims (5pp pre-refinement and 11.5pp post-refinement gains on AIME for Qwen3-4B, pass@4) are stated without any information on the number of independent runs, standard deviations, baseline implementation details, or data splits, rendering the deltas impossible to assess for statistical reliability.

Authors: We agree that additional details are needed for assessing reliability. In the revised manuscript, we will report results averaged over 3 independent runs with standard deviations for the AIME and other benchmark results, include more explicit baseline implementation details (e.g., exact hyperparameters and training steps), and clarify the data splits used. These updates will appear in the Results section and a dedicated Experimental Details appendix. revision: yes
Referee: [Training Procedure] Training Procedure section: the claim that identical binary correctness reward across reasoning and refinement phases produces stable joint optimization and the observed rectify-then-fortify curriculum is load-bearing yet unsupported by ablations on phase alternation schedule, GRPO group size, or controls that would detect trivial copying (e.g., near-identical outputs when the initial solution is already correct).

Authors: The training dynamics analysis in Section 4 already demonstrates the emergence of the rectify-then-fortify curriculum from the joint optimization with identical binary rewards, without requiring external critique signals. We will expand the discussion in the Training Procedure section to more explicitly link the identical reward to stable joint optimization. However, we cannot add the requested ablations due to computational resource limits. revision: no

standing simulated objections not resolved

Ablations on phase alternation schedule, GRPO group size, and controls for detecting trivial copying, which would require substantial additional compute not available for this revision.

Circularity Check

0 steps flagged

No significant circularity: empirical RLVR with external reward and held-out evaluation

full rationale

The paper applies standard GRPO in alternating phases using an externally defined binary correctness reward (1 if final answer matches ground truth). Performance claims are measured on separate held-out benchmarks (AIME, etc.) via pass@4, not defined in terms of fitted internal quantities or training loss. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation; the observed curriculum is a post-hoc observation of training dynamics rather than an input assumption. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a single binary correctness signal suffices to drive both reasoning acquisition and subsequent self-refinement without additional supervision or architectural changes.

axioms (1)

domain assumption A single binary correctness reward is sufficient to train both the reasoning and self-refinement phases effectively
The framework description states that the same reward is used in both phases without correctness signals or critique annotations.

pith-pipeline@v0.9.0 · 5518 in / 1297 out tokens · 53982 ms · 2026-05-13T21:51:55.429295+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ThinkTwice alternates Phase 1 reasoning GRPO and Phase 2 refinement GRPO using identical binary correctness reward ri=1[E(yi)=a*] without critique or process signals
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

implicit rectify-then-fortify curriculum from joint optimization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

Check if there are any errors in calculations, logic, or problem understanding

Go through each calculation step-by-step. Check if there are any errors in calculations, logic, or problem understanding

work page
[2]

If you find any mistakes, explicitly point out what was wrong and explain the correct approach

work page
[3]

If the solution is already correct, verify each step and explain it more clearly

work page
[4]

The refinement instruction is task-agnostic and contains no correctness signals, ensuring the model learns self-refinement without external supervision

Finally, after finishing the review, provide your refined solution and answer. The refinement instruction is task-agnostic and contains no correctness signals, ensuring the model learns self-refinement without external supervision. B.1.3 Hyperparameter Configuration Table 4 summarizes the key hyperparameters for ThinkTwice training. B.2 Implementation of ...

work page
[5]

But maybe we can look for a telescoping pattern? Let’s compute the expression for small values of n and see if we can spot a pattern

and REFLEXION(Shinn et al., 2023). All baselines are evaluated innon-thinkingmode. To isolate the effect of the inference procedure from prompt engineering, we keep the refinement instruction fixed across all refinement-based baselines. Prompt formatting.Each evaluation example is rendered with the model’s chat template and an added generation prompt. For...

work page 2023

[1] [1]

Check if there are any errors in calculations, logic, or problem understanding

Go through each calculation step-by-step. Check if there are any errors in calculations, logic, or problem understanding

work page

[2] [2]

If you find any mistakes, explicitly point out what was wrong and explain the correct approach

work page

[3] [3]

If the solution is already correct, verify each step and explain it more clearly

work page

[4] [4]

The refinement instruction is task-agnostic and contains no correctness signals, ensuring the model learns self-refinement without external supervision

Finally, after finishing the review, provide your refined solution and answer. The refinement instruction is task-agnostic and contains no correctness signals, ensuring the model learns self-refinement without external supervision. B.1.3 Hyperparameter Configuration Table 4 summarizes the key hyperparameters for ThinkTwice training. B.2 Implementation of ...

work page

[5] [5]

But maybe we can look for a telescoping pattern? Let’s compute the expression for small values of n and see if we can spot a pattern

and REFLEXION(Shinn et al., 2023). All baselines are evaluated innon-thinkingmode. To isolate the effect of the inference procedure from prompt engineering, we keep the refinement instruction fixed across all refinement-based baselines. Prompt formatting.Each evaluation example is rendered with the model’s chat template and an added generation prompt. For...

work page 2023