pith. sign in

arxiv: 2604.01591 · v2 · submitted 2026-04-02 · 💻 cs.AI

ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

Pith reviewed 2026-05-13 21:51 UTC · model grok-4.3

classification 💻 cs.AI
keywords ThinkTwiceself-refinementjoint optimizationreasoning modelsGRPObinary rewardcurriculum emergencemathematical reasoning
0
0 comments X

The pith

ThinkTwice alternates reasoning optimization with self-refinement using the same binary reward to improve both skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ThinkTwice as a two-phase training process that first optimizes a language model to solve reasoning problems and then optimizes it to refine its own outputs on the same problems. Both phases rely on the identical binary correctness signal without any added critiques or labels. A reader would care because the approach produces stronger initial answers and better self-correction on math benchmarks than standard policy optimization, while also generating an automatic training progression that fixes errors early and later protects correct solutions.

Core claim

ThinkTwice is a two-phase framework built on Group Relative Policy Optimization that jointly trains models by first solving reasoning problems and then refining their own solutions to those problems, using the identical binary correctness reward in each phase. Across five mathematical reasoning benchmarks and two model families, the method raises both pre-refinement and post-refinement accuracy over GRPO baselines, with specific gains of 5 points before refinement and 11.5 points after one refinement step on AIME for the Qwen3-4B model. Training dynamics reveal an implicit rectify-then-fortify curriculum in which refinement first corrects mistakes and later shifts to preserving already-recto

What carries the argument

The alternating two-phase optimization loop in which a model is trained first to produce correct solutions and then to refine those solutions using the shared binary correctness reward.

If this is right

  • Higher pass rates on math problems both before and after one self-refinement step.
  • An automatic curriculum emerges that first rectifies errors and later preserves correct answers.
  • The gains appear across multiple benchmarks and two different model families without extra annotations.
  • Joint training of reasoning and refinement is presented as a direct methodology for reinforcement learning with verifiable rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alternating structure could be tested on sequential tasks outside mathematics where an initial output is later revised.
  • Self-refinement might become a built-in part of the generation policy rather than a separate post-processing step.
  • Applying different rewards to the two phases could be compared directly to isolate whether the shared reward is essential to the observed curriculum.

Load-bearing premise

The same binary correctness reward can be applied to both the reasoning phase and the refinement phase without causing instability or reward hacking during joint optimization.

What would settle it

A training run that exhibits instability, reward hacking, or no accuracy gain over GRPO when the two phases share the identical binary reward.

Figures

Figures reproduced from arXiv: 2604.01591 by Ashton Anderson, Blair Yang, Difan Jiao, Qianfeng Wen, Zhenwei Tang.

Figure 1
Figure 1. Figure 1: (A) Prompt-only reflection can reduce top frontier LLM’s performance on AIME24, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: ThinkTwice at a glance. long-CoT training stability; Dr. GRPO analyzes optimization bias in GRPO; GSPO moves from token-level to sequence-level importance ratios and clipping; and newer variants such as GMPO, GPG, and shrinkage baselines revisit ratio aggregation, simplification, response￾length bias, and baseline variance (Yu et al., 2025a; Liu et al., 2025; Zheng et al., 2025; Zhao et al., 2025b; Chu et … view at source ↗
Figure 3
Figure 3. Figure 3: Cross-model refinement evaluation (average pass@4, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training dynamics of refinement across checkpoints. The vertical dashed lines [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training-time cost and dynamics of ThinkTwice compared with GRPO. * denoted [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Reasoning pass@k curves across five mathematical reasoning benchmarks for Qwen3-4B (top) and OLMo3-7B (bottom). 40 60 80 Qwen3-4B Pass@k (%) AIME 70 80 90 AMC 85 90 95 100 MATH500 35 40 45 50 Minerva 60 70 80 OlympiadBench 1 2 4 8 16 32 30 40 50 60 OLMo3-7B Pass@k (%) 1 2 4 8 16 32 60 70 80 90 1 2 4 8 16 32 85 90 95 1 2 4 8 16 32 30 40 50 1 2 4 8 16 32 60 70 80 Base GRPO DrGRPO DAPO ThinkTwice [PITH_FULL_… view at source ↗
Figure 7
Figure 7. Figure 7: Self-refinement pass@k curves across five mathematical reasoning benchmarks for Qwen3-4B (top) and OLMo3-7B (bottom). For reasoning ( [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
read the original abstract

We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines. Specifically, on Qwen3-4B, ThinkTwice outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step, measured by pass@4. Analysis of the training dynamics of ThinkTwice reveals an implicit rectify-then-fortify curriculum: refinement predominantly corrects errors early in training and naturally shifts toward preserving already-correct solutions as the model improves, yielding a more rectified reward signal. Our work establishes joint training of reasoning and self-refinement as a principled and effective methodology for RLVR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ThinkTwice, a two-phase framework that jointly optimizes LLMs for reasoning and self-refinement via Group Relative Policy Optimization (GRPO). In alternating training steps, the model is first optimized on solving problems and then on refining its own outputs, using the identical binary correctness reward (1 if final answer correct) in both phases without critique annotations or external signals. The authors report substantial gains over GRPO baselines across five math benchmarks and two model families (Qwen3-4B, Olmo3-7B), including +5pp before refinement and +11.5pp after one refinement step on AIME for Qwen3-4B (pass@4). They further identify an emergent 'rectify-then-fortify' curriculum in the training dynamics.

Significance. If the reported gains and curriculum hold under rigorous controls, the work provides a simple, annotation-free method for improving both initial reasoning and self-correction in LLMs through joint RLVR. The observation that refinement naturally shifts from error correction to preservation of correct solutions is a useful empirical finding that could guide future alternating-phase designs. However, the current experimental reporting leaves the magnitude and stability of these gains unverified.

major comments (2)
  1. [Abstract and Results] Abstract and Results section: the central performance claims (5pp pre-refinement and 11.5pp post-refinement gains on AIME for Qwen3-4B, pass@4) are stated without any information on the number of independent runs, standard deviations, baseline implementation details, or data splits, rendering the deltas impossible to assess for statistical reliability.
  2. [Training Procedure] Training Procedure section: the claim that identical binary correctness reward across reasoning and refinement phases produces stable joint optimization and the observed rectify-then-fortify curriculum is load-bearing yet unsupported by ablations on phase alternation schedule, GRPO group size, or controls that would detect trivial copying (e.g., near-identical outputs when the initial solution is already correct).
minor comments (1)
  1. [Abstract] The abstract refers to 'five mathematical reasoning benchmarks' without naming them; an explicit list would improve clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback, which highlights opportunities to strengthen the statistical reporting and empirical support in our work. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and Results section: the central performance claims (5pp pre-refinement and 11.5pp post-refinement gains on AIME for Qwen3-4B, pass@4) are stated without any information on the number of independent runs, standard deviations, baseline implementation details, or data splits, rendering the deltas impossible to assess for statistical reliability.

    Authors: We agree that additional details are needed for assessing reliability. In the revised manuscript, we will report results averaged over 3 independent runs with standard deviations for the AIME and other benchmark results, include more explicit baseline implementation details (e.g., exact hyperparameters and training steps), and clarify the data splits used. These updates will appear in the Results section and a dedicated Experimental Details appendix. revision: yes

  2. Referee: [Training Procedure] Training Procedure section: the claim that identical binary correctness reward across reasoning and refinement phases produces stable joint optimization and the observed rectify-then-fortify curriculum is load-bearing yet unsupported by ablations on phase alternation schedule, GRPO group size, or controls that would detect trivial copying (e.g., near-identical outputs when the initial solution is already correct).

    Authors: The training dynamics analysis in Section 4 already demonstrates the emergence of the rectify-then-fortify curriculum from the joint optimization with identical binary rewards, without requiring external critique signals. We will expand the discussion in the Training Procedure section to more explicitly link the identical reward to stable joint optimization. However, we cannot add the requested ablations due to computational resource limits. revision: no

standing simulated objections not resolved
  • Ablations on phase alternation schedule, GRPO group size, and controls for detecting trivial copying, which would require substantial additional compute not available for this revision.

Circularity Check

0 steps flagged

No significant circularity: empirical RLVR with external reward and held-out evaluation

full rationale

The paper applies standard GRPO in alternating phases using an externally defined binary correctness reward (1 if final answer matches ground truth). Performance claims are measured on separate held-out benchmarks (AIME, etc.) via pass@4, not defined in terms of fitted internal quantities or training loss. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation; the observed curriculum is a post-hoc observation of training dynamics rather than an input assumption. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a single binary correctness signal suffices to drive both reasoning acquisition and subsequent self-refinement without additional supervision or architectural changes.

axioms (1)
  • domain assumption A single binary correctness reward is sufficient to train both the reasoning and self-refinement phases effectively
    The framework description states that the same reward is used in both phases without correctness signals or critique annotations.

pith-pipeline@v0.9.0 · 5518 in / 1297 out tokens · 53982 ms · 2026-05-13T21:51:55.429295+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

  1. [1]

    Check if there are any errors in calculations, logic, or problem understanding

    Go through each calculation step-by-step. Check if there are any errors in calculations, logic, or problem understanding

  2. [2]

    If you find any mistakes, explicitly point out what was wrong and explain the correct approach

  3. [3]

    If the solution is already correct, verify each step and explain it more clearly

  4. [4]

    The refinement instruction is task-agnostic and contains no correctness signals, ensuring the model learns self-refinement without external supervision

    Finally, after finishing the review, provide your refined solution and answer. The refinement instruction is task-agnostic and contains no correctness signals, ensuring the model learns self-refinement without external supervision. B.1.3 Hyperparameter Configuration Table 4 summarizes the key hyperparameters for ThinkTwice training. B.2 Implementation of ...

  5. [5]

    But maybe we can look for a telescoping pattern? Let’s compute the expression for small values of n and see if we can spot a pattern

    and REFLEXION(Shinn et al., 2023). All baselines are evaluated innon-thinkingmode. To isolate the effect of the inference procedure from prompt engineering, we keep the refinement instruction fixed across all refinement-based baselines. Prompt formatting.Each evaluation example is rendered with the model’s chat template and an added generation prompt. For...