ThinkTwice jointly optimizes LLMs for reasoning and self-refinement via a two-phase GRPO process, yielding gains of 5 points before and 11.5 points after refinement on AIME for Qwen3-4B.
The refinement instruction is task-agnostic and contains no correctness signals, ensuring the model learns self-refinement without external supervision
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement
ThinkTwice jointly optimizes LLMs for reasoning and self-refinement via a two-phase GRPO process, yielding gains of 5 points before and 11.5 points after refinement on AIME for Qwen3-4B.