Transformation-Augmented GRPO for Enhancing Exploration in Reasoning of Large Language Models
Pith reviewed 2026-05-21 14:06 UTC · model grok-4.3
The pith
Augmenting each training question with multiple equivalent rephrasings lets GRPO avoid zero gradients and diversity collapse in LLM reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TA-GRPO generates multiple problem-equivalent rephrasings for each training question by altering wording, format, and information order while preserving meaning. These rephrasings shift the model's perceived difficulty, so that sampling responses across the set yields mixed rewards rather than uniform ones. Advantages are computed jointly over the expanded response pool and all importance ratios are aligned to the original question, allowing the policy to learn from a richer collection of solution attempts without changing the underlying reward model.
What carries the argument
Transformation-Augmented GRPO, which augments each training question with automatically generated semantically equivalent rephrasings to produce mixed rewards and diverse reasoning paths for advantage computation.
If this is right
- Improves average pass@32 scores by 4.97 points on Qwen3-1.7B and 4.34 points on Qwen3-4B across competition math and science benchmarks.
- Achieves exploration quality comparable to baselines trained on up to 2.5 times more data.
- Delivers consistent gains on both in-distribution benchmarks like AIME and out-of-distribution ones like GPQA-Diamond.
- Applies successfully to multiple model families including Qwen3 and Llama-3.2 series.
Where Pith is reading between the lines
- The rephrasing method could reduce reliance on curating large and varied training datasets by creating synthetic variety on the fly.
- Similar transformations might enhance other reward-based optimization techniques that suffer from sparse or uniform signals in language model alignment.
- Extending the approach to non-verifiable reward settings or to long-horizon planning tasks would test its broader utility beyond math reasoning.
Load-bearing premise
The generated rephrasings preserve exact semantic equivalence and meaningfully shift the model's perceived difficulty without introducing new ambiguities or solution shortcuts that would invalidate the reward signal.
What would settle it
If applying TA-GRPO to the same training questions as standard GRPO produces no gains in pass@k or no measurable increase in reasoning path diversity, the benefit of the augmentation would be falsified.
read the original abstract
Group Relative Policy Optimization (GRPO) has become the dominant method for reinforcement learning with verifiable rewards in large language models, but it suffers from two critical limitations: gradient vanishing and diversity collapse. When training questions are too easy or too hard, all sampled responses receive identical rewards, yielding zero gradients. Meanwhile, the model tends to collapse its responses toward a single reasoning pattern rather than exploring diverse strategies. We propose Transformation-Augmented GRPO (TA-GRPO), a simple but effective method that addresses both issues via question rephrasing. For each training question, we automatically generate multiple problem-equivalent rephrasings that alter wording, format, and information order while preserving the underlying meaning. Because these rephrasings shift the model's perceived difficulty, pooling responses across the original and its rephrasings yields mixed rewards and more diverse reasoning paths. TA-GRPO jointly computes advantages over this expanded response set and aligns all importance ratios to the original question, enabling the model to learn from a richer set of solution attempts. Experiments on four LLMs (Qwen3-1.7B, Qwen3-4B, Llama-3.2-1B, Llama-3.2-3B) show that TA-GRPO consistently improves pass@$k$ on competition-level benchmarks (AMC, OlympiadBench, AIME24, AIME25) and out-of-distribution benchmarks (Minerva, GPQA-Diamond). Notably, it improves the average pass@32 of Qwen3-1.7B and Qwen3-4B by \textbf{4.97} and \textbf{4.34} points, respectively, and matches the exploration quality of baselines trained on up to 2.5$\times$ more data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Transformation-Augmented GRPO (TA-GRPO), an extension of Group Relative Policy Optimization that generates multiple automatically produced rephrasings of each training question. These rephrasings are pooled with the originals to produce mixed rewards and more diverse reasoning paths while aligning importance ratios to the original question. The method is evaluated on four LLMs across competition-level math benchmarks (AMC, OlympiadBench, AIME24, AIME25) and out-of-distribution sets (Minerva, GPQA-Diamond), reporting average pass@32 gains of 4.97 and 4.34 points for Qwen3-1.7B and Qwen3-4B respectively, together with exploration quality matching baselines trained on up to 2.5× more data.
Significance. If the rephrasings preserve semantic equivalence and the advantage computation remains valid, TA-GRPO provides a practical, data-efficient route to improved exploration and reasoning performance in verifiable-reward RL for LLMs. The reported gains on both in-distribution and OOD benchmarks, achieved without extra data, would be a useful contribution to the GRPO literature.
major comments (2)
- [Abstract] The validity of the reported advantages and non-vanishing gradients rests on every rephrasing being exactly semantically equivalent to the original (identical solution set, no new ambiguities or shortcuts). The abstract states that rephrasings are generated automatically to alter wording, format, and information order, yet supplies no quantitative verification of equivalence (e.g., solution-set overlap statistics, expert ratings, or an ablation that removes low-quality rephrasings). This check is load-bearing for the central claim that gains arise from genuine diversity rather than noisy or shortcut-laden rewards.
- [Abstract] The manuscript provides limited detail on the exact advantage computation when responses are pooled across originals and rephrasings, and on whether any post-hoc filtering of rephrasings was applied. Without these specifics it is difficult to reproduce the mixed-reward mechanism or to assess whether the importance-ratio alignment fully preserves the GRPO objective.
minor comments (1)
- [Abstract] The abstract claims 'problem-equivalent rephrasings' but does not define the precise criteria used by the automatic generator; a short description or pseudocode in the methods section would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional details and clarifications that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Abstract] The validity of the reported advantages and non-vanishing gradients rests on every rephrasing being exactly semantically equivalent to the original (identical solution set, no new ambiguities or shortcuts). The abstract states that rephrasings are generated automatically to alter wording, format, and information order, yet supplies no quantitative verification of equivalence (e.g., solution-set overlap statistics, expert ratings, or an ablation that removes low-quality rephrasings). This check is load-bearing for the central claim that gains arise from genuine diversity rather than noisy or shortcut-laden rewards.
Authors: We agree that explicit verification of semantic equivalence is important to substantiate the central claims. In the revised manuscript we have added a dedicated subsection in the Methods that reports solution-set overlap statistics computed over a held-out sample of training questions, along with an ablation that removes rephrasings whose generated solutions deviate from the original problem's solution set. These additions confirm that the observed gains arise from increased response diversity rather than reward noise or shortcuts. revision: yes
-
Referee: [Abstract] The manuscript provides limited detail on the exact advantage computation when responses are pooled across originals and rephrasings, and on whether any post-hoc filtering of rephrasings was applied. Without these specifics it is difficult to reproduce the mixed-reward mechanism or to assess whether the importance-ratio alignment fully preserves the GRPO objective.
Authors: We appreciate the request for greater implementation detail. The revised Methods section now includes the exact advantage formula used when responses from both the original question and its rephrasings are pooled, together with the precise alignment of importance ratios to the original question's policy. We explicitly state that no post-hoc filtering of rephrasings was performed; all automatically generated rephrasings were retained. We have also inserted pseudocode for the full TA-GRPO procedure to aid reproducibility. revision: yes
Circularity Check
No significant circularity; algorithmic method evaluated on external benchmarks
full rationale
The paper presents TA-GRPO as a direct algorithmic extension to GRPO that pools responses from automatically generated rephrasings and aligns importance ratios to the original question. All reported gains (e.g., +4.97 pass@32 on Qwen3-1.7B) are obtained from evaluation on fixed external public benchmarks (AMC, AIME24/25, Minerva, GPQA-Diamond) rather than from any internal fit or self-referential quantity. No equations reduce the advantages, gradients, or performance metrics to quantities defined inside the same training run, and no load-bearing step relies on a self-citation chain or uniqueness theorem. The derivation chain is therefore self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Automatically generated rephrasings preserve underlying problem meaning and validity
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.