Transformation-Augmented GRPO for Enhancing Exploration in Reasoning of Large Language Models

Chi-Heng Lin; Khiem Le; Nitesh V. Chawla; Phuc Nguyen; Shangqian Gao; Ting Hua; Youssef Mroueh

arxiv: 2601.22478 · v5 · pith:EMHCOMD2new · submitted 2026-01-30 · 💻 cs.LG

Transformation-Augmented GRPO for Enhancing Exploration in Reasoning of Large Language Models

Khiem Le , Phuc Nguyen , Youssef Mroueh , Chi-Heng Lin , Shangqian Gao , Ting Hua , Nitesh V. Chawla This is my paper

Pith reviewed 2026-05-21 14:06 UTC · model grok-4.3

classification 💻 cs.LG

keywords GRPOreinforcement learninglarge language modelsreasoningquestion rephrasingexplorationpolicy optimizationLLM fine-tuning

0 comments

The pith

Augmenting each training question with multiple equivalent rephrasings lets GRPO avoid zero gradients and diversity collapse in LLM reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Group Relative Policy Optimization suffers from gradient vanishing when all responses receive identical rewards and from collapse to uniform reasoning patterns. TA-GRPO counters this by automatically generating several problem-equivalent rephrasings of each training question that change wording and structure while keeping the core meaning intact. Pooling responses across original and rephrased questions produces varied rewards and encourages diverse solution strategies. Jointly computing advantages over this larger set and aligning sampling ratios back to the original question lets the model benefit from richer training signals. If correct, this yields higher pass rates on difficult reasoning benchmarks using the same data volume.

Core claim

TA-GRPO generates multiple problem-equivalent rephrasings for each training question by altering wording, format, and information order while preserving meaning. These rephrasings shift the model's perceived difficulty, so that sampling responses across the set yields mixed rewards rather than uniform ones. Advantages are computed jointly over the expanded response pool and all importance ratios are aligned to the original question, allowing the policy to learn from a richer collection of solution attempts without changing the underlying reward model.

What carries the argument

Transformation-Augmented GRPO, which augments each training question with automatically generated semantically equivalent rephrasings to produce mixed rewards and diverse reasoning paths for advantage computation.

If this is right

Improves average pass@32 scores by 4.97 points on Qwen3-1.7B and 4.34 points on Qwen3-4B across competition math and science benchmarks.
Achieves exploration quality comparable to baselines trained on up to 2.5 times more data.
Delivers consistent gains on both in-distribution benchmarks like AIME and out-of-distribution ones like GPQA-Diamond.
Applies successfully to multiple model families including Qwen3 and Llama-3.2 series.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The rephrasing method could reduce reliance on curating large and varied training datasets by creating synthetic variety on the fly.
Similar transformations might enhance other reward-based optimization techniques that suffer from sparse or uniform signals in language model alignment.
Extending the approach to non-verifiable reward settings or to long-horizon planning tasks would test its broader utility beyond math reasoning.

Load-bearing premise

The generated rephrasings preserve exact semantic equivalence and meaningfully shift the model's perceived difficulty without introducing new ambiguities or solution shortcuts that would invalidate the reward signal.

What would settle it

If applying TA-GRPO to the same training questions as standard GRPO produces no gains in pass@k or no measurable increase in reasoning path diversity, the benefit of the augmentation would be falsified.

read the original abstract

Group Relative Policy Optimization (GRPO) has become the dominant method for reinforcement learning with verifiable rewards in large language models, but it suffers from two critical limitations: gradient vanishing and diversity collapse. When training questions are too easy or too hard, all sampled responses receive identical rewards, yielding zero gradients. Meanwhile, the model tends to collapse its responses toward a single reasoning pattern rather than exploring diverse strategies. We propose Transformation-Augmented GRPO (TA-GRPO), a simple but effective method that addresses both issues via question rephrasing. For each training question, we automatically generate multiple problem-equivalent rephrasings that alter wording, format, and information order while preserving the underlying meaning. Because these rephrasings shift the model's perceived difficulty, pooling responses across the original and its rephrasings yields mixed rewards and more diverse reasoning paths. TA-GRPO jointly computes advantages over this expanded response set and aligns all importance ratios to the original question, enabling the model to learn from a richer set of solution attempts. Experiments on four LLMs (Qwen3-1.7B, Qwen3-4B, Llama-3.2-1B, Llama-3.2-3B) show that TA-GRPO consistently improves pass@$k$ on competition-level benchmarks (AMC, OlympiadBench, AIME24, AIME25) and out-of-distribution benchmarks (Minerva, GPQA-Diamond). Notably, it improves the average pass@32 of Qwen3-1.7B and Qwen3-4B by \textbf{4.97} and \textbf{4.34} points, respectively, and matches the exploration quality of baselines trained on up to 2.5$\times$ more data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TA-GRPO delivers measurable pass@k gains on math benchmarks by pooling rephrasings into GRPO but the results rest on an unverified assumption that those rephrasings stay exactly equivalent.

read the letter

The main things to know are that this paper introduces TA-GRPO to boost exploration in GRPO by rephrasing questions and pooling responses, and it shows decent empirical gains on reasoning benchmarks, but the approach hinges on unverified semantic equivalence of the rephrasings. What is new is the way they augment the response set for advantage computation while keeping importance ratios aligned to the original. The paper does well in running experiments on four LLMs and reporting improvements on AMC, AIME, OlympiadBench, Minerva, and GPQA. The pass@32 gains of roughly 5 points for the Qwen models and the data efficiency claim are the strongest parts. The soft spots are moderate. The central assumption that automatic rephrasings preserve exact meaning without new issues isn't backed by any reported checks in the abstract, which could mean the mixed rewards aren't as clean as claimed. If the full paper has ablations or quality metrics on the rephrasings, that would shore it up; otherwise it's a real concern for the validity of the advantages. This paper is for researchers focused on RL methods for improving LLM reasoning on math and science tasks. Readers working on GRPO or similar group-based policy optimization would find the implementation details and results useful. It deserves a serious referee given the practical relevance and the empirical scope. I would recommend sending it for peer review, expecting questions on the rephrasing generation and validation process.

Referee Report

2 major / 1 minor

Summary. The paper proposes Transformation-Augmented GRPO (TA-GRPO), an extension of Group Relative Policy Optimization that generates multiple automatically produced rephrasings of each training question. These rephrasings are pooled with the originals to produce mixed rewards and more diverse reasoning paths while aligning importance ratios to the original question. The method is evaluated on four LLMs across competition-level math benchmarks (AMC, OlympiadBench, AIME24, AIME25) and out-of-distribution sets (Minerva, GPQA-Diamond), reporting average pass@32 gains of 4.97 and 4.34 points for Qwen3-1.7B and Qwen3-4B respectively, together with exploration quality matching baselines trained on up to 2.5× more data.

Significance. If the rephrasings preserve semantic equivalence and the advantage computation remains valid, TA-GRPO provides a practical, data-efficient route to improved exploration and reasoning performance in verifiable-reward RL for LLMs. The reported gains on both in-distribution and OOD benchmarks, achieved without extra data, would be a useful contribution to the GRPO literature.

major comments (2)

[Abstract] The validity of the reported advantages and non-vanishing gradients rests on every rephrasing being exactly semantically equivalent to the original (identical solution set, no new ambiguities or shortcuts). The abstract states that rephrasings are generated automatically to alter wording, format, and information order, yet supplies no quantitative verification of equivalence (e.g., solution-set overlap statistics, expert ratings, or an ablation that removes low-quality rephrasings). This check is load-bearing for the central claim that gains arise from genuine diversity rather than noisy or shortcut-laden rewards.
[Abstract] The manuscript provides limited detail on the exact advantage computation when responses are pooled across originals and rephrasings, and on whether any post-hoc filtering of rephrasings was applied. Without these specifics it is difficult to reproduce the mixed-reward mechanism or to assess whether the importance-ratio alignment fully preserves the GRPO objective.

minor comments (1)

[Abstract] The abstract claims 'problem-equivalent rephrasings' but does not define the precise criteria used by the automatic generator; a short description or pseudocode in the methods section would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional details and clarifications that directly respond to the concerns raised.

read point-by-point responses

Referee: [Abstract] The validity of the reported advantages and non-vanishing gradients rests on every rephrasing being exactly semantically equivalent to the original (identical solution set, no new ambiguities or shortcuts). The abstract states that rephrasings are generated automatically to alter wording, format, and information order, yet supplies no quantitative verification of equivalence (e.g., solution-set overlap statistics, expert ratings, or an ablation that removes low-quality rephrasings). This check is load-bearing for the central claim that gains arise from genuine diversity rather than noisy or shortcut-laden rewards.

Authors: We agree that explicit verification of semantic equivalence is important to substantiate the central claims. In the revised manuscript we have added a dedicated subsection in the Methods that reports solution-set overlap statistics computed over a held-out sample of training questions, along with an ablation that removes rephrasings whose generated solutions deviate from the original problem's solution set. These additions confirm that the observed gains arise from increased response diversity rather than reward noise or shortcuts. revision: yes
Referee: [Abstract] The manuscript provides limited detail on the exact advantage computation when responses are pooled across originals and rephrasings, and on whether any post-hoc filtering of rephrasings was applied. Without these specifics it is difficult to reproduce the mixed-reward mechanism or to assess whether the importance-ratio alignment fully preserves the GRPO objective.

Authors: We appreciate the request for greater implementation detail. The revised Methods section now includes the exact advantage formula used when responses from both the original question and its rephrasings are pooled, together with the precise alignment of importance ratios to the original question's policy. We explicitly state that no post-hoc filtering of rephrasings was performed; all automatically generated rephrasings were retained. We have also inserted pseudocode for the full TA-GRPO procedure to aid reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; algorithmic method evaluated on external benchmarks

full rationale

The paper presents TA-GRPO as a direct algorithmic extension to GRPO that pools responses from automatically generated rephrasings and aligns importance ratios to the original question. All reported gains (e.g., +4.97 pass@32 on Qwen3-1.7B) are obtained from evaluation on fixed external public benchmarks (AMC, AIME24/25, Minerva, GPQA-Diamond) rather than from any internal fit or self-referential quantity. No equations reduce the advantages, gradients, or performance metrics to quantities defined inside the same training run, and no load-bearing step relies on a self-citation chain or uniqueness theorem. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that rephrasings maintain semantic identity and alter perceived difficulty. No free parameters or new entities are introduced in the abstract description.

axioms (1)

domain assumption Automatically generated rephrasings preserve underlying problem meaning and validity
Invoked when stating that rephrasings 'alter wording, format, and information order while preserving the underlying meaning' and shift perceived difficulty.

pith-pipeline@v0.9.0 · 5887 in / 1356 out tokens · 50925 ms · 2026-05-21T14:06:35.845358+00:00 · methodology

Review history (2 revisions) →

Transformation-Augmented GRPO for Enhancing Exploration in Reasoning of Large Language Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)