pith. sign in

arxiv: 2601.22478 · v5 · pith:EMHCOMD2new · submitted 2026-01-30 · 💻 cs.LG

Transformation-Augmented GRPO for Enhancing Exploration in Reasoning of Large Language Models

Pith reviewed 2026-05-21 14:06 UTC · model grok-4.3

classification 💻 cs.LG
keywords GRPOreinforcement learninglarge language modelsreasoningquestion rephrasingexplorationpolicy optimizationLLM fine-tuning
0
0 comments X

The pith

Augmenting each training question with multiple equivalent rephrasings lets GRPO avoid zero gradients and diversity collapse in LLM reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Group Relative Policy Optimization suffers from gradient vanishing when all responses receive identical rewards and from collapse to uniform reasoning patterns. TA-GRPO counters this by automatically generating several problem-equivalent rephrasings of each training question that change wording and structure while keeping the core meaning intact. Pooling responses across original and rephrased questions produces varied rewards and encourages diverse solution strategies. Jointly computing advantages over this larger set and aligning sampling ratios back to the original question lets the model benefit from richer training signals. If correct, this yields higher pass rates on difficult reasoning benchmarks using the same data volume.

Core claim

TA-GRPO generates multiple problem-equivalent rephrasings for each training question by altering wording, format, and information order while preserving meaning. These rephrasings shift the model's perceived difficulty, so that sampling responses across the set yields mixed rewards rather than uniform ones. Advantages are computed jointly over the expanded response pool and all importance ratios are aligned to the original question, allowing the policy to learn from a richer collection of solution attempts without changing the underlying reward model.

What carries the argument

Transformation-Augmented GRPO, which augments each training question with automatically generated semantically equivalent rephrasings to produce mixed rewards and diverse reasoning paths for advantage computation.

If this is right

  • Improves average pass@32 scores by 4.97 points on Qwen3-1.7B and 4.34 points on Qwen3-4B across competition math and science benchmarks.
  • Achieves exploration quality comparable to baselines trained on up to 2.5 times more data.
  • Delivers consistent gains on both in-distribution benchmarks like AIME and out-of-distribution ones like GPQA-Diamond.
  • Applies successfully to multiple model families including Qwen3 and Llama-3.2 series.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rephrasing method could reduce reliance on curating large and varied training datasets by creating synthetic variety on the fly.
  • Similar transformations might enhance other reward-based optimization techniques that suffer from sparse or uniform signals in language model alignment.
  • Extending the approach to non-verifiable reward settings or to long-horizon planning tasks would test its broader utility beyond math reasoning.

Load-bearing premise

The generated rephrasings preserve exact semantic equivalence and meaningfully shift the model's perceived difficulty without introducing new ambiguities or solution shortcuts that would invalidate the reward signal.

What would settle it

If applying TA-GRPO to the same training questions as standard GRPO produces no gains in pass@k or no measurable increase in reasoning path diversity, the benefit of the augmentation would be falsified.

read the original abstract

Group Relative Policy Optimization (GRPO) has become the dominant method for reinforcement learning with verifiable rewards in large language models, but it suffers from two critical limitations: gradient vanishing and diversity collapse. When training questions are too easy or too hard, all sampled responses receive identical rewards, yielding zero gradients. Meanwhile, the model tends to collapse its responses toward a single reasoning pattern rather than exploring diverse strategies. We propose Transformation-Augmented GRPO (TA-GRPO), a simple but effective method that addresses both issues via question rephrasing. For each training question, we automatically generate multiple problem-equivalent rephrasings that alter wording, format, and information order while preserving the underlying meaning. Because these rephrasings shift the model's perceived difficulty, pooling responses across the original and its rephrasings yields mixed rewards and more diverse reasoning paths. TA-GRPO jointly computes advantages over this expanded response set and aligns all importance ratios to the original question, enabling the model to learn from a richer set of solution attempts. Experiments on four LLMs (Qwen3-1.7B, Qwen3-4B, Llama-3.2-1B, Llama-3.2-3B) show that TA-GRPO consistently improves pass@$k$ on competition-level benchmarks (AMC, OlympiadBench, AIME24, AIME25) and out-of-distribution benchmarks (Minerva, GPQA-Diamond). Notably, it improves the average pass@32 of Qwen3-1.7B and Qwen3-4B by \textbf{4.97} and \textbf{4.34} points, respectively, and matches the exploration quality of baselines trained on up to 2.5$\times$ more data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Transformation-Augmented GRPO (TA-GRPO), an extension of Group Relative Policy Optimization that generates multiple automatically produced rephrasings of each training question. These rephrasings are pooled with the originals to produce mixed rewards and more diverse reasoning paths while aligning importance ratios to the original question. The method is evaluated on four LLMs across competition-level math benchmarks (AMC, OlympiadBench, AIME24, AIME25) and out-of-distribution sets (Minerva, GPQA-Diamond), reporting average pass@32 gains of 4.97 and 4.34 points for Qwen3-1.7B and Qwen3-4B respectively, together with exploration quality matching baselines trained on up to 2.5× more data.

Significance. If the rephrasings preserve semantic equivalence and the advantage computation remains valid, TA-GRPO provides a practical, data-efficient route to improved exploration and reasoning performance in verifiable-reward RL for LLMs. The reported gains on both in-distribution and OOD benchmarks, achieved without extra data, would be a useful contribution to the GRPO literature.

major comments (2)
  1. [Abstract] The validity of the reported advantages and non-vanishing gradients rests on every rephrasing being exactly semantically equivalent to the original (identical solution set, no new ambiguities or shortcuts). The abstract states that rephrasings are generated automatically to alter wording, format, and information order, yet supplies no quantitative verification of equivalence (e.g., solution-set overlap statistics, expert ratings, or an ablation that removes low-quality rephrasings). This check is load-bearing for the central claim that gains arise from genuine diversity rather than noisy or shortcut-laden rewards.
  2. [Abstract] The manuscript provides limited detail on the exact advantage computation when responses are pooled across originals and rephrasings, and on whether any post-hoc filtering of rephrasings was applied. Without these specifics it is difficult to reproduce the mixed-reward mechanism or to assess whether the importance-ratio alignment fully preserves the GRPO objective.
minor comments (1)
  1. [Abstract] The abstract claims 'problem-equivalent rephrasings' but does not define the precise criteria used by the automatic generator; a short description or pseudocode in the methods section would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional details and clarifications that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] The validity of the reported advantages and non-vanishing gradients rests on every rephrasing being exactly semantically equivalent to the original (identical solution set, no new ambiguities or shortcuts). The abstract states that rephrasings are generated automatically to alter wording, format, and information order, yet supplies no quantitative verification of equivalence (e.g., solution-set overlap statistics, expert ratings, or an ablation that removes low-quality rephrasings). This check is load-bearing for the central claim that gains arise from genuine diversity rather than noisy or shortcut-laden rewards.

    Authors: We agree that explicit verification of semantic equivalence is important to substantiate the central claims. In the revised manuscript we have added a dedicated subsection in the Methods that reports solution-set overlap statistics computed over a held-out sample of training questions, along with an ablation that removes rephrasings whose generated solutions deviate from the original problem's solution set. These additions confirm that the observed gains arise from increased response diversity rather than reward noise or shortcuts. revision: yes

  2. Referee: [Abstract] The manuscript provides limited detail on the exact advantage computation when responses are pooled across originals and rephrasings, and on whether any post-hoc filtering of rephrasings was applied. Without these specifics it is difficult to reproduce the mixed-reward mechanism or to assess whether the importance-ratio alignment fully preserves the GRPO objective.

    Authors: We appreciate the request for greater implementation detail. The revised Methods section now includes the exact advantage formula used when responses from both the original question and its rephrasings are pooled, together with the precise alignment of importance ratios to the original question's policy. We explicitly state that no post-hoc filtering of rephrasings was performed; all automatically generated rephrasings were retained. We have also inserted pseudocode for the full TA-GRPO procedure to aid reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; algorithmic method evaluated on external benchmarks

full rationale

The paper presents TA-GRPO as a direct algorithmic extension to GRPO that pools responses from automatically generated rephrasings and aligns importance ratios to the original question. All reported gains (e.g., +4.97 pass@32 on Qwen3-1.7B) are obtained from evaluation on fixed external public benchmarks (AMC, AIME24/25, Minerva, GPQA-Diamond) rather than from any internal fit or self-referential quantity. No equations reduce the advantages, gradients, or performance metrics to quantities defined inside the same training run, and no load-bearing step relies on a self-citation chain or uniqueness theorem. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that rephrasings maintain semantic identity and alter perceived difficulty. No free parameters or new entities are introduced in the abstract description.

axioms (1)
  • domain assumption Automatically generated rephrasings preserve underlying problem meaning and validity
    Invoked when stating that rephrasings 'alter wording, format, and information order while preserving the underlying meaning' and shift perceived difficulty.

pith-pipeline@v0.9.0 · 5887 in / 1356 out tokens · 50925 ms · 2026-05-21T14:06:35.845358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.