ExTra: Exploratory Trajectory Optimization for Language Model Reinforcement Learning

Bryan Kian Hsiang Low; Daniel Dahlmeier; Junxiang Jia; See-kiong Ng; Wenyang Hu; Zhen Shu

arxiv: 2606.24994 · v1 · pith:PP2VIP7Anew · submitted 2026-06-23 · 💻 cs.LG · cs.AI

ExTra: Exploratory Trajectory Optimization for Language Model Reinforcement Learning

Wenyang Hu , Junxiang Jia , Zhen Shu , Daniel Dahlmeier , See-Kiong Ng , Bryan Kian Hsiang Low This is my paper

Pith reviewed 2026-06-26 00:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords ExTraExploratory Trajectory OptimizationGRPOreinforcement learninglanguage modelsexplorationmathematical reasoningverifiable rewards

0 comments

The pith

ExTra adds embedding novelty rewards and entropy prefix regeneration to GRPO, lifting math reasoning pass rates by five points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ExTra to address two failure modes in reinforcement learning with verifiable rewards for language models. Easy prompts produce uniform correct rollouts that supply little gradient signal, while hard prompts produce uniform incorrect rollouts that supply no positive reward. ExTra extracts two exploration signals directly from the model's own rollouts: an embedding-based novelty bonus that rewards diverse correct answers after GRPO normalization, and entropy-guided regeneration that restarts from promising intermediate prefixes. These additions produce measurable gains on standard benchmarks. A reader would care because the method improves both single-answer accuracy and the variety of solutions found at inference time without requiring external data or changes to the base optimizer.

Core claim

ExTra is a GRPO-compatible framework that extracts exploration signals from the model's own rollouts. It combines a novelty reward that adds embedding-based diversity bonuses after GRPO normalization to reward diverse correct solutions, and entropy-guided prefix regeneration that scores partial trajectories by entropy and continues exploration from promising intermediate steps. Across six mathematical reasoning benchmarks this produces roughly five-point gains on pass@1 and seven-point gains on pass@16 for Qwen3-1.7B.

What carries the argument

ExTra's novelty reward using embedding diversity after GRPO normalization and its entropy-guided prefix regeneration from high-entropy partial trajectories.

If this is right

Single-sample accuracy on mathematical reasoning tasks increases.
Inference-time coverage of distinct correct solutions increases.
The framework remains compatible with existing GRPO training pipelines.
Exploration bonuses are derived entirely from the model's rollouts without external data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rollout-derived signals could be tested in reinforcement learning setups outside mathematical reasoning.
If the signals prove stable, they might reduce the performance gap between smaller and larger models on verifiable-reward tasks.
Measuring whether the novelty and entropy bonuses remain effective when the base model changes would test robustness.

Load-bearing premise

The embedding-based novelty signal and entropy scores from the model's own rollouts supply useful, non-spurious exploration bonuses that improve learning rather than merely increasing variance.

What would settle it

An ablation that removes both the novelty reward and the entropy-guided regeneration, then measures no change or a drop in pass@1 and pass@16 on the six benchmarks, would falsify the claim that these signals drive the observed gains.

Figures

Figures reproduced from arXiv: 2606.24994 by Bryan Kian Hsiang Low, Daniel Dahlmeier, Junxiang Jia, See-kiong Ng, Wenyang Hu, Zhen Shu.

**Figure 1.** Figure 1: Overview of ExTra. Top: On easy problems, a novelty reward is added to GRPO’s normalized advantage, steering the policy toward diverse correct reasoning strategies. Bottom: On hard problems, all rollout prefixes are scored by Mean Token Entropy; The lowest-entropy prefix is re-queued as a guided prompt for the next batch, encouraging exploration from a promising intermediate reasoning step. and it uses a s… view at source ↗

**Figure 2.** Figure 2: Training dynamics of policy entropy and nov [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: plots average pass@16 against this prompt budget at step 250. ExTra reaches 73.2% average pass@16 using 136k generated prompt instances. DAPO uses 392k prompt instances but reaches only 63.6%, while GRPO reaches 66.4% with 128k prompt instances. Thus, ExTra improves over GRPO at a similar rollout cost and outperforms DAPO on both accuracy and sample efficiency. 0 100 200 300 400 500 Cumulative generated pr… view at source ↗

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) for language-model reasoning can fail at both extremes of task difficulty: easy prompts often produce all-correct, low-diversity rollout groups with little gradient signal, while hard prompts can produce all-incorrect groups with no positive reward. We introduce ExTra (Exploratory Trajectory Optimization), a GRPO-compatible framework that extracts exploration signals from the model's own rollouts. ExTra combines two mechanisms: (i) a novelty reward that adds embedding-based diversity bonuses after GRPO normalization, rewarding diverse correct solutions; and (ii) entropy-guided prefix regeneration, which scores partial trajectories using entropy signals and continues exploration from promising intermediate steps. Across six mathematical reasoning benchmarks, ExTra improves Qwen3-1.7B over GRPO by about +5 points on pass@1 and +7 points on pass@16, showing that trajectory-level exploration signals can improve both single-sample accuracy and inference-time coverage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ExTra layers embedding novelty bonuses and entropy prefix regen onto GRPO for better exploration in LLM math RL, but the gains rest on thin evidence without ablations or controls.

read the letter

ExTra adds two mechanisms on top of GRPO: an embedding-based novelty reward applied after normalization to encourage diverse correct solutions, and entropy-guided regeneration that restarts from high-entropy promising prefixes. These target the sparsity problem where easy prompts yield uniform correct rollouts and hard ones yield none.

The paper does a clean job of spelling out the practical fixes and reports concrete gains of about +5 pass@1 and +7 pass@16 on six math benchmarks with Qwen3-1.7B. The entropy regeneration idea in particular looks like a straightforward way to keep exploring from intermediate states instead of discarding partial trajectories.

The soft spots are the missing controls. The abstract supplies no run counts, variance numbers, or ablations that turn the novelty bonus or the regeneration off separately. The stress-test concern lands: if the embeddings come from the policy model itself, the bonus may simply track already-likely trajectories rather than add independent diversity, and nothing in the description rules out that a random or variance-boosting term would produce similar numbers. Without those checks, attribution stays shaky.

This is for groups already running GRPO on reasoning tasks who want ready-to-try exploration tweaks. A reader looking for implementable ideas will find value, but anyone needing solid causal evidence will want the full experiments expanded. It deserves peer review because the problem is real and the mechanisms are clearly defined, even though the current results need more rigor to hold up.

Referee Report

3 major / 2 minor

Summary. The paper introduces ExTra, a GRPO-compatible framework for RLVR in language models that addresses low-diversity rollouts on easy prompts and zero-reward groups on hard prompts. It adds two mechanisms: (i) a novelty reward that applies embedding-based diversity bonuses after GRPO normalization to reward diverse correct solutions, and (ii) entropy-guided prefix regeneration that scores partial trajectories and restarts exploration from promising prefixes. On six mathematical reasoning benchmarks, ExTra improves Qwen3-1.7B over GRPO by roughly +5 points pass@1 and +7 points pass@16.

Significance. If the gains prove robust and causally attributable to the two signals, the work supplies a lightweight, rollout-internal method for improving both training signal and inference coverage in reasoning RL. The approach of extracting novelty and entropy signals directly from the policy's own trajectories is a practical strength that avoids external models or additional parameters.

major comments (3)

[Experimental results / §4] Experimental section (results tables and §4): the central empirical claim of +5 pass@1 / +7 pass@16 gains is presented without reported run counts, standard deviations, number of random seeds, or statistical tests. This prevents verification that the improvements exceed baseline variance and is load-bearing for the attribution to ExTra.
[§3.1] §3.1 (novelty reward definition): the embedding-based bonus is added after GRPO normalization and described as rewarding diverse correct solutions, yet no ablation or control (e.g., random bonus of matched magnitude, or lexical-only diversity) is reported. Without this, it is impossible to rule out that any variance-increasing additive term would produce comparable gains, undermining the claim that the embedding signal is the causal mechanism.
[§3.2] §3.2 (entropy-guided prefix regeneration): the method scores partial trajectories with entropy and regenerates from promising prefixes, but the paper supplies no comparison isolating this component from the novelty reward alone, nor any analysis of how often regeneration is triggered or its effect on gradient variance.

minor comments (2)

[§3] Notation for the combined reward (Eq. in §3) should explicitly state whether the novelty term is normalized per group or globally, and whether it is applied only to correct trajectories.
[Results] The six benchmarks are listed in the abstract but the main text should include a table with per-benchmark pass@1 and pass@16 numbers plus GRPO baselines for direct comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the empirical validation and providing necessary ablations. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Experimental results / §4] Experimental section (results tables and §4): the central empirical claim of +5 pass@1 / +7 pass@16 gains is presented without reported run counts, standard deviations, number of random seeds, or statistical tests. This prevents verification that the improvements exceed baseline variance and is load-bearing for the attribution to ExTra.

Authors: We agree that the absence of run statistics limits verification of the gains. In the revised manuscript we will report results over multiple random seeds (with the exact count specified), include standard deviations, and add statistical significance tests to confirm the improvements exceed baseline variance. revision: yes
Referee: [§3.1] §3.1 (novelty reward definition): the embedding-based bonus is added after GRPO normalization and described as rewarding diverse correct solutions, yet no ablation or control (e.g., random bonus of matched magnitude, or lexical-only diversity) is reported. Without this, it is impossible to rule out that any variance-increasing additive term would produce comparable gains, undermining the claim that the embedding signal is the causal mechanism.

Authors: We acknowledge that controls are required to establish causality for the embedding signal. We will add ablations in the revision comparing the embedding novelty reward against a random bonus of matched magnitude and a lexical-only diversity baseline. revision: yes
Referee: [§3.2] §3.2 (entropy-guided prefix regeneration): the method scores partial trajectories with entropy and regenerates from promising prefixes, but the paper supplies no comparison isolating this component from the novelty reward alone, nor any analysis of how often regeneration is triggered or its effect on gradient variance.

Authors: We will add an ablation isolating the entropy-guided prefix regeneration from the novelty reward alone. We will also report regeneration trigger frequency and analyze its impact on gradient variance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with direct benchmark comparisons

full rationale

The paper describes ExTra as a GRPO-compatible framework using embedding-based novelty rewards and entropy-guided prefix regeneration extracted from rollouts. It reports empirical gains (+5 pass@1, +7 pass@16) on six math benchmarks versus GRPO baseline. No derivation chain, fitted parameters renamed as predictions, self-citation load-bearing premises, or ansatz smuggling is present. The central claims rest on experimental results rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies no explicit free parameters, axioms, or invented entities; standard RL assumptions and embedding similarity are implicit but not itemized.

pith-pipeline@v0.9.1-grok · 5708 in / 1094 out tokens · 17570 ms · 2026-06-26T00:21:15.969276+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 7 linked inside Pith

[1]

The 41st international ACM SIGIR conference on research & development in information retrieval , pages=

Texygen: A benchmarking platform for text generation models , author=. The 41st international ACM SIGIR conference on research & development in information retrieval , pages=
[2]

arXiv preprint arXiv:2509.09675 , year=

Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models , author=. arXiv preprint arXiv:2509.09675 , year=

arXiv
[4]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv
[5]

Policy gradient methods for reinforcement learning with function approximation , author=. Proc. NeurIPS , volume=
[6]

Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning , author=. Proc. ICLR , year=
[7]

arXiv preprint arXiv:2407.21787 , year=

Large language monkeys: Scaling inference compute with repeated sampling , author=. arXiv preprint arXiv:2407.21787 , year=

Pith/arXiv arXiv
[8]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2503.14476 , year=

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

Pith/arXiv arXiv
[10]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=
[11]

Proceedings of the 34th International Conference on Machine Learning (ICML) , pages=

Curiosity-driven Exploration by Self-Supervised Prediction , author=. Proceedings of the 34th International Conference on Machine Learning (ICML) , pages=. 2017 , organization=

2017
[12]

International Conference on Learning Representations (ICLR) , year=

Exploration by Random Network Distillation , author=. International Conference on Learning Representations (ICLR) , year=
[13]

arXiv preprint arXiv:2305.20050 , year=

Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

Pith/arXiv arXiv
[14]

Advances in Neural Information Processing Systems , volume=

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author=. Advances in Neural Information Processing Systems , volume=
[15]

Math-Shepherd: Verify and Reinforce

Wang, Peiyi and Li, Lei and Shao, Zhihong and Xu, Run and Dai, Damai and Li, Yifei and Chen, Deli and Wu, Yu and Sui, Zhifang , journal=. Math-Shepherd: Verify and Reinforce
[16]

arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv
[17]

arXiv preprint arXiv:1707.06347 , year=

Proximal Policy Optimization Algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

Pith/arXiv arXiv
[18]

Advances in Neural Information Processing Systems , volume=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , volume=

[1] [1]

The 41st international ACM SIGIR conference on research & development in information retrieval , pages=

Texygen: A benchmarking platform for text generation models , author=. The 41st international ACM SIGIR conference on research & development in information retrieval , pages=

[2] [2]

arXiv preprint arXiv:2509.09675 , year=

Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models , author=. arXiv preprint arXiv:2509.09675 , year=

arXiv

[3] [4]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv

[4] [5]

Policy gradient methods for reinforcement learning with function approximation , author=. Proc. NeurIPS , volume=

[5] [6]

Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning , author=. Proc. ICLR , year=

[6] [7]

arXiv preprint arXiv:2407.21787 , year=

Large language monkeys: Scaling inference compute with repeated sampling , author=. arXiv preprint arXiv:2407.21787 , year=

Pith/arXiv arXiv

[7] [8]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv

[8] [9]

arXiv preprint arXiv:2503.14476 , year=

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

Pith/arXiv arXiv

[9] [10]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

[10] [11]

Proceedings of the 34th International Conference on Machine Learning (ICML) , pages=

Curiosity-driven Exploration by Self-Supervised Prediction , author=. Proceedings of the 34th International Conference on Machine Learning (ICML) , pages=. 2017 , organization=

2017

[11] [12]

International Conference on Learning Representations (ICLR) , year=

Exploration by Random Network Distillation , author=. International Conference on Learning Representations (ICLR) , year=

[12] [13]

arXiv preprint arXiv:2305.20050 , year=

Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

Pith/arXiv arXiv

[13] [14]

Advances in Neural Information Processing Systems , volume=

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author=. Advances in Neural Information Processing Systems , volume=

[14] [15]

Math-Shepherd: Verify and Reinforce

Wang, Peiyi and Li, Lei and Shao, Zhihong and Xu, Run and Dai, Damai and Li, Yifei and Chen, Deli and Wu, Yu and Sui, Zhifang , journal=. Math-Shepherd: Verify and Reinforce

[15] [16]

arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv

[16] [17]

arXiv preprint arXiv:1707.06347 , year=

Proximal Policy Optimization Algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

Pith/arXiv arXiv

[17] [18]

Advances in Neural Information Processing Systems , volume=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , volume=