Trajectory First: A Curriculum for Discovering Diverse Policies

Cornelius V. Braun; Marc Toussaint; Sayantan Auddy

arxiv: 2506.01568 · v3 · submitted 2025-06-02 · 💻 cs.LG · cs.RO

Trajectory First: A Curriculum for Discovering Diverse Policies

Cornelius V. Braun , Sayantan Auddy , Marc Toussaint This is my paper

Pith reviewed 2026-05-19 11:05 UTC · model grok-4.3

classification 💻 cs.LG cs.RO

keywords constrained diversity optimizationreinforcement learningcurriculum learningtrajectory optimizationpolicy distillationrobot manipulationdiverse policiesexploration bias

0 comments

The pith

A two-stage curriculum first optimizes diverse spline-based trajectories then distills them into reactive policies to raise behavioral variety while preserving task rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a curriculum for constrained diversity reinforcement learning that tackles under-exploration in hard tasks such as robot manipulation. In the first stage a spline-based trajectory prior acts as an inductive bias to generate many different high-reward behaviors. The second stage distills those behaviors into ordinary step-by-step reactive policies. A sympathetic reader would value the result because agents that can solve the same task in multiple ways become more robust to changes and less likely to get stuck in poor solutions.

Core claim

The authors claim that their curriculum increases the diversity of learned skills while maintaining high task performance. They achieve this by first using a spline-based trajectory prior to produce diverse high-reward behaviors and then distilling those behaviors into reactive policies, providing new insights into the difficulties of diversity-targeted training.

What carries the argument

The spline-based trajectory prior, an inductive bias that guides the first stage toward varied high-reward trajectories before they are turned into reactive policies.

If this is right

Agents become more robust to task variations because they possess multiple distinct solutions.
Exploration improves in complex continuous-control domains such as manipulation.
Distilled reactive policies retain most of the diversity and reward achieved in the trajectory stage.
Training reveals concrete challenges that arise when diversity objectives are applied directly to reactive policies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trajectory-first idea could be tested on non-manipulation tasks such as navigation or game playing.
The curriculum might combine with other exploration bonuses to further enlarge the set of discovered behaviors.
If the distillation step preserves diversity, the method could support lifelong learning where new skills are added without overwriting old ones.

Load-bearing premise

The spline-based trajectory prior supplies a useful inductive bias for exploration and the discovered behaviors survive distillation into reactive policies without large losses in diversity or reward.

What would settle it

An experiment that measures behavioral diversity and task reward on a robot manipulation benchmark and finds no statistically significant gain in diversity or a clear drop in reward when the two-stage curriculum is replaced by a standard single-stage constrained-diversity method.

read the original abstract

Being able to solve a task in diverse ways makes agents more robust to task variations and less prone to local optima. In this context, constrained diversity optimization has become a useful reinforcement learning (RL) framework for training a set of diverse agents in parallel. However, existing constrained-diversity RL methods often under-explore in complex tasks such as robot manipulation, resulting in limited behavioral diversity. We address this with a two-stage curriculum that introduces a spline-based trajectory prior as an inductive bias to produce diverse, high-reward behaviors in an initial stage, and then distills these behaviors into reactive, step-wise policies in a second stage. In our empirical evaluation, we provide novel insights into challenges of diversity-targeted training and show that our curriculum increases the diversity of learned skills while maintaining high task performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The two-stage curriculum with spline trajectory priors for initial diversity then distillation to reactive policies is a practical tweak for constrained RL in manipulation, but the transfer step is the weakest link.

read the letter

The core idea is straightforward: run diversity optimization first under a spline-based trajectory prior to surface varied high-reward behaviors in complex tasks, then distill those into ordinary step-wise policies. This ordering gives the search an explicit temporal bias that standard constrained-diversity methods lack, which matches the practical problem of under-exploration in robot manipulation.

Referee Report

2 major / 2 minor

Summary. The paper proposes a two-stage curriculum for constrained diversity optimization in RL. Stage 1 uses a spline-based trajectory prior as an inductive bias to discover diverse high-reward behaviors in complex tasks such as robot manipulation; stage 2 distills these into reactive step-wise policies. The central empirical claim is that the curriculum increases behavioral diversity of the final learned skills while maintaining high task performance, with additional insights into challenges of diversity-targeted training.

Significance. If the empirical claims hold with rigorous validation, the work offers a practical inductive bias for improving exploration in diversity-constrained RL, where existing methods often under-explore. The trajectory-first approach could generalize to other sequential decision tasks requiring robustness to variations, provided the distillation step reliably transfers diversity.

major comments (2)

[§3.2] §3.2 (Distillation stage): The claim that spline-trajectory behaviors can be distilled into reactive policies without major loss of diversity or reward is load-bearing for the headline result, yet the manuscript provides no direct pre/post-distillation comparison of diversity metrics (e.g., state visitation entropy or trajectory variance) on the same task instances. Because spline priors encode explicit non-Markovian temporal structure while the final policies are strictly Markovian, this transfer risks collapsing distinct behaviors; a quantitative ablation isolating the distillation step is required to substantiate the curriculum's net benefit.
[Experiments] Experiments section (quantitative results): The abstract states empirical gains in diversity while preserving performance, but the reported tables/figures lack sufficient baselines (e.g., standard constrained-diversity RL methods without the curriculum), ablations on the spline prior, and statistical significance tests across multiple seeds. Without these, it is impossible to verify that the observed diversity increase is attributable to the proposed curriculum rather than task-specific tuning or metric choice.

minor comments (2)

Clarify the exact diversity metric used (e.g., is it mutual information, Wasserstein distance on trajectories, or something else?) and how it is computed for both trajectory and policy stages.
Figure captions and axis labels should explicitly state the number of runs, random seeds, and error bars to allow reproducibility assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments on our paper. We have revised the manuscript to address the concerns regarding the distillation stage and the experimental results. Our point-by-point responses are as follows.

read point-by-point responses

Referee: [§3.2] §3.2 (Distillation stage): The claim that spline-trajectory behaviors can be distilled into reactive policies without major loss of diversity or reward is load-bearing for the headline result, yet the manuscript provides no direct pre/post-distillation comparison of diversity metrics (e.g., state visitation entropy or trajectory variance) on the same task instances. Because spline priors encode explicit non-Markovian temporal structure while the final policies are strictly Markovian, this transfer risks collapsing distinct behaviors; a quantitative ablation isolating the distillation step is required to substantiate the curriculum's net benefit.

Authors: We concur that a direct comparison of diversity metrics pre- and post-distillation is essential to validate the transfer of behaviors. Accordingly, we have added a dedicated ablation in the revised manuscript (new Section 4.4) that reports state visitation entropy and trajectory variance for the trajectory priors and the distilled policies on the same manipulation tasks. The results indicate that diversity is preserved to a large extent, with the curriculum enabling the discovery of behaviors that remain distinct even after distillation to Markovian policies. We discuss the role of the spline prior in mitigating potential collapse. revision: yes
Referee: [Experiments] Experiments section (quantitative results): The abstract states empirical gains in diversity while preserving performance, but the reported tables/figures lack sufficient baselines (e.g., standard constrained-diversity RL methods without the curriculum), ablations on the spline prior, and statistical significance tests across multiple seeds. Without these, it is impossible to verify that the observed diversity increase is attributable to the proposed curriculum rather than task-specific tuning or metric choice.

Authors: We acknowledge that additional baselines and rigorous statistical analysis would strengthen the empirical claims. In the revised manuscript, we have incorporated comparisons against standard constrained-diversity RL approaches that do not utilize the trajectory curriculum. We have also included an ablation study that removes the spline-based prior to quantify its impact. All experiments now include results averaged over 5 independent random seeds, with standard deviations and p-values from statistical tests to establish significance. These updates confirm the attribution of diversity improvements to our proposed method. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical curriculum is self-contained

full rationale

The paper presents a two-stage empirical curriculum that first uses a spline-based trajectory prior to generate diverse high-reward behaviors and then distills them into reactive policies. No equations, derivations, or first-principles results are claimed that reduce to the inputs by construction. The central claim of increased diversity with maintained performance rests on experimental evaluation rather than fitted parameters renamed as predictions or load-bearing self-citations. The spline prior is introduced as an inductive bias for exploration, and the distillation step is validated through task performance metrics without self-referential fitting. This is a standard empirical RL contribution with no evident circularity in its derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract does not specify numerical free parameters, background axioms, or new postulated entities; the spline prior is described as an inductive bias rather than a fitted construct.

pith-pipeline@v0.9.0 · 5661 in / 898 out tokens · 40834 ms · 2026-05-19T11:05:20.373685+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-stage curriculum that introduces a spline-based trajectory prior as an inductive bias to produce diverse, high-reward behaviors in an initial stage, and then distills these behaviors into reactive, step-wise policies
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Constrained Novelty Search (CNS) ... Hparticle(X) ... min distance to nearest neighbor

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.