D2PPO: Diffusion Policy Policy Optimization with Dispersive Loss

Guowei Zou; Haitao Wang; Hejun Wu; Weibing Li; Yuhang Wang; Yukun Qian

arxiv: 2508.02644 · v1 · submitted 2025-08-04 · 💻 cs.AI

D2PPO: Diffusion Policy Policy Optimization with Dispersive Loss

Guowei Zou , Weibing Li , Hejun Wu , Yukun Qian , Yuhang Wang , Haitao Wang This is my paper

Pith reviewed 2026-05-19 00:40 UTC · model grok-4.3

classification 💻 cs.AI

keywords diffusion policydispersive lossrepresentation collapserobotic manipulationpolicy optimizationdiscriminative representationsreinforcement learning

0 comments

The pith

Dispersive loss on batch representations prevents collapse in diffusion policies and sharpens robotic manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion policies map similar observations to nearly identical internal features, which limits their ability to respond to the small but decisive differences that complex manipulation requires. The paper proposes D2PPO, which adds a dispersive loss that forces every pair of hidden representations inside a training batch to act as negatives. This pushes the network to produce more distinct features even for visually close situations. The resulting policy shows clear gains on standard robotic benchmarks in both pre-training and fine-tuning phases, with the largest improvements appearing on harder tasks. Real-robot trials on a Franka arm further indicate higher success rates when the manipulation involves subtle variations.

Core claim

Diffusion policies suffer from representation collapse in which semantically similar observations produce indistinguishable features and therefore impair precise control. D2PPO counters this by adding dispersive loss regularization that treats all hidden representations within each batch as negative pairs, compelling the network to learn discriminative representations of similar observations and thereby enabling the policy to detect and act on the subtle differences needed for complex manipulation.

What carries the argument

Dispersive loss regularization that treats all hidden representations within each batch as negative pairs.

If this is right

Early-layer application of the loss aids simpler tasks while late-layer application improves complex manipulation sequences.
The approach yields new state-of-the-art results on RoboMimic benchmarks for both pre-training and fine-tuning.
Real-world tests on a Franka Emika Panda arm produce especially high success rates on complex tasks compared with prior diffusion policies.
The policy becomes better at distinguishing critical variations in observations that standard diffusion training overlooks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same batch-wise repulsion idea could be tried in other generative or policy models that currently suffer from feature collapse.
The reported difference in benefit between early and late layers points to a possible general pattern: deeper representations may need stronger separation for hard tasks.
If the loss remains stable at larger batch sizes, it might extend naturally to offline datasets with greater visual diversity.

Load-bearing premise

Forcing every pair of hidden representations in a batch to repel each other will produce useful distinctions without creating training instability or harming tasks that benefit from some feature similarity.

What would settle it

If adding the dispersive loss causes training to diverge or lowers success rates on tasks where similar observations legitimately require similar actions, the central claim would be refuted.

read the original abstract

Diffusion policies excel at robotic manipulation by naturally modeling multimodal action distributions in high-dimensional spaces. Nevertheless, diffusion policies suffer from diffusion representation collapse: semantically similar observations are mapped to indistinguishable features, ultimately impairing their ability to handle subtle but critical variations required for complex robotic manipulation. To address this problem, we propose D2PPO (Diffusion Policy Policy Optimization with Dispersive Loss). D2PPO introduces dispersive loss regularization that combats representation collapse by treating all hidden representations within each batch as negative pairs. D2PPO compels the network to learn discriminative representations of similar observations, thereby enabling the policy to identify subtle yet crucial differences necessary for precise manipulation. In evaluation, we find that early-layer regularization benefits simple tasks, while late-layer regularization sharply enhances performance on complex manipulation tasks. On RoboMimic benchmarks, D2PPO achieves an average improvement of 22.7% in pre-training and 26.1% after fine-tuning, setting new SOTA results. In comparison with SOTA, results of real-world experiments on a Franka Emika Panda robot show the excitingly high success rate of our method. The superiority of our method is especially evident in complex tasks. Project page: https://guowei-zou.github.io/d2ppo/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

D2PPO adds batch-wide dispersive loss to fight representation collapse in diffusion policies and reports benchmark gains, but the negative-pair treatment risks over-separating correlated states common in robotic trajectories.

read the letter

The main thing to know is that D2PPO adds a dispersive loss treating every hidden representation in a batch as a negative pair to reduce collapse in diffusion policies, and it shows average gains of 22.7% pre-training and 26.1% fine-tuning on RoboMimic plus stronger real-robot results on complex tasks with a Franka arm. What is new is the specific use of this batch-negative mechanism inside diffusion policy optimization rather than a generic contrastive add-on. The paper does a reasonable job documenting the empirical side, including the observation that early-layer regularization helps simpler tasks while late-layer helps complex ones, and it backs the claims with both simulation benchmarks and physical robot experiments. Those concrete numbers and the real-world validation give the work some weight. The soft spots sit mainly around the core assumption. Robotic trajectory batches frequently contain temporally or semantically close states from the same demonstration or replay buffer, so forcing all pairs apart could penalize useful similarity structure instead of cleanly fixing collapse. The post-hoc layer selection for different task difficulties also invites questions about how much tuning went into the reported margins. More ablations on batch composition, loss magnitude stability, and statistical significance would help pin down whether the gains are robustly attributable to the dispersive term. This paper is for researchers working on diffusion policies and imitation learning for robotic manipulation. Readers who want practical regularization ideas for multimodal controllers would get value from the results and the project page. It has enough empirical grounding and a clear, testable idea to deserve serious referee time rather than a desk reject, though revisions would likely focus on stronger controls for the loss effects.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes D2PPO, which augments diffusion policies for robotic manipulation with a dispersive loss regularization term. This loss treats all hidden representations within each training batch as negative pairs to counteract representation collapse, where semantically similar observations map to indistinguishable features. The authors report that early-layer regularization aids simple tasks while late-layer regularization improves complex ones. On RoboMimic benchmarks, D2PPO yields average gains of 22.7% during pre-training and 26.1% after fine-tuning, establishing new state-of-the-art results; real-world experiments on a Franka Emika Panda robot show particularly high success rates on complex manipulation tasks.

Significance. If the reported gains prove robustly attributable to the dispersive loss and survive controlled ablations on batch composition and layer choice, the work would offer a practical regularization technique for improving feature discriminability in diffusion policies. This could have meaningful impact on robotic control benchmarks and real-robot deployment where subtle state variations matter, especially since the approach is presented as a lightweight empirical addition rather than a fundamental redesign.

major comments (3)

[Abstract] Abstract: The central performance claims (22.7% pre-training and 26.1% fine-tuning average improvements, new SOTA) are presented without error bars, number of runs, or statistical significance tests. This omission makes it impossible to determine whether the margins exceed typical variance in diffusion-policy training on RoboMimic.
[Method / Experiments] Method / Experiments: The dispersive loss treats every batch member as a negative pair, yet robotic trajectory data drawn from demonstration buffers frequently contains temporally or semantically adjacent states. No ablation on batch sampling, intra-batch similarity statistics, or loss annealing is reported, leaving open the possibility that the regularization penalizes useful correlations and that the gains are not securely attributed to the proposed mechanism.
[Experiments] Experiments: Layer-wise regularization (early layers for simple tasks, late layers for complex tasks) is chosen post-hoc based on observed performance. This selection strategy risks confirmation bias and requires either a pre-specified rule or exhaustive cross-task ablations to support the claim that late-layer application “sharply enhances” complex-task performance.

minor comments (1)

[Abstract] The abstract references a project page but the manuscript would benefit from explicit comparison of the dispersive loss formulation to existing contrastive or regularization techniques in diffusion models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each of the major comments below and outline the revisions we plan to make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims (22.7% pre-training and 26.1% fine-tuning average improvements, new SOTA) are presented without error bars, number of runs, or statistical significance tests. This omission makes it impossible to determine whether the margins exceed typical variance in diffusion-policy training on RoboMimic.

Authors: We agree that including error bars, the number of runs, and statistical significance would enhance the credibility of the reported gains. In the full paper, results are averaged over 5 random seeds, but we will update the abstract to include these details (e.g., mean ± std) and add a discussion of statistical tests in the experiments section to confirm the improvements are significant. revision: yes
Referee: [Method / Experiments] Method / Experiments: The dispersive loss treats every batch member as a negative pair, yet robotic trajectory data drawn from demonstration buffers frequently contains temporally or semantically adjacent states. No ablation on batch sampling, intra-batch similarity statistics, or loss annealing is reported, leaving open the possibility that the regularization penalizes useful correlations and that the gains are not securely attributed to the proposed mechanism.

Authors: We appreciate this point regarding potential issues with batch composition in trajectory data. While our current implementation uses standard random batch sampling from the buffer, we acknowledge the value of additional ablations. In the revision, we will include experiments on different batch sampling strategies, report intra-batch similarity statistics (e.g., average cosine similarity between batch elements), and explore loss annealing to ensure the gains are robustly due to the dispersive loss mechanism. revision: yes
Referee: [Experiments] Experiments: Layer-wise regularization (early layers for simple tasks, late layers for complex tasks) is chosen post-hoc based on observed performance. This selection strategy risks confirmation bias and requires either a pre-specified rule or exhaustive cross-task ablations to support the claim that late-layer application “sharply enhances” complex-task performance.

Authors: The layer selection was indeed guided by empirical observations during development. To mitigate concerns of confirmation bias, we will perform and report exhaustive ablations across all layers for every task in the revised version. This will allow us to derive a clearer, pre-specified guideline for choosing the regularization layer based on task complexity, thereby supporting the claim more rigorously. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical regularization addition is self-contained

full rationale

The paper presents D2PPO as the addition of a dispersive loss term that treats intra-batch hidden representations as negative pairs to mitigate representation collapse in diffusion policies. This is introduced as a practical regularization technique with layer-specific application choices, supported by benchmark improvements on RoboMimic and real-robot experiments. No derivation chain reduces a claimed prediction or uniqueness result to a fitted parameter or self-citation by construction; the central performance attribution rests on reported empirical outcomes rather than a closed mathematical loop from the paper's own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that representation collapse is the primary limiter and that contrastive-style negative pairing within batches will yield useful discrimination without side effects; no free parameters or invented entities are explicitly named in the abstract.

free parameters (1)

dispersive loss weight
Regularization strength must be chosen or tuned to balance the main diffusion objective against the new loss term.

axioms (1)

domain assumption Treating all hidden representations in a batch as negative pairs combats representation collapse in diffusion policies
Core premise invoked to justify the dispersive loss design for enabling subtle variation detection.

pith-pipeline@v0.9.0 · 5770 in / 1310 out tokens · 49025 ms · 2026-05-19T00:40:19.506001+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

D²PPO introduces dispersive loss regularization that combats representation collapse by treating all hidden representations within each batch as negative pairs. ... LDisp = log Ei,j [exp(−D(zi,zj)/τ)]
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose D²PPO ... two-stage training strategy: (1) pre-training with dispersive loss ... (2) fine-tuning with PPO

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

VADF adds an Adaptive Loss Network for hard-negative training sampling and a Hierarchical Vision Task Segmenter for adaptive noise scheduling during inference to speed convergence and reduce timeouts in diffusion robo...