pith. sign in

arxiv: 2505.13255 · v6 · submitted 2025-05-19 · 💻 cs.RO

Policy Contrastive Decoding for Robotic Foundation Models

Pith reviewed 2026-05-22 14:06 UTC · model grok-4.3

classification 💻 cs.RO
keywords robotic foundation modelspolicy contrastive decodingspurious correlationsobject maskinggeneralizationtraining-free methodsrobot policies
0
0 comments X

The pith

Policy Contrastive Decoding improves robotic policies by contrasting actions on original and object-masked images to cut spurious correlations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to show that robotic foundation models pick up spurious correlations from pre-training data that limit generalization to new tasks. It introduces Policy Contrastive Decoding as a training-free method that steers policies toward object-relevant features by comparing action probabilities from full visual inputs against those with task objects masked. A reader would care because this offers a simple plugin to make existing generalist robot policies more reliable in simulation and real environments without any retraining. Experiments on OpenVLA, Octo, and π0 demonstrate gains, including large lifts for the current best policy in real-world settings. The approach works by redirecting focus away from misleading visual cues that do not support core manipulation logic.

Core claim

Existing robot policies learn spurious correlations from pre-training trajectories that hurt generalization; Policy Contrastive Decoding addresses this by redirecting focus to object-relevant visual clues through contrasting action probability distributions from original inputs and object-masked inputs, and it functions as a training-free plugin that boosts performance across autoregressive and diffusion-based policies such as OpenVLA, Octo, and π0.

What carries the argument

Policy Contrastive Decoding (PCD), which contrasts action probability distributions derived from original and object-masked visual inputs to emphasize object-relevant clues.

If this is right

  • PCD can be added to different robot policies including autoregressive and diffusion-based ones without finetuning or weight access.
  • Performance gains appear in both simulation and real-world robot environments.
  • The method increases the success rate of the leading policy π0 by 8.9 percent in simulation and 108 percent in real settings.
  • It works as a flexible plugin that improves generalization beyond the original training trajectories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The contrast mechanism might transfer to other vision-action models that suffer from dataset biases in non-robotics domains such as autonomous driving.
  • Pairing PCD with light fine-tuning on masked data could amplify gains in tasks requiring fine object discrimination.
  • Extending the masking to dynamic elements like moving obstacles would test whether the method scales to more cluttered scenes.

Load-bearing premise

Masking task objects removes primarily spurious correlations while leaving the policy's core object-manipulation logic intact and without introducing new unintended biases through the contrast step.

What would settle it

Applying PCD to a new task where the masked objects are central to correct actions rather than spurious cues, and observing that success rates drop below the unmodified policy baseline.

Figures

Figures reproduced from arXiv: 2505.13255 by Heng Tao Shen, Jingkuan Song, Ji Zhang, Junlin Xie, Lianli Gao, Shihan Wu, Xu Luo.

Figure 1
Figure 1. Figure 1: Robot policies tend to spuriously cor￾relate task-irrelevant features with actions, com￾promising their ability to generalize to unseen scenarios. As observed, changing the light posi￾tion from (a) to (b) and the drawer handle position from (a) to (c) results in 36% and 32% drops in the performance of the baseline policy OpenVLA (Kim et al., 2024), respectively. (d) Attention map. More results are in Secti… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our proposed Policy Contrastive Decoding (PCD) approach. PCD serves as a plugin to redirect the robot policy’s focus toward object-relevant visual cues by contrasting action probability distributions derived from original observations p and object-masked observations pˆ. For illustrative purposes, we visualize the predictions only in the ∆x and ∆y dimensions of the robot action space [∆x, ∆y, ∆… view at source ↗
Figure 3
Figure 3. Figure 3: Real-world Performance. The target objects in the initial observation are automatically annotated by Grounding DINO (Liu et al., 2024b). PCD delivers a remarkable 108% performance improvement on the baseline, though it incurs a 24% increase in time cost. Results and Analysis [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation studies on (a) the hyperparameter α in Eq. (2), (b) the object detection schemes and (c) object inpainting strategies in Track2Mask. α = 0 in (a) and the black dotted lines in (b)(c) represent the performance of the baseline policies. The results are averaged over the 9 simulation tasks. PCD consistently improves the three policies when α > 0 and exhibits low sensitivity to changes in off-the-shel… view at source ↗
Figure 5
Figure 5. Figure 5: Performance of baseline policies integrated w/ or w/o our proposed PCD approach in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: CAM visualization of failure cases. B DETAILS OF THE TRACK2MASK MODULE In our devised Track2Mask module, the target object specified by the language instruction in the initial observation can be annotated using Point and Box prompts, along with off-the-shelf open-vocabulary object detection models, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of the Track2Mask module. 1We only visualize the CAM results of OpenVLA, as the action prediction mechanism of diffusion-based models (e.g., Octo and π0) makes it difficult to produce their CAM results. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: We conduct 20 trials for each real-world task, randomizing the task configurations in every [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Failure cases of the Track2Mask module. G FAILURE CASES OF TRACK2MASK From the Track2Mask pipeline presented in Appendix Fig.7, the failure cases fall into two categories: a) Object Detection Failure—the off-the-shelf open-vocabulary detector (i.e., GDINO) fails to localize objects in the initial observation; b) Incomplete Object Masking—the target objects along the trajectory are partially masked, as show… view at source ↗
Figure 10
Figure 10. Figure 10: Object masking results of Track2Mask using different object annotation strageties. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Various magnitudes of distractors. K PERFORMANCE IN MULTI-PERSPECTIVE SCENARIOS In this part, we conduct an experiment to investigate PCD’s effectiveness in multi-perspective scenarios. Specifically, we leverage the miniVLA+VQ h8+Wrist (Belkhale and Sadigh, 2024) 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
read the original abstract

Robotic foundation models, or generalist robot policies, hold immense potential to enable flexible, general-purpose and dexterous robotic systems. Despite their advancements, our empirical experiments reveal that existing robot policies are prone to learning spurious correlations from pre-training trajectories, adversely affecting their generalization capabilities beyond the training data. To tackle this, we propose a novel Policy Contrastive Decoding (PCD) approach, which redirects the robot policy's focus toward object-relevant visual clues by contrasting action probability distributions derived from original and object-masked visual inputs. As a training-free method, our PCD can be used as a plugin to improve different types of robot policies without needing to finetune or access model weights. We conduct extensive experiments on top of three open-source robot policies, including the autoregressive policy OpenVLA and the diffusion-based policies Octo and $\pi_0$. The obtained results in both simulation and real-world environments prove PCD's flexibility and effectiveness, e.g., PCD enhances the state-of-the-art policy $\pi_0$ by 8.9% in the simulation environment and by 108% in the real-world environment. Code and demos are publicly available at: https://koorye.github.io/PCD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Policy Contrastive Decoding (PCD), a training-free plugin method that improves robotic foundation models by contrasting action probability distributions obtained from original visual inputs versus object-masked inputs, thereby redirecting focus away from spurious pre-training correlations. The approach is demonstrated on three policies (autoregressive OpenVLA and diffusion-based Octo and π0) with reported gains in both simulation and real-world settings, including an 8.9% improvement for the state-of-the-art π0 policy in simulation and a 108% improvement in real-world environments.

Significance. If the central empirical claims hold under rigorous verification, the work would be significant for the robotics community. It offers a practical, zero-shot enhancement to existing generalist policies without requiring model fine-tuning or weight access, directly addressing the prevalent issue of spurious correlations in visual robotic policies. The flexibility across policy architectures and the public release of code and demos support broader adoption and reproducibility.

major comments (2)
  1. [Abstract / Experimental Results] Abstract and Experimental Results: The reported gains (e.g., 108% real-world improvement for π0) are stated without accompanying details on the number of trials, statistical significance testing, standard deviations across runs, or explicit controls for object-masking quality and potential side effects of masking on spatial context. These omissions are load-bearing for the central claim that PCD specifically mitigates spurious correlations rather than introducing regularization or altered action biases.
  2. [Method / PCD Formulation] Method section (PCD formulation): The contrast operation is defined directly as a difference between forward passes on original and masked inputs, yet no derivation or invariance argument is supplied showing that p(a|original) − λ·p(a|masked) reliably preserves core object-manipulation logic while attenuating only spurious cues. This is particularly relevant for diffusion policies (Octo, π0), where masking can alter distribution support or correlate with valid actions.
minor comments (2)
  1. [Method] The scaling hyperparameter λ and the precise mathematical form of the contrast (including any normalization or temperature terms) should be stated explicitly with an equation to improve reproducibility.
  2. [Experiments] Figure captions and experimental tables would benefit from clearer indication of whether results are averaged over multiple seeds or single runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential significance of our work on Policy Contrastive Decoding. We have carefully addressed each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Experimental Results] Abstract and Experimental Results: The reported gains (e.g., 108% real-world improvement for π0) are stated without accompanying details on the number of trials, statistical significance testing, standard deviations across runs, or explicit controls for object-masking quality and potential side effects of masking on spatial context. These omissions are load-bearing for the central claim that PCD specifically mitigates spurious correlations rather than introducing regularization or altered action biases.

    Authors: We agree that additional experimental details are necessary to rigorously support the reported improvements and to confirm that PCD specifically addresses spurious correlations. In the revised manuscript, we will expand the Experimental Results section (and update the abstract accordingly) to report the number of trials (50 per task in simulation, 20 in real-world), results from statistical significance testing (paired t-tests with p-values), standard deviations across runs, and explicit controls/ablation studies on the object-masking procedure to address potential side effects on spatial context or action biases. revision: yes

  2. Referee: [Method / PCD Formulation] Method section (PCD formulation): The contrast operation is defined directly as a difference between forward passes on original and masked inputs, yet no derivation or invariance argument is supplied showing that p(a|original) − λ·p(a|masked) reliably preserves core object-manipulation logic while attenuating only spurious cues. This is particularly relevant for diffusion policies (Octo, π0), where masking can alter distribution support or correlate with valid actions.

    Authors: We acknowledge that a formal derivation would improve the clarity and theoretical grounding of the PCD formulation. In the revised Method section, we will add a brief derivation framing the contrast as a feature-reweighting operation that attenuates non-object cues while preserving manipulation-relevant action probabilities. For diffusion policies, we will include an analysis of how masking interacts with the denoising process and why the contrast remains effective, supported by new ablations showing preservation of core logic. revision: yes

Circularity Check

0 steps flagged

No significant circularity; PCD is a direct definitional contrast without reduction to inputs or self-citations

full rationale

The paper defines Policy Contrastive Decoding explicitly as a training-free contrast between action probability distributions from original versus object-masked visual inputs, applied as a plugin to existing policies (OpenVLA, Octo, π0). This is a procedural definition rather than a claimed derivation or first-principles prediction that reduces by construction to fitted parameters, self-referential quantities, or load-bearing self-citations. No equations or steps in the provided text show a prediction that is statistically forced by prior fitting within the same paper, nor does the central claim rely on a uniqueness theorem imported from the authors' prior work. Empirical gains are reported from external experiments and do not form a closed loop with the method's definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that object masking selectively removes spurious visual correlations without destroying task-critical information.

axioms (1)
  • domain assumption Object masking isolates spurious correlations from task-relevant visual features
    Invoked as the basis for the contrast operation that redirects policy focus.

pith-pipeline@v0.9.0 · 5757 in / 1203 out tokens · 33781 ms · 2026-05-22T14:06:28.157549+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. What to Ignore, What to React: Visually Robust RL Fine-Tuning of VLA Models

    cs.RO 2026-05 conditional novelty 6.0

    PAIR-VLA adds invariance and sensitivity objectives over paired visual variants during PPO fine-tuning of VLA models, yielding 9-16% average gains on ManiSkill3 under distractors, textures, poses, viewpoints, and ligh...

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 1 Pith paper · 17 internal anchors

  1. [1]

    RT-H: Action Hierarchies Using Language

    Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823,

  2. [2]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, et al. Pi-0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

  3. [3]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817,

  4. [4]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    A. Brohan, N. Brown, J. Carbajal, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818,

  5. [5]

    DINOv2: Learning Robust Visual Features without Supervision

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

  6. [6]

    Halc: Object hallucination re- duction via adaptive focal-contrast decoding

    Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. Halc: Object hallucination reduction via adaptive focal-contrast decoding.arXiv preprint arXiv:2403.00425,

  7. [7]

    Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha...

  8. [8]

    Classifier-Free Diffusion Guidance

    11 Published as a conference paper at ICLR 2026 Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

  9. [9]

    OpenVLA: An Open-Source Vision-Language-Action Model

    S. Karamcheti, S. Nair, A. Balakrishna, et al. Prismatic vlms: Investigating the design space of visually-conditioned language model. InICML, page 235, 2024a. S. Karamcheti, S. Nair, A. Balakrishna, et al. Prismatic vlms: Investigating the design space of visually-conditioned language models. InICML, page 235, 2024b. M. J. Kim, K. Pertsch, S. Karamcheti, ...

  10. [10]

    Segment anything model 2 (sam2).arXiv preprint arXiv:2304.01492,

    Alexander Kirillov, Xinyu Liu, and Kaiming He. Segment anything model 2 (sam2).arXiv preprint arXiv:2304.01492,

  11. [11]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941,

  12. [12]

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.NeurIPS, 36:44776–44791, 2023a. Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Mitigating halluci- nation in large multi-modal models via robust instruction tuning.arXiv pr...

  13. [14]

    Oquab, T

    12 Published as a conference paper at ICLR 2026 M. Oquab, T. Darcet, T. Moutakanni, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 11:1–31,

  14. [15]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

  15. [16]

    A Survey of Hallucination in Large Foundation Models

    Vipula Rawte, Amit Sheth, and Amitava Das. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922,

  16. [17]

    O. M. Team, D. Ghosh, H. Walke, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213,

  17. [18]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023b. Maya Varma, Jean-Benoi...

  18. [19]

    Mitigating hallucinations in large vision-language models with instruction contrastive decoding

    Xintong Wang, Jingheng Pan, Liang Ding, and Chris Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding.arXiv preprint arXiv:2403.18715,

  19. [20]

    Will You Find These Shortcuts?

    Wenqian Ye, Guangtao Zheng, Xu Cao, Yunsheng Ma, and Aidong Zhang. Spurious correlations in machine learning: A survey.arXiv preprint arXiv:2402.12715,

  20. [21]

    Robotic Control via Embodied Chain-of-Thought Reasoning

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693,

  21. [22]

    X. Zhai, B. Mustafa, A. Kolesnikov, et al. Sigmoid loss for language image pre-training. InICCV, pages 11975–11986, 2023a. Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Siglip: Sigmoid loss for language image pre-training.arXiv preprint arXiv:2303.15343, 2023b. 13 Published as a conference paper at ICLR 2026 Ji Zhang, Shihan Wu, Xu L...

  22. [23]

    A Survey on Efficient Inference for Large Language Models

    Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models.arXiv preprint arXiv:2404.14294,

  23. [24]

    14 Published as a conference paper at ICLR 2026 A CAM VISUALIZATION OFFAILURECASES We present the CAM (Selvaraju et al.,

  24. [25]

    15 Published as a conference paper at ICLR 2026 C DETAILS OF THEEXPERIMENTALSETUP C.1 BASELINEPOLICIES Simulation Experiments

    makes it difficult to produce their CAM results. 15 Published as a conference paper at ICLR 2026 C DETAILS OF THEEXPERIMENTALSETUP C.1 BASELINEPOLICIES Simulation Experiments. We conduct simulation experiments using three diverse robot policies in the SIMPLER environment, including the autoregressive policy OpenVLA (Kim et al., 2024), and diffusion-based ...

  25. [26]

    •OpenVLA : a vision-language-action model with 7 billion parameters, is trained on 970,000 episodes of robotic demonstrations from the Open X-Embodiment dataset

    andπ 0 (Black et al., 2024). •OpenVLA : a vision-language-action model with 7 billion parameters, is trained on 970,000 episodes of robotic demonstrations from the Open X-Embodiment dataset. This policy is fine-tuned using the pre-trained Prismatic (Karamcheti et al., 2024b) model.2 •Octo (base): an open-source generalist policy with 93 million parameters...

  26. [27]

    stack the green cube on the yellow cube

    D PERFORMANCE AT THEMAXIMUMSTEP In Table 1, tasks completed before the predefined maximum step are included in the success rate calculation, and the task is terminated upon completion. Table 4 presents success rates computed at the maximum step, irrespective of whether tasks were completed earlier. As seen, PCD demonstrates its advantages by improving the...

  27. [28]

    E COMPUTATIONALOVERHEAD Table 5 reports the computational overhead of three baseline policies integrated w/ or w/o our PCD method in the SIMPLER environment.1) Inference Latency. PCD approximately doubles the 17 Published as a conference paper at ICLR 2026 Table 5: Computational overhead of three baseline policies integrated w/ or w/o our PCD method in th...

  28. [29]

    pick coke can

    According to Equation 2, when the first failure case occurs, PCD’s prediction becomes equivalent to the original baseline result. We conduct a ablation study to assess the impact of the incomplete masking failure cases in Table 8, where β indicates the ratio of masked pixels manually excluded. As can be seen, PCD’s performance progressively declines as β ...

  29. [30]

    20 Published as a conference paper at ICLR 2026 policy—which incorporates both third-person perspective and wrist images as input—as our baseline. The achieved results on five randomly sampled tasks from the LIBERO-90 (Liu et al., 2023a) benchmark are shown in Table 11, demonstrating that PCD consistently improves the baseline policy across all five tasks...