pith. machine review for the scientific record. sign in

arxiv: 2402.10885 · v3 · pith:YN5B5WVAnew · submitted 2024-02-16 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

3D Diffuser Actor: Policy Diffusion with 3D Scene Representations

Pith reviewed 2026-05-17 21:56 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LG
keywords robot manipulationdiffusion policy3D scene representationRLBenchCALVIN benchmarkdenoising transformerviewpoint generalizationfew-shot real-robot learning
0
0 comments X

The pith

A diffusion policy that denoises 3D robot pose trajectories from tokenized scene features, language, and proprioception sets new performance records on standard robot benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents 3D Diffuser Actor as a way to combine diffusion-based action generation with 3D scene representations for robot control. Instead of directly outputting actions or using 2D image features, the model takes noised 3D pose trajectories as input and learns to predict the added noise while conditioning on aggregated 3D visual features, a language goal, and robot joint states. This design produces higher success rates than prior policies on RLBench and CALVIN, with reported absolute gains of 18.1 percent in multi-view and 13.1 percent in single-view RLBench settings plus a 9 percent relative lift on CALVIN. The authors also demonstrate that the same model can be deployed on a physical manipulator after training on only a small number of real demonstrations. Through comparisons and ablations they attribute the gains to the use of 3D rather than 2D inputs, the diffusion objective rather than regression or classification, and tokenized rather than holistic scene embeddings.

Core claim

The central claim is that a denoising transformer operating on 3D scene tokens fused with language and proprioception can accurately predict noise in 3D robot pose trajectories and thereby produce policies that generalize across viewpoints better than 2D or non-diffusion alternatives, yielding the stated performance improvements on RLBench and CALVIN.

What carries the argument

A 3D denoising transformer that receives tokenized 3D scene embeddings from depth images together with language instructions and proprioception to output the noise estimate for noised 3D robot pose trajectories.

If this is right

  • The policy outperforms both regression and classification objectives for action prediction.
  • Tokenized 3D scene embeddings outperform holistic non-tokenized 3D embeddings and absolute attention mechanisms.
  • The same architecture transfers from simulation benchmarks to real-robot control with only a handful of demonstrations.
  • Multi-view 3D inputs produce larger gains than single-view inputs on the evaluated tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the 3D features stay reliable under distribution shift, the approach could reduce the need for extensive viewpoint-specific data collection in new environments.
  • The denoising formulation may allow the policy to represent multimodal action distributions more naturally than deterministic regressors, which could matter for tasks with multiple valid solutions.
  • Combining the 3D scene tokens with other sensor modalities such as tactile feedback could be a direct next step without changing the transformer backbone.

Load-bearing premise

The 3D scene features extracted from depth images remain accurate and viewpoint-invariant even when camera placement or lighting differs from the training distribution.

What would settle it

Measure success rate on the same RLBench tasks but with cameras moved to new positions or under changed lighting conditions not present in training; a large drop relative to the reported numbers would falsify the generalization benefit of the 3D representation.

read the original abstract

Diffusion policies are conditional diffusion models that learn robot action distributions conditioned on the robot and environment state. They have recently shown to outperform both deterministic and alternative action distribution learning formulations. 3D robot policies use 3D scene feature representations aggregated from a single or multiple camera views using sensed depth. They have shown to generalize better than their 2D counterparts across camera viewpoints. We unify these two lines of work and present 3D Diffuser Actor, a neural policy equipped with a novel 3D denoising transformer that fuses information from the 3D visual scene, a language instruction and proprioception to predict the noise in noised 3D robot pose trajectories. 3D Diffuser Actor sets a new state-of-the-art on RLBench with an absolute performance gain of 18.1% over the current SOTA on a multi-view setup and an absolute gain of 13.1% on a single-view setup. On the CALVIN benchmark, it improves over the current SOTA by a 9% relative increase. It also learns to control a robot manipulator in the real world from a handful of demonstrations. Through thorough comparisons with the current SOTA policies and ablations of our model, we show 3D Diffuser Actor's design choices dramatically outperform 2D representations, regression and classification objectives, absolute attentions, and holistic non-tokenized 3D scene embeddings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces 3D Diffuser Actor, a conditional diffusion policy for robot manipulation that employs a novel 3D denoising transformer to fuse 3D scene features (aggregated from depth images), language instructions, and proprioception for denoising noised 3D robot pose trajectories. It reports new state-of-the-art results on RLBench (absolute gains of 18.1% multi-view and 13.1% single-view over prior SOTA) and a 9% relative improvement on CALVIN, plus real-robot control from few demonstrations, with ablations showing advantages over 2D representations, regression/classification objectives, and holistic 3D embeddings.

Significance. If the reported gains prove robust under identical evaluation protocols, the work meaningfully advances diffusion-based policies by integrating explicit 3D scene representations, which prior results suggest improve viewpoint generalization. The inclusion of real-world validation and systematic ablations against 2D, regression, and non-tokenized 3D baselines strengthens the contribution; these elements provide concrete evidence that the 3D denoising transformer design is load-bearing for the observed performance.

major comments (3)
  1. [Experiments] Experiments section (RLBench and CALVIN results): the headline absolute gains of 18.1% (multi-view) and 13.1% (single-view) on RLBench rest on direct numerical comparison to prior SOTA; the manuscript must explicitly state whether all baselines were re-implemented and re-evaluated by the authors under identical task sets, demonstration counts, camera configurations, simulator versions, action discretization, and success metrics, or whether numbers were taken from original papers.
  2. [Results] Results tables and abstract: no error bars, standard deviations, or number of evaluation seeds/runs are reported for the stochastic diffusion policy, nor is any statistical significance test provided; this omission makes it impossible to determine whether the reported gains exceed run-to-run variability.
  3. [§3] §3 (model description): the 3D scene feature aggregation from depth images is central to the viewpoint-invariance claim, yet no quantitative analysis or ablation tests robustness when camera placement, lighting, or depth noise distributions differ from those seen in training data.
minor comments (2)
  1. [Figures] Figure captions and §4.3: clarify whether the visualized 3D tokens are per-point or per-voxel and how the denoising transformer attends across them.
  2. [Related Work] Related work: ensure all recent 3D representation papers for manipulation (beyond the cited diffusion and 3D works) are referenced for completeness.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, indicating planned revisions where appropriate to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (RLBench and CALVIN results): the headline absolute gains of 18.1% (multi-view) and 13.1% (single-view) on RLBench rest on direct numerical comparison to prior SOTA; the manuscript must explicitly state whether all baselines were re-implemented and re-evaluated by the authors under identical task sets, demonstration counts, camera configurations, simulator versions, action discretization, and success metrics, or whether numbers were taken from original papers.

    Authors: We thank the referee for this important clarification. The baseline numbers reported in our manuscript are taken directly from the original papers, following common practice in the field to ensure consistency with published protocols. Our method was evaluated using the exact task sets, demonstration counts, camera setups, and success metrics described in those works. We will add an explicit statement in the Experiments section and a clarifying footnote to the results tables in the revised manuscript. revision: yes

  2. Referee: [Results] Results tables and abstract: no error bars, standard deviations, or number of evaluation seeds/runs are reported for the stochastic diffusion policy, nor is any statistical significance test provided; this omission makes it impossible to determine whether the reported gains exceed run-to-run variability.

    Authors: We agree that reporting variability is essential for stochastic policies such as ours. While our primary results used a fixed random seed for reproducibility, we will update all tables to include standard deviations computed over 5 independent evaluation seeds and add a brief discussion of statistical significance in the revised Results section and abstract. revision: yes

  3. Referee: [§3] §3 (model description): the 3D scene feature aggregation from depth images is central to the viewpoint-invariance claim, yet no quantitative analysis or ablation tests robustness when camera placement, lighting, or depth noise distributions differ from those seen in training data.

    Authors: The viewpoint-invariance benefit is evidenced by our single-view versus multi-view comparisons and the consistent outperformance over 2D baselines, which already test generalization across camera configurations. We acknowledge that dedicated quantitative ablations on lighting variations and depth noise distributions were not included. We will add a targeted discussion in §3 and a supporting experiment in the appendix of the revised manuscript. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark gains with minor self-citation context but no load-bearing circularity

full rationale

The paper proposes a 3D Diffuser Actor architecture that fuses 3D scene features, language, and proprioception via a denoising transformer to model action distributions. All headline claims consist of measured success rates on RLBench and CALVIN benchmarks rather than any closed-form prediction or first-principles derivation. Prior diffusion-policy and 3D-representation papers are cited for motivation and architectural inspiration, yet those citations supply independent empirical precedents and do not substitute for the new model's training or evaluation protocol. No equation or result is shown to be definitionally equivalent to its own inputs, and the reported absolute/relative gains are obtained by direct comparison against re-implemented or published baselines under the same task definitions.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The performance claims rest on standard supervised learning assumptions plus the untested premise that 3D feature aggregation from depth is robust. No new physical entities are postulated.

free parameters (2)
  • diffusion noise schedule
    Standard hyperparameter in diffusion models; its specific values are not reported in the abstract.
  • number of denoising steps
    Typical diffusion training choice that affects both performance and compute.
axioms (1)
  • domain assumption 3D scene features extracted from depth images are sufficiently accurate and generalizable across viewpoints
    Invoked when claiming superiority over 2D representations.
invented entities (1)
  • 3D denoising transformer no independent evidence
    purpose: Fuses 3D visual features, language, and proprioception to denoise robot pose trajectories
    New architectural component introduced in the paper; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5568 in / 1337 out tokens · 24135 ms · 2026-05-17T21:56:13.325864+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation/DimensionForcing dimension_forced echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    3D robot policies use 3D scene feature representations aggregated from a single or multiple camera views using sensed depth. They have shown to generalize better than their 2D counterparts across camera viewpoints.

  • Foundation/DimensionForcing D3_has_linking echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We unify these two lines of work and present 3D Diffuser Actor, a neural policy equipped with a novel 3D denoising transformer that fuses information from the 3D visual scene

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

    cs.RO 2026-04 unverdicted novelty 7.0

    VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.

  2. Rectified Schr\"odinger Bridge Matching for Few-Step Visual Navigation

    cs.RO 2026-04 unverdicted novelty 7.0

    RSBM exploits velocity field invariance across regularization levels to achieve over 94% cosine similarity and 92% success in visual navigation using only 3 integration steps.

  3. SID: Sliding into Distribution for Robust Few-Demonstration Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    SID achieves approximately 90% success on six real-world manipulation tasks with only two demonstrations under out-of-distribution initializations, with less than 10% performance drop under distractors and disturbances.

  4. GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

    cs.RO 2026-05 unverdicted novelty 6.0

    GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.

  5. StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception

    cs.RO 2026-05 unverdicted novelty 6.0

    StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...

  6. SnapFlow: One-Step Action Generation for Flow-Matching VLAs via Progressive Self-Distillation

    cs.CV 2026-04 unverdicted novelty 6.0

    SnapFlow compresses multi-step denoising in flow-matching VLAs into one step via progressive self-distillation using two-step Euler shortcuts from marginal velocities, matching 10-step teacher success rates with 9.6x ...

  7. PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation

    cs.RO 2026-01 unverdicted novelty 6.0

    PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.

  8. Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

    cs.RO 2025-11 unverdicted novelty 6.0

    Isaac Lab is a unified GPU-native platform combining high-fidelity physics, photorealistic rendering, multi-frequency sensors, domain randomization, and learning pipelines for scalable multi-modal robot policy training.

  9. DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

    cs.CV 2025-07 unverdicted novelty 6.0

    DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 avera...

  10. RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    cs.RO 2025-06 unverdicted novelty 6.0

    RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.

  11. HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

    cs.CV 2025-03 unverdicted novelty 6.0

    HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.

  12. DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    cs.RO 2025-02 unverdicted novelty 6.0

    DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.

  13. Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    cs.CV 2024-12 unverdicted novelty 6.0

    Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.

  14. GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    cs.RO 2024-10 unverdicted novelty 6.0

    GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.

  15. R3D: Revisiting 3D Policy Learning

    cs.CV 2026-04 unverdicted novelty 5.0

    A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.

  16. ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation

    cs.RO 2026-04 unverdicted novelty 5.0

    Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.

  17. VLBiMan: Vision-Language Anchored One-Shot Demonstration Enables Generalizable Bimanual Robotic Manipulation

    cs.RO 2025-09 unverdicted novelty 5.0

    VLBiMan framework enables generalizable bimanual manipulation from single human demonstrations via vision-language anchored task decomposition and adaptation without retraining.

  18. What Matters in Building Vision-Language-Action Models for Generalist Robots

    cs.RO 2024-12 unverdicted novelty 5.0

    Systematic tests of VLM backbones, policy architectures, and cross-embodiment data yield RoboVLMs that set new SOTA on robot manipulation benchmarks while requiring few manual designs.

  19. EL3DD: Extended Latent 3D Diffusion for Language Conditioned Multitask Manipulation

    cs.RO 2025-11 unverdicted novelty 4.0

    EL3DD extends latent 3D diffusion with language inputs and reference demonstrations to improve success rates on sequential manipulation tasks in the CALVIN dataset.

Reference graph

Works this paper leans on

120 extracted references · 120 canonical work pages · cited by 19 Pith papers · 16 internal anchors

  1. [1]

    Generative Adversarial Imitation Learning

    J. Ho and S. Ermon. Generative adversarial imitation learning. CoRR, abs/1606.03476, 2016. URL http://arxiv.org/abs/1606.03476

  2. [2]

    Tsurumine and T

    Y . Tsurumine and T. Matsubara. Goal-aware generative adversarial imitation learning from imperfect demonstration for robotic cloth manipulation, 2022

  3. [4]

    URL http://arxiv.org/abs/1705.10479

  4. [5]

    N. M. M. Shafiullah, Z. J. Cui, A. Altanzaya, and L. Pinto. Behavior transformers: Cloning k modes with one stone, 2022

  5. [6]

    Pearce, T

    T. Pearce, T. Rashid, A. Kanervisto, D. Bignell, M. Sun, R. Georgescu, S. V . Macua, S. Z. Tan, I. Momennejad, K. Hofmann, and S. Devlin. Imitating human behaviour with diffusion models, 2023

  6. [7]

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023. 9

  7. [8]

    Reuss, M

    M. Reuss, M. Li, X. Jia, and R. Lioutikov. Goal-conditioned imitation learning using score-based diffusion policies. arXiv preprint arXiv:2304.02532, 2023

  8. [9]

    Mandlekar, F

    A. Mandlekar, F. Ramos, B. Boots, L. Fei-Fei, A. Garg, and D. Fox. IRIS: implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. CoRR, abs/1911.05321, 2019. URL http://arxiv.org/abs/1911.05321

  9. [10]

    Chernova and M

    S. Chernova and M. Veloso. Confidence-based policy learning from demonstration using gaussian mixture models. In Proceedings of the 6th International Joint Conference on Au- tonomous Agents and Multiagent Systems, AAMAS ’07, New York, NY , USA, 2007. Associa- tion for Computing Machinery. ISBN 9788190426275. doi:10.1145/1329125.1329407. URL https://doi.or...

  10. [11]

    Florence, C

    P. Florence, C. Lynch, A. Zeng, O. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mordatch, and J. Tompson. Implicit behavioral cloning. CoRR, abs/2109.00137, 2021. URL https: //arxiv.org/abs/2109.00137

  11. [12]

    James, Z

    S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

  12. [13]

    O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022

  13. [14]

    A. Zeng, P. Florence, J. Tompson, S. Welker, J. Chien, M. Attarian, T. Armstrong, I. Krasin, D. Duong, V . Sindhwani, et al. Transporter networks: Rearranging the visual world for robotic manipulation. In Conference on Robot Learning, pages 726–747. PMLR, 2021

  14. [15]

    Huang, O

    H. Huang, O. Howell, X. Zhu, D. Wang, R. Walters, and R. Platt. Fourier transporter: Bi- equivariant robotic manipulation in 3d. In ICLR, 2024

  15. [16]

    James, K

    S. James, K. Wada, T. Laidlow, and A. J. Davison. Coarse-to-fine q-attention: Efficient learning for visual robotic manipulation via discretisation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13739–13748, 2022

  16. [17]

    Shridhar, L

    M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023

  17. [18]

    Gervet, Z

    T. Gervet, Z. Xian, N. Gkanatsios, and K. Fragkiadaki. Act3d: 3d feature field transformers for multi-task robotic manipulation. CoRL, 2023

  18. [19]

    arXiv preprint arXiv:2306.14896 , year=

    A. Goyal, J. Xu, Y . Guo, V . Blukis, Y .-W. Chao, and D. Fox. Rvt: Robotic view transformer for 3d object manipulation. arXiv preprint arXiv:2306.14896, 2023

  19. [20]

    P. Shaw, J. Uszkoreit, and A. Vaswani. Self-attention with relative position representations, 2018

  20. [21]

    J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021

  21. [22]

    Z. Xian, N. Gkanatsios, T. Gervet, T.-W. Ke, and K. Fragkiadaki. Chaineddiffuser: Unifying trajectory diffusion and keypose prediction for robotic manipulation. In Conference on Robot Learning, pages 2323–2339. PMLR, 2023

  22. [23]

    Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. In Proceedings of Robotics: Science and Systems (RSS), 2024

  23. [24]

    Pomerleau

    D. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In D. Touretzky, editor, Proceedings of (NeurIPS) Neural Information Processing Systems , pages 305 – 313. Morgan Kaufmann, December 1989

  24. [25]

    End to End Learning for Self-Driving Cars

    M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Mon- fort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba. End to end learning for self-driving cars. CoRR, abs/1604.07316, 2016. URL http://arxiv.org/abs/1604.07316. 10

  25. [27]

    Y . Ding, C. Florensa, P. Abbeel, and M. Phielipp. Goal-conditioned imitation learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ c8d3a760ebab63156...

  26. [28]

    Guhur, S

    P.-L. Guhur, S. Chen, R. G. Pinel, M. Tapaswi, I. Laptev, and C. Schmid. Instruction-driven history-aware policies for robotic manipulations. In Conference on Robot Learning , pages 175–187. PMLR, 2023

  27. [29]

    Z. J. Cui, Y . Wang, N. M. M. Shafiullah, and L. Pinto. From play to policy: Conditional behavior generation from uncurated robot data. ArXiv, abs/2210.10047, 2022

  28. [30]

    D.-N. Ta, E. Cousineau, H. Zhao, and S. Feng. Conditional energy-based models for implicit policies: The gap between theory and practice, 2022

  29. [31]

    Gkanatsios, A

    N. Gkanatsios, A. Jain, Z. Xian, Y . Zhang, C. Atkeson, and K. Fragkiadaki. Energy-based models as zero-shot planners for compositional scene rearrangement. arXiv preprint arXiv:2304.14391, 2023

  30. [32]

    Sohl-Dickstein, E

    J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics, 2015

  31. [34]

    URL https://arxiv.org/abs/2006.11239

  32. [35]

    Singh, S

    S. Singh, S. Tu, and V . Sindhwani. Revisiting energy based models as policies: Ranking noise contrastive estimation and interpolating energy models, 2023

  33. [36]

    Salimans and J

    T. Salimans and J. Ho. Should EBMs model the energy or the score? In Energy Based Models Workshop - ICLR 2021, 2021. URL https://openreview.net/forum?id= 9AS-TF2jRNb

  34. [37]

    H. Ryu, J. Kim, J. Chang, H. S. Ahn, J. Seo, T. Kim, J. Choi, and R. Horowitz. Diffusion-edfs: Bi-equivariant denoising generative modeling on se (3) for visual robotic manipulation. arXiv preprint arXiv:2309.02685, 2023

  35. [38]

    Urain, N

    J. Urain, N. Funk, J. Peters, and G. Chalvatzaki. Se (3)-diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5923–5930. IEEE, 2023

  36. [39]

    Z. Wang, J. J. Hunt, and M. Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022

  37. [40]

    U. A. Mishra and Y . Chen. Reorientdiff: Diffusion model based reorientation for object manipulation. arXiv preprint arXiv:2303.12700, 2023

  38. [41]

    W. Liu, T. Hermans, S. Chernova, and C. Paxton. Structdiffusion: Object-centric diffusion for semantic rearrangement of novel objects. arXiv preprint arXiv:2211.04604, 2022

  39. [42]

    Simeonov, A

    A. Simeonov, A. Goyal, L. Manuelli, L. Yen-Chen, A. Sarmiento, A. Rodriguez, P. Agrawal, and D. Fox. Shelving, stacking, hanging: Relational pose diffusion for multi-modal rearrangement. arXiv preprint arXiv:2307.04751, 2023

  40. [43]

    X. Fang, C. R. Garrett, C. Eppner, T. Lozano-Pérez, L. P. Kaelbling, and D. Fox. Dimsam: Diffusion models as samplers for task and motion planning under partial observability. arXiv preprint arXiv:2306.13196, 2023

  41. [44]

    Kapelyukh, V

    I. Kapelyukh, V . V osylius, and E. Johns. Dall-e-bot: Introducing web-scale diffusion models to robotics. IEEE Robotics and Automation Letters, 2023. 11

  42. [45]

    Y . Dai, M. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation. arXiv preprint arXiv:2302.00111, 2023

  43. [46]

    A. Ajay, S. Han, Y . Du, S. Li, G. Abhi, T. Jaakkola, J. Tenenbaum, L. Kaelbling, A. Srivastava, and P. Agrawal. Compositional foundation models for hierarchical planning. arXiv preprint arXiv:2309.08587, 2023

  44. [47]

    Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

    K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine. Zero- shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023

  45. [48]

    H. Chen, C. Lu, C. Ying, H. Su, and J. Zhu. Offline reinforcement learning via high-fidelity generative behavior modeling, 2023

  46. [49]

    B. Yang, H. Su, N. Gkanatsios, T.-W. Ke, A. Jain, J. Schneider, and K. Fragkiadaki. Diffusion- es: Gradient-free planning with diffusion for autonomous driving and zero-shot instruction following. ArXiv, abs/2402.06559, 2024

  47. [50]

    IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

    P. Hansen-Estruch, I. Kostrikov, M. Janner, J. G. Kuba, and S. Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573, 2023

  48. [51]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

  49. [52]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

  50. [53]

    S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y . Sulsky, J. Kay, J. T. Springenberg, et al. A generalist agent.arXiv preprint arXiv:2205.06175, 2022

  51. [54]

    E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022

  52. [55]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan, A. Raffin, A. Wahid, B. Burgess-Limerick, B. Kim, B. Schölkopf, B. Ichter, C. Lu, C. Xu, C. Finn, C. Xu, C. Chi, C. Huang, C. Chan, C. Pan, C. Fu, C. Devin, D. Driess, D. Pathak, D. Shah, D. Büchler, D. Kalashnikov, D. Sadigh, E. Johns, F. Ceola, F....

  53. [56]

    Ghosh, H

    Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023

  54. [57]

    H. Liu, L. Lee, K. Lee, and P. Abbeel. Instruction-following agents with jointly pre-trained vision-language models. arXiv preprint arXiv:2210.13431, 2022

  55. [58]

    Jaegle, F

    A. Jaegle, F. Gimeno, A. Brock, A. Zisserman, O. Vinyals, and J. Carreira. Perceiver: General perception with iterative attention, 2021

  56. [59]

    James and A

    S. James and A. J. Davison. Q-attention: Enabling efficient learning for vision-based robotic manipulation. IEEE Robotics and Automation Letters, 7(2):1612–1619, 2022

  57. [60]

    H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong. Unleash- ing large-scale video generative pre-training for visual robot manipulation. arXiv preprint arXiv:2312.13139, 2023

  58. [61]

    Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks, 2020

  59. [62]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020

  60. [63]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

  61. [64]

    Li and T

    Y . Li and T. Harada. Lepard: Learning partial point cloud matching in rigid and deformable scenes. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2022

  62. [65]

    Gkanatsios, M

    N. Gkanatsios, M. K. Singh, Z. Fang, S. Tulsiani, and K. Fragkiadaki. Analogy-forming transformers for few-shot 3d parsing. ArXiv, abs/2304.14382, 2023

  63. [66]

    J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

  64. [67]

    Rohmer, S

    E. Rohmer, S. P. Singh, and M. Freese. V-rep: A versatile and scalable robot simulation framework. In 2013 IEEE/RSJ international conference on intelligent robots and systems, pages 1321–1326. IEEE, 2013

  65. [68]

    J. J. Kuffner and S. M. LaValle. Rrt-connect: An efficient approach to single-query path planning. In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), volume 2, pages 995–1001. IEEE, 2000

  66. [69]

    Y . Ze, G. Yan, Y .-H. Wu, A. Macaluso, Y . Ge, J. Ye, N. Hansen, L. E. Li, and X. Wang. Gnfactor: Multi-task real robot learning with generalizable neural feature fields. arXiv preprint arXiv:2308.16891, 2023

  67. [70]

    S. Chen, R. G. Pinel, C. Schmid, and I. Laptev. Polarnet: 3d point clouds for language-guided robotic manipulation. ArXiv, abs/2309.15596, 2023

  68. [71]

    Coumans and Y

    E. Coumans and Y . Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. 2016

  69. [72]

    O. Mees, L. Hermann, and W. Burgard. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters (RA-L), 7(4):11205– 11212, 2022

  70. [73]

    Lynch and P

    C. Lynch and P. Sermanet. Language conditioned imitation learning over unstructured data. arXiv preprint arXiv:2005.07648, 2020. 13

  71. [74]

    X. Li, M. Liu, H. Zhang, C. Yu, J. Xu, H. Wu, C. Cheang, Y . Jing, W. Zhang, H. Liu, et al. Vision- language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378, 2023

  72. [75]

    Reducing the Barrier to Entry of Complex Robotic Software: a MoveIt! Case Study

    D. Coleman, I. Sucan, S. Chitta, and N. Correll. Reducing the barrier to entry of complex robotic software: a moveit! case study. arXiv preprint arXiv:1404.3785, 2014

  73. [76]

    G. Qian, Y . Li, H. Peng, J. Mai, H. A. A. K. Hammoud, M. Elhoseiny, and B. Ghanem. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. In NeurIPS, 2022

  74. [77]

    Brooks, A

    T. Brooks, A. Holynski, and A. A. Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023

  75. [78]

    Perez, F

    E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. 14 Appendix A Additional Experimental Results and Details 16 A.1 Robustness to noisy depth information on RLBench . . . . . . . . . . . . . . . . . 16 ...

  76. [79]

    The agent is successful if the target drawer is opened

    Open a drawer: The cabinet has three drawers (top, middle and bottom). The agent is successful if the target drawer is opened. The task on average involves three keyposes. 16 Figure 3: Failure cases on RLBench on the setup of GNFactor . We categorize the failure cases into 4 types: 1) precise pose prediction, where predicted end-effector poses are too imp...

  77. [80]

    The end-effector must push the block to the zone with the specified color

    Slide a block to a colored zone: There is one block and four zones with different colors (red, blue, pink, and yellow). The end-effector must push the block to the zone with the specified color. On average, the task involves approximately 4.7 keyposes

  78. [81]

    The agent needs to sweep the dirt into the specified dustpan

    Sweep the dust into a dustpan: There are two dustpans of different sizes (short and tall). The agent needs to sweep the dirt into the specified dustpan. The task on average involves 4.6 keyposes

  79. [82]

    The agent needs to take the meat off the grill frame and put it on the side

    Take the meat off the grill frame: There is chicken leg or steck. The agent needs to take the meat off the grill frame and put it on the side. The task involves 5 keyposes

  80. [83]

    The agent needs to rotate the specified handle 90◦

    Turn on the water tap: The water tap has two sides of handle. The agent needs to rotate the specified handle 90◦. The task involves 2 keyposes

Showing first 80 references.