arxiv: 2402.10885 · v3 · pith:YN5B5WVAnew · submitted 2024-02-16 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

3D Diffuser Actor: Policy Diffusion with 3D Scene Representations

Tsung-Wei Ke , Nikolaos Gkanatsios , Katerina Fragkiadaki This is my paper

Pith reviewed 2026-05-17 21:56 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LG

keywords robot manipulationdiffusion policy3D scene representationRLBenchCALVIN benchmarkdenoising transformerviewpoint generalizationfew-shot real-robot learning

0 comments

The pith

A diffusion policy that denoises 3D robot pose trajectories from tokenized scene features, language, and proprioception sets new performance records on standard robot benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents 3D Diffuser Actor as a way to combine diffusion-based action generation with 3D scene representations for robot control. Instead of directly outputting actions or using 2D image features, the model takes noised 3D pose trajectories as input and learns to predict the added noise while conditioning on aggregated 3D visual features, a language goal, and robot joint states. This design produces higher success rates than prior policies on RLBench and CALVIN, with reported absolute gains of 18.1 percent in multi-view and 13.1 percent in single-view RLBench settings plus a 9 percent relative lift on CALVIN. The authors also demonstrate that the same model can be deployed on a physical manipulator after training on only a small number of real demonstrations. Through comparisons and ablations they attribute the gains to the use of 3D rather than 2D inputs, the diffusion objective rather than regression or classification, and tokenized rather than holistic scene embeddings.

Core claim

The central claim is that a denoising transformer operating on 3D scene tokens fused with language and proprioception can accurately predict noise in 3D robot pose trajectories and thereby produce policies that generalize across viewpoints better than 2D or non-diffusion alternatives, yielding the stated performance improvements on RLBench and CALVIN.

What carries the argument

A 3D denoising transformer that receives tokenized 3D scene embeddings from depth images together with language instructions and proprioception to output the noise estimate for noised 3D robot pose trajectories.

If this is right

The policy outperforms both regression and classification objectives for action prediction.
Tokenized 3D scene embeddings outperform holistic non-tokenized 3D embeddings and absolute attention mechanisms.
The same architecture transfers from simulation benchmarks to real-robot control with only a handful of demonstrations.
Multi-view 3D inputs produce larger gains than single-view inputs on the evaluated tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the 3D features stay reliable under distribution shift, the approach could reduce the need for extensive viewpoint-specific data collection in new environments.
The denoising formulation may allow the policy to represent multimodal action distributions more naturally than deterministic regressors, which could matter for tasks with multiple valid solutions.
Combining the 3D scene tokens with other sensor modalities such as tactile feedback could be a direct next step without changing the transformer backbone.

Load-bearing premise

The 3D scene features extracted from depth images remain accurate and viewpoint-invariant even when camera placement or lighting differs from the training distribution.

What would settle it

Measure success rate on the same RLBench tasks but with cameras moved to new positions or under changed lighting conditions not present in training; a large drop relative to the reported numbers would falsify the generalization benefit of the 3D representation.

read the original abstract

Diffusion policies are conditional diffusion models that learn robot action distributions conditioned on the robot and environment state. They have recently shown to outperform both deterministic and alternative action distribution learning formulations. 3D robot policies use 3D scene feature representations aggregated from a single or multiple camera views using sensed depth. They have shown to generalize better than their 2D counterparts across camera viewpoints. We unify these two lines of work and present 3D Diffuser Actor, a neural policy equipped with a novel 3D denoising transformer that fuses information from the 3D visual scene, a language instruction and proprioception to predict the noise in noised 3D robot pose trajectories. 3D Diffuser Actor sets a new state-of-the-art on RLBench with an absolute performance gain of 18.1% over the current SOTA on a multi-view setup and an absolute gain of 13.1% on a single-view setup. On the CALVIN benchmark, it improves over the current SOTA by a 9% relative increase. It also learns to control a robot manipulator in the real world from a handful of demonstrations. Through thorough comparisons with the current SOTA policies and ablations of our model, we show 3D Diffuser Actor's design choices dramatically outperform 2D representations, regression and classification objectives, absolute attentions, and holistic non-tokenized 3D scene embeddings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces 3D Diffuser Actor, a conditional diffusion policy for robot manipulation that employs a novel 3D denoising transformer to fuse 3D scene features (aggregated from depth images), language instructions, and proprioception for denoising noised 3D robot pose trajectories. It reports new state-of-the-art results on RLBench (absolute gains of 18.1% multi-view and 13.1% single-view over prior SOTA) and a 9% relative improvement on CALVIN, plus real-robot control from few demonstrations, with ablations showing advantages over 2D representations, regression/classification objectives, and holistic 3D embeddings.

Significance. If the reported gains prove robust under identical evaluation protocols, the work meaningfully advances diffusion-based policies by integrating explicit 3D scene representations, which prior results suggest improve viewpoint generalization. The inclusion of real-world validation and systematic ablations against 2D, regression, and non-tokenized 3D baselines strengthens the contribution; these elements provide concrete evidence that the 3D denoising transformer design is load-bearing for the observed performance.

major comments (3)

[Experiments] Experiments section (RLBench and CALVIN results): the headline absolute gains of 18.1% (multi-view) and 13.1% (single-view) on RLBench rest on direct numerical comparison to prior SOTA; the manuscript must explicitly state whether all baselines were re-implemented and re-evaluated by the authors under identical task sets, demonstration counts, camera configurations, simulator versions, action discretization, and success metrics, or whether numbers were taken from original papers.
[Results] Results tables and abstract: no error bars, standard deviations, or number of evaluation seeds/runs are reported for the stochastic diffusion policy, nor is any statistical significance test provided; this omission makes it impossible to determine whether the reported gains exceed run-to-run variability.
[§3] §3 (model description): the 3D scene feature aggregation from depth images is central to the viewpoint-invariance claim, yet no quantitative analysis or ablation tests robustness when camera placement, lighting, or depth noise distributions differ from those seen in training data.

minor comments (2)

[Figures] Figure captions and §4.3: clarify whether the visualized 3D tokens are per-point or per-voxel and how the denoising transformer attends across them.
[Related Work] Related work: ensure all recent 3D representation papers for manipulation (beyond the cited diffusion and 3D works) are referenced for completeness.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, indicating planned revisions where appropriate to improve clarity and rigor.

read point-by-point responses

Referee: [Experiments] Experiments section (RLBench and CALVIN results): the headline absolute gains of 18.1% (multi-view) and 13.1% (single-view) on RLBench rest on direct numerical comparison to prior SOTA; the manuscript must explicitly state whether all baselines were re-implemented and re-evaluated by the authors under identical task sets, demonstration counts, camera configurations, simulator versions, action discretization, and success metrics, or whether numbers were taken from original papers.

Authors: We thank the referee for this important clarification. The baseline numbers reported in our manuscript are taken directly from the original papers, following common practice in the field to ensure consistency with published protocols. Our method was evaluated using the exact task sets, demonstration counts, camera setups, and success metrics described in those works. We will add an explicit statement in the Experiments section and a clarifying footnote to the results tables in the revised manuscript. revision: yes
Referee: [Results] Results tables and abstract: no error bars, standard deviations, or number of evaluation seeds/runs are reported for the stochastic diffusion policy, nor is any statistical significance test provided; this omission makes it impossible to determine whether the reported gains exceed run-to-run variability.

Authors: We agree that reporting variability is essential for stochastic policies such as ours. While our primary results used a fixed random seed for reproducibility, we will update all tables to include standard deviations computed over 5 independent evaluation seeds and add a brief discussion of statistical significance in the revised Results section and abstract. revision: yes
Referee: [§3] §3 (model description): the 3D scene feature aggregation from depth images is central to the viewpoint-invariance claim, yet no quantitative analysis or ablation tests robustness when camera placement, lighting, or depth noise distributions differ from those seen in training data.

Authors: The viewpoint-invariance benefit is evidenced by our single-view versus multi-view comparisons and the consistent outperformance over 2D baselines, which already test generalization across camera configurations. We acknowledge that dedicated quantitative ablations on lighting variations and depth noise distributions were not included. We will add a targeted discussion in §3 and a supporting experiment in the appendix of the revised manuscript. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark gains with minor self-citation context but no load-bearing circularity

full rationale

The paper proposes a 3D Diffuser Actor architecture that fuses 3D scene features, language, and proprioception via a denoising transformer to model action distributions. All headline claims consist of measured success rates on RLBench and CALVIN benchmarks rather than any closed-form prediction or first-principles derivation. Prior diffusion-policy and 3D-representation papers are cited for motivation and architectural inspiration, yet those citations supply independent empirical precedents and do not substitute for the new model's training or evaluation protocol. No equation or result is shown to be definitionally equivalent to its own inputs, and the reported absolute/relative gains are obtained by direct comparison against re-implemented or published baselines under the same task definitions.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The performance claims rest on standard supervised learning assumptions plus the untested premise that 3D feature aggregation from depth is robust. No new physical entities are postulated.

free parameters (2)

diffusion noise schedule
Standard hyperparameter in diffusion models; its specific values are not reported in the abstract.
number of denoising steps
Typical diffusion training choice that affects both performance and compute.

axioms (1)

domain assumption 3D scene features extracted from depth images are sufficiently accurate and generalizable across viewpoints
Invoked when claiming superiority over 2D representations.

invented entities (1)

3D denoising transformer no independent evidence
purpose: Fuses 3D visual features, language, and proprioception to denoise robot pose trajectories
New architectural component introduced in the paper; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5568 in / 1337 out tokens · 24135 ms · 2026-05-17T21:56:13.325864+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/DimensionForcing dimension_forced echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

3D robot policies use 3D scene feature representations aggregated from a single or multiple camera views using sensed depth. They have shown to generalize better than their 2D counterparts across camera viewpoints.
Foundation/DimensionForcing D3_has_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We unify these two lines of work and present 3D Diffuser Actor, a neural policy equipped with a novel 3D denoising transformer that fuses information from the 3D visual scene

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis
cs.RO 2026-04 unverdicted novelty 7.0

VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
Rectified Schr\"odinger Bridge Matching for Few-Step Visual Navigation
cs.RO 2026-04 unverdicted novelty 7.0

RSBM exploits velocity field invariance across regularization levels to achieve over 94% cosine similarity and 92% success in visual navigation using only 3 integration steps.
SID: Sliding into Distribution for Robust Few-Demonstration Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

SID achieves approximately 90% success on six real-world manipulation tasks with only two demonstrations under out-of-distribution initializations, with less than 10% performance drop under distractors and disturbances.
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
cs.RO 2026-05 unverdicted novelty 6.0

GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
cs.RO 2026-05 unverdicted novelty 6.0

StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...
SnapFlow: One-Step Action Generation for Flow-Matching VLAs via Progressive Self-Distillation
cs.CV 2026-04 unverdicted novelty 6.0

SnapFlow compresses multi-step denoising in flow-matching VLAs into one step via progressive self-distillation using two-step Euler shortcuts from marginal velocities, matching 10-step teacher success rates with 9.6x ...
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
cs.RO 2026-01 unverdicted novelty 6.0

PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning
cs.RO 2025-11 unverdicted novelty 6.0

Isaac Lab is a unified GPU-native platform combining high-fidelity physics, photorealistic rendering, multi-frequency sensors, domain randomization, and learning pipelines for scalable multi-modal robot policy training.
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
cs.CV 2025-07 unverdicted novelty 6.0

DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 avera...
RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
cs.RO 2025-06 unverdicted novelty 6.0

RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
cs.CV 2025-03 unverdicted novelty 6.0

HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
cs.RO 2025-02 unverdicted novelty 6.0

DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
cs.CV 2024-12 unverdicted novelty 6.0

Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
cs.RO 2024-10 unverdicted novelty 6.0

GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
R3D: Revisiting 3D Policy Learning
cs.CV 2026-04 unverdicted novelty 5.0

A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation
cs.RO 2026-04 unverdicted novelty 5.0

Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.
VLBiMan: Vision-Language Anchored One-Shot Demonstration Enables Generalizable Bimanual Robotic Manipulation
cs.RO 2025-09 unverdicted novelty 5.0

VLBiMan framework enables generalizable bimanual manipulation from single human demonstrations via vision-language anchored task decomposition and adaptation without retraining.
What Matters in Building Vision-Language-Action Models for Generalist Robots
cs.RO 2024-12 unverdicted novelty 5.0

Systematic tests of VLM backbones, policy architectures, and cross-embodiment data yield RoboVLMs that set new SOTA on robot manipulation benchmarks while requiring few manual designs.
EL3DD: Extended Latent 3D Diffusion for Language Conditioned Multitask Manipulation
cs.RO 2025-11 unverdicted novelty 4.0

EL3DD extends latent 3D diffusion with language inputs and reference demonstrations to improve success rates on sequential manipulation tasks in the CALVIN dataset.

Reference graph

Works this paper leans on

120 extracted references · 120 canonical work pages · cited by 19 Pith papers · 16 internal anchors

[1]

Generative Adversarial Imitation Learning

J. Ho and S. Ermon. Generative adversarial imitation learning. CoRR, abs/1606.03476, 2016. URL http://arxiv.org/abs/1606.03476

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Tsurumine and T

Y . Tsurumine and T. Matsubara. Goal-aware generative adversarial imitation learning from imperfect demonstration for robotic cloth manipulation, 2022

work page 2022
[4]

URL http://arxiv.org/abs/1705.10479

work page internal anchor Pith review Pith/arXiv arXiv
[5]

N. M. M. Shafiullah, Z. J. Cui, A. Altanzaya, and L. Pinto. Behavior transformers: Cloning k modes with one stone, 2022

work page 2022
[6]

Pearce, T

T. Pearce, T. Rashid, A. Kanervisto, D. Bignell, M. Sun, R. Georgescu, S. V . Macua, S. Z. Tan, I. Momennejad, K. Hofmann, and S. Devlin. Imitating human behaviour with diffusion models, 2023

work page 2023
[7]

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023. 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Reuss, M

M. Reuss, M. Li, X. Jia, and R. Lioutikov. Goal-conditioned imitation learning using score-based diffusion policies. arXiv preprint arXiv:2304.02532, 2023

work page arXiv 2023
[9]

Mandlekar, F

A. Mandlekar, F. Ramos, B. Boots, L. Fei-Fei, A. Garg, and D. Fox. IRIS: implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. CoRR, abs/1911.05321, 2019. URL http://arxiv.org/abs/1911.05321

work page arXiv 1911
[10]

Chernova and M

S. Chernova and M. Veloso. Confidence-based policy learning from demonstration using gaussian mixture models. In Proceedings of the 6th International Joint Conference on Au- tonomous Agents and Multiagent Systems, AAMAS ’07, New York, NY , USA, 2007. Associa- tion for Computing Machinery. ISBN 9788190426275. doi:10.1145/1329125.1329407. URL https://doi.or...

work page doi:10.1145/1329125.1329407 2007
[11]

Florence, C

P. Florence, C. Lynch, A. Zeng, O. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mordatch, and J. Tompson. Implicit behavioral cloning. CoRR, abs/2109.00137, 2021. URL https: //arxiv.org/abs/2109.00137

work page arXiv 2021
[12]

James, Z

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

work page 2020
[13]

O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022

work page 2022
[14]

A. Zeng, P. Florence, J. Tompson, S. Welker, J. Chien, M. Attarian, T. Armstrong, I. Krasin, D. Duong, V . Sindhwani, et al. Transporter networks: Rearranging the visual world for robotic manipulation. In Conference on Robot Learning, pages 726–747. PMLR, 2021

work page 2021
[15]

Huang, O

H. Huang, O. Howell, X. Zhu, D. Wang, R. Walters, and R. Platt. Fourier transporter: Bi- equivariant robotic manipulation in 3d. In ICLR, 2024

work page 2024
[16]

James, K

S. James, K. Wada, T. Laidlow, and A. J. Davison. Coarse-to-fine q-attention: Efficient learning for visual robotic manipulation via discretisation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13739–13748, 2022

work page 2022
[17]

Shridhar, L

M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023

work page 2023
[18]

Gervet, Z

T. Gervet, Z. Xian, N. Gkanatsios, and K. Fragkiadaki. Act3d: 3d feature field transformers for multi-task robotic manipulation. CoRL, 2023

work page 2023
[19]

arXiv preprint arXiv:2306.14896 , year=

A. Goyal, J. Xu, Y . Guo, V . Blukis, Y .-W. Chao, and D. Fox. Rvt: Robotic view transformer for 3d object manipulation. arXiv preprint arXiv:2306.14896, 2023

work page arXiv 2023
[20]

P. Shaw, J. Uszkoreit, and A. Vaswani. Self-attention with relative position representations, 2018

work page 2018
[21]

J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Z. Xian, N. Gkanatsios, T. Gervet, T.-W. Ke, and K. Fragkiadaki. Chaineddiffuser: Unifying trajectory diffusion and keypose prediction for robotic manipulation. In Conference on Robot Learning, pages 2323–2339. PMLR, 2023

work page 2023
[23]

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. In Proceedings of Robotics: Science and Systems (RSS), 2024

work page 2024
[24]

Pomerleau

D. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In D. Touretzky, editor, Proceedings of (NeurIPS) Neural Information Processing Systems , pages 305 – 313. Morgan Kaufmann, December 1989

work page 1989
[25]

End to End Learning for Self-Driving Cars

M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Mon- fort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba. End to end learning for self-driving cars. CoRR, abs/1604.07316, 2016. URL http://arxiv.org/abs/1604.07316. 10

work page internal anchor Pith review Pith/arXiv arXiv 2016
[27]

Y . Ding, C. Florensa, P. Abbeel, and M. Phielipp. Goal-conditioned imitation learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ c8d3a760ebab63156...

work page 2019
[28]

Guhur, S

P.-L. Guhur, S. Chen, R. G. Pinel, M. Tapaswi, I. Laptev, and C. Schmid. Instruction-driven history-aware policies for robotic manipulations. In Conference on Robot Learning , pages 175–187. PMLR, 2023

work page 2023
[29]

Z. J. Cui, Y . Wang, N. M. M. Shafiullah, and L. Pinto. From play to policy: Conditional behavior generation from uncurated robot data. ArXiv, abs/2210.10047, 2022

work page arXiv 2022
[30]

D.-N. Ta, E. Cousineau, H. Zhao, and S. Feng. Conditional energy-based models for implicit policies: The gap between theory and practice, 2022

work page 2022
[31]

Gkanatsios, A

N. Gkanatsios, A. Jain, Z. Xian, Y . Zhang, C. Atkeson, and K. Fragkiadaki. Energy-based models as zero-shot planners for compositional scene rearrangement. arXiv preprint arXiv:2304.14391, 2023

work page arXiv 2023
[32]

Sohl-Dickstein, E

J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics, 2015

work page 2015
[34]

URL https://arxiv.org/abs/2006.11239

work page internal anchor Pith review Pith/arXiv arXiv 2006
[35]

Singh, S

S. Singh, S. Tu, and V . Sindhwani. Revisiting energy based models as policies: Ranking noise contrastive estimation and interpolating energy models, 2023

work page 2023
[36]

Salimans and J

T. Salimans and J. Ho. Should EBMs model the energy or the score? In Energy Based Models Workshop - ICLR 2021, 2021. URL https://openreview.net/forum?id= 9AS-TF2jRNb

work page 2021
[37]

H. Ryu, J. Kim, J. Chang, H. S. Ahn, J. Seo, T. Kim, J. Choi, and R. Horowitz. Diffusion-edfs: Bi-equivariant denoising generative modeling on se (3) for visual robotic manipulation. arXiv preprint arXiv:2309.02685, 2023

work page arXiv 2023
[38]

Urain, N

J. Urain, N. Funk, J. Peters, and G. Chalvatzaki. Se (3)-diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5923–5930. IEEE, 2023

work page 2023
[39]

Z. Wang, J. J. Hunt, and M. Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

U. A. Mishra and Y . Chen. Reorientdiff: Diffusion model based reorientation for object manipulation. arXiv preprint arXiv:2303.12700, 2023

work page arXiv 2023
[41]

W. Liu, T. Hermans, S. Chernova, and C. Paxton. Structdiffusion: Object-centric diffusion for semantic rearrangement of novel objects. arXiv preprint arXiv:2211.04604, 2022

work page arXiv 2022
[42]

Simeonov, A

A. Simeonov, A. Goyal, L. Manuelli, L. Yen-Chen, A. Sarmiento, A. Rodriguez, P. Agrawal, and D. Fox. Shelving, stacking, hanging: Relational pose diffusion for multi-modal rearrangement. arXiv preprint arXiv:2307.04751, 2023

work page arXiv 2023
[43]

X. Fang, C. R. Garrett, C. Eppner, T. Lozano-Pérez, L. P. Kaelbling, and D. Fox. Dimsam: Diffusion models as samplers for task and motion planning under partial observability. arXiv preprint arXiv:2306.13196, 2023

work page arXiv 2023
[44]

Kapelyukh, V

I. Kapelyukh, V . V osylius, and E. Johns. Dall-e-bot: Introducing web-scale diffusion models to robotics. IEEE Robotics and Automation Letters, 2023. 11

work page 2023
[45]

Y . Dai, M. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation. arXiv preprint arXiv:2302.00111, 2023

work page arXiv 2023
[46]

A. Ajay, S. Han, Y . Du, S. Li, G. Abhi, T. Jaakkola, J. Tenenbaum, L. Kaelbling, A. Srivastava, and P. Agrawal. Compositional foundation models for hierarchical planning. arXiv preprint arXiv:2309.08587, 2023

work page arXiv 2023
[47]

Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine. Zero- shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

H. Chen, C. Lu, C. Ying, H. Su, and J. Zhu. Offline reinforcement learning via high-fidelity generative behavior modeling, 2023

work page 2023
[49]

B. Yang, H. Su, N. Gkanatsios, T.-W. Ke, A. Jain, J. Schneider, and K. Fragkiadaki. Diffusion- es: Gradient-free planning with diffusion for autonomous driving and zero-shot instruction following. ArXiv, abs/2402.06559, 2024

work page arXiv 2024
[50]

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

P. Hansen-Estruch, I. Kostrikov, M. Janner, J. G. Kuba, and S. Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

work page 2023
[53]

S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y . Sulsky, J. Kay, J. T. Springenberg, et al. A generalist agent.arXiv preprint arXiv:2205.06175, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[54]

E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022

work page 2022
[55]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan, A. Raffin, A. Wahid, B. Burgess-Limerick, B. Kim, B. Schölkopf, B. Ichter, C. Lu, C. Xu, C. Finn, C. Xu, C. Chi, C. Huang, C. Chan, C. Pan, C. Fu, C. Devin, D. Driess, D. Pathak, D. Shah, D. Büchler, D. Kalashnikov, D. Sadigh, E. Johns, F. Ceola, F....

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023

work page 2023
[57]

H. Liu, L. Lee, K. Lee, and P. Abbeel. Instruction-following agents with jointly pre-trained vision-language models. arXiv preprint arXiv:2210.13431, 2022

work page arXiv 2022
[58]

Jaegle, F

A. Jaegle, F. Gimeno, A. Brock, A. Zisserman, O. Vinyals, and J. Carreira. Perceiver: General perception with iterative attention, 2021

work page 2021
[59]

James and A

S. James and A. J. Davison. Q-attention: Enabling efficient learning for vision-based robotic manipulation. IEEE Robotics and Automation Letters, 7(2):1612–1619, 2022

work page 2022
[60]

H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong. Unleash- ing large-scale video generative pre-training for visual robot manipulation. arXiv preprint arXiv:2312.13139, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks, 2020

work page 2020
[62]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020

work page 2020
[63]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021
[64]

Li and T

Y . Li and T. Harada. Lepard: Learning partial point cloud matching in rigid and deformable scenes. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2022

work page 2022
[65]

Gkanatsios, M

N. Gkanatsios, M. K. Singh, Z. Fang, S. Tulsiani, and K. Fragkiadaki. Analogy-forming transformers for few-shot 3d parsing. ArXiv, abs/2304.14382, 2023

work page arXiv 2023
[66]

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[67]

Rohmer, S

E. Rohmer, S. P. Singh, and M. Freese. V-rep: A versatile and scalable robot simulation framework. In 2013 IEEE/RSJ international conference on intelligent robots and systems, pages 1321–1326. IEEE, 2013

work page 2013
[68]

J. J. Kuffner and S. M. LaValle. Rrt-connect: An efficient approach to single-query path planning. In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), volume 2, pages 995–1001. IEEE, 2000

work page 2000
[69]

Y . Ze, G. Yan, Y .-H. Wu, A. Macaluso, Y . Ge, J. Ye, N. Hansen, L. E. Li, and X. Wang. Gnfactor: Multi-task real robot learning with generalizable neural feature fields. arXiv preprint arXiv:2308.16891, 2023

work page arXiv 2023
[70]

S. Chen, R. G. Pinel, C. Schmid, and I. Laptev. Polarnet: 3d point clouds for language-guided robotic manipulation. ArXiv, abs/2309.15596, 2023

work page arXiv 2023
[71]

Coumans and Y

E. Coumans and Y . Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. 2016

work page 2016
[72]

O. Mees, L. Hermann, and W. Burgard. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters (RA-L), 7(4):11205– 11212, 2022

work page 2022
[73]

Lynch and P

C. Lynch and P. Sermanet. Language conditioned imitation learning over unstructured data. arXiv preprint arXiv:2005.07648, 2020. 13

work page arXiv 2005
[74]

X. Li, M. Liu, H. Zhang, C. Yu, J. Xu, H. Wu, C. Cheang, Y . Jing, W. Zhang, H. Liu, et al. Vision- language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[75]

Reducing the Barrier to Entry of Complex Robotic Software: a MoveIt! Case Study

D. Coleman, I. Sucan, S. Chitta, and N. Correll. Reducing the barrier to entry of complex robotic software: a moveit! case study. arXiv preprint arXiv:1404.3785, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[76]

G. Qian, Y . Li, H. Peng, J. Mai, H. A. A. K. Hammoud, M. Elhoseiny, and B. Ghanem. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. In NeurIPS, 2022

work page 2022
[77]

Brooks, A

T. Brooks, A. Holynski, and A. A. Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023

work page 2023
[78]

Perez, F

E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. 14 Appendix A Additional Experimental Results and Details 16 A.1 Robustness to noisy depth information on RLBench . . . . . . . . . . . . . . . . . 16 ...

work page 2018
[79]

The agent is successful if the target drawer is opened

Open a drawer: The cabinet has three drawers (top, middle and bottom). The agent is successful if the target drawer is opened. The task on average involves three keyposes. 16 Figure 3: Failure cases on RLBench on the setup of GNFactor . We categorize the failure cases into 4 types: 1) precise pose prediction, where predicted end-effector poses are too imp...

work page
[80]

The end-effector must push the block to the zone with the specified color

Slide a block to a colored zone: There is one block and four zones with different colors (red, blue, pink, and yellow). The end-effector must push the block to the zone with the specified color. On average, the task involves approximately 4.7 keyposes

work page
[81]

The agent needs to sweep the dirt into the specified dustpan

Sweep the dust into a dustpan: There are two dustpans of different sizes (short and tall). The agent needs to sweep the dirt into the specified dustpan. The task on average involves 4.6 keyposes

work page
[82]

The agent needs to take the meat off the grill frame and put it on the side

Take the meat off the grill frame: There is chicken leg or steck. The agent needs to take the meat off the grill frame and put it on the side. The task involves 5 keyposes

work page
[83]

The agent needs to rotate the specified handle 90◦

Turn on the water tap: The water tap has two sides of handle. The agent needs to rotate the specified handle 90◦. The task involves 2 keyposes

work page

Showing first 80 references.