3D Diffuser Actor: Policy Diffusion with 3D Scene Representations
Pith reviewed 2026-05-17 21:56 UTC · model grok-4.3
The pith
A diffusion policy that denoises 3D robot pose trajectories from tokenized scene features, language, and proprioception sets new performance records on standard robot benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a denoising transformer operating on 3D scene tokens fused with language and proprioception can accurately predict noise in 3D robot pose trajectories and thereby produce policies that generalize across viewpoints better than 2D or non-diffusion alternatives, yielding the stated performance improvements on RLBench and CALVIN.
What carries the argument
A 3D denoising transformer that receives tokenized 3D scene embeddings from depth images together with language instructions and proprioception to output the noise estimate for noised 3D robot pose trajectories.
If this is right
- The policy outperforms both regression and classification objectives for action prediction.
- Tokenized 3D scene embeddings outperform holistic non-tokenized 3D embeddings and absolute attention mechanisms.
- The same architecture transfers from simulation benchmarks to real-robot control with only a handful of demonstrations.
- Multi-view 3D inputs produce larger gains than single-view inputs on the evaluated tasks.
Where Pith is reading between the lines
- If the 3D features stay reliable under distribution shift, the approach could reduce the need for extensive viewpoint-specific data collection in new environments.
- The denoising formulation may allow the policy to represent multimodal action distributions more naturally than deterministic regressors, which could matter for tasks with multiple valid solutions.
- Combining the 3D scene tokens with other sensor modalities such as tactile feedback could be a direct next step without changing the transformer backbone.
Load-bearing premise
The 3D scene features extracted from depth images remain accurate and viewpoint-invariant even when camera placement or lighting differs from the training distribution.
What would settle it
Measure success rate on the same RLBench tasks but with cameras moved to new positions or under changed lighting conditions not present in training; a large drop relative to the reported numbers would falsify the generalization benefit of the 3D representation.
read the original abstract
Diffusion policies are conditional diffusion models that learn robot action distributions conditioned on the robot and environment state. They have recently shown to outperform both deterministic and alternative action distribution learning formulations. 3D robot policies use 3D scene feature representations aggregated from a single or multiple camera views using sensed depth. They have shown to generalize better than their 2D counterparts across camera viewpoints. We unify these two lines of work and present 3D Diffuser Actor, a neural policy equipped with a novel 3D denoising transformer that fuses information from the 3D visual scene, a language instruction and proprioception to predict the noise in noised 3D robot pose trajectories. 3D Diffuser Actor sets a new state-of-the-art on RLBench with an absolute performance gain of 18.1% over the current SOTA on a multi-view setup and an absolute gain of 13.1% on a single-view setup. On the CALVIN benchmark, it improves over the current SOTA by a 9% relative increase. It also learns to control a robot manipulator in the real world from a handful of demonstrations. Through thorough comparisons with the current SOTA policies and ablations of our model, we show 3D Diffuser Actor's design choices dramatically outperform 2D representations, regression and classification objectives, absolute attentions, and holistic non-tokenized 3D scene embeddings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces 3D Diffuser Actor, a conditional diffusion policy for robot manipulation that employs a novel 3D denoising transformer to fuse 3D scene features (aggregated from depth images), language instructions, and proprioception for denoising noised 3D robot pose trajectories. It reports new state-of-the-art results on RLBench (absolute gains of 18.1% multi-view and 13.1% single-view over prior SOTA) and a 9% relative improvement on CALVIN, plus real-robot control from few demonstrations, with ablations showing advantages over 2D representations, regression/classification objectives, and holistic 3D embeddings.
Significance. If the reported gains prove robust under identical evaluation protocols, the work meaningfully advances diffusion-based policies by integrating explicit 3D scene representations, which prior results suggest improve viewpoint generalization. The inclusion of real-world validation and systematic ablations against 2D, regression, and non-tokenized 3D baselines strengthens the contribution; these elements provide concrete evidence that the 3D denoising transformer design is load-bearing for the observed performance.
major comments (3)
- [Experiments] Experiments section (RLBench and CALVIN results): the headline absolute gains of 18.1% (multi-view) and 13.1% (single-view) on RLBench rest on direct numerical comparison to prior SOTA; the manuscript must explicitly state whether all baselines were re-implemented and re-evaluated by the authors under identical task sets, demonstration counts, camera configurations, simulator versions, action discretization, and success metrics, or whether numbers were taken from original papers.
- [Results] Results tables and abstract: no error bars, standard deviations, or number of evaluation seeds/runs are reported for the stochastic diffusion policy, nor is any statistical significance test provided; this omission makes it impossible to determine whether the reported gains exceed run-to-run variability.
- [§3] §3 (model description): the 3D scene feature aggregation from depth images is central to the viewpoint-invariance claim, yet no quantitative analysis or ablation tests robustness when camera placement, lighting, or depth noise distributions differ from those seen in training data.
minor comments (2)
- [Figures] Figure captions and §4.3: clarify whether the visualized 3D tokens are per-point or per-voxel and how the denoising transformer attends across them.
- [Related Work] Related work: ensure all recent 3D representation papers for manipulation (beyond the cited diffusion and 3D works) are referenced for completeness.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, indicating planned revisions where appropriate to improve clarity and rigor.
read point-by-point responses
-
Referee: [Experiments] Experiments section (RLBench and CALVIN results): the headline absolute gains of 18.1% (multi-view) and 13.1% (single-view) on RLBench rest on direct numerical comparison to prior SOTA; the manuscript must explicitly state whether all baselines were re-implemented and re-evaluated by the authors under identical task sets, demonstration counts, camera configurations, simulator versions, action discretization, and success metrics, or whether numbers were taken from original papers.
Authors: We thank the referee for this important clarification. The baseline numbers reported in our manuscript are taken directly from the original papers, following common practice in the field to ensure consistency with published protocols. Our method was evaluated using the exact task sets, demonstration counts, camera setups, and success metrics described in those works. We will add an explicit statement in the Experiments section and a clarifying footnote to the results tables in the revised manuscript. revision: yes
-
Referee: [Results] Results tables and abstract: no error bars, standard deviations, or number of evaluation seeds/runs are reported for the stochastic diffusion policy, nor is any statistical significance test provided; this omission makes it impossible to determine whether the reported gains exceed run-to-run variability.
Authors: We agree that reporting variability is essential for stochastic policies such as ours. While our primary results used a fixed random seed for reproducibility, we will update all tables to include standard deviations computed over 5 independent evaluation seeds and add a brief discussion of statistical significance in the revised Results section and abstract. revision: yes
-
Referee: [§3] §3 (model description): the 3D scene feature aggregation from depth images is central to the viewpoint-invariance claim, yet no quantitative analysis or ablation tests robustness when camera placement, lighting, or depth noise distributions differ from those seen in training data.
Authors: The viewpoint-invariance benefit is evidenced by our single-view versus multi-view comparisons and the consistent outperformance over 2D baselines, which already test generalization across camera configurations. We acknowledge that dedicated quantitative ablations on lighting variations and depth noise distributions were not included. We will add a targeted discussion in §3 and a supporting experiment in the appendix of the revised manuscript. revision: partial
Circularity Check
Empirical benchmark gains with minor self-citation context but no load-bearing circularity
full rationale
The paper proposes a 3D Diffuser Actor architecture that fuses 3D scene features, language, and proprioception via a denoising transformer to model action distributions. All headline claims consist of measured success rates on RLBench and CALVIN benchmarks rather than any closed-form prediction or first-principles derivation. Prior diffusion-policy and 3D-representation papers are cited for motivation and architectural inspiration, yet those citations supply independent empirical precedents and do not substitute for the new model's training or evaluation protocol. No equation or result is shown to be definitionally equivalent to its own inputs, and the reported absolute/relative gains are obtained by direct comparison against re-implemented or published baselines under the same task definitions.
Axiom & Free-Parameter Ledger
free parameters (2)
- diffusion noise schedule
- number of denoising steps
axioms (1)
- domain assumption 3D scene features extracted from depth images are sufficiently accurate and generalizable across viewpoints
invented entities (1)
-
3D denoising transformer
no independent evidence
Lean theorems connected to this paper
-
Foundation/DimensionForcingdimension_forced echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
3D robot policies use 3D scene feature representations aggregated from a single or multiple camera views using sensed depth. They have shown to generalize better than their 2D counterparts across camera viewpoints.
-
Foundation/DimensionForcingD3_has_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We unify these two lines of work and present 3D Diffuser Actor, a neural policy equipped with a novel 3D denoising transformer that fuses information from the 3D visual scene
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis
VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
-
Rectified Schr\"odinger Bridge Matching for Few-Step Visual Navigation
RSBM exploits velocity field invariance across regularization levels to achieve over 94% cosine similarity and 92% success in visual navigation using only 3 integration steps.
-
SID: Sliding into Distribution for Robust Few-Demonstration Manipulation
SID achieves approximately 90% success on six real-world manipulation tasks with only two demonstrations under out-of-distribution initializations, with less than 10% performance drop under distractors and disturbances.
-
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
-
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...
-
SnapFlow: One-Step Action Generation for Flow-Matching VLAs via Progressive Self-Distillation
SnapFlow compresses multi-step denoising in flow-matching VLAs into one step via progressive self-distillation using two-step Euler shortcuts from marginal velocities, matching 10-step teacher success rates with 9.6x ...
-
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
-
Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning
Isaac Lab is a unified GPU-native platform combining high-fidelity physics, photorealistic rendering, multi-frequency sensors, domain randomization, and learning pipelines for scalable multi-modal robot policy training.
-
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 avera...
-
RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.
-
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.
-
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
-
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
-
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
-
R3D: Revisiting 3D Policy Learning
A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
-
ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation
Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.
-
VLBiMan: Vision-Language Anchored One-Shot Demonstration Enables Generalizable Bimanual Robotic Manipulation
VLBiMan framework enables generalizable bimanual manipulation from single human demonstrations via vision-language anchored task decomposition and adaptation without retraining.
-
What Matters in Building Vision-Language-Action Models for Generalist Robots
Systematic tests of VLM backbones, policy architectures, and cross-embodiment data yield RoboVLMs that set new SOTA on robot manipulation benchmarks while requiring few manual designs.
-
EL3DD: Extended Latent 3D Diffusion for Language Conditioned Multitask Manipulation
EL3DD extends latent 3D diffusion with language inputs and reference demonstrations to improve success rates on sequential manipulation tasks in the CALVIN dataset.
Reference graph
Works this paper leans on
-
[1]
Generative Adversarial Imitation Learning
J. Ho and S. Ermon. Generative adversarial imitation learning. CoRR, abs/1606.03476, 2016. URL http://arxiv.org/abs/1606.03476
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Y . Tsurumine and T. Matsubara. Goal-aware generative adversarial imitation learning from imperfect demonstration for robotic cloth manipulation, 2022
work page 2022
-
[4]
URL http://arxiv.org/abs/1705.10479
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
N. M. M. Shafiullah, Z. J. Cui, A. Altanzaya, and L. Pinto. Behavior transformers: Cloning k modes with one stone, 2022
work page 2022
- [6]
-
[7]
C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023. 9
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [8]
-
[9]
A. Mandlekar, F. Ramos, B. Boots, L. Fei-Fei, A. Garg, and D. Fox. IRIS: implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. CoRR, abs/1911.05321, 2019. URL http://arxiv.org/abs/1911.05321
-
[10]
S. Chernova and M. Veloso. Confidence-based policy learning from demonstration using gaussian mixture models. In Proceedings of the 6th International Joint Conference on Au- tonomous Agents and Multiagent Systems, AAMAS ’07, New York, NY , USA, 2007. Associa- tion for Computing Machinery. ISBN 9788190426275. doi:10.1145/1329125.1329407. URL https://doi.or...
-
[11]
P. Florence, C. Lynch, A. Zeng, O. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mordatch, and J. Tompson. Implicit behavioral cloning. CoRR, abs/2109.00137, 2021. URL https: //arxiv.org/abs/2109.00137
- [12]
-
[13]
O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022
work page 2022
-
[14]
A. Zeng, P. Florence, J. Tompson, S. Welker, J. Chien, M. Attarian, T. Armstrong, I. Krasin, D. Duong, V . Sindhwani, et al. Transporter networks: Rearranging the visual world for robotic manipulation. In Conference on Robot Learning, pages 726–747. PMLR, 2021
work page 2021
- [15]
- [16]
-
[17]
M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023
work page 2023
- [18]
-
[19]
arXiv preprint arXiv:2306.14896 , year=
A. Goyal, J. Xu, Y . Guo, V . Blukis, Y .-W. Chao, and D. Fox. Rvt: Robotic view transformer for 3d object manipulation. arXiv preprint arXiv:2306.14896, 2023
-
[20]
P. Shaw, J. Uszkoreit, and A. Vaswani. Self-attention with relative position representations, 2018
work page 2018
-
[21]
J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[22]
Z. Xian, N. Gkanatsios, T. Gervet, T.-W. Ke, and K. Fragkiadaki. Chaineddiffuser: Unifying trajectory diffusion and keypose prediction for robotic manipulation. In Conference on Robot Learning, pages 2323–2339. PMLR, 2023
work page 2023
-
[23]
Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. In Proceedings of Robotics: Science and Systems (RSS), 2024
work page 2024
- [24]
-
[25]
End to End Learning for Self-Driving Cars
M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Mon- fort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba. End to end learning for self-driving cars. CoRR, abs/1604.07316, 2016. URL http://arxiv.org/abs/1604.07316. 10
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[27]
Y . Ding, C. Florensa, P. Abbeel, and M. Phielipp. Goal-conditioned imitation learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ c8d3a760ebab63156...
work page 2019
- [28]
- [29]
-
[30]
D.-N. Ta, E. Cousineau, H. Zhao, and S. Feng. Conditional energy-based models for implicit policies: The gap between theory and practice, 2022
work page 2022
-
[31]
N. Gkanatsios, A. Jain, Z. Xian, Y . Zhang, C. Atkeson, and K. Fragkiadaki. Energy-based models as zero-shot planners for compositional scene rearrangement. arXiv preprint arXiv:2304.14391, 2023
-
[32]
J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics, 2015
work page 2015
-
[34]
URL https://arxiv.org/abs/2006.11239
work page internal anchor Pith review Pith/arXiv arXiv 2006
- [35]
-
[36]
T. Salimans and J. Ho. Should EBMs model the energy or the score? In Energy Based Models Workshop - ICLR 2021, 2021. URL https://openreview.net/forum?id= 9AS-TF2jRNb
work page 2021
- [37]
- [38]
-
[39]
Z. Wang, J. J. Hunt, and M. Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [40]
- [41]
-
[42]
A. Simeonov, A. Goyal, L. Manuelli, L. Yen-Chen, A. Sarmiento, A. Rodriguez, P. Agrawal, and D. Fox. Shelving, stacking, hanging: Relational pose diffusion for multi-modal rearrangement. arXiv preprint arXiv:2307.04751, 2023
- [43]
-
[44]
I. Kapelyukh, V . V osylius, and E. Johns. Dall-e-bot: Introducing web-scale diffusion models to robotics. IEEE Robotics and Automation Letters, 2023. 11
work page 2023
- [45]
- [46]
-
[47]
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine. Zero- shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
H. Chen, C. Lu, C. Ying, H. Su, and J. Zhu. Offline reinforcement learning via high-fidelity generative behavior modeling, 2023
work page 2023
- [49]
-
[50]
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
P. Hansen-Estruch, I. Kostrikov, M. Janner, J. G. Kuba, and S. Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
RT-1: Robotics Transformer for Real-World Control at Scale
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[52]
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...
work page 2023
-
[53]
S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y . Sulsky, J. Kay, J. T. Springenberg, et al. A generalist agent.arXiv preprint arXiv:2205.06175, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[54]
E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022
work page 2022
-
[55]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan, A. Raffin, A. Wahid, B. Burgess-Limerick, B. Kim, B. Schölkopf, B. Ichter, C. Lu, C. Xu, C. Finn, C. Xu, C. Chi, C. Huang, C. Chan, C. Pan, C. Fu, C. Devin, D. Driess, D. Pathak, D. Shah, D. Büchler, D. Kalashnikov, D. Sadigh, E. Johns, F. Ceola, F....
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [56]
- [57]
- [58]
-
[59]
S. James and A. J. Davison. Q-attention: Enabling efficient learning for vision-based robotic manipulation. IEEE Robotics and Automation Letters, 7(2):1612–1619, 2022
work page 2022
-
[60]
H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong. Unleash- ing large-scale video generative pre-training for visual robot manipulation. arXiv preprint arXiv:2312.13139, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks, 2020
work page 2020
-
[62]
J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020
work page 2020
-
[63]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021
work page 2021
- [64]
-
[65]
N. Gkanatsios, M. K. Singh, Z. Fang, S. Tulsiani, and K. Fragkiadaki. Analogy-forming transformers for few-shot 3d parsing. ArXiv, abs/2304.14382, 2023
-
[66]
J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
- [67]
-
[68]
J. J. Kuffner and S. M. LaValle. Rrt-connect: An efficient approach to single-query path planning. In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), volume 2, pages 995–1001. IEEE, 2000
work page 2000
- [69]
- [70]
-
[71]
E. Coumans and Y . Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. 2016
work page 2016
-
[72]
O. Mees, L. Hermann, and W. Burgard. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters (RA-L), 7(4):11205– 11212, 2022
work page 2022
-
[73]
C. Lynch and P. Sermanet. Language conditioned imitation learning over unstructured data. arXiv preprint arXiv:2005.07648, 2020. 13
-
[74]
X. Li, M. Liu, H. Zhang, C. Yu, J. Xu, H. Wu, C. Cheang, Y . Jing, W. Zhang, H. Liu, et al. Vision- language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[75]
Reducing the Barrier to Entry of Complex Robotic Software: a MoveIt! Case Study
D. Coleman, I. Sucan, S. Chitta, and N. Correll. Reducing the barrier to entry of complex robotic software: a moveit! case study. arXiv preprint arXiv:1404.3785, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[76]
G. Qian, Y . Li, H. Peng, J. Mai, H. A. A. K. Hammoud, M. Elhoseiny, and B. Ghanem. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. In NeurIPS, 2022
work page 2022
- [77]
-
[78]
E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. 14 Appendix A Additional Experimental Results and Details 16 A.1 Robustness to noisy depth information on RLBench . . . . . . . . . . . . . . . . . 16 ...
work page 2018
-
[79]
The agent is successful if the target drawer is opened
Open a drawer: The cabinet has three drawers (top, middle and bottom). The agent is successful if the target drawer is opened. The task on average involves three keyposes. 16 Figure 3: Failure cases on RLBench on the setup of GNFactor . We categorize the failure cases into 4 types: 1) precise pose prediction, where predicted end-effector poses are too imp...
-
[80]
The end-effector must push the block to the zone with the specified color
Slide a block to a colored zone: There is one block and four zones with different colors (red, blue, pink, and yellow). The end-effector must push the block to the zone with the specified color. On average, the task involves approximately 4.7 keyposes
-
[81]
The agent needs to sweep the dirt into the specified dustpan
Sweep the dust into a dustpan: There are two dustpans of different sizes (short and tall). The agent needs to sweep the dirt into the specified dustpan. The task on average involves 4.6 keyposes
-
[82]
The agent needs to take the meat off the grill frame and put it on the side
Take the meat off the grill frame: There is chicken leg or steck. The agent needs to take the meat off the grill frame and put it on the side. The task involves 5 keyposes
-
[83]
The agent needs to rotate the specified handle 90◦
Turn on the water tap: The water tap has two sides of handle. The agent needs to rotate the specified handle 90◦. The task involves 2 keyposes
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.