pith. machine review for the scientific record. sign in

arxiv: 2604.11793 · v1 · submitted 2026-04-13 · 💻 cs.RO

Recognition: unknown

Disentangled Point Diffusion for Precise Object Placement

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:55 UTC · model grok-4.3

classification 💻 cs.RO
keywords object placementpoint cloud diffusionrobotic manipulationdisentangled diffusiondense GMMprecise insertiongeneralization
0
0 comments X

The pith

A disentangled point diffusion framework separates global scene priors from local object geometry and frame diffusion to achieve more precise robotic placement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to improve robotic object placement by moving beyond end-to-end policies that struggle with novel objects and high precision. It introduces TAX-DPD, a hierarchical approach that first applies a feed-forward dense GMM to generate a spatially dense prior over global placements, then uses a point cloud diffusion module that diffuses object geometry and placement frame as separate streams. This separation is meant to support better local geometric accuracy, multiple possible placements, and adaptation to changes in object shape or scene layout. The authors test the method on simulated and real industrial insertion tasks plus a cloth-hanging example to show its range.

Core claim

TAX-DPD models global scene-level placements through a novel feed-forward Dense Gaussian Mixture Model that yields a spatially dense prior over global placements, then models the local object-level configuration through a novel disentangled point cloud diffusion module that separately diffuses the object geometry and the placement frame. This enables precise local geometric reasoning and achieves substantially higher accuracy than prior SE(3)-diffusion approaches even for rigid objects, while extending to non-rigid objects as shown in cloth tasks.

What carries the argument

The disentangled point cloud diffusion module that separately diffuses the object geometry and the placement frame, supported by a feed-forward Dense GMM for global scene-level priors.

If this is right

  • Achieves state-of-the-art performance in placement precision on high-precision industrial insertion tasks.
  • Delivers improved multi-modal coverage of possible placement options.
  • Generalizes to variations in object geometries and scene configurations.
  • Extends applicability to non-rigid objects as demonstrated on simulated cloth-hanging tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of geometry and frame diffusion could be tested in other point-cloud-based robotic planning problems beyond placement.
  • The framework might integrate with perception pipelines that already output point clouds to reduce end-to-end training data needs.
  • Further experiments could check whether the dense GMM prior remains stable when scene objects move during execution.
  • The approach opens a path to hybrid systems that combine the diffusion output with local optimization for even tighter tolerances.

Load-bearing premise

That a feed-forward dense GMM supplies an effective global prior and that separately diffusing object geometry and placement frame in point clouds will produce substantially more precise local geometric reasoning than unified SE(3)-diffusion methods.

What would settle it

A controlled side-by-side test on the same suite of novel object geometries where the disentangled method shows no higher insertion success rate or lower placement error than an SE(3)-diffusion baseline would falsify the claimed precision gain.

Figures

Figures reproduced from arXiv: 2604.11793 by David Held, Eric Cai, Jianjun Wang, Lyuxing He, Shobhit Aggarwal.

Figure 1
Figure 1. Figure 1: Our method (TAX-DPD) uses disentangled point diffusion to predict precise goal configurations for a millimeter￾precision industrial insertion task. Blue denotes the scene point cloud, turquoise denotes the manipulated object point cloud, and red denotes the diffused goal point cloud, where we jointly diffuse the object placement frame and geometry. Abstract— Recent advances in robotic manipulation have hig… view at source ↗
Figure 2
Figure 2. Figure 2: Method Overview. (Left) Our Global Placement Initialization samples a rough global position using a novel dense GMM-based prediction module, a framework that models highly multi-modal placement distributions at the scene-level. (Right) Our Local Configuration Refinement then proceeds with a novel disentangled shape and reference frame diffusion that simultaneously allow precise and dense goal predictions. … view at source ↗
Figure 3
Figure 3. Figure 3: RPDiff Task Environments. (Top) Our experiments span various multi-modal placement tasks with significant object and scene variation. (Middle) TAX-DPD is able to precisely model goal configuration as point clouds. (Bottom) Successful executions of our model’s goal predictions. as Pˆ∗ O = gˆPˆ∗ O + ˆg. The robot can then move the object to this pose, using either a learned goal-conditioned policy (Appendix … view at source ↗
Figure 4
Figure 4. Figure 4: Coverage vs. Precision. We further evaluate TAX￾DPD and the baselines on coverage and precision with increasing numbers of inference samples on the RPDiff task Book/Shelf. significant on mug-hanging tasks involving substantial object geometric variations. This result confirms that operating in point space is a key contributor to our model’s success. To further analyze why point cloud diffusion is prefer￾ab… view at source ↗
Figure 5
Figure 5. Figure 5: Insertion task rollouts. Selected TAX-DPD rollouts. Top: pre-insertion. Bottom: post-insertion. teleoperations. More details about the real-world experiments can be found in Appendix IV. 2) Insertion Success Rate Results: The results presented in Table III underscore the effectiveness of our approach. We achieved 80-100% success across four high-precision insertion tasks that necessitate positional accurac… view at source ↗
Figure 6
Figure 6. Figure 6: DEDO Task Environment. (Top) A visualization of the variation in configuration in the HangProcCloth-DH task. (Middle) Our method generalizes well to diverse cloth geometries. (Bottom) Successful placements with a goal-conditioned placement policy. GMM for global initialization lowers success rate by 9 points and degrades both coverage and precision RMSE by 0.38 each, confirming its importance in capturing … view at source ↗
Figure 7
Figure 7. Figure 7: Data collection process for the vision-based insertion view at source ↗
Figure 8
Figure 8. Figure 8: The vision-based insertion requires two cameras, one view at source ↗
read the original abstract

Recent advances in robotic manipulation have highlighted the effectiveness of learning from demonstration. However, while end-to-end policies excel in expressivity and flexibility, they struggle both in generalizing to novel object geometries and in attaining a high degree of precision. An alternative, object-centric approach frames the task as predicting the placement pose of the target object, providing a modular decomposition of the problem. Building on this goal-prediction paradigm, we propose TAX-DPD, a hierarchical, disentangled point diffusion framework that achieves state-of-the-art performance in placement precision, multi-modal coverage, and generalization to variations in object geometries and scene configurations. We model global scene-level placements through a novel feed-forward Dense Gaussian Mixture Model (GMM) that yields a spatially dense prior over global placements; we then model the local object-level configuration through a novel disentangled point cloud diffusion module that separately diffuses the object geometry and the placement frame, enabling precise local geometric reasoning. Interestingly, we demonstrate that our point cloud diffusion achieves substantially higher accuracy than a prior approach based on SE(3)-diffusion, even in the context of rigid object placement. We validate our approach across a suite of challenging tasks in simulation and in the real-world on high-precision industrial insertion tasks. Furthermore, we present results on a cloth-hanging task in simulation, indicating that our framework can further relax assumptions on object rigidity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes TAX-DPD, a hierarchical disentangled point diffusion framework for precise object placement in robotics. It models global placements with a feed-forward Dense Gaussian Mixture Model and local configurations with a disentangled point cloud diffusion module that separately diffuses object geometry and the placement frame. The authors claim this achieves state-of-the-art results in placement precision, multi-modal coverage, and generalization to novel object geometries and scene configurations, with validation on simulation and real-world high-precision insertion tasks as well as a cloth-hanging task.

Significance. If the empirical claims hold after proper validation, the work would advance object-centric robotic manipulation by providing a modular diffusion approach that improves precision and generalization over end-to-end policies, with particular relevance for high-precision insertions and non-rigid objects.

major comments (2)
  1. [Abstract] Abstract: The claim that 'our point cloud diffusion achieves substantially higher accuracy than a prior approach based on SE(3)-diffusion, even in the context of rigid object placement' is central to the novelty but lacks any reported ablation isolating the disentanglement of geometry and placement-frame diffusion from confounding factors such as the Dense GMM prior, network capacity, or training schedule.
  2. [§5] §5 (Experimental Results): The abstract asserts SOTA performance across precision, coverage, and generalization but supplies no quantitative metrics, baselines, error bars, or experimental details, so the support for the central empirical claim cannot be assessed from the provided information.
minor comments (1)
  1. [Abstract] The acronym TAX-DPD is introduced in the abstract without expansion or definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We have revised the manuscript to address the concerns about isolating the contribution of disentanglement and providing clearer quantitative experimental support. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'our point cloud diffusion achieves substantially higher accuracy than a prior approach based on SE(3)-diffusion, even in the context of rigid object placement' is central to the novelty but lacks any reported ablation isolating the disentanglement of geometry and placement-frame diffusion from confounding factors such as the Dense GMM prior, network capacity, or training schedule.

    Authors: We agree that the abstract claim requires stronger isolation of the disentanglement effect. In the revised manuscript, we have added an ablation study (new Section 5.4) that compares the full disentangled point cloud diffusion against an SE(3)-diffusion baseline while holding the Dense GMM prior, network capacity, and training schedule fixed. The results show the accuracy improvement persists and is attributable to the separate diffusion of geometry and placement frame. We have also updated the abstract to reference this ablation. revision: yes

  2. Referee: [§5] §5 (Experimental Results): The abstract asserts SOTA performance across precision, coverage, and generalization but supplies no quantitative metrics, baselines, error bars, or experimental details, so the support for the central empirical claim cannot be assessed from the provided information.

    Authors: We apologize for any lack of prominence of the experimental details in the reviewed version. Section 5 of the manuscript does contain quantitative results, including Tables 1–3 with precision, coverage, and generalization metrics, multiple baselines (end-to-end policies and SE(3)-diffusion), and error bars from repeated runs, along with task descriptions in Section 5.1. To ensure the claims are fully assessable, we have expanded Section 5 with additional implementation details for baselines, explicit definitions of all metrics, statistical significance tests, and a consolidated summary table of SOTA comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on novel architecture validated against external baselines

full rationale

The paper introduces TAX-DPD as a hierarchical framework with a feed-forward Dense GMM for global scene placements and a disentangled point-cloud diffusion module (separately diffusing geometry and placement frame) for local reasoning. It reports empirical superiority over prior SE(3)-diffusion methods on precision, multi-modal coverage, and generalization tasks, including rigid-object and cloth-hanging scenarios. No equations, parameters, or results are shown to reduce by construction to the method's own definitions or fitted inputs. The comparison to SE(3)-diffusion is presented as an external benchmark result rather than a self-referential necessity. No load-bearing self-citations or ansatz smuggling appear in the provided text. The derivation chain is therefore self-contained against independent experimental validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claim rests on the effectiveness of two newly introduced modeling components whose performance is asserted but not derived from prior literature or external benchmarks in the abstract.

invented entities (2)
  • Dense Gaussian Mixture Model (GMM) no independent evidence
    purpose: To produce a spatially dense prior over global scene-level placements
    Described as a novel feed-forward component in the abstract
  • disentangled point cloud diffusion module no independent evidence
    purpose: To separately diffuse object geometry and placement frame for local configuration
    Presented as a novel module enabling precise local geometric reasoning

pith-pipeline@v0.9.0 · 5550 in / 1172 out tokens · 54028 ms · 2026-05-10T15:55:32.323000+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    Learning fine-grained bimanual manipulation with low-cost hardware,

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” inICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems

  2. [2]

    Aloha unleashed: A simple recipe for robot dexterity,

    T. Z. Zhao, J. Tompson, D. Driess, P. Florence, S. K. S. Ghasemipour, C. Finn, and A. Wahid, “Aloha unleashed: A simple recipe for robot dexterity,” in8th Annual Conference on Robot Learning

  3. [3]

    Tidybot++: An open-source holonomic mobile manipulator for robot learning,

    J. Wu, W. Chong, R. Holmberg, A. Prasad, Y . Gao, O. Khatib, S. Song, S. Rusinkiewicz, and J. Bohg, “Tidybot++: An open-source holonomic mobile manipulator for robot learning,” in8th Annual Conference on Robot Learning

  4. [4]

    Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots

    C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song, “Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.” Robotics: Science and Systems, 2024

  5. [5]

    Gello: A general, low- cost, and intuitive teleoperation framework for robot manipulators,

    P. Wu, Y . Shentu, Z. Yi, X. Lin, and P. Abbeel, “Gello: A general, low- cost, and intuitive teleoperation framework for robot manipulators,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 12 156–12 163

  6. [6]

    Open-television: Teleoperation with immersive active visual feedback,

    X. Cheng, J. Li, S. Yang, G. Yang, and X. Wang, “Open-television: Teleoperation with immersive active visual feedback,” in8th Annual Conference on Robot Learning

  7. [7]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” in Robotics: Science and Systems, 2023

  8. [8]

    3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,

    Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,” in2nd Workshop on Dexterous Manipulation: Design, Perception and Control (RSS)

  9. [9]

    Behavior transformers: Cloningkmodes with one stone,

    N. M. Shafiullah, Z. Cui, A. A. Altanzaya, and L. Pinto, “Behavior transformers: Cloningkmodes with one stone,”Advances in neural information processing systems, vol. 35, pp. 22 955–22 968, 2022

  10. [10]

    Behavior generation with latent actions,

    S. Lee, Y . Wang, H. Etukuru, H. J. ovKim, N. M. M. Shafiullah, and L. Pinto, “Behavior generation with latent actions,” inInternational Conference on Machine Learning. PMLR, 2024, pp. 26 991–27 008

  11. [11]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter,et al., “π 0: A vision- language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

  12. [12]

    Non-rigid relative place- ment through 3d dense diffusion,

    E. Cai, O. Donca, B. Eisner, and D. Held, “Non-rigid relative place- ment through 3d dense diffusion,” inConference on Robot Learning (CoRL), 2024

  13. [13]

    Tax-pose: Task- specific cross-pose estimation for robot manipulation,

    C. Pan, B. Okorn, H. Zhang, B. Eisner, and D. Held, “Tax-pose: Task- specific cross-pose estimation for robot manipulation,” inConference on Robot Learning. PMLR, 2023, pp. 1783–1792

  14. [14]

    Imagination policy: Using generative point cloud models for learning manipulation policies,

    H. Huang, K. Schmeckpeper, D. Wang, O. Biza, Y . Qian, H. Liu, M. Jia, R. Platt, and R. Walters, “Imagination policy: Using generative point cloud models for learning manipulation policies,” inProceedings of the Conference on Robot Learning, 2024

  15. [15]

    Shelving, stacking, hanging: Rela- tional pose diffusion for multi-modal rearrangement,

    A. Simeonov, A. Goyal, L. Manuelli, Y .-C. Lin, A. Sarmiento, A. R. Garcia, P. Agrawal, and D. Fox, “Shelving, stacking, hanging: Rela- tional pose diffusion for multi-modal rearrangement,” inConference on Robot Learning. PMLR, 2023, pp. 2030–2069

  16. [16]

    Anyplace: Learning general- ized object placement for robot manipulation,

    Y . Zhao, M. Bogdanovic, C. Luo, S. Tohme, K. Darvish, A. Aspuru- Guzik, F. Shkurti, and A. Garg, “Anyplace: Learning general- ized object placement for robot manipulation,”arXiv preprint arXiv:2502.04531, 2025

  17. [17]

    Structdif- fusion: Language-guided creation of physically-valid structures using unseen objects,

    W. Liu, Y . Du, T. Hermans, S. Chernova, and C. Paxton, “Structdif- fusion: Language-guided creation of physically-valid structures using unseen objects,” inRobotics: Science and Systems, 2023

  18. [18]

    Learning distributional demon- stration spaces for task-specific cross-pose estimation,

    J. Wang, O. Donca, and D. Held, “Learning distributional demon- stration spaces for task-specific cross-pose estimation,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 15 054–15 060

  19. [19]

    Deep se (3)-equivariant geometric reasoning for precise placement tasks,

    B. Eisner, Y . Yang, T. Davchev, M. Vecerik, J. Scholz, and D. Held, “Deep se (3)-equivariant geometric reasoning for precise placement tasks,” inThe Twelfth International Conference on Learning Repre- sentations

  20. [20]

    Neural descriptor fields: Se (3)- equivariant object representations for manipulation,

    A. Simeonov, Y . Du, A. Tagliasacchi, J. B. Tenenbaum, A. Rodriguez, P. Agrawal, and V . Sitzmann, “Neural descriptor fields: Se (3)- equivariant object representations for manipulation,” in2022 Interna- tional Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 6394–6400

  21. [21]

    Dense object nets: Learn- ing dense visual object descriptors by and for robotic manipulation,

    P. R. Florence, L. Manuelli, and R. Tedrake, “Dense object nets: Learn- ing dense visual object descriptors by and for robotic manipulation,” inConference on Robot Learning. PMLR, 2018, pp. 373–385

  22. [22]

    Dap: Diffusion-based affordance prediction for multi- modality storage,

    H. Chang, K. Boyalakuntla, Y . Liu, X. Zhang, L. Schramm, and A. Boularias, “Dap: Diffusion-based affordance prediction for multi- modality storage,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 9476–9481

  23. [23]

    Generative and discriminative voxel modeling with convolutional neural networks,

    A. Brock, T. Lim, J. M. Ritchie, and N. Weston, “Generative and discriminative voxel modeling with convolutional neural networks,” arXiv preprint arXiv:1608.04236, 2016

  24. [24]

    Setvae: Learning hierarchical composition for generative modeling of set-structured data,

    J. Kim, J. Yoo, J. Lee, and S. Hong, “Setvae: Learning hierarchical composition for generative modeling of set-structured data,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 059–15 068

  25. [25]

    Learning representations and generative models for 3d point clouds,

    P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas, “Learning representations and generative models for 3d point clouds,” inInter- national conference on machine learning. PMLR, 2018, pp. 40–49

  26. [26]

    3d point cloud generative adversarial network based on tree structured graph convolutions,

    D. W. Shu, S. W. Park, and J. Kwon, “3d point cloud generative adversarial network based on tree structured graph convolutions,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 3859–3868

  27. [27]

    Cpcgan: A controllable 3d point cloud generative adversarial network with semantic label generating,

    X. Yang, Y . Wu, K. Zhang, and C. Jin, “Cpcgan: A controllable 3d point cloud generative adversarial network with semantic label generating,” inProceedings of the AAAI conference on artificial intelligence, vol. 35, no. 4, 2021, pp. 3154–3162

  28. [28]

    Diffusion probabilistic models for 3d point cloud generation,

    S. Luo and W. Hu, “Diffusion probabilistic models for 3d point cloud generation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 2837–2845

  29. [29]

    Learning to generate realistic lidar point clouds,

    V . Zyrianov, X. Zhu, and S. Wang, “Learning to generate realistic lidar point clouds,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 17–35

  30. [30]

    Dit- 3d: Exploring plain diffusion transformers for 3d shape generation,

    S. Mo, E. Xie, R. Chu, L. Hong, M. Niessner, and Z. Li, “Dit- 3d: Exploring plain diffusion transformers for 3d shape generation,” Advances in neural information processing systems, vol. 36, pp. 67 960–67 971, 2023

  31. [31]

    Fast training of diffusion transformer with extreme masking for 3d point clouds generation,

    S. Mo, E. Xie, Y . Wu, J. Chen, M. Nießner, and Z. Li, “Fast training of diffusion transformer with extreme masking for 3d point clouds generation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 354–370

  32. [32]

    City3d: Large-scale building reconstruction from airborne lidar point clouds,

    J. Huang, J. Stoter, R. Peters, and L. Nan, “City3d: Large-scale building reconstruction from airborne lidar point clouds,”Remote Sensing, vol. 14, no. 9, p. 2254, 2022

  33. [33]

    Coherent 3d scene diffusion from a single rgb image,

    M. Dahnert, A. Dai, N. M ¨uller, and M. Nießner, “Coherent 3d scene diffusion from a single rgb image,”Advances in Neural Information Processing Systems, vol. 37, pp. 23 435–23 463, 2024

  34. [34]

    Point-E: A System for Generating 3D Point Clouds from Complex Prompts

    A. Nichol, H. Jun, P. Dhariwal, P. Mishkin, and M. Chen, “Point-e: A system for generating 3d point clouds from complex prompts,”arXiv preprint arXiv:2212.08751, 2022

  35. [35]

    Clip-forge: Towards zero-shot text- to-shape generation,

    A. Sanghi, H. Chu, J. G. Lambourne, Y . Wang, C.-Y . Cheng, M. Fumero, and K. R. Malekshan, “Clip-forge: Towards zero-shot text- to-shape generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 603–18 613

  36. [36]

    Diffusion probabilistic models for scene-scale 3d categorical data,

    J. Lee, W. Im, S. Lee, and S.-E. Yoon, “Diffusion probabilistic models for scene-scale 3d categorical data,”arXiv preprint arXiv:2301.00527, 2023

  37. [37]

    Towards realistic scene gener- ation with lidar diffusion models,

    H. Ran, V . Guizilini, and Y . Wang, “Towards realistic scene gener- ation with lidar diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 738–14 748

  38. [38]

    3d shape generation and completion through point-voxel diffusion,

    L. Zhou, Y . Du, and J. Wu, “3d shape generation and completion through point-voxel diffusion,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 5826–5835

  39. [39]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

  40. [40]

    Improved denoising diffusion prob- abilistic models,

    A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion prob- abilistic models,” inInternational conference on machine learning. PMLR, 2021, pp. 8162–8171

  41. [41]

    3d-vla: A 3d vision-language-action generative world model,

    H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan, “3d-vla: A 3d vision-language-action generative world model,” inInternational Conference on Machine Learning. PMLR, 2024, pp. 61 229–61 245

  42. [42]

    Pointnet++: Deep hierarchical feature learning on point sets in a metric space,

    C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,”Advances in neural information processing systems, vol. 30, 2017

  43. [43]

    Scalable diffusion models with transformers,

    W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4195–4205

  44. [44]

    Pybullet, a python module for physics simulation for games, robotics, and machine learning,

    E. Coumans and Y . Bai, “Pybullet, a python module for physics simulation for games, robotics, and machine learning,” 2016-2020. [Online]. Available: http://pybullet.org

  45. [45]

    Assembly Performance Metrics and Test Meth- ods,

    “Assembly Performance Metrics and Test Meth- ods,”NIST, May 2018. [Online]. Avail- able: https://www.nist.gov/el/intelligent-systems-division-73500/ robotic-grasping-and-manipulation-assembly/assembly

  46. [46]

    Dynamic environments with deformable objects,

    R. Antonova, P. Shi, H. Yin, Z. Weng, and D. K. Jensfelt, “Dynamic environments with deformable objects,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

  47. [47]

    Articubot: Learning universal articulated object manipulation policy via large scale simulation,

    Y . Wang, Z. Wang, M. Nakura, P. Bhowal, C.-L. Kuo, Y .-T. Chen, Z. Erickson, and D. Held, “Articubot: Learning universal articulated object manipulation policy via large scale simulation,”arXiv preprint arXiv:2503.03045, 2025

  48. [48]

    Chaineddiffuser: Unifying trajectory diffusion and keypose prediction for robotic manipulation,

    Z. Xian and N. Gkanatsios, “Chaineddiffuser: Unifying trajectory diffusion and keypose prediction for robotic manipulation,” inConfer- ence on Robot Learning/Proceedings of Machine Learning Research. Proceedings of Machine Learning Research, 2023

  49. [49]

    On the continuity of rotation representations in neural networks,

    Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, “On the continuity of rotation representations in neural networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5745–5753

  50. [50]

    Iterative geometry encoding volume for stereo matching,

    G. Xu, X. Wang, X. Ding, and X. Yang, “Iterative geometry encoding volume for stereo matching,” inProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, 2023, pp. 21 919– 21 928. Supplementary Material APPENDIXI EXTENSION TONON-RIGIDPLACEMENTTASKS A. Experimental Setup Since our point cloud–based formulation for goal predic- ...

  51. [51]

    To address this and ensure consistency across tasks, we standardize the input space by adaptively scaling both the scene and object point clouds based on task- specific statistics

    Dataset Scaling:As one of the motivating issues of our method, diffusing point clouds in the object placement setting is less feasible due to the scale mismatch between the object and the scene. To address this and ensure consistency across tasks, we standardize the input space by adaptively scaling both the scene and object point clouds based on task- sp...

  52. [52]

    Point Cloud Downsampling:For training, both the ob- ject and scene point clouds are downsampled using furthest point sampling. Depending on the scene complexity, we vary the number of sampled points to ensure minimal geometric information loss, preserving the structural features necessary for inferring the goal object configuration. We document the number...

  53. [53]

    token mixing

    Augmentation:We additionally augment the scene, initial object, and goal configuration point clouds with the same rotation, which is uniformly sampled from[0,2π]about thez-axis. B. Hyper-parameters We provide the hyper-parameters used for training and model configuration in Table V. These include both optimiza- tion settings (e.g., batch size, learning ra...

  54. [54]

    samples a set of three correspondences{(p i,ˆp∗ i )},

  55. [55]

    estimates a candidate transformationTfrom these correspondences,

  56. [56]

    AfterNiterations, we select the transform with the largest inlier set, then re-estimate the final SE(3) transform using SVD over all inliers

    evaluates inlier support by counting correspondences that satisfy∥T p j −ˆp∗ j ∥2 < τ, whereτis a distance threshold. AfterNiterations, we select the transform with the largest inlier set, then re-estimate the final SE(3) transform using SVD over all inliers. APPENDIXIV REALWORLDEXPERIMENTSDETAILS A. Setup Descriptions

  57. [57]

    Waterproof

    Hardware:The hardware setup is shown in Figure 8. The robot uses a 6-DOF arm and a gripper to manipulate the objects. To get visual input for insertion pose estimation, our setup contains two cameras: one at the end effector (wrist camera, Intel D405), and the other fixed to the table on the side of the workspace (side camera, ZEDX-Mini)). The objects cho...

  58. [58]

    When capturing the plug part of the connector (i.e

    Capturing Point Cloud:We use images from the stereo cameras and the deep stereo method IGEV [50] to provide depth estimation and therefore scene point clouds. When capturing the plug part of the connector (i.e. the objectO), which is movable, the robot grasping the objectOmoves to an object-capturing pose such that the side camera, fixed to the ground, ca...

  59. [59]

    Waterproof

    Generating Demonstration Data.:Data is collected by making the robot perform the insertion task multiple times with pre-programmed poses and human instructions. Figure 7 shows how each cycle of data collection works, where we randomly add variations to the initial gripper pose to simulate various initial object configurations. Specifically, the variations...