pith. sign in

arxiv: 2509.23468 · v3 · submitted 2025-09-27 · 💻 cs.RO · cs.AI· cs.LG

Multi-Modal Manipulation via Multi-Modal Policy Consensus

Pith reviewed 2026-05-18 11:47 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords multi-modal manipulationdiffusion modelsconsensus routersensor fusionrobotic policyRLBenchreal-world manipulation
0
0 comments X

The pith

A robotic manipulation policy factorizes into modality-specific diffusion models combined via a learned consensus router.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to improve how robots combine different sensory inputs such as vision and touch for manipulation tasks. Typical feature concatenation lets dominant signals overwhelm important but sparse ones, and makes adding new sensors difficult. By training separate diffusion models for each modality and using a router network to learn how to weight their outputs, the method allows adaptive combination and easy addition of modalities. Evaluation on simulated RLBench tasks and real-world experiments like picking occluded objects, reorienting a spoon in hand, and inserting puzzles shows better results than baselines, especially when multimodal reasoning is needed. The approach also handles perturbations and sensor issues, with analysis confirming shifts in modality reliance.

Core claim

Our method factorizes the policy into a set of diffusion models, each specialized for a single representation, and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental addition of new representations. Evaluations on simulated manipulation tasks in RLBench as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion demonstrate significant outperformance over feature-concatenation baselines on scenarios requiring multimodal reasoning, with further robustness to physical perturbations and sensor corruption shown via perturbation-based importance analysis revealing adaptive shifts

What carries the argument

the router network that learns consensus weights to adaptively combine contributions from modality-specific diffusion models

If this is right

  • The policy supports incremental incorporation of new sensory representations without retraining the entire system.
  • Performance gains occur on contact-rich tasks where vision might otherwise dominate touch signals.
  • Robustness is maintained under physical perturbations and sensor corruption.
  • Importance analysis confirms adaptive weighting shifts between modalities during execution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This factorization may enable automatic downweighting of faulty sensors without separate fault detection mechanisms.
  • The consensus router could extend naturally to additional input types such as audio or force sensing in assembly scenarios.
  • Scaling tests with four or more modalities would clarify whether consensus learning stays stable as the set grows.

Load-bearing premise

The router network successfully learns consensus weights that generalize across tasks and perturbations to produce the reported performance gains.

What would settle it

If new experiments on tasks with heavy visual occlusion and critical tactile requirements show no significant outperformance over feature-concatenation baselines, the value of the adaptive consensus weighting would be falsified.

Figures

Figures reproduced from arXiv: 2509.23468 by Binghao Huang, Chaoqi Liu, Haonan Chen, Hongyu Chen, Jiaming Xu, Jiayuan Mao, Kaiwen Hong, Katherine Driggs-Campbell, Yilun Du, Yunzhu Li.

Figure 1
Figure 1. Figure 1: Representation-Composable Policy. (a) Perturbation-based importance analysis in the occluded marker picking task shows that vision dominates early, while tactile signals become important once occluded, demonstrating that our framework dynamically utilizes different modalities across task phases. (b) Classical feature concatenation vs. our policy-level composition, where mi denotes a modality (e.g., RGB, po… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Our Compositional Policy Framework. Raw sensory modalities (mrgb, mtac) are encoded into embeddings (ergb, etac). Each modality is factorized into complementary sub￾policies (e.g., ϵθrgb,context(ergb, a), ϵθrgb,local(ergb, a), ϵθtac,coarse (etac, a), ϵθtac,fine (etac, a)), which produce score predictions that are averaged into a modality-specific score. A router network Rψ(ergb, . . . , etac) t… view at source ↗
Figure 3
Figure 3. Figure 3: Real-World Experimental Setup. (a) UR5e manipulator equipped with dual cameras and tactile sensors. (b–d) Overlays of initial conditions for the evaluation tasks: occluded marker picking, spoon reorientation, and puzzle insertion. corresponds to summing their score functions. This leads to a two-step aggregation process at each denoising step k: Intra-Modality Composition. The composed score for modality i… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Policy Rollouts. Representative execution traces from three tasks: Task 1 occluded marker picking, where tactile feedback guides manipulation when vision is unavailable; Task 2 spoon reorientation, demonstrating dexterous in-hand manipulation; Task 3 puzzle insertion, requiring high-precision alignment at millimeter accuracy. RGB-Tactile RGB RGB-Tactile RGB RGB-Tactile RGB Get Stuck in Bag Movi… view at source ↗
Figure 5
Figure 5. Figure 5: Typical Failure Cases of Baseline Methods. We show failure cases of an RGB-only policy compared with an RGB+Tactile concatenation baseline. Each task highlights the complementary roles of the two modalities: vision provides global spatial and geometric information, while tactile sensing provides contact awareness and fine-grained grasp feedback. (a) In occluded marker picking, the concatenation baseline be… view at source ↗
Figure 6
Figure 6. Figure 6: Policy Robustness under Diverse Perturbations. We evaluate three types of interventions: (a) runtime perturbation, where the marker is suddenly snatched away during execution; (b) sensor corruption, where a camera is occluded to simulate partial sensor failure; and (c) object repositioning, where task-relevant objects are reset and moved to new positions between executions. Our method maintains reliable pe… view at source ↗
Figure 7
Figure 7. Figure 7: Incremental Learning. (a) An RGB-only policy fails to grasp the marker without tactile feedback. (b) By composing a pre-trained RGB policy with a tactile policy using manually set consensus weights (0.5, 0.5), the combined policy successfully grasps the marker under occlusion without requiring retraining. by occluding one camera with an opaque card, eliminating visual input (Figure 6b). This corruption is … view at source ↗
read the original abstract

Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental of new representations. We evaluate our approach on simulated manipulation tasks in {RLBench}, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce a multi-modal policy for robotic manipulation by factorizing it into a set of diffusion models, each specialized for a single representation such as vision or touch. A router network learns consensus weights to adaptively combine the contributions from these models. This design is said to enable the incremental addition of new representations without retraining the entire policy, unlike feature concatenation approaches. The method is evaluated on simulated manipulation tasks in RLBench and real-world tasks including occluded object picking, in-hand spoon reorientation, and puzzle insertion, where it is reported to significantly outperform feature-concatenation baselines on scenarios requiring multimodal reasoning. Additional claims include robustness to physical perturbations and sensor corruption, supported by perturbation-based importance analysis revealing adaptive shifts between modalities.

Significance. Should the quantitative results and the incremental modality addition property be validated through appropriate experiments, this approach could provide a valuable framework for handling diverse sensory inputs in robotic policies. It addresses common issues with modality dominance in contact-rich tasks and offers potential for more flexible and scalable multi-modal systems in robotics.

major comments (2)
  1. [Abstract] The abstract asserts that the approach 'significantly outperforms feature-concatenation baselines' on RLBench and real tasks but supplies no quantitative metrics, error bars, ablation studies, or experimental protocol details, leaving the central claim without visible supporting evidence.
  2. [Abstract] The premise that the router enables incremental addition of new modalities without retraining the whole policy is central to the contribution but lacks any supporting experiment; the evaluations use fixed modality sets with no ablation or protocol for adding a new diffusion model post-training and updating only the router.
minor comments (1)
  1. [Abstract] There is a grammatical issue in 'enabling incremental of new representations'; it should read 'enabling incremental addition of new representations'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major comment point by point below, indicating where we agree revisions are warranted and outlining the changes we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] The abstract asserts that the approach 'significantly outperforms feature-concatenation baselines' on RLBench and real tasks but supplies no quantitative metrics, error bars, ablation studies, or experimental protocol details, leaving the central claim without visible supporting evidence.

    Authors: We agree that the abstract would be strengthened by including key quantitative indicators to support the performance claim. The full manuscript already contains detailed results with success rates, standard deviations, error bars, and ablation studies in Sections 4 and 5. In the revised version we will update the abstract to report representative metrics (e.g., mean success rates and standard deviations on RLBench and the real-world tasks) while keeping the abstract concise. revision: yes

  2. Referee: [Abstract] The premise that the router enables incremental addition of new modalities without retraining the whole policy is central to the contribution but lacks any supporting experiment; the evaluations use fixed modality sets with no ablation or protocol for adding a new diffusion model post-training and updating only the router.

    Authors: The architecture factorizes the policy into independent modality-specific diffusion models whose outputs are combined by a router; this design permits a new diffusion model to be trained separately and integrated by updating only the router weights. While the current experiments evaluate fixed modality combinations to demonstrate overall performance, we acknowledge that an explicit incremental-addition experiment is absent. We will add such an experiment in the revision, training an initial policy on a subset of modalities and then introducing an additional modality while retraining only the router, to directly validate the claimed property. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical architecture with no load-bearing derivations

full rationale

The paper presents an empirical method that factorizes policies into per-modality diffusion models combined via a learned router network. No equations, derivations, or first-principles results are described that reduce any claim to fitted parameters or self-referential definitions by construction. Evaluations on RLBench and real tasks are framed as direct performance comparisons against baselines, with no self-citation chains or uniqueness theorems invoked to force the architecture. The approach is self-contained as a proposed design tested on held-out scenarios.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view supplies insufficient detail to enumerate concrete free parameters, axioms, or invented entities; the router network and diffusion models are described at high level without explicit fitting procedures or background assumptions listed.

pith-pipeline@v0.9.0 · 5722 in / 1203 out tokens · 41457 ms · 2026-05-18T11:47:05.326204+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TouchGuide: Inference-Time Steering of Visuomotor Policies via Touch Guidance

    cs.RO 2026-01 unverdicted novelty 7.0

    TouchGuide improves contact-rich robot manipulation by steering diffusion or flow-matching visuomotor policies with tactile feasibility scores from a contrastively trained Contact Physical Model.

  2. Flexible Multitask Learning with Factorized Diffusion Policy

    cs.RO 2025-12 unverdicted novelty 6.0

    A factorized modular diffusion policy improves fitting of multimodal robot actions and enables flexible task adaptation without catastrophic forgetting.

  3. FlexiTac: A Low-Cost, Open-Source, Scalable Tactile Sensing Solution for Robotic Systems

    cs.RO 2026-04 unverdicted novelty 5.0

    FlexiTac is a scalable piezoresistive tactile sensing system with flexible FPC-Velostat-FPC pads and a 100 Hz multi-channel readout board that mounts on rigid or soft grippers and supports visuo-tactile learning.

  4. Learning Versatile Humanoid Manipulation with Touch Dreaming

    cs.RO 2026-04 conditional novelty 5.0

    HTD, a multimodal transformer policy trained with behavioral cloning and touch dreaming to predict future tactile latents, achieves a 90.9% relative success rate improvement over baselines on five real-world contact-r...

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 4 Pith papers · 2 internal anchors

  1. [1]

    Robot sound interpretation: Combining sight and sound in learning-based control,

    P. Chang, S. Liu, H. Chen, and K. Driggs-Campbell, “Robot sound interpretation: Combining sight and sound in learning-based control,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE Press, 2020, p. 5580–5587

  2. [2]

    Making sense of vision and touch: Self- supervised learning of multimodal representations for contact-rich tasks,

    M. A. Lee, Y . Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg, “Making sense of vision and touch: Self- supervised learning of multimodal representations for contact-rich tasks,” in2019 International Conference on Robotics and Automation (ICRA). IEEE Press, 2019, p. 8943–8950

  3. [3]

    See, hear, and feel: Smart sensory fusion for robotic manipulation,

    H. Li, Y . Zhang, J. Zhu, S. Wang, M. A. Lee, H. Xu, E. Adelson, L. Fei-Fei, R. Gao, and J. Wu, “See, hear, and feel: Smart sensory fusion for robotic manipulation,” inProceedings of The 6th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, K. Liu, D. Kulic, and J. Ichnowski, Eds., vol. 205. PMLR, 14–18 Dec 2023, pp. 1368–1378

  4. [4]

    More than a feeling: Learning to grasp and regrasp using vision and touch,

    R. Calandra, A. Owens, D. Jayaraman, J. Lin, W. Yuan, J. Malik, E. H. Adelson, and S. Levine, “More than a feeling: Learning to grasp and regrasp using vision and touch,”IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3300–3307, 2018

  5. [5]

    General in-hand object rotation with vision and touch,

    H. Qi, B. Yi, S. Suresh, M. Lambeta, Y . Ma, R. Calandra, and J. Malik, “General in-hand object rotation with vision and touch,” inProceed- ings of The 7th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, J. Tan, M. Toussaint, and K. Darvish, Eds., vol. 229. PMLR, 06–09 Nov 2023, pp. 2549–2564

  6. [6]

    Visuotactile-rl: Learning multimodal manipulation policies with deep reinforcement learning,

    J. Hansen, F. Hogan, D. Rivkin, D. Meger, M. Jenkin, and G. Dudek, “Visuotactile-rl: Learning multimodal manipulation policies with deep reinforcement learning,” in2022 International Conference on Robotics and Automation (ICRA), 2022, pp. 8298–8304

  7. [7]

    Compositional visual generation with energy based models,

    Y . Du, S. Li, and I. Mordatch, “Compositional visual generation with energy based models,” inAdvances in Neural Information Processing Systems, 2020

  8. [8]

    Compo- sitional visual generation with composable diffusion models,

    N. Liu, S. Li, Y . Du, A. Torralba, and J. B. Tenenbaum, “Compo- sitional visual generation with composable diffusion models,”arXiv preprint arXiv:2206.01714, 2022

  9. [9]

    Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc,

    Y . Du, C. Durkan, R. Strudel, J. B. Tenenbaum, S. Dieleman, R. Fer- gus, J. Sohl-Dickstein, A. Doucet, and W. Grathwohl, “Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc,” 2024

  10. [10]

    Compositional generative modeling: A single model is not all you need,

    Y . Du and L. Kaelbling, “Compositional generative modeling: A single model is not all you need,”arXiv preprint arXiv:2402.01103, 2024

  11. [11]

    Compositional diffusion-based continuous constraint solvers,

    Z. Yang, J. Mao, Y . Du, J. Wu, J. B. Tenenbaum, T. Lozano-Pérez, and L. P. Kaelbling, “Compositional diffusion-based continuous constraint solvers,” inProceedings of The 7th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, J. Tan, M. Toussaint, and K. Darvish, Eds., vol. 229. PMLR, 06–09 Nov 2023, pp. 3242–3265

  12. [12]

    Rlbench: The robot learning benchmark & learning environment,

    S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The robot learning benchmark & learning environment,” 2019

  13. [13]

    Roles of glabrous skin receptors and sensorimotor memory in automatic control of precision grip when lifting rougher or more slippery objects,

    R. S. Johansson and G. Westling, “Roles of glabrous skin receptors and sensorimotor memory in automatic control of precision grip when lifting rougher or more slippery objects,”Experimental Brain Research, vol. 56, pp. 550–564, 2004

  14. [14]

    Robotic grasping and contact: a review,

    A. Bicchi and V . R. Kumar, “Robotic grasping and contact: a review,” Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065), vol. 1, pp. 348–353 vol.1, 2000

  15. [15]

    Towards safe multi-level human-robot interaction in industrial tasks,

    Z. Huang, Y .-J. Mun, H. Chen, Y . Xie, Y . Niu, X. Li, N. Zhong, H.-I. You, D. L. McPherson, and K. Driggs-Campbell, “Towards safe multi-level human-robot interaction in industrial tasks,”ArXiv, vol. abs/2308.03222, 2023

  16. [16]

    Neural feels with neural fields: Visuo-tactile perception for in-hand manipulation,

    S. Suresh, H. Qi, T. Wu, T. Fan, L. Pineda, M. Lambeta, J. Malik, M. Kalakrishnan, R. Calandra, M. Kaess, J. Ortiz, and M. Mukadam, “Neural feels with neural fields: Visuo-tactile perception for in-hand manipulation,”Science Robotics, p. adl0628, 2024

  17. [17]

    Multimodal visual-tactile rep- resentation learning through self-supervised contrastive pre-training,

    V . Dave, F. Lygerakis, and E. Rueckert, “Multimodal visual-tactile rep- resentation learning through self-supervised contrastive pre-training,” arXiv preprint arXiv:2401.12024, 2024

  18. [18]

    Sim2real manipulation on unknown objects with tactile- based reinforcement learning,

    E. Su, C. Jia, Y . Qin, W. Zhou, A. Macaluso, B. Huang, and X. Wang, “Sim2real manipulation on unknown objects with tactile- based reinforcement learning,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 9234–9241

  19. [19]

    Visuo-tactile transformers for manipulation,

    Y . Chen, M. Van der Merwe, A. Sipos, and N. Fazeli, “Visuo-tactile transformers for manipulation,” in6th Annual Conference on Robot Learning, 2022

  20. [20]

    Robot synesthesia: In-hand manipulation with visuotactile sensing,

    Y . Yuan, H. Che, Y . Qin, B. Huang, Z.-H. Yin, K.-W. Lee, Y . Wu, S.-C. Lim, and X. Wang, “Robot synesthesia: In-hand manipulation with visuotactile sensing,”arXiv preprint arXiv:2312.01853, 2023

  21. [21]

    3d vitac:learning fine-grained manipulation with visuo-tactile sensing,

    B. Huang, Y . Wang, X. Yang, Y . Luo, and Y . Li, “3d vitac:learning fine-grained manipulation with visuo-tactile sensing,” inProceedings of Robotics: Conference on Robot Learning(CoRL), 2024

  22. [22]

    Learning to compose visual relations,

    N. Liu, S. Li, Y . Du, J. B. Tenenbaum, and A. Torralba, “Learning to compose visual relations,” inAdvances in Neural Information Processing Systems, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021

  23. [23]

    Composable energy policies for reactive motion generation and reinforcement learning,

    J. Urain, A. Li, P. Liu, C. D’Eramo, and J. Peters, “Composable energy policies for reactive motion generation and reinforcement learning,” The International Journal of Robotics Research, vol. 42, no. 10, pp. 827–858, 2023

  24. [24]

    Energy-based Models are Zero-Shot Planners for Composi- tional Scene Rearrangement,

    N. Gkanatsios, A. Jain, Z. Xian, Y . Zhang, C. Atkeson, and K. Fragki- adaki, “Energy-based Models are Zero-Shot Planners for Composi- tional Scene Rearrangement,” inRobotics: Science and Systems, 2023

  25. [25]

    Score-based generative modeling through stochastic differential equations,

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” inInternational Conference on Learning Rep- resentations, 2021

  26. [26]

    Planning with diffusion for flexible behavior synthesis,

    M. Janner, Y . Du, J. B. Tenenbaum, and S. Levine, “Planning with diffusion for flexible behavior synthesis,” inInternational Conference on Machine Learning, 2022

  27. [27]

    Is Conditional Generative Modeling all you need for Decision-Making?

    A. Ajay, Y . Du, A. Gupta, J. Tenenbaum, T. Jaakkola, and P. Agrawal, “Is conditional generative modeling all you need for decision- making?”arXiv preprint arXiv:2211.15657, 2022

  28. [28]

    Generative trajectory stitching through diffusion composition,

    Y . Luo, U. A. Mishra, Y . Du, and D. Xu, “Generative trajectory stitching through diffusion composition,” 2025

  29. [29]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improv- ing factuality and reasoning in language models through multiagent debate,”arXiv preprint arXiv:2305.14325, 2023

  30. [30]

    Generative skill chaining: Long-horizon skill planning with diffusion models,

    U. A. Mishra, S. Xue, Y . Chen, and D. Xu, “Generative skill chaining: Long-horizon skill planning with diffusion models,” inConference on Robot Learning. PMLR, 2023, pp. 2905–2925

  31. [31]

    Compositional foundation models for hierarchical planning

    A. Ajay, S. Han, Y . Du, S. Li, A. Gupta, T. Jaakkola, J. Tenen- baum, L. Kaelbling, A. Srivastava, and P. Agrawal, “Composi- tional foundation models for hierarchical planning,”arXiv preprint arXiv:2309.08587, 2023

  32. [32]

    Causal composition diffusion model for closed-loop traffic generation,

    H. Lin, X. Huang, T. Phan-Minh, D. S. Hayden, H. Zhang, D. Zhao, S. Srinivasa, E. M. Wolff, and H. Chen, “Causal composition diffusion model for closed-loop traffic generation,”arXiv preprint arXiv:2412.17920, 2024

  33. [33]

    Human motion diffusion as a generative prior,

    Y . Shafir, G. Tevet, R. Kapon, and A. H. Bermano, “Human motion diffusion as a generative prior,”arXiv preprint arXiv:2303.01418, 2023

  34. [34]

    Coma: Compositional human motion generation with multi-modal agents,

    S. Sun, G. De Araujo, J. Xu, S. Zhou, H. Zhang, Z. Huang, C. You, and X. Xie, “Coma: Compositional human motion generation with multi-modal agents,”arXiv preprint arXiv:2412.07320, 2024

  35. [35]

    Poco: Policy composition from and for heterogeneous robot learning,

    L. Wang, J. Zhao, Y . Du, E. H. Adelson, and R. Tedrake, “Poco: Policy composition from and for heterogeneous robot learning,”arXiv preprint arXiv:2402.02511, 2024

  36. [36]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” in Proceedings of Robotics: Science and Systems (RSS), 2023

  37. [37]

    Imitating human behaviour with diffusion models,

    T. Pearce, T. Rashid, A. Kanervisto, D. Bignell, M. Sun, R. Georgescu, S. V . Macua, S. Z. Tan, I. Momennejad, K. Hofmann,et al., “Imitating human behaviour with diffusion models,”ICLR, 2023

  38. [38]

    Goal-conditioned imitation learning using score-based diffusion policies,

    M. Reuss, M. Li, X. Jia, and R. Lioutikov, “Goal-conditioned imitation learning using score-based diffusion policies,” inProceedings of Robotics: Science and Systems (RSS), 2023

  39. [39]

    Touch in the wild: Learning fine-grained manipulation with a portable visuo-tactile gripper.arXiv preprint arXiv:2507.15062, 2025

    X. Zhu, B. Huang, and Y . Li, “Touch in the wild: Learning fine-grained manipulation with a portable visuo-tactile gripper,”arXiv preprint arXiv:2507.15062, 2025

  40. [40]

    Learning coordinated bimanual manipulation policies using state diffusion and inverse dynamics models,

    H. Chen, J. Xu, L. Sheng, T. Ji, S. Liu, Y . Li, and K. Driggs-Campbell, “Learning coordinated bimanual manipulation policies using state diffusion and inverse dynamics models,” in2025 IEEE International Conference on Robotics and Automation (ICRA), 2025

  41. [41]

    Tool-as- interface: Learning robot policies from observing human tool use,

    H. Chen, C. Zhu, S. Liu, Y . Li, and K. Driggs-Campbell, “Tool-as- interface: Learning robot policies from observing human tool use,” in Conference on Robot Learning (CoRL), 2025

  42. [42]

    Edmp: Ensemble-of-costs-guided diffusion for motion planning,

    K. Saha, V . Mandadi, J. Reddy, A. Srikanth, A. Agarwal, B. Sen, A. Singh, and M. Krishna, “Edmp: Ensemble-of-costs-guided diffusion for motion planning,”arXiv, 2023

  43. [43]

    Vt-refine: Learning bimanual assembly with visuo-tactile feedback via simulation fine-tuning,

    B. Huang, J. Xu, I. Akinola, W. Yang, B. Sundaralingam, R. O’Flaherty, D. Fox, X. Wang, A. Mousavian, Y .-W. Chao,et al., “Vt-refine: Learning bimanual assembly with visuo-tactile feedback via simulation fine-tuning,” inRSS 2025 Workshop on Whole-body Control and Bimanual Manipulation: Applications in Humanoids and Beyond

  44. [44]

    Hierarchical diffusion for offline decision making,

    W. Li, X. Wang, B. Jin, and H. Zha, “Hierarchical diffusion for offline decision making,” inInternational Conference on Machine Learning, 2023, pp. 20 035–20 064

  45. [45]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving,

    B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang,et al., “Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving,”arXiv preprint arXiv:2411.15139, 2024

  46. [46]

    Modality- composable diffusion policy via inference-time distribution-level com- position,

    J. Cao, Q. Zhang, H. Guo, J. Wang, H. Cheng, and R. Xu, “Modality- composable diffusion policy via inference-time distribution-level com- position,” 2025

  47. [47]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems, 2020

  48. [48]

    Emerging properties in self-supervised vision trans- formers,

    M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision trans- formers,” inProceedings of the International Conference on Computer Vision (ICCV), 2021

  49. [49]

    DINOv2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. HAZIZA, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Je- gou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “DINOv2: Learning robust visual features withou...