Multimodal Diffusion Forcing for Forceful Manipulation

Dmitry Berenson; Huaidian Hou; Zixuan Huang

arxiv: 2511.04812 · v2 · submitted 2025-11-06 · 💻 cs.RO · cs.AI· cs.LG

Multimodal Diffusion Forcing for Forceful Manipulation

Zixuan Huang , Huaidian Hou , Dmitry Berenson This is my paper

Pith reviewed 2026-05-18 00:29 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords diffusion modelsimitation learningmultimodal trajectoriesforceful manipulationcontact-rich taskstrajectory reconstructionrobotic policies

0 comments

The pith

Multimodal Diffusion Forcing trains a diffusion model to reconstruct randomly masked multimodal robot trajectories, learning temporal and cross-modal dependencies for forceful manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Multimodal Diffusion Forcing as an alternative to standard imitation learning, which maps observations directly to actions. Instead, the method takes expert trajectories containing sensory inputs, actions, and rewards, applies random partial masking, and trains a diffusion model to reconstruct the full sequence. This reconstruction task pushes the model to discover how actions affect force signals and how partial observations relate to complete states over time. The resulting policies are evaluated on contact-rich manipulation tasks in simulation and on physical robots, where they demonstrate versatility, competitive success rates, and resilience when observations contain noise. A reader would care because the approach models the physical interplay between modalities that direct-mapping methods typically ignore.

Core claim

By applying random partial masking to multimodal trajectories and training a diffusion model to reconstruct them, the framework learns temporal and cross-modal dependencies, such as predicting the effects of actions on force signals or inferring states from partial observations, which supports effective policies for contact-rich forceful manipulation.

What carries the argument

Multimodal Diffusion Forcing: a diffusion model trained to reconstruct randomly partially masked trajectories that combine sensory inputs, actions, and rewards, thereby capturing interdependencies across time and modalities.

If this is right

The model can predict how actions influence force signals as a direct result of the learned cross-modal dependencies.
States can be inferred from partial or noisy observations without explicit state estimation modules.
Policies remain effective under sensor noise in both simulated and physical contact-rich environments.
Functionality extends beyond action generation to include trajectory completion and effect prediction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same masking objective might improve generalization when transferring policies from simulation to real robots by forcing the model to handle incomplete data.
Varying the masking ratio or modality-specific masking rates could be tested to optimize dependency capture for different task types.
The reconstruction approach could be combined with language instructions to handle tasks that require both physical and semantic reasoning.

Load-bearing premise

The assumption that random partial masking of multimodal trajectories will cause a diffusion model to automatically capture the temporal and cross-modal dependencies required for forceful manipulation policies.

What would settle it

A head-to-head comparison in which a standard imitation-learning baseline matches or exceeds MDF performance and noise robustness on the same real-world forceful manipulation tasks would undermine the benefit of the masking-and-reconstruction objective.

Figures

Figures reproduced from arXiv: 2511.04812 by Dmitry Berenson, Huaidian Hou, Zixuan Huang.

**Figure 1.** Figure 1: We propose Multimodal Diffusion Forcing, a unified model that captures the interplay between modalities over time through masked diffusion training. At inference time, the model not only offers flexibility by allowing different input modalities, adjustable horizon lengths and prediction horizons, it also diverse functionalities—serving as a policy, planner, dynamics model, state estimator, and anomaly dete… view at source ↗

**Figure 2.** Figure 2: Regular diffusion models employ a scalar noise level to control the denoising process. Diffusion Forcing [36] extends this idea with a time-varying noise vector to sample video sequences autoregressively. We further generalize this framework to the multimodal setting by introducing a time–modality varying noise matrix. This design enables versatile functionalities at test time such as policy, planner, dyna… view at source ↗

**Figure 3.** Figure 3: An overview of the key components and training process of MDF. Pretraining: MDF learns a diffusion-based autoencoder to compress point clouds into compact embeddings. Multimodal masked training: MDF processes six modalities: partial point cloud, full point cloud (training only), force, action, reward and proprioception (omitted in figure). The point clouds are tokenized using the pretrained PointNet encode… view at source ↗

**Figure 4.** Figure 4: Contact-rich manipulaation tasks in IsaacSim. Peg Insert The robot must insert a cylindrical peg into a tight-fitting hole. The clearance is small, requiring accurate alignment and controlled contact. 1) Dataset collection: Teleoperation for contact-rich manipulation tasks in simulation is challenging due to the lack of force feedback. We instead train a state-based RL policy using PPO to collect demonstr… view at source ↗

**Figure 5.** Figure 5: The history length of MDF can be adjusted dynamically at test time to accommodate task requirements. MDF in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: We compare MDF with DP3 on two real-world forceful manipulation tasks. For each task, we compute the score over 20 trials (160 in total), the grading standards can be found in Section V-D. We found MDF to be more robust to noisy observation thanks to its noise-as-masking training scheme. Table II summarizes the results. ImDiffusion achieves reasonable accuracy in identifying anomalous timesteps but fails … view at source ↗

**Figure 7.** Figure 7: Future reconstructed partial (orange) and full (blue) point cloud by query MDF as a dynamics model on Nut Thread. C. Dynamics Modeling [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

Given a dataset of expert trajectories, standard imitation learning approaches typically learn a direct mapping from observations (e.g., RGB images) to actions. However, such methods often overlook the rich interplay between different modalities, i.e., sensory inputs, actions, and rewards, which is crucial for modeling robot behavior and understanding task outcomes. In this work, we propose Multimodal Diffusion Forcing, a unified framework for learning from multimodal robot trajectories that extends beyond action generation. Rather than modeling a fixed distribution, MDF applies random partial masking and trains a diffusion model to reconstruct the trajectory. This training objective encourages the model to learn temporal and cross-modal dependencies, such as predicting the effects of actions on force signals or inferring states from partial observations. We evaluate MDF on contact-rich, forceful manipulation tasks in simulated and real-world environments. Our results show that MDF not only delivers versatile functionalities, but also achieves strong performance, and robustness under noisy observations. More visualizations can be found on our $\href{https://unified-df.github.io}{website}$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MDF uses random partial masking on multimodal trajectories to train a diffusion model that reconstructs observations, actions, forces and rewards, which is a reasonable extension for contact-rich tasks but leaves open whether it reliably learns the sparse action-force links.

read the letter

The main thing to know is that this paper trains a diffusion model on robot trajectories by randomly masking parts of the data and asking it to reconstruct the full sequence. The trajectories include images, actions, forces, and rewards, so the model is meant to pick up temporal and cross-modal patterns that standard imitation learning skips. They test it on forceful manipulation in both simulation and on a real robot, with some claims about handling noisy observations.

Referee Report

2 major / 2 minor

Summary. The paper proposes Multimodal Diffusion Forcing (MDF), a unified framework for learning from multimodal robot trajectories. Rather than direct observation-to-action mapping, MDF applies random partial masking across modalities (sensory inputs, actions, forces/rewards) and trains a diffusion model to reconstruct the masked elements. This objective is claimed to capture temporal and cross-modal dependencies, supporting versatile functionalities beyond action generation. The authors evaluate MDF on contact-rich forceful manipulation tasks, reporting strong performance and robustness under noisy observations in both simulated and real-world environments.

Significance. If the empirical claims hold under detailed scrutiny, MDF offers a promising direction for imitation learning in robotics by explicitly modeling multimodal interplay, particularly force dynamics in contact-rich tasks. This could improve policy robustness where standard methods overlook intermittent force signals. The extension to reconstruction-based training on trajectories is a conceptual strength, though its advantage depends on validation against the sparsity issues in contact events.

major comments (2)

[§3] §3 (Masking and training objective): The central claim that random partial masking suffices to learn action-force causal mappings rests on the reconstruction objective alone. In contact-rich tasks, force signals are intermittent and high-magnitude only during brief intervals. Uniform random masking therefore has low probability of jointly masking an action and its immediate force consequence. The manuscript does not describe contact-window biasing, importance sampling, or an auxiliary force-prediction term. This directly affects whether the learned joint distribution encodes the dependencies needed for the reported robustness under noisy observations.
[§4] §4 (Experiments): The performance and robustness claims on forceful manipulation tasks are load-bearing for the contribution. Without an ablation that varies masking strategy (uniform vs. contact-aware) or reports per-contact success rates and force-prediction error, it is difficult to confirm that the reconstruction objective, rather than other factors, drives the gains over baselines.

minor comments (2)

[Abstract and §4] Clarify the exact set of modalities used in the real-world experiments (e.g., whether reward signals are present or if the abstract reference to rewards is aspirational).
[Figures in §4] Figure captions and axis labels in the results section should explicitly state the noise levels and contact metrics used for the robustness evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below, providing clarifications on the masking strategy and committing to additional experimental analyses in the revision to further validate our claims.

read point-by-point responses

Referee: [§3] §3 (Masking and training objective): The central claim that random partial masking suffices to learn action-force causal mappings rests on the reconstruction objective alone. In contact-rich tasks, force signals are intermittent and high-magnitude only during brief intervals. Uniform random masking therefore has low probability of jointly masking an action and its immediate force consequence. The manuscript does not describe contact-window biasing, importance sampling, or an auxiliary force-prediction term. This directly affects whether the learned joint distribution encodes the dependencies needed for the reported robustness under noisy observations.

Authors: We thank the referee for highlighting this important consideration about intermittent force signals. While masking is uniform and random, each trajectory is subjected to multiple independent masking patterns during training, and the diffusion model is required to reconstruct the entire multimodal sequence conditioned on the unmasked elements. This process statistically exposes the model to a wide range of action-force co-occurrences across the dataset, enabling it to learn the underlying joint distribution and cross-modal dependencies without explicit biasing. We have revised Section 3 to include a dedicated paragraph explaining this coverage and the sufficiency of the reconstruction objective for capturing causal mappings in contact-rich settings. revision: partial
Referee: [§4] §4 (Experiments): The performance and robustness claims on forceful manipulation tasks are load-bearing for the contribution. Without an ablation that varies masking strategy (uniform vs. contact-aware) or reports per-contact success rates and force-prediction error, it is difficult to confirm that the reconstruction objective, rather than other factors, drives the gains over baselines.

Authors: We agree that targeted ablations would strengthen the empirical support for our claims. In the revised manuscript we add a new ablation subsection in §4 that directly compares uniform random masking against a contact-aware variant (increased masking probability within detected contact windows). We also report per-contact success rates and average force-prediction error for MDF and all baselines. These results indicate that uniform masking already yields the reported robustness gains, with contact-aware masking providing only marginal further improvement, thereby confirming the reconstruction objective as the primary driver. revision: yes

Circularity Check

0 steps flagged

No significant circularity: MDF defined via independent masking objective with no reduction to fitted inputs or self-citations

full rationale

The paper defines Multimodal Diffusion Forcing directly as the application of random partial masking to multimodal trajectories followed by diffusion-based reconstruction training. This objective is motivated as a means to capture temporal and cross-modal dependencies without any quoted equations or claims that reduce the learned dependencies or performance claims back to previously fitted parameters, self-referential definitions, or load-bearing self-citations. The abstract presents the approach as an extension of standard diffusion techniques to robot trajectories, with evaluation on forceful manipulation tasks treated as empirical validation rather than a derived necessity. No steps in the provided derivation chain exhibit the enumerated circularity patterns; the central claim remains self-contained against external benchmarks of diffusion modeling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities; the approach relies on standard diffusion model assumptions and the unstated premise that masking will induce useful cross-modal learning.

axioms (1)

standard math Diffusion models can be trained to reconstruct partially masked sequences by learning underlying data distributions.
Implicit in the description of the training objective.

pith-pipeline@v0.9.0 · 5709 in / 1099 out tokens · 31756 ms · 2026-05-18T00:29:01.201831+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MDF applies random partial masking and trains a diffusion model to reconstruct the trajectory... 2D Time-Modality Noise Level Matrix K ∈ {0,...,K}^{T×M}
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

sequence length is set to 10... full-sequence denoising with 200 steps

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Contact-Grounded Policy: Dexterous Visuotactile Policy with Generative Contact Grounding
cs.RO 2026-03 unverdicted novelty 6.0

Contact-Grounded Policy predicts coupled robot-state and tactile trajectories with a diffusion model and maps them via a learned consistency function to executable targets for compliance controllers, outperforming sta...

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

Play it by ear: Learning skills amidst occlusion through audio-visual imitation learning,

M. Du, O. Y . Lee, S. Nair, and C. Finn, “Play it by ear: Learning skills amidst occlusion through audio-visual imitation learning,”arXiv preprint arXiv:2205.14850, 2022

work page arXiv 2022
[2]

Visuo-tactile transformers for manipulation,

Y . Chen, A. Sipos, M. Van der Merwe, and N. Fazeli, “Visuo-tactile transformers for manipulation,”CoRL, 2022

work page 2022
[3]

Maniwav: Learning robot manipulation from in-the-wild audio-visual data,

Z. Liu, C. Chi, E. Cousineau, N. Kuppuswamy, B. Burchfiel, and S. Song, “Maniwav: Learning robot manipulation from in-the-wild audio-visual data,” in8th Annual Conference on Robot Learning, 2024

work page 2024
[4]

Learn- ing visuotactile skills with two multifingered hands,

T. Lin, Y . Zhang, Q. Li, H. Qi, B. Yi, S. Levine, and J. Malik, “Learning visuotactile skills with two multifingered hands,”arXiv preprint arXiv:2404.16823, 2024

work page arXiv 2024
[5]

RMA: Rapid Motor Adaptation for Legged Robots

A. Kumar, Z. Fu, D. Pathak, and J. Malik, “Rma: Rapid motor adaptation for legged robots,”arXiv preprint arXiv:2107.04034, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Tacsl: A library for visuotactile sensor simulation and learning,

I. Akinola, J. Xu, J. Carius, D. Fox, and Y . Narang, “Tacsl: A library for visuotactile sensor simulation and learning,”TRO, 2025

work page 2025
[7]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” arXiv preprint arXiv:2303.04137, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Playfusion: Skill acquisition via diffusion from language-annotated play,

L. Chen, S. Bahl, and D. Pathak, “Playfusion: Skill acquisition via diffusion from language-annotated play,” inConference on Robot Learning. PMLR, 2023, pp. 2012–2029

work page 2023
[9]

Reactive diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation,

H. Xue, J. Ren, W. Chen, G. Zhang, Y . Fang, G. Gu, H. Xu, and C. Lu, “Reactive diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation,”arXiv preprint arXiv:2503.02881, 2025

work page arXiv 2025
[10]

3d diffusion policy: Generalizable visuomotor policy learning via simple 3d rep- resentations,

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d rep- resentations,” inICRA 2024 Workshop on 3D Visual Representations for Robot Manipulation, 2024

work page 2024
[11]

arXiv preprint arXiv:2402.03570 , year=

Z. Ding, A. Zhang, Y . Tian, and Q. Zheng, “Diffusion world model: Future modeling beyond step-by-step rollout for offline reinforcement learning,”arXiv preprint arXiv:2402.03570, 2024

work page arXiv 2024
[12]

Rigter, J

M. Rigter, J. Yamada, and I. Posner, “World models via policy-guided trajectory diffusion,”arXiv preprint arXiv:2312.08533, 2023

work page arXiv 2023
[13]

Is Conditional Generative Modeling all you need for Decision-Making?

A. Ajay, Y . Du, A. Gupta, J. Tenenbaum, T. Jaakkola, and P. Agrawal, “Is conditional generative modeling all you need for decision- making?”arXiv preprint arXiv:2211.15657, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Subgoal diffuser: Coarse-to-fine subgoal generation to guide model predictive control for robot manipulation,

Z. Huang, Y . Lin, F. Yang, and D. Berenson, “Subgoal diffuser: Coarse-to-fine subgoal generation to guide model predictive control for robot manipulation,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 16 489–16 495

work page 2024
[15]

Implicit contact diffuser: Sequential contact reasoning with latent point cloud diffusion,

Z. Huang, Y . He, Y . Lin, and D. Berenson, “Implicit contact diffuser: Sequential contact reasoning with latent point cloud diffusion,”arXiv preprint arXiv:2410.16571, 2024

work page arXiv 2024
[16]

Spot: Se (3) pose trajectory diffusion for object-centric manipulation,

C.-C. Hsu, B. Wen, J. Xu, Y . Narang, X. Wang, Y . Zhu, J. Biswas, and S. Birchfield, “Spot: Se (3) pose trajectory diffusion for object-centric manipulation,”arXiv preprint arXiv:2411.00965, 2024

work page arXiv 2024
[17]

Planning with diffusion for flexible behavior synthesis,

M. Janner, Y . Du, J. B. Tenenbaum, and S. Levine, “Planning with diffusion for flexible behavior synthesis,”ICML, 2022

work page 2022
[18]

Anomalies-by-synthesis: Anomaly detection using generative diffusion models for off-road navigation

S. Jiang, S. Ancha, T. Manderson, L. Brandt, Y . Du, P. R. Osteen, and N. Roy, “Anomalies-by-synthesis: Anomaly detection using generative diffusion models for off-road navigation.”

work page
[19]

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta, “Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets,”arXiv preprint arXiv:2504.02792, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Unified Video Action Model

S. Li, Y . Gao, D. Sadigh, and S. Song, “Unified video action model,” arXiv preprint arXiv:2503.00200, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

R3M: A Universal Visual Representation for Robot Manipulation

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta, “R3m: A universal visual representation for robot manipulation,”arXiv preprint arXiv:2203.12601, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

π0: A vision-language-action flow model for general robot control, 2024

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “π0: A vision-language-action flow model for general robot control, 2024.”

work page 2024
[23]

Vima: General robot manipulation with multimodal prompts,

Y . Jiang, A. Gupta, Z. Zhang, G. Wang, Y . Dou, Y . Chen, L. Fei- Fei, A. Anandkumar, Y . Zhu, and L. Fan, “Vima: General robot manipulation with multimodal prompts,”ICML, 2022

work page 2022
[24]

See, hear, and feel: Smart sensory fusion for robotic manipulation,

H. Li, Y . Zhang, J. Zhu, S. Wang, M. A. Lee, H. Xu, E. Adelson, L. Fei-Fei, R. Gao, and J. Wu, “See, hear, and feel: Smart sensory fusion for robotic manipulation,”CoRL, 2022

work page 2022
[25]

Forge: Force-guided exploration for robust contact-rich manipulation under uncertainty,

M. Noseworthy, B. Tang, B. Wen, A. Handa, C. Kessens, N. Roy, D. Fox, F. Ramos, Y . Narang, and I. Akinola, “Forge: Force-guided exploration for robust contact-rich manipulation under uncertainty,” IEEE Robotics and Automation Letters, 2025

work page 2025
[26]

Robotic compliant object prying using diffusion policy guided by vision and force observations,

J. H. Kang, S. Joshi, R. Huang, and S. K. Gupta, “Robotic compliant object prying using diffusion policy guided by vision and force observations,”IEEE Robotics and Automation Letters, 2025

work page 2025
[27]

Tacdiffusion: Force-domain diffusion policy for precise tactile manipulation,

Y . Wu, Z. Chen, F. Wu, L. Chen, L. Zhang, Z. Bing, A. Swikir, S. Had- dadin, and A. Knoll, “Tacdiffusion: Force-domain diffusion policy for precise tactile manipulation,”arXiv preprint arXiv:2409.11047, 2024

work page arXiv 2024
[28]

Self-attention based visual-tactile fusion learning for predicting grasp outcomes,

S. Cui, R. Wang, J. Wei, J. Hu, and S. Wang, “Self-attention based visual-tactile fusion learning for predicting grasp outcomes,”IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. 5827–5834, 2020

work page 2020
[29]

Forcesight: Text-guided mobile manipulation with visual-force goals,

J. A. Collins, C. Houff, Y . L. Tan, and C. C. Kemp, “Forcesight: Text-guided mobile manipulation with visual-force goals,” inICRA, 2024

work page 2024
[30]

Prediction with action: Visual policy learning via joint denoising process,

Y . Guo, Y . Hu, J. Zhang, Y .-J. Wang, X. Chen, C. Lu, and J. Chen, “Prediction with action: Visual policy learning via joint denoising process,”Advances in Neural Information Processing Systems, vol. 37, pp. 112 386–112 410, 2024

work page 2024
[31]

ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

L. Heng, H. Geng, K. Zhang, P. Abbeel, and J. Malik, “Vitacformer: Learning cross-modal representation for visuo-tactile dexterous ma- nipulation,”arXiv preprint arXiv:2506.15953, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Masked trajectory models for prediction, representa- tion, and control,

P. Wu, A. Majumdar, K. Stone, Y . Lin, I. Mordatch, P. Abbeel, and A. Rajeswaran, “Masked trajectory models for prediction, representa- tion, and control,” inICML. PMLR, 2023, pp. 37 607–37 623

work page 2023
[33]

Uni [mask]: Unified inference in sequential decision problems,

M. Carroll, O. Paradise, J. Lin, R. Georgescu, M. Sun, D. Bignell, S. Milani, K. Hofmann, M. Hausknecht, A. Draganet al., “Uni [mask]: Unified inference in sequential decision problems,”Advances in neural information processing systems, vol. 35, pp. 35 365–35 378, 2022

work page 2022
[34]

Humanoid locomotion as next token prediction,

I. Radosavovic, B. Zhang, B. Shi, J. Rajasegaran, S. Kamat, T. Darrell, K. Sreenath, and J. Malik, “Humanoid locomotion as next token prediction,” inThe Thirty-eighth Annual Conference on Neural In- formation Processing Systems, 2024

work page 2024
[35]

Masked autoencoding for scalable and generalizable decision making,

F. Liu, H. Liu, A. Grover, and P. Abbeel, “Masked autoencoding for scalable and generalizable decision making,”Advances in neural information processing systems, vol. 35, pp. 12 608–12 618, 2022

work page 2022
[36]

Diffusion forcing: Next-token prediction meets full-sequence diffusion,

B. Chen, D. Mart ´ı Mons ´o, Y . Du, M. Simchowitz, R. Tedrake, and V . Sitzmann, “Diffusion forcing: Next-token prediction meets full-sequence diffusion,”Advances in Neural Information Processing Systems, vol. 37, pp. 24 081–24 125, 2024

work page 2024
[37]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

work page 2020
[38]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. D. M.-W. C. Kenton, L. K. Toutanovaet al., “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of naacL-HLT, vol. 1, no. 2. Minneapolis, Minnesota, 2019

work page 2019
[39]

Masked autoencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009

work page 2022
[40]

Video diffusion models,

J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,”Advances in Neural Information Processing Systems, vol. 35, pp. 8633–8646, 2022

work page 2022
[41]

Neural discrete representation learning,

A. Van Den Oord, O. Vinyalset al., “Neural discrete representation learning,”NeurIPS, vol. 30, 2017

work page 2017
[42]

Diffusion probabilistic models for 3d point cloud generation,

S. Luo and W. Hu, “Diffusion probabilistic models for 3d point cloud generation,” inCVPR, 2021, pp. 2837–2845

work page 2021
[43]

Variational diffusion models,

D. Kingma, T. Salimans, B. Poole, and J. Ho, “Variational diffusion models,”Advances in neural information processing systems, vol. 34, pp. 21 696–21 707, 2021

work page 2021
[44]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,”arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[45]

Improved denoising diffusion prob- abilistic models,

A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion prob- abilistic models,” inInternational conference on machine learning. PMLR, 2021, pp. 8162–8171

work page 2021
[46]

Denoising Diffusion Implicit Models

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,”arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[47]

Narang, K

Y . Narang, K. Storey, I. Akinola, M. Macklin, P. Reist, L. Wawrzyniak, Y . Guo, A. Moravanszky, G. State, M. Luet al., “Factory: Fast contact for robotic assembly,”arXiv preprint arXiv:2205.03532, 2022

work page arXiv 2022
[48]

Clutterdexgrasp: A sim-to-real system for general dexterous grasping in cluttered scenes,

Z. Chen, Q. Yan, Y . Chen, T. Wu, J. Zhang, Z. Ding, J. Li, Y . Yang, and H. Dong, “Clutterdexgrasp: A sim-to-real system for general dexterous grasping in cluttered scenes,”arXiv preprint arXiv:2506.14317, 2025

work page arXiv 2025
[49]

Imdiffusion: Imputed diffusion models for multivariate time series anomaly detection,

Y . Chen, C. Zhang, M. Ma, Y . Liu, R. Ding, B. Li, S. He, S. Ra- jmohan, Q. Lin, and D. Zhang, “Imdiffusion: Imputed diffusion models for multivariate time series anomaly detection,”arXiv preprint arXiv:2307.00754, 2023

work page arXiv 2023

[1] [1]

Play it by ear: Learning skills amidst occlusion through audio-visual imitation learning,

M. Du, O. Y . Lee, S. Nair, and C. Finn, “Play it by ear: Learning skills amidst occlusion through audio-visual imitation learning,”arXiv preprint arXiv:2205.14850, 2022

work page arXiv 2022

[2] [2]

Visuo-tactile transformers for manipulation,

Y . Chen, A. Sipos, M. Van der Merwe, and N. Fazeli, “Visuo-tactile transformers for manipulation,”CoRL, 2022

work page 2022

[3] [3]

Maniwav: Learning robot manipulation from in-the-wild audio-visual data,

Z. Liu, C. Chi, E. Cousineau, N. Kuppuswamy, B. Burchfiel, and S. Song, “Maniwav: Learning robot manipulation from in-the-wild audio-visual data,” in8th Annual Conference on Robot Learning, 2024

work page 2024

[4] [4]

Learn- ing visuotactile skills with two multifingered hands,

T. Lin, Y . Zhang, Q. Li, H. Qi, B. Yi, S. Levine, and J. Malik, “Learning visuotactile skills with two multifingered hands,”arXiv preprint arXiv:2404.16823, 2024

work page arXiv 2024

[5] [5]

RMA: Rapid Motor Adaptation for Legged Robots

A. Kumar, Z. Fu, D. Pathak, and J. Malik, “Rma: Rapid motor adaptation for legged robots,”arXiv preprint arXiv:2107.04034, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

Tacsl: A library for visuotactile sensor simulation and learning,

I. Akinola, J. Xu, J. Carius, D. Fox, and Y . Narang, “Tacsl: A library for visuotactile sensor simulation and learning,”TRO, 2025

work page 2025

[7] [7]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” arXiv preprint arXiv:2303.04137, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Playfusion: Skill acquisition via diffusion from language-annotated play,

L. Chen, S. Bahl, and D. Pathak, “Playfusion: Skill acquisition via diffusion from language-annotated play,” inConference on Robot Learning. PMLR, 2023, pp. 2012–2029

work page 2023

[9] [9]

Reactive diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation,

H. Xue, J. Ren, W. Chen, G. Zhang, Y . Fang, G. Gu, H. Xu, and C. Lu, “Reactive diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation,”arXiv preprint arXiv:2503.02881, 2025

work page arXiv 2025

[10] [10]

3d diffusion policy: Generalizable visuomotor policy learning via simple 3d rep- resentations,

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d rep- resentations,” inICRA 2024 Workshop on 3D Visual Representations for Robot Manipulation, 2024

work page 2024

[11] [11]

arXiv preprint arXiv:2402.03570 , year=

Z. Ding, A. Zhang, Y . Tian, and Q. Zheng, “Diffusion world model: Future modeling beyond step-by-step rollout for offline reinforcement learning,”arXiv preprint arXiv:2402.03570, 2024

work page arXiv 2024

[12] [12]

Rigter, J

M. Rigter, J. Yamada, and I. Posner, “World models via policy-guided trajectory diffusion,”arXiv preprint arXiv:2312.08533, 2023

work page arXiv 2023

[13] [13]

Is Conditional Generative Modeling all you need for Decision-Making?

A. Ajay, Y . Du, A. Gupta, J. Tenenbaum, T. Jaakkola, and P. Agrawal, “Is conditional generative modeling all you need for decision- making?”arXiv preprint arXiv:2211.15657, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

Subgoal diffuser: Coarse-to-fine subgoal generation to guide model predictive control for robot manipulation,

Z. Huang, Y . Lin, F. Yang, and D. Berenson, “Subgoal diffuser: Coarse-to-fine subgoal generation to guide model predictive control for robot manipulation,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 16 489–16 495

work page 2024

[15] [15]

Implicit contact diffuser: Sequential contact reasoning with latent point cloud diffusion,

Z. Huang, Y . He, Y . Lin, and D. Berenson, “Implicit contact diffuser: Sequential contact reasoning with latent point cloud diffusion,”arXiv preprint arXiv:2410.16571, 2024

work page arXiv 2024

[16] [16]

Spot: Se (3) pose trajectory diffusion for object-centric manipulation,

C.-C. Hsu, B. Wen, J. Xu, Y . Narang, X. Wang, Y . Zhu, J. Biswas, and S. Birchfield, “Spot: Se (3) pose trajectory diffusion for object-centric manipulation,”arXiv preprint arXiv:2411.00965, 2024

work page arXiv 2024

[17] [17]

Planning with diffusion for flexible behavior synthesis,

M. Janner, Y . Du, J. B. Tenenbaum, and S. Levine, “Planning with diffusion for flexible behavior synthesis,”ICML, 2022

work page 2022

[18] [18]

Anomalies-by-synthesis: Anomaly detection using generative diffusion models for off-road navigation

S. Jiang, S. Ancha, T. Manderson, L. Brandt, Y . Du, P. R. Osteen, and N. Roy, “Anomalies-by-synthesis: Anomaly detection using generative diffusion models for off-road navigation.”

work page

[19] [19]

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta, “Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets,”arXiv preprint arXiv:2504.02792, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Unified Video Action Model

S. Li, Y . Gao, D. Sadigh, and S. Song, “Unified video action model,” arXiv preprint arXiv:2503.00200, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

R3M: A Universal Visual Representation for Robot Manipulation

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta, “R3m: A universal visual representation for robot manipulation,”arXiv preprint arXiv:2203.12601, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

π0: A vision-language-action flow model for general robot control, 2024

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “π0: A vision-language-action flow model for general robot control, 2024.”

work page 2024

[23] [23]

Vima: General robot manipulation with multimodal prompts,

Y . Jiang, A. Gupta, Z. Zhang, G. Wang, Y . Dou, Y . Chen, L. Fei- Fei, A. Anandkumar, Y . Zhu, and L. Fan, “Vima: General robot manipulation with multimodal prompts,”ICML, 2022

work page 2022

[24] [24]

See, hear, and feel: Smart sensory fusion for robotic manipulation,

H. Li, Y . Zhang, J. Zhu, S. Wang, M. A. Lee, H. Xu, E. Adelson, L. Fei-Fei, R. Gao, and J. Wu, “See, hear, and feel: Smart sensory fusion for robotic manipulation,”CoRL, 2022

work page 2022

[25] [25]

Forge: Force-guided exploration for robust contact-rich manipulation under uncertainty,

M. Noseworthy, B. Tang, B. Wen, A. Handa, C. Kessens, N. Roy, D. Fox, F. Ramos, Y . Narang, and I. Akinola, “Forge: Force-guided exploration for robust contact-rich manipulation under uncertainty,” IEEE Robotics and Automation Letters, 2025

work page 2025

[26] [26]

Robotic compliant object prying using diffusion policy guided by vision and force observations,

J. H. Kang, S. Joshi, R. Huang, and S. K. Gupta, “Robotic compliant object prying using diffusion policy guided by vision and force observations,”IEEE Robotics and Automation Letters, 2025

work page 2025

[27] [27]

Tacdiffusion: Force-domain diffusion policy for precise tactile manipulation,

Y . Wu, Z. Chen, F. Wu, L. Chen, L. Zhang, Z. Bing, A. Swikir, S. Had- dadin, and A. Knoll, “Tacdiffusion: Force-domain diffusion policy for precise tactile manipulation,”arXiv preprint arXiv:2409.11047, 2024

work page arXiv 2024

[28] [28]

Self-attention based visual-tactile fusion learning for predicting grasp outcomes,

S. Cui, R. Wang, J. Wei, J. Hu, and S. Wang, “Self-attention based visual-tactile fusion learning for predicting grasp outcomes,”IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. 5827–5834, 2020

work page 2020

[29] [29]

Forcesight: Text-guided mobile manipulation with visual-force goals,

J. A. Collins, C. Houff, Y . L. Tan, and C. C. Kemp, “Forcesight: Text-guided mobile manipulation with visual-force goals,” inICRA, 2024

work page 2024

[30] [30]

Prediction with action: Visual policy learning via joint denoising process,

Y . Guo, Y . Hu, J. Zhang, Y .-J. Wang, X. Chen, C. Lu, and J. Chen, “Prediction with action: Visual policy learning via joint denoising process,”Advances in Neural Information Processing Systems, vol. 37, pp. 112 386–112 410, 2024

work page 2024

[31] [31]

ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

L. Heng, H. Geng, K. Zhang, P. Abbeel, and J. Malik, “Vitacformer: Learning cross-modal representation for visuo-tactile dexterous ma- nipulation,”arXiv preprint arXiv:2506.15953, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Masked trajectory models for prediction, representa- tion, and control,

P. Wu, A. Majumdar, K. Stone, Y . Lin, I. Mordatch, P. Abbeel, and A. Rajeswaran, “Masked trajectory models for prediction, representa- tion, and control,” inICML. PMLR, 2023, pp. 37 607–37 623

work page 2023

[33] [33]

Uni [mask]: Unified inference in sequential decision problems,

M. Carroll, O. Paradise, J. Lin, R. Georgescu, M. Sun, D. Bignell, S. Milani, K. Hofmann, M. Hausknecht, A. Draganet al., “Uni [mask]: Unified inference in sequential decision problems,”Advances in neural information processing systems, vol. 35, pp. 35 365–35 378, 2022

work page 2022

[34] [34]

Humanoid locomotion as next token prediction,

I. Radosavovic, B. Zhang, B. Shi, J. Rajasegaran, S. Kamat, T. Darrell, K. Sreenath, and J. Malik, “Humanoid locomotion as next token prediction,” inThe Thirty-eighth Annual Conference on Neural In- formation Processing Systems, 2024

work page 2024

[35] [35]

Masked autoencoding for scalable and generalizable decision making,

F. Liu, H. Liu, A. Grover, and P. Abbeel, “Masked autoencoding for scalable and generalizable decision making,”Advances in neural information processing systems, vol. 35, pp. 12 608–12 618, 2022

work page 2022

[36] [36]

Diffusion forcing: Next-token prediction meets full-sequence diffusion,

B. Chen, D. Mart ´ı Mons ´o, Y . Du, M. Simchowitz, R. Tedrake, and V . Sitzmann, “Diffusion forcing: Next-token prediction meets full-sequence diffusion,”Advances in Neural Information Processing Systems, vol. 37, pp. 24 081–24 125, 2024

work page 2024

[37] [37]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

work page 2020

[38] [38]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. D. M.-W. C. Kenton, L. K. Toutanovaet al., “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of naacL-HLT, vol. 1, no. 2. Minneapolis, Minnesota, 2019

work page 2019

[39] [39]

Masked autoencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009

work page 2022

[40] [40]

Video diffusion models,

J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,”Advances in Neural Information Processing Systems, vol. 35, pp. 8633–8646, 2022

work page 2022

[41] [41]

Neural discrete representation learning,

A. Van Den Oord, O. Vinyalset al., “Neural discrete representation learning,”NeurIPS, vol. 30, 2017

work page 2017

[42] [42]

Diffusion probabilistic models for 3d point cloud generation,

S. Luo and W. Hu, “Diffusion probabilistic models for 3d point cloud generation,” inCVPR, 2021, pp. 2837–2845

work page 2021

[43] [43]

Variational diffusion models,

D. Kingma, T. Salimans, B. Poole, and J. Ho, “Variational diffusion models,”Advances in neural information processing systems, vol. 34, pp. 21 696–21 707, 2021

work page 2021

[44] [44]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,”arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[45] [45]

Improved denoising diffusion prob- abilistic models,

A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion prob- abilistic models,” inInternational conference on machine learning. PMLR, 2021, pp. 8162–8171

work page 2021

[46] [46]

Denoising Diffusion Implicit Models

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,”arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[47] [47]

Narang, K

Y . Narang, K. Storey, I. Akinola, M. Macklin, P. Reist, L. Wawrzyniak, Y . Guo, A. Moravanszky, G. State, M. Luet al., “Factory: Fast contact for robotic assembly,”arXiv preprint arXiv:2205.03532, 2022

work page arXiv 2022

[48] [48]

Clutterdexgrasp: A sim-to-real system for general dexterous grasping in cluttered scenes,

Z. Chen, Q. Yan, Y . Chen, T. Wu, J. Zhang, Z. Ding, J. Li, Y . Yang, and H. Dong, “Clutterdexgrasp: A sim-to-real system for general dexterous grasping in cluttered scenes,”arXiv preprint arXiv:2506.14317, 2025

work page arXiv 2025

[49] [49]

Imdiffusion: Imputed diffusion models for multivariate time series anomaly detection,

Y . Chen, C. Zhang, M. Ma, Y . Liu, R. Ding, B. Li, S. He, S. Ra- jmohan, Q. Lin, and D. Zhang, “Imdiffusion: Imputed diffusion models for multivariate time series anomaly detection,”arXiv preprint arXiv:2307.00754, 2023

work page arXiv 2023