Spacetime Optimal-Transport Attention for Visuo-Haptic Imitation Learning of Contact-Rich Manipulation

I-Ming Chen; Weicheng Huang; Yue Feng

arxiv: 2605.20433 · v1 · pith:OMH5JTZFnew · submitted 2026-05-19 · 💻 cs.RO

Spacetime Optimal-Transport Attention for Visuo-Haptic Imitation Learning of Contact-Rich Manipulation

Yue Feng , Weicheng Huang , I-Ming Chen This is my paper

Pith reviewed 2026-05-21 06:55 UTC · model grok-4.3

classification 💻 cs.RO

keywords optimal transport attentionvisuo-haptic fusionimitation learningcontact-rich manipulationpeg-in-hole assemblydiffusion policymulti-modal roboticsrobustness to occlusion

0 comments

The pith

Spacetime Optimal-Transport Attention fuses vision, force, and proprioception signals using entropy-regularized optimal transport to replace standard attention and improve success on contact-rich robot tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Spacetime Optimal-Transport Attention as a tri-modal fusion method for imitation learning policies. It aligns force-pose sub-queries with visual patches through entropy-regularized optimal transport instead of softmax attention, imposing marginal constraints that encourage stable, conditioning-aware spatial selection. This setup is paired with a diffusion policy that maps observation windows to action chunks. Experiments on real-robot tasks show higher success rates and greater robustness to visual changes compared with cross-attention and concatenation baselines. A sympathetic reader would care because contact-rich manipulation remains unreliable when vision, force, and pose must be combined under partial observability and safety limits.

Core claim

Spacetime Optimal-Transport Attention replaces softmax-normalized patch attention with an entropy-regularized optimal transport alignment between force-pose-derived sub-queries and visual patches. Explicit marginal constraints supply a structured inductive bias that produces conditioning-aware spatial selection stable across illumination changes, distractors, and partial occlusions. When embedded in a diffusion-based sequence policy, the resulting controller reaches 100 percent success on tight peg-in-hole assembly and retains 82.5 percent success under combined visual perturbations where a concatenation baseline falls to 43.5 percent.

What carries the argument

Spacetime Optimal-Transport Attention (SO-TA), an entropy-regularized optimal transport alignment that maps force-pose sub-queries onto visual patches while enforcing marginal constraints to guide attention mass toward task-relevant regions.

If this is right

SO-TA reaches 100 percent success on tight peg-in-hole versus 93 percent for cross-attention at matched capacity.
The method retains 82.5 percent success under illumination, distractor, and partial-occlusion perturbations while a concatenation baseline drops to 43.5 percent.
OT-derived patch heatmaps and leave-one-out modality-influence ratios supply phase-dependent diagnostics for contact-rich behavior.
The same backbone applies to connector insertion and curved-surface mark erasing without modality-specific redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The phase-dependent selection visible in OT heatmaps could be used to diagnose and correct policy failures during deployment on new contact-rich tasks.
Because the transport formulation is largely parameter-free once marginals are set, the approach may transfer to other multi-modal settings where one modality supplies sparse but reliable conditioning signals.
The reported robustness to visual noise suggests the method could lower the volume of visual data augmentation needed during training for outdoor or unstructured environments.

Load-bearing premise

Entropy-regularized optimal transport supplies a stable spatial selection bias that stays effective across illumination changes, distractors, and partial occlusions without introducing instabilities or requiring task-specific retuning of the transport cost or marginals.

What would settle it

Running the same peg-in-hole trials with the marginal constraints removed or with a different transport cost and observing whether success rate falls below the 93 percent cross-attention baseline or whether robustness to perturbations collapses.

Figures

Figures reproduced from arXiv: 2605.20433 by I-Ming Chen, Weicheng Huang, Yue Feng.

**Figure 3.** Figure 3: Compact view of the OT-based attention node inside SO-TA. The sub [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 5.** Figure 5: Predicted first-step ∆z L vs. ground truth for Ninfer diff ∈ {1, 2, 5, 10} (SO-TA) [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Peg-in-hole: success rate (left) and success-only completion time [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Peg-in-hole: success rate (left) and success-only completion time [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 4.** Figure 4: Tight peg-in-hole task. (a) Third-person view of the setup. (b) Robot [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 8.** Figure 8: Peg-in-hole under illumination changes, distractor pegs, and partial [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 11.** Figure 11: BCM connector insertion: success rate (left) and success-only [PITH_FULL_IMAGE:figures/full_fig_p006_11.png] view at source ↗

**Figure 12.** Figure 12: Selecting the cutoff time by minimizing the [PITH_FULL_IMAGE:figures/full_fig_p006_12.png] view at source ↗

**Figure 13.** Figure 13: With the 12 s cutoff wrapper, the long-duration tail is removed and the short-time mode is preserved. forgiving. The dataset contains 396 teleoperated demonstrations (29,126 steps, 2,916.2 s), yielding 22,430 training pairs at the same Tw, Th. Success requires the connector to be fully seated under 5 N downward force. Across 200 online rollouts the SO-TA policy attains a mean success-only time close to h… view at source ↗

**Figure 14.** Figure 14: Curved-surface mark erasing. (a) Setup with a handwritten “Robot” [PITH_FULL_IMAGE:figures/full_fig_p007_14.png] view at source ↗

**Figure 15.** Figure 15: Erasing: success rate (left) and success-only completion-time density [PITH_FULL_IMAGE:figures/full_fig_p007_15.png] view at source ↗

read the original abstract

Contact-rich manipulation tasks such as tight-clearance insertion, connector mating, polishing, and surface-conforming wiping remain difficult for data-driven controllers because they couple discontinuous contact dynamics, partial observability, and strict safety constraints. No single sensing modality suffices: vision supplies global context before contact, force/torque (F/T) feedback governs interaction after contact, and proprioceptive pose provides a consistent kinematic backbone. Most prior imitation-learning policies for contact-rich tasks operate on uni- or bi-modal signals, and the few that fuse three modalities typically adopt off-the-shelf attention modules with no explicit prior on how attention mass should be distributed across task-relevant regions. We present Spacetime Optimal-Transport Attention (SO-TA), a tri-modal fusion backbone that replaces softmax-normalized patch attention by an entropy-regularized Optimal Transport (OT) alignment between force-pose-derived sub-queries and visual patches. Explicit marginal constraints act as a structured inductive bias for contact-rich tasks, encouraging conditioning-aware spatial selection that is stable across illumination, distractors, and partial occlusion. SO-TA is paired with a diffusion-based sequence policy mapping observation windows to pose-action chunks. We evaluate SO-TA on three real-robot tasks: tight peg-in-hole assembly, BCM wiring-connector insertion, and curved-surface mark erasing. With ~200 rollouts per condition, SO-TA reaches 100% success on tight peg-in-hole versus 93% for cross-attention at matched capacity, and retains 82.5% success under illumination, distractor, and partial-occlusion perturbations where a concatenation baseline drops to 43.5%. OT-derived patch heatmaps and leave-one-out modality-influence ratios provide interpretable, phase-dependent diagnostics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SO-TA replaces standard attention with entropy-regularized OT for tri-modal fusion and shows real-robot gains under perturbation, but the stability claim rests on unverified fixed hyperparameters.

read the letter

The main point is that this work replaces off-the-shelf attention with an entropy-regularized optimal transport alignment between force-pose sub-queries and visual patches, using explicit marginal constraints as an inductive bias inside a diffusion policy for contact-rich tasks. The reported real-robot numbers are the clearest part: 100% success on tight peg-in-hole versus 93% for matched cross-attention, and 82.5% retention under illumination, distractor, and occlusion changes where concatenation falls to 43.5%. The OT heatmaps and leave-one-out modality ratios give some phase-dependent interpretability that standard attention lacks. That combination of tri-modal spacetime OT and diffusion policy on three physical tasks is new relative to the uni- or bi-modal baselines cited in the abstract. The experiments are run on real hardware with perturbation conditions, which is better than pure simulation work. The soft spots are straightforward. The abstract gives success rates from roughly 200 rollouts per condition but supplies no error bars, no statistical tests, and no description of rollout collection or post-hoc tuning. The central robustness argument assumes the transport cost matrix and marginal vectors stay identical across nominal and perturbed trials; nothing in the provided summary confirms this or includes a sensitivity check. If those parameters were adjusted per condition, the performance gap could be explained by optimization rather than the OT structure itself. This is for researchers working on multi-modal imitation learning for assembly and maintenance, especially those who want an explicit spatial bias instead of learned attention. A reader already familiar with optimal transport and diffusion policies will get the most out of the formulation and the perturbation results. It deserves a serious referee because the real-robot evidence is concrete and the inductive-bias idea is worth testing, even though the paper will need tighter controls on hyperparameters and statistics before acceptance.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Spacetime Optimal-Transport Attention (SO-TA), a tri-modal fusion backbone for visuo-haptic imitation learning of contact-rich manipulation tasks. It replaces standard softmax attention with entropy-regularized optimal transport alignment between force-pose sub-queries and visual patches, using explicit marginal constraints as a structured inductive bias to encourage conditioning-aware spatial selection. SO-TA is combined with a diffusion-based sequence policy and evaluated on three real-robot tasks (tight peg-in-hole assembly, BCM wiring-connector insertion, curved-surface mark erasing), reporting 100% success on nominal tight peg-in-hole versus 93% for cross-attention at matched capacity, and 82.5% success under illumination/distractor/partial-occlusion perturbations versus 43.5% for a concatenation baseline, with ~200 rollouts per condition. Interpretable diagnostics via OT-derived patch heatmaps and modality-influence ratios are also provided.

Significance. If the central claims hold after addressing experimental verification gaps, the work introduces a principled inductive bias via entropy-regularized OT for multi-modal attention in robotics, potentially improving robustness to visual perturbations in contact-rich tasks without requiring per-condition retuning. The inclusion of phase-dependent interpretable diagnostics (OT heatmaps and leave-one-out ratios) strengthens the contribution by enabling analysis of modality influence across task phases. The approach addresses a genuine gap in fusing vision, force/torque, and proprioception for discontinuous contact dynamics.

major comments (2)

[Experimental evaluation] Experimental evaluation: The performance claims (100% vs. 93% nominal success; 82.5% vs. 43.5% under perturbations) rest on ~200 rollouts per condition but provide no error bars, statistical significance tests, details on rollout collection protocol, or confirmation that hyperparameters were not tuned post-hoc. This leaves open whether the reported gaps reflect the OT structure or experimental variability.
[Method description and robustness experiments] Method and robustness claims: The central assertion that entropy-regularized OT marginal constraints supply a stable, conditioning-aware spatial selection bias effective across illumination, distractor, and partial-occlusion changes without new instabilities or task-specific retuning lacks explicit verification. No sensitivity study or statement confirms that the cost matrix construction and marginal vectors were held identical between nominal peg-in-hole trials and the three perturbation conditions; if per-condition adjustment occurred, the performance advantage could be attributable to optimization rather than the OT inductive bias.

minor comments (2)

The abstract mentions 'three real-robot tasks' but does not explicitly list the success metrics or perturbation definitions in the opening summary; adding one sentence would improve clarity for readers.
Notation for the OT cost matrix and marginal vectors could be introduced earlier with a small equation block to aid readers unfamiliar with entropy-regularized transport in attention contexts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the two major comments point by point below. Where clarifications or additional analyses are warranted, we will revise the manuscript accordingly.

read point-by-point responses

Referee: [Experimental evaluation] Experimental evaluation: The performance claims (100% vs. 93% nominal success; 82.5% vs. 43.5% under perturbations) rest on ~200 rollouts per condition but provide no error bars, statistical significance tests, details on rollout collection protocol, or confirmation that hyperparameters were not tuned post-hoc. This leaves open whether the reported gaps reflect the OT structure or experimental variability.

Authors: We acknowledge that the manuscript would benefit from greater statistical transparency. In the revised version we will (i) report error bars (standard error of the mean) computed across the ~200 rollouts per condition, (ii) include a statistical significance test (binomial test for success rates and McNemar’s test for paired comparisons against baselines), (iii) add an explicit description of the rollout protocol (randomized trial ordering, independent environment resets, and fixed random seeds), and (iv) state that all hyperparameters were locked after validation-set tuning and were not adjusted after seeing test results. These additions will make clear that the reported performance differences are not attributable to post-hoc optimization. revision: yes
Referee: [Method description and robustness experiments] Method and robustness claims: The central assertion that entropy-regularized OT marginal constraints supply a stable, conditioning-aware spatial selection bias effective across illumination, distractor, and partial-occlusion changes without new instabilities or task-specific retuning lacks explicit verification. No sensitivity study or statement confirms that the cost matrix construction and marginal vectors were held identical between nominal peg-in-hole trials and the three perturbation conditions; if per-condition adjustment occurred, the performance advantage could be attributable to optimization rather than the OT inductive bias.

Authors: The OT parameters were in fact held fixed: the cost matrix is constructed from the same feature-distance metric between force-pose sub-queries and visual patches, and the marginal vectors are uniform (scaled only by sequence length) with no condition-specific re-weighting. This fixed configuration was used for both nominal and perturbed peg-in-hole trials. To make this explicit, we will insert a clarifying paragraph in the experimental setup section and add a sensitivity study in the appendix that varies the entropy-regularization coefficient and marginal scaling factors while keeping all other elements unchanged. The study will show that success rates remain stable without per-condition retuning, supporting that the robustness stems from the OT inductive bias rather than hidden optimization adjustments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; SO-TA is an independent architectural proposal with empirical validation

full rationale

The paper introduces Spacetime Optimal-Transport Attention (SO-TA) as a tri-modal fusion backbone that replaces softmax attention with entropy-regularized optimal transport between force-pose sub-queries and visual patches, using explicit marginal constraints as a structured inductive bias. No equations, derivations, or self-citations are supplied that reduce the claimed performance gains (100% vs 93% success, 82.5% vs 43.5% under perturbations) or robustness properties to fitted parameters, self-defined outputs, or prior author work. The OT mechanism is presented as an external mathematical tool applied to the visuo-haptic setting, and results are reported from real-robot experiments with ~200 rollouts per condition. The central claims rest on this architectural choice and empirical comparisons rather than any self-referential loop or renaming of known results, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields limited visibility into free parameters or axioms; the work appears to rest on standard imitation-learning assumptions plus the novel OT marginal constraints.

pith-pipeline@v0.9.0 · 5852 in / 1259 out tokens · 42617 ms · 2026-05-21T06:55:47.839734+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

entropy-regularized Optimal Transport (OT) alignment between force–pose-derived sub-queries and visual patches. Explicit marginal constraints act as a structured inductive bias
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

min Π≥0 ⟨Π,C⟩ −ε ot H(Π) s.t. Π1=γ, ΠT1=β

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 2 internal anchors

[1]

A survey of robot manipulation in contact,

M. Suomalainen, Y . Karayiannidis, and V . Kyrki, “A survey of robot manipulation in contact,”Robotics and Autonomous Systems, vol. 156, p. 104224, 2022

work page 2022
[2]

A survey on imitation learning for contact-rich tasks in robotics,

T. Tsuji, Y . Kato, G. Solak, H. Zhang, T. Petriˇc, F. Nori, and A. Ajoudani, “A survey on imitation learning for contact-rich tasks in robotics,”arXiv preprint arXiv:2506.13498, 2025

work page arXiv 2025
[3]

Survey of learning-based approaches for robotic in-hand manipulation,

A. I. Weinberg, A. Shirizly, O. Azulay, and A. Sintov, “Survey of learning-based approaches for robotic in-hand manipulation,”Frontiers in Robotics and AI, vol. 11, p. 1455431, 2024

work page 2024
[4]

Towards forceful robotic foundation models: A literature survey,

W. Xie and N. Correll, “Towards forceful robotic foundation models: A literature survey,”arXiv preprint arXiv:2504.11827, 2025

work page arXiv 2025
[5]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion,

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. C. M. Burchfiel, and S. Song, “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion,” inProceedings of Robotics: Science and Systems (RSS), 2023

work page 2023
[6]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware,

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware,” inProceedings of Robotics: Science and Systems (RSS), 2023. 8

work page 2023
[7]

Behavior transformers: Cloningkmodes with one stone,

N. M. M. Shafiullah, Z. J. Cui, A. Altanzaya, and L. Pinto, “Behavior transformers: Cloningkmodes with one stone,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[8]

Open X-embodiment: Robotic learning datasets and RT-X models,

Q. Vuong, S. Levine, H. R. Walke, K. Pertsch, A. Singh, R. Doshi, C. Xu, J. Luo, L. Tan, D. Shahet al., “Open X-embodiment: Robotic learning datasets and RT-X models,” inTowards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition @ CoRL, 2023

work page 2023
[9]

Towards generalist robot learning from internet video: A survey,

R. McCarthy, D. C. H. Tan, D. Schmidt, F. Acero, N. Herr, Y . Du, T. G. Thuruthel, and Z. Li, “Towards generalist robot learning from internet video: A survey,”Journal of Artificial Intelligence Research, vol. 83, 2025

work page 2025
[10]

RH20T: A comprehensive robotic dataset for learning diverse skills in one-shot,

H.-S. Fang, H. Fang, Z. Tang, J. Liu, C. Wang, J. Wang, H. Zhu, and C. Lu, “RH20T: A comprehensive robotic dataset for learning diverse skills in one-shot,”arXiv preprint arXiv:2307.00595, 2023

work page arXiv 2023
[11]

Multimodal machine learning: A survey and taxonomy,

T. Baltru ˇsaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423–443, 2019

work page 2019
[12]

Multimodal learning with transformers: A survey,

Y . Du, J. Hu, J. Chen, and Y . Wang, “Multimodal learning with transformers: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

work page 2022
[13]

Deep generative models in robotics: A survey on learning from multimodal demonstrations,

J. Urain, A. Mandlekar, Y . Du, M. Shafiullah, D. Xu, K. Fragkiadaki, G. Chalvatzaki, and J. Peters, “Deep generative models in robotics: A survey on learning from multimodal demonstrations,”arXiv preprint arXiv:2408.04380, 2024

work page arXiv 2024
[14]

A survey of robot learning from demonstration,

B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,”Robotics and Autonomous Systems, vol. 57, no. 5, pp. 469–483, 2009

work page 2009
[15]

An algorithmic perspective on imitation learning,

T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, and J. Peters, “An algorithmic perspective on imitation learning,”Foundations and Trends in Robotics, vol. 7, no. 1–2, pp. 1–179, 2018

work page 2018
[16]

Recent advances in robot learning from demonstration,

H. Ravichandar, A. S. Polydoros, S. Chernova, and A. Billard, “Recent advances in robot learning from demonstration,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 3, pp. 297–330, 2020

work page 2020
[17]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), 2017, pp. 5998– 6008

work page 2017
[18]

Multimodal transformer for unaligned multimodal language sequences,

Y .-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019, pp. 6558–6569

work page 2019
[19]

SoftGrasp: Adaptive grasping for dexterous hand based on multimodal imitation learning,

Y . Li, C. Guo, J. Ren, B. Chen, C. Cheng, H. Zhang, and H. Lu, “SoftGrasp: Adaptive grasping for dexterous hand based on multimodal imitation learning,”Biomimetic Intelligence and Robotics, vol. 5, no. 2, p. 100217, 2025

work page 2025
[20]

Learning diffusion policies from demonstrations for compliant contact- rich manipulation,

M. Aburub, C. C. Beltran-Hernandez, T. Kamijo, and M. Hamaya, “Learning diffusion policies from demonstrations for compliant contact- rich manipulation,”arXiv preprint, 2024

work page 2024
[21]

Robotic compliant object prying using diffusion policy guided by vision and force obser- vations,

J. H. Kang, S. Joshi, R. Huang, and S. K. Gupta, “Robotic compliant object prying using diffusion policy guided by vision and force obser- vations,”arXiv preprint, 2024

work page 2024
[22]

See, hear, and feel: Smart sensory fusion for robotic manipulation,

H. Li, Y . Zhang, J. Zhu, S. Wang, M. A. Lee, H. Xu, E. Adelson, L. Fei- Fei, R. Gao, and J. Wu, “See, hear, and feel: Smart sensory fusion for robotic manipulation,” inProceedings of the 6th Conference on Robot Learning (CoRL), ser. Proceedings of Machine Learning Research, vol. 205, 2023

work page 2023
[23]

Visuo-tactile transformers for manipulation,

Y . Chen, M. van der Merwe, A. Sipos, and N. Fazeli, “Visuo-tactile transformers for manipulation,” inProceedings of the 6th Conference on Robot Learning (CoRL), ser. Proceedings of Machine Learning Research, vol. 205, 2023, pp. 2026–2040

work page 2023
[24]

Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks,

M. A. Lee, Y . Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei- Fei, A. Garg, and J. Bohg, “Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks,” inIEEE International Conference on Robotics and Automation (ICRA), 2019, pp. 8943–8950

work page 2019
[25]

QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation,

D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke, and S. Levine, “QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation,” inConference on Robot Learning (CoRL), 2018, pp. 651– 673

work page 2018
[26]

Robotic imitation of human assembly skills using hybrid trajectory and force learning,

Y . Wang, C. C. Beltran-Hernandez, W. Wan, and K. Harada, “Robotic imitation of human assembly skills using hybrid trajectory and force learning,”arXiv preprint, 2021

work page 2021
[27]

Robotic paper wrapping by learning force control,

H. Hanai, T. Kiyokawa, W. Wan, and K. Harada, “Robotic paper wrapping by learning force control,”arXiv preprint, 2023

work page 2023
[28]

MOMA-Force: Visual-force imitation for real-world mobile manipulation,

T. Yang, Y . Jing, H. Wu, J. Xu, K. Sima, G. Chen, Q. Sima, and T. Kong, “MOMA-Force: Visual-force imitation for real-world mobile manipulation,”arXiv preprint arXiv:2308.03624, 2024

work page arXiv 2024
[29]

ForceMimic: Force- centric imitation learning with force-motion capture system for contact- rich manipulation,

W. Liu, J. Wang, Y . Wang, W. Wang, and C. Lu, “ForceMimic: Force- centric imitation learning with force-motion capture system for contact- rich manipulation,”arXiv preprint, 2024

work page 2024
[30]

Multi-modal fusion in contact- rich precise tasks via hierarchical policy learning,

P. Jin, Y . Lin, Y . Tan, T. Li, and W. Yang, “Multi-modal fusion in contact- rich precise tasks via hierarchical policy learning,”arXiv preprint, 2022

work page 2022
[31]

Vision-force-fused curriculum learning for robotic contact-rich assembly tasks,

P. Jin, Y . Lin, Y . Song, T. Li, and W. Yang, “Vision-force-fused curriculum learning for robotic contact-rich assembly tasks,”Frontiers in Neurorobotics, vol. 17, p. 1280773, 2023

work page 2023
[32]

Play to the score: Stage-guided dynamic multi-sensory fusion for robotic manipulation,

R. Feng, D. Hu, W. Ma, and X. Li, “Play to the score: Stage-guided dynamic multi-sensory fusion for robotic manipulation,” inProceedings of the Conference on Robot Learning (CoRL), 2024

work page 2024
[33]

Sinkhorn distances: Lightspeed computation of optimal transport,

M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” inAdvances in Neural Information Processing Systems (NeurIPS), 2013

work page 2013
[34]

Peyr ´e and M

G. Peyr ´e and M. Cuturi,Computational Optimal Transport, ser. Founda- tions and Trends in Machine Learning. Now Publishers, 2019, vol. 11, no. 5–6

work page 2019
[35]

Sinkformers: Transformers with doubly stochastic attention,

M. E. Sander, P. Ablin, M. Blondel, and G. Peyr ´e, “Sinkformers: Transformers with doubly stochastic attention,” inProceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS), 2022, pp. 3515–3530

work page 2022
[36]

Unlocking slot attention by changing optimal transport costs,

Y . Zhang, D. W. Zhang, S. Lacoste-Julien, G. J. Burghouts, and C. G. M. Snoek, “Unlocking slot attention by changing optimal transport costs,” in Proceedings of the 40th International Conference on Machine Learning (ICML), 2023, pp. 41 931–41 951

work page 2023
[37]

OTSeg: Multi-prompt sinkhorn atten- tion for zero-shot semantic segmentation,

K. Kim, Y . Oh, and J. C. Ye, “OTSeg: Multi-prompt sinkhorn atten- tion for zero-shot semantic segmentation,” inEuropean Conference on Computer Vision (ECCV), 2024, pp. 200–217

work page 2024
[38]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 6840–6851

work page 2020
[39]

Score-based generative modeling through stochastic differ- ential equations,

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differ- ential equations,”International Conference on Learning Representations (ICLR), 2021

work page 2021
[40]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778

work page 2016
[41]

Mixed precision training,

P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen, D. Garc ´ıa, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed precision training,” inInternational Conference on Learning Representations (ICLR), 2018

work page 2018
[42]

Delta6: A Low-Cost, 6-DOF Force-Sensing Flexible End-Effector

Y . Feng, W. Huang, C. Qiu, H. Dong, and I.-M. Chen, “Delta6: A low-cost, 6-dof force-sensing flexible end-effector,”arXiv preprint arXiv:2604.06150, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

One Interface, Many Robots: Unified Real-Time Low-Level Motion Planning for Collaborative Arms

Y . Feng, W. Huang, and I.-M. Chen, “One interface, many robots: Unified real-time low-level motion planning for collaborative arms,” arXiv preprint arXiv:2604.08787, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

Optimizing small-scale commer- cial automation: Introducing WOS, a low-code solution for robotic arms integration,

Y . Feng, W. Huang, and I.-M. Chen, “Optimizing small-scale commer- cial automation: Introducing WOS, a low-code solution for robotic arms integration,” inIEEE International Conference on Advanced Intelligent Mechatronics (AIM), 2024, pp. 272–277

work page 2024
[45]

On the continuity of rotation representations in neural networks,

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, “On the continuity of rotation representations in neural networks,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5745– 5753

work page 2019

[1] [1]

A survey of robot manipulation in contact,

M. Suomalainen, Y . Karayiannidis, and V . Kyrki, “A survey of robot manipulation in contact,”Robotics and Autonomous Systems, vol. 156, p. 104224, 2022

work page 2022

[2] [2]

A survey on imitation learning for contact-rich tasks in robotics,

T. Tsuji, Y . Kato, G. Solak, H. Zhang, T. Petriˇc, F. Nori, and A. Ajoudani, “A survey on imitation learning for contact-rich tasks in robotics,”arXiv preprint arXiv:2506.13498, 2025

work page arXiv 2025

[3] [3]

Survey of learning-based approaches for robotic in-hand manipulation,

A. I. Weinberg, A. Shirizly, O. Azulay, and A. Sintov, “Survey of learning-based approaches for robotic in-hand manipulation,”Frontiers in Robotics and AI, vol. 11, p. 1455431, 2024

work page 2024

[4] [4]

Towards forceful robotic foundation models: A literature survey,

W. Xie and N. Correll, “Towards forceful robotic foundation models: A literature survey,”arXiv preprint arXiv:2504.11827, 2025

work page arXiv 2025

[5] [5]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion,

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. C. M. Burchfiel, and S. Song, “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion,” inProceedings of Robotics: Science and Systems (RSS), 2023

work page 2023

[6] [6]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware,

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware,” inProceedings of Robotics: Science and Systems (RSS), 2023. 8

work page 2023

[7] [7]

Behavior transformers: Cloningkmodes with one stone,

N. M. M. Shafiullah, Z. J. Cui, A. Altanzaya, and L. Pinto, “Behavior transformers: Cloningkmodes with one stone,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[8] [8]

Open X-embodiment: Robotic learning datasets and RT-X models,

Q. Vuong, S. Levine, H. R. Walke, K. Pertsch, A. Singh, R. Doshi, C. Xu, J. Luo, L. Tan, D. Shahet al., “Open X-embodiment: Robotic learning datasets and RT-X models,” inTowards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition @ CoRL, 2023

work page 2023

[9] [9]

Towards generalist robot learning from internet video: A survey,

R. McCarthy, D. C. H. Tan, D. Schmidt, F. Acero, N. Herr, Y . Du, T. G. Thuruthel, and Z. Li, “Towards generalist robot learning from internet video: A survey,”Journal of Artificial Intelligence Research, vol. 83, 2025

work page 2025

[10] [10]

RH20T: A comprehensive robotic dataset for learning diverse skills in one-shot,

H.-S. Fang, H. Fang, Z. Tang, J. Liu, C. Wang, J. Wang, H. Zhu, and C. Lu, “RH20T: A comprehensive robotic dataset for learning diverse skills in one-shot,”arXiv preprint arXiv:2307.00595, 2023

work page arXiv 2023

[11] [11]

Multimodal machine learning: A survey and taxonomy,

T. Baltru ˇsaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423–443, 2019

work page 2019

[12] [12]

Multimodal learning with transformers: A survey,

Y . Du, J. Hu, J. Chen, and Y . Wang, “Multimodal learning with transformers: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

work page 2022

[13] [13]

Deep generative models in robotics: A survey on learning from multimodal demonstrations,

J. Urain, A. Mandlekar, Y . Du, M. Shafiullah, D. Xu, K. Fragkiadaki, G. Chalvatzaki, and J. Peters, “Deep generative models in robotics: A survey on learning from multimodal demonstrations,”arXiv preprint arXiv:2408.04380, 2024

work page arXiv 2024

[14] [14]

A survey of robot learning from demonstration,

B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,”Robotics and Autonomous Systems, vol. 57, no. 5, pp. 469–483, 2009

work page 2009

[15] [15]

An algorithmic perspective on imitation learning,

T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, and J. Peters, “An algorithmic perspective on imitation learning,”Foundations and Trends in Robotics, vol. 7, no. 1–2, pp. 1–179, 2018

work page 2018

[16] [16]

Recent advances in robot learning from demonstration,

H. Ravichandar, A. S. Polydoros, S. Chernova, and A. Billard, “Recent advances in robot learning from demonstration,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 3, pp. 297–330, 2020

work page 2020

[17] [17]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), 2017, pp. 5998– 6008

work page 2017

[18] [18]

Multimodal transformer for unaligned multimodal language sequences,

Y .-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019, pp. 6558–6569

work page 2019

[19] [19]

SoftGrasp: Adaptive grasping for dexterous hand based on multimodal imitation learning,

Y . Li, C. Guo, J. Ren, B. Chen, C. Cheng, H. Zhang, and H. Lu, “SoftGrasp: Adaptive grasping for dexterous hand based on multimodal imitation learning,”Biomimetic Intelligence and Robotics, vol. 5, no. 2, p. 100217, 2025

work page 2025

[20] [20]

Learning diffusion policies from demonstrations for compliant contact- rich manipulation,

M. Aburub, C. C. Beltran-Hernandez, T. Kamijo, and M. Hamaya, “Learning diffusion policies from demonstrations for compliant contact- rich manipulation,”arXiv preprint, 2024

work page 2024

[21] [21]

Robotic compliant object prying using diffusion policy guided by vision and force obser- vations,

J. H. Kang, S. Joshi, R. Huang, and S. K. Gupta, “Robotic compliant object prying using diffusion policy guided by vision and force obser- vations,”arXiv preprint, 2024

work page 2024

[22] [22]

See, hear, and feel: Smart sensory fusion for robotic manipulation,

H. Li, Y . Zhang, J. Zhu, S. Wang, M. A. Lee, H. Xu, E. Adelson, L. Fei- Fei, R. Gao, and J. Wu, “See, hear, and feel: Smart sensory fusion for robotic manipulation,” inProceedings of the 6th Conference on Robot Learning (CoRL), ser. Proceedings of Machine Learning Research, vol. 205, 2023

work page 2023

[23] [23]

Visuo-tactile transformers for manipulation,

Y . Chen, M. van der Merwe, A. Sipos, and N. Fazeli, “Visuo-tactile transformers for manipulation,” inProceedings of the 6th Conference on Robot Learning (CoRL), ser. Proceedings of Machine Learning Research, vol. 205, 2023, pp. 2026–2040

work page 2023

[24] [24]

Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks,

M. A. Lee, Y . Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei- Fei, A. Garg, and J. Bohg, “Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks,” inIEEE International Conference on Robotics and Automation (ICRA), 2019, pp. 8943–8950

work page 2019

[25] [25]

QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation,

D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke, and S. Levine, “QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation,” inConference on Robot Learning (CoRL), 2018, pp. 651– 673

work page 2018

[26] [26]

Robotic imitation of human assembly skills using hybrid trajectory and force learning,

Y . Wang, C. C. Beltran-Hernandez, W. Wan, and K. Harada, “Robotic imitation of human assembly skills using hybrid trajectory and force learning,”arXiv preprint, 2021

work page 2021

[27] [27]

Robotic paper wrapping by learning force control,

H. Hanai, T. Kiyokawa, W. Wan, and K. Harada, “Robotic paper wrapping by learning force control,”arXiv preprint, 2023

work page 2023

[28] [28]

MOMA-Force: Visual-force imitation for real-world mobile manipulation,

T. Yang, Y . Jing, H. Wu, J. Xu, K. Sima, G. Chen, Q. Sima, and T. Kong, “MOMA-Force: Visual-force imitation for real-world mobile manipulation,”arXiv preprint arXiv:2308.03624, 2024

work page arXiv 2024

[29] [29]

ForceMimic: Force- centric imitation learning with force-motion capture system for contact- rich manipulation,

W. Liu, J. Wang, Y . Wang, W. Wang, and C. Lu, “ForceMimic: Force- centric imitation learning with force-motion capture system for contact- rich manipulation,”arXiv preprint, 2024

work page 2024

[30] [30]

Multi-modal fusion in contact- rich precise tasks via hierarchical policy learning,

P. Jin, Y . Lin, Y . Tan, T. Li, and W. Yang, “Multi-modal fusion in contact- rich precise tasks via hierarchical policy learning,”arXiv preprint, 2022

work page 2022

[31] [31]

Vision-force-fused curriculum learning for robotic contact-rich assembly tasks,

P. Jin, Y . Lin, Y . Song, T. Li, and W. Yang, “Vision-force-fused curriculum learning for robotic contact-rich assembly tasks,”Frontiers in Neurorobotics, vol. 17, p. 1280773, 2023

work page 2023

[32] [32]

Play to the score: Stage-guided dynamic multi-sensory fusion for robotic manipulation,

R. Feng, D. Hu, W. Ma, and X. Li, “Play to the score: Stage-guided dynamic multi-sensory fusion for robotic manipulation,” inProceedings of the Conference on Robot Learning (CoRL), 2024

work page 2024

[33] [33]

Sinkhorn distances: Lightspeed computation of optimal transport,

M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” inAdvances in Neural Information Processing Systems (NeurIPS), 2013

work page 2013

[34] [34]

Peyr ´e and M

G. Peyr ´e and M. Cuturi,Computational Optimal Transport, ser. Founda- tions and Trends in Machine Learning. Now Publishers, 2019, vol. 11, no. 5–6

work page 2019

[35] [35]

Sinkformers: Transformers with doubly stochastic attention,

M. E. Sander, P. Ablin, M. Blondel, and G. Peyr ´e, “Sinkformers: Transformers with doubly stochastic attention,” inProceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS), 2022, pp. 3515–3530

work page 2022

[36] [36]

Unlocking slot attention by changing optimal transport costs,

Y . Zhang, D. W. Zhang, S. Lacoste-Julien, G. J. Burghouts, and C. G. M. Snoek, “Unlocking slot attention by changing optimal transport costs,” in Proceedings of the 40th International Conference on Machine Learning (ICML), 2023, pp. 41 931–41 951

work page 2023

[37] [37]

OTSeg: Multi-prompt sinkhorn atten- tion for zero-shot semantic segmentation,

K. Kim, Y . Oh, and J. C. Ye, “OTSeg: Multi-prompt sinkhorn atten- tion for zero-shot semantic segmentation,” inEuropean Conference on Computer Vision (ECCV), 2024, pp. 200–217

work page 2024

[38] [38]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 6840–6851

work page 2020

[39] [39]

Score-based generative modeling through stochastic differ- ential equations,

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differ- ential equations,”International Conference on Learning Representations (ICLR), 2021

work page 2021

[40] [40]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778

work page 2016

[41] [41]

Mixed precision training,

P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen, D. Garc ´ıa, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed precision training,” inInternational Conference on Learning Representations (ICLR), 2018

work page 2018

[42] [42]

Delta6: A Low-Cost, 6-DOF Force-Sensing Flexible End-Effector

Y . Feng, W. Huang, C. Qiu, H. Dong, and I.-M. Chen, “Delta6: A low-cost, 6-dof force-sensing flexible end-effector,”arXiv preprint arXiv:2604.06150, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[43] [43]

One Interface, Many Robots: Unified Real-Time Low-Level Motion Planning for Collaborative Arms

Y . Feng, W. Huang, and I.-M. Chen, “One interface, many robots: Unified real-time low-level motion planning for collaborative arms,” arXiv preprint arXiv:2604.08787, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[44] [44]

Optimizing small-scale commer- cial automation: Introducing WOS, a low-code solution for robotic arms integration,

Y . Feng, W. Huang, and I.-M. Chen, “Optimizing small-scale commer- cial automation: Introducing WOS, a low-code solution for robotic arms integration,” inIEEE International Conference on Advanced Intelligent Mechatronics (AIM), 2024, pp. 272–277

work page 2024

[45] [45]

On the continuity of rotation representations in neural networks,

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, “On the continuity of rotation representations in neural networks,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5745– 5753

work page 2019