FAWAM: Force-Aware World Action Models for Closed-Loop Contact-Rich Manipulation

Haotian He; Ning Guo; Qipeng Liu; Wenzhao Lian; Zeyu Yan

arxiv: 2606.08555 · v1 · pith:SCQPGLLOnew · submitted 2026-06-07 · 💻 cs.RO

FAWAM: Force-Aware World Action Models for Closed-Loop Contact-Rich Manipulation

Haotian He , Zeyu Yan , Qipeng Liu , Ning Guo , Wenzhao Lian This is my paper

Pith reviewed 2026-06-27 18:19 UTC · model grok-4.3

classification 💻 cs.RO

keywords force-aware manipulationworld action modelscontact-rich tasksclosed-loop controlwrench predictionresidual correctionrobotic manipulationforce feedback

0 comments

The pith

FAWAM encodes force signals at perception, prediction, and closed-loop execution to raise success rates in contact-rich robotic tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FAWAM, a world action model that brings force/torque data into three stages of robotic control. Historical 6-axis signals modulate generated actions, future actions and end-effector wrenches are predicted jointly to capture contact changes, and a residual correction module uses the predicted wrench trajectory as a live reference to adjust actions from real-time feedback. Experiments on several contact-rich tasks report average success-rate gains of 36.25 percent over vision-only baselines and 21.25 percent over prior force-aware methods. A reader would care because force supplies direct interaction information that vision alone often misses when objects touch or slide.

Core claim

FAWAM first encodes historical 6-axis force/torque signals to modulate action generation, then jointly predicts future actions and end-effector wrenches to explicitly model contact evolution, and finally applies a residual correction module that uses the predicted wrench trajectory as an execution-time reference to refine actions online based on real-time force feedback.

What carries the argument

The residual correction module, which treats the predicted wrench trajectory as a real-time reference signal to adjust actions during execution.

If this is right

Joint prediction of actions and wrenches produces an explicit model of how contacts evolve over time.
Real-time force feedback drives online refinement of planned actions without retraining.
The same architecture delivers measurable gains on multiple distinct contact-rich manipulation tasks.
Performance exceeds both pure vision baselines and earlier force-augmented approaches by the reported margins.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the wrench-prediction head remains accurate on unseen objects, the method could support rapid adaptation to new contact-rich tasks with minimal additional data.
The three-level force integration could be tested in simulation-to-real transfer settings to check whether predicted wrenches help close the reality gap.
Scaling the residual correction to multi-finger or dual-arm systems would reveal whether the same reference-signal approach remains stable at higher degrees of freedom.

Load-bearing premise

The residual correction module will generate stable online refinements without introducing instability or needing task-specific tuning that was not reported.

What would settle it

A controlled ablation that disables the residual correction module and measures whether success rates fall back to baseline levels or whether action execution becomes unstable on the same contact-rich tasks.

Figures

Figures reproduced from arXiv: 2606.08555 by Haotian He, Ning Guo, Qipeng Liu, Wenzhao Lian, Zeyu Yan.

**Figure 1.** Figure 1: Comparison between force-conditioned prediction and our method. (left) Adding force as additional observation alone fails to correct contact deviation while (right) systematic force integration in three levels restores proper contact for successful wiping. 2 Related Work Force-Aware Manipulation Policies. Recent works [6, 9, 15, 18, 19, 20] have explored various strategies to incorporate force signals into… view at source ↗

**Figure 2.** Figure 2: Overview of FAWAM. FAWAM incorporates force signals through force-conditioned action generation, future force prediction, and online residual correction. (a) The Force-Envisioned Action Model conditions on visual, language and force inputs to jointly predict action chunks and future force trajectories. (b) The Force-Guided Residual Correction module learns from human interventions and corrects the base act… view at source ↗

**Figure 3.** Figure 3: Contact-rich tasks. Each task is evaluated under diverse contact conditions induced by controlled changes in surface inclination, table height, or the position of the sand-filled box. 4 Experimental Results 4.1 Experimental Setup Hardware Platform. All real-world experiments are conducted on a Franka Research 3 (FR3), a 7-DoF robotic arm. An ATI Axia80-M8 6-axis force/torque sensor is mounted at the wrist … view at source ↗

**Figure 4.** Figure 4: Execution time comparison. Lower values indicate faster completion with less inefficient contact adjustment. 73.75% without additional correction data, showing that the proposed force-envisioned action model already provides a strong base policy. The residual corrector further improves execution robustness by compensating for contact deviations during deployment. 4.3 Ablation Study [PITH_FULL_IMAGE:figure… view at source ↗

**Figure 5.** Figure 5: Perturbation rollouts comparison. With perturbations, force-guided residual correction help FAWAM recover stable contact, while the model without correction fails to adapt [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Experimental environment. The setup consists of a Franka robot for task execution, a FACTR leader robot for force-feedback teleoperation, and multiple Intel RealSense cameras for multi-view visual observation. A.2 Residual-Only Ablation Details The Res-only ablation isolates the effect of residual correction without the predicted wrench guidance used by the full model in Sec. 3.3.1. Since this variant doe… view at source ↗

**Figure 7.** Figure 7: Comparative Correction Experiments for Erase Board. w/o Correction w/ Correction [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Comparative Correction Experiments for Peel Cucumber. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Comparative Correction Experiments for Pivot Box. w/o Correction w/Correction [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Comparative Correction Experiments for Wipe Vase. A.5 Additional Visualization Supplement Figures 11–14 complement the ablation study in [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Ablation qualitative supplement for Erase Board [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Ablation qualitative supplement for Peel Cucumber. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Ablation qualitative supplement for Pivot Box [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Ablation qualitative supplement for Wipe Vase. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Success rate under different τ . In this analysis, we evaluate the effect of different residual activation thresholds τ . A smaller τ makes the residual gate easier to activate, so residual corrections are more frequently added to the base action chunk. We test τ ∈ {0, 0.2, 0.5, 0.8, 0.99} on the Peel Cucumber and Wipe Vase tasks. As shown in the [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

read the original abstract

Force signals provide critical interaction cues for contact-rich robotic manipulation. However, existing methods mostly use force as an additional observation modality, without fully exploiting its role in modeling future interaction dynamics or guiding execution-time feedback correction. In this paper, we propose FAWAM, a force-aware world action model that incorporates force information at three levels: perception, prediction, and closed-loop execution. FAWAM first encodes historical 6-axis force/torque signals to modulate action generation, then jointly predicts future actions and end-effector wrenches to explicitly model contact evolution. It further introduces a residual correction module that uses the predicted wrench trajectory as an execution-time reference to refine actions online based on real-time force feedback. Real-world experiments across multiple contact-rich tasks show that FAWAM improves the average success rate by 36.25% over vision-only baselines and 21.25% over existing force-aware baselines, demonstrating the effectiveness of our force-aware framework for robust contact-rich manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FAWAM adds force at perception, joint action-wrench prediction, and a residual correction step, with abstract claiming 36% and 21% success lifts, but no trial counts or stability checks are given so the numbers stay hard to trust.

read the letter

The core claim is that feeding force signals into three places—modulating perception, predicting wrenches with actions, and using the predicted wrench trajectory for online residual fixes—produces noticeably better real-world success on contact-rich tasks than vision-only or prior force-aware baselines.

The architecture itself is the clearest new piece. Most prior work treats force as extra input; here the model explicitly forecasts future wrenches and then treats that forecast as a reference signal for correction during execution. That closed-loop use of the prediction is the part that could matter for tasks where contact evolves over time.

The reported gains are large enough to be worth checking. A 36% lift over vision baselines and 21% over other force methods would matter for assembly or insertion work if the tasks are representative and the numbers are stable.

The main weakness is the missing experimental detail. The abstract gives no trial counts, no variance numbers, no list of exact tasks, and no ablation or stability analysis on the residual module. The stress-test concern about whether the correction stays stable under force noise or needs hidden per-task tuning is not addressed in what is visible. Without those, it is impossible to know whether the gains generalize or depend on favorable conditions.

This is the sort of paper that would interest people working on practical contact-rich controllers. A reader looking for concrete ideas on how to close the loop with predicted force would find something usable. Someone who needs reproducible results or strong statistical backing would have to wait for the full methods section and data.

I would send it to review. The idea is concrete and the empirical direction is testable; the current write-up is just too light on evidence to stand on its own.

Referee Report

2 major / 1 minor

Summary. The paper proposes FAWAM, a force-aware world action model for closed-loop contact-rich robotic manipulation. Force information is incorporated at three levels: perception (encoding historical 6-axis force/torque signals to modulate action generation), prediction (jointly forecasting future actions and end-effector wrenches to model contact evolution), and execution (a residual correction module that uses the predicted wrench trajectory as an online reference to refine actions based on real-time force feedback). Real-world experiments across multiple tasks claim average success-rate improvements of 36.25% over vision-only baselines and 21.25% over existing force-aware baselines.

Significance. If the empirical claims hold under rigorous validation, the explicit modeling of wrench trajectories for closed-loop residual correction represents a meaningful step toward more reliable force-aware control in contact-rich tasks, where vision-only methods often fail due to unmodeled dynamics.

major comments (2)

[Abstract] Abstract: the headline success-rate gains (36.25% / 21.25%) are presented without any reported trial counts, statistical tests, variance measures, task definitions, or failure-mode analysis, rendering it impossible to assess whether the data support the central empirical claim.
[Method (residual correction)] Residual correction module: no stability analysis, ablation on feedback gains, or evidence of fixed-hyperparameter generalization across tasks is supplied, which is load-bearing for the claim that the module delivers the reported improvements without introducing instability.

minor comments (1)

[Abstract] The abstract would be clearer if it briefly named the specific contact-rich tasks evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the headline success-rate gains (36.25% / 21.25%) are presented without any reported trial counts, statistical tests, variance measures, task definitions, or failure-mode analysis, rendering it impossible to assess whether the data support the central empirical claim.

Authors: We agree that the abstract, due to length constraints, omits these supporting details. The full manuscript reports trial counts, variance, task definitions, and failure modes in the Experiments section. To address the concern directly in the abstract, we will revise it to include trial counts and variance measures. revision: yes
Referee: [Method (residual correction)] Residual correction module: no stability analysis, ablation on feedback gains, or evidence of fixed-hyperparameter generalization across tasks is supplied, which is load-bearing for the claim that the module delivers the reported improvements without introducing instability.

Authors: The referee is correct that the current manuscript supplies no formal stability analysis, gain ablations, or explicit generalization evidence for the residual correction module. While experiments show consistent gains without instability, this constitutes a gap in the supporting analysis. We will add an ablation on feedback gains and a discussion of hyperparameter generalization in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical claims with no derivations

full rationale

The paper's central claims rest on real-world experimental success rates (36.25% and 21.25% improvements) across contact-rich tasks. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or method description. The residual correction module is presented as a component of the architecture, but its performance is evaluated empirically rather than derived from prior self-referential steps. This is a standard non-finding for an applied robotics paper whose contributions are measured by hardware results rather than theoretical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, training objectives, or modeling assumptions, so the ledger is empty.

pith-pipeline@v0.9.1-grok · 5711 in / 1005 out tokens · 17287 ms · 2026-06-27T18:19:39.355342+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 11 linked inside Pith

[1]

M. A. Lee, Y . Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg. Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks. In2019 International conference on robotics and automation (ICRA), pages 8943–8950. IEEE, 2019

2019
[2]

Z. Zhao, S. Haldar, J. Cui, L. Pinto, and R. Bhirangi. Touch begins where vision ends: Gener- alizable policies for contact-rich manipulation, 2025

2025
[3]

Zheng, S

Y . Zheng, S. Gu, W. Li, Y . Zheng, Y . Zang, S. Tian, X. Li, C. Hao, C. Gao, S. Liu, et al. Omnivta: Visuo-tactile world modeling for contact-rich robotic manipulation.arXiv preprint arXiv:2603.19201, 2026

arXiv 2026
[4]

J. Yu, H. Liu, Q. Yu, J. Ren, C. Hao, H. Ding, G. Huang, G. Huang, Y . Song, P. Cai, C. Lu, and W. Zhang. ForceVLA: Enhancing vla models with a force-aware moe for contact-rich manipulation, 2025

2025
[5]

C. Chen, Z. Yu, H. Choi, M. Cutkosky, and J. Bohg. Dexforce: Extracting force-informed actions from kinesthetic demonstrations for dexterous manipulation.IEEE Robotics and Au- tomation Letters, 2025

2025
[6]

H. Fang, S. Tang, M. Mei, H. Qin, Z. He, J. Chen, Y . Feng, C. Wang, W. Liu, Z. He, C. Lu, and S. Wang. Force policy: Learning hybrid force-position control policy under interaction frame for contact-rich manipulation, 2026

2026
[7]

Z. He, H. Fang, J. Chen, H.-S. Fang, and C. Lu. FoAR: Force-aware reactive policy for contact- rich robotic manipulation, 2024

2024
[8]

Zhang, H

Z. Zhang, H. Xu, Z. Yang, C. Yue, Z. Lin, H.-a. Gao, Z. Wang, and H. Zhao. TA-VLA: Elucidating the design space of torque-aware vision-language-action models. In9th Annual Conference on Robot Learning, 2025

2025
[9]

Z. Sun, Y . Wang, D. Held, and Z. Erickson. Force-constrained visual policy: Safe robot- assisted dressing via multi-modal sensing.IEEE Robotics and Automation Letters, 9(5):4178– 4185, 2024

2024
[10]

Y . Hou, Z. Liu, C. Chi, E. Cousineau, N. Kuppuswamy, S. Feng, B. Burchfiel, and S. Song. Adaptive compliance policy: Learning approximate compliance for diffusion guided control, 2024

2024
[11]

H. Xue, J. Ren, W. Chen, G. Zhang, Y . Fang, G. Gu, H. Xu, and C. Lu. Reactive diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation, 2025

2025
[12]

Y . Li, P. Tang, W. Zhang, C. Zhu, Y . Duan, W. Shi, X. Zhang, Z. Yang, J. Ji, and Y . Zhang. FA VLA: A force-adaptive fast-slow vla model for contact-rich robotic manipulation, 2026

2026
[13]

Y . Liao, P. Zhou, S. Huang, D. Yang, S. Chen, Y . Jiang, Y . Hu, J. Cai, S. Liu, J. Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

Pith/arXiv arXiv 2025
[14]

L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, Y . Shen, and Y . Xu. Causal world modeling for robot control, 2026

2026
[15]

J. Pai, L. Achenbach, V . Montesinos, B. Forrai, O. Mees, and E. Nava. mimic-video: Video- action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025. 9

Pith/arXiv arXiv 2025
[16]

H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y . Feng, C. Xiang, Y . Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

Pith/arXiv arXiv 2025
[17]

T. Yuan, Z. Dong, Y . Liu, and H. Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

Pith/arXiv arXiv 2026
[18]

Buamanee, M

T. Buamanee, M. Kobayashi, Y . Uranishi, and H. Takemura. Bi-act: Bilateral control-based imitation learning via action chunking with transformer. In2024 IEEE International Confer- ence on Advanced Intelligent Mechatronics (AIM), pages 410–415. IEEE, 2024

2024
[19]

J. Seo, A. Kruthiventy, S. Lee, M. Teng, S. Choi, X. Zhang, J. Choi, and R. Horowitz. Equicon- tact: A hierarchical se (3) vision-to-force equivariant policy for spatially generalizable contact- rich tasks.arXiv preprint arXiv:2507.10961, 2025

arXiv 2025
[20]

B. Zhou, R. Jiao, Y . Li, X. Yuan, F. Fang, and S. Li. Admittance visuomotor policy learning for general-purpose contact-rich manipulations.IEEE Transactions on Industrial Electronics, 2025

2025
[21]

J. J. Liu, Y . Li, K. Shaw, T. Tao, R. Salakhutdinov, and D. Pathak. FACTR: Force-attending curriculum training for contact-rich policy learning, 2025

2025
[22]

Hafner, J

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models, 2023

2023
[23]

S. Wang, J. Shi, Z. Fu, X. He, F. Liu, C. Yang, Y . Zhou, Z. Fei, J. Gong, J. Fu, M. Z. Shou, X. Huang, X. Qiu, and Y .-G. Jiang. World action models: The next frontier in embodied ai, 2026

2026
[24]

Y . Shen, F. Wei, Z. Du, Y . Liang, Y . Lu, J. Yang, N. Zheng, and B. Guo. Videovla: Video generators can be generalizable robot manipulators.Advances in neural information processing systems, 38:95597–95621, 2026

2026
[25]

B. Hou, G. Li, J. Jia, T. An, X. Guo, S. Leng, H. Geng, Y . Ze, T. Harada, P. Torr, et al. World model for robot learning: A comprehensive survey.arXiv preprint arXiv:2605.00080, 2026

Pith/arXiv arXiv 2026
[26]

J. Lyu, Z. Li, X. Shi, C. Xu, Y . Wang, and H. Wang. Dywa: Dynamics-adaptive world ac- tion model for generalizable non-prehensile manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11058–11068, 2025

2025
[27]

X. Liu, Z. Bai, H. Ci, K. Y . Ma, and M. Z. Shou. World-vla-loop: Closed-loop learning of video world model and vla policy.arXiv preprint arXiv:2602.06508, 2026

Pith/arXiv arXiv 2026
[28]

A. L. Chandra, I. Nematollahi, C. Huang, T. Welschehold, W. Burgard, and A. Valada. Diwa: Diffusion policy adaptation with world models. InConference on Robot Learning, pages 3378–
[29]

Zheng, J

R. Zheng, J. Wang, S. Reed, J. Bjorck, Y . Fang, F. Hu, J. Jang, K. Kundalia, Z. Lin, L. Magne, et al. Flare: Robot learning with implicit world modeling. InConference on Robot Learning, pages 3952–3971. PMLR, 2025

2025
[30]

A. Ye, B. Wang, C. Ni, G. Huang, G. Zhao, H. Li, H. Li, J. Li, J. Lv, J. Liu, et al. Gigaworld- policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

arXiv 2026
[31]

Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024. 10

Pith/arXiv arXiv 2024
[32]

J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

Pith/arXiv arXiv 2025
[33]

M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, and J. Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026

2026
[34]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xi- ang, A. Malik, K. Lee, W. Liang, N. Ranawaka, J. Gu, Y . Xu, G. Wang, F. Hu, A. Narayan, J. Bjorck, J. Wang, G. Kim, D. Niu, R. Zheng, Y . Xie, J. Wu, Q. Wang, R. Julian, D. Xu, Y . Du, Y . Chebotar, S. Reed, J. Kautz, Y . Zhu, L. Fan, and J. Jang. World action mode...

2026
[35]

S. Li, Y . Gao, D. Sadigh, and S. Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025

Pith/arXiv arXiv 2025
[36]

C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta. Unified world models: Cou- pling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025

Pith/arXiv arXiv 2025
[37]

Seo, B.-J

S. Seo, B.-J. Lee, J. Lee, H. Hwang, H. Yang, and K.-E. Kim. Mitigating covariate shift in behavioral cloning via robust stationary distribution correction. InAdvances in Neural Infor- mation Processing Systems, volume 37, 2024

2024
[38]

S. Ross, G. J. Gordon, and J. A. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the Fourteenth International Confer- ence on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pages 627–635, 2011

2011
[39]

Kelly, C

M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer. Hg-dagger: Interactive imitation learning with human experts. In2019 International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019

2019
[40]

Spencer, S

J. Spencer, S. Choudhury, M. Barnes, M. Schmittle, M. Chiang, P. Ramadge, and S. Srinivasa. Learning from interventions: Human-robot interaction as both explicit and implicit feedback. InRobotics: Science and Systems, 2020

2020
[41]

Mandlekar, D

A. Mandlekar, D. Xu, R. Mart ´ın-Mart´ın, Y . Zhu, L. Fei-Fei, and S. Savarese. Human-in-the- loop imitation learning using remote teleoperation, 2020

2020
[42]

H. Liu, S. Nasiriany, L. Zhang, Z. Bao, and Y . Zhu. Robot learning on the job: Human-in- the-loop autonomy and learning during deployment.The International Journal of Robotics Research, 2022

2022
[43]

Hoque, A

R. Hoque, A. Balakrishna, C. Putterman, M. Luo, D. S. Brown, D. Seita, B. Thananjeyan, E. Novoseller, and K. Goldberg. LazyDAgger: Reducing context switching in interactive imitation learning. InIEEE International Conference on Automation Science and Engineering, pages 502–509, 2021

2021
[44]

Hoque, A

R. Hoque, A. Balakrishna, E. Novoseller, A. Wilcox, D. S. Brown, and K. Goldberg. ThriftyDAgger: Budget-aware novelty and risk gating for interactive imitation learning, 2021

2021
[45]

P. Wu, Y . Shentu, Q. Liao, D. Jin, M. Guo, K. Sreenath, X. Lin, and P. Abbeel. Robocopi- lot: Human-in-the-loop interactive imitation learning for robot manipulation.arXiv preprint arXiv:2503.07771, 2025

arXiv 2025
[46]

X. Xu, Y . Hou, C. Xin, Z. Liu, and S. Song. Compliant residual DAgger: Improving real- world contact-rich manipulation with human corrections. InAdvances in Neural Information Processing Systems, 2025. 11

2025
[47]

Johannink, S

T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. A. Ojea, E. Solowjow, and S. Levine. Residual reinforcement learning for robot control. InIEEE International Conference on Robotics and Automation, pages 6023–6029, 2019

2019
[48]

Ankile, A

L. Ankile, A. Simeonov, I. Shenfeld, M. Torne, and P. Agrawal. From imitation to refinement: Residual RL for precise assembly, 2024

2024
[49]

X. Yuan, T. Mu, S. Tao, Y . Fang, M. Zhang, and H. Su. Policy decorator: Model-agnostic online refinement for large policy model, 2024

2024
[50]

T. He, J. Gao, W. Xiao, Y . Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbab, C. Pan, et al. Asap: Aligning simulation and real-world physics for learning agile humanoid whole- body skills.arXiv preprint arXiv:2502.01143, 2025

arXiv 2025
[51]

Haldar, J

S. Haldar, J. Pari, A. Rai, and L. Pinto. Teach a robot to fish: Versatile imitation from one minute of demonstrations.arXiv preprint arXiv:2303.01497, 2023

arXiv 2023
[52]

Guzey, Y

I. Guzey, Y . Dai, B. Evans, S. Chintala, and L. Pinto. See to touch: Learning tactile dexterity through visual incentives. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 13825–13832. IEEE, 2024

2024
[53]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: A vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025. 12 A Appendix A.1 Hardware and Training Details Hardware.Figure 6 shows the physical layout of the follower robot, FACTR leader ...

Pith/arXiv arXiv 2025

[1] [1]

M. A. Lee, Y . Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg. Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks. In2019 International conference on robotics and automation (ICRA), pages 8943–8950. IEEE, 2019

2019

[2] [2]

Z. Zhao, S. Haldar, J. Cui, L. Pinto, and R. Bhirangi. Touch begins where vision ends: Gener- alizable policies for contact-rich manipulation, 2025

2025

[3] [3]

Zheng, S

Y . Zheng, S. Gu, W. Li, Y . Zheng, Y . Zang, S. Tian, X. Li, C. Hao, C. Gao, S. Liu, et al. Omnivta: Visuo-tactile world modeling for contact-rich robotic manipulation.arXiv preprint arXiv:2603.19201, 2026

arXiv 2026

[4] [4]

J. Yu, H. Liu, Q. Yu, J. Ren, C. Hao, H. Ding, G. Huang, G. Huang, Y . Song, P. Cai, C. Lu, and W. Zhang. ForceVLA: Enhancing vla models with a force-aware moe for contact-rich manipulation, 2025

2025

[5] [5]

C. Chen, Z. Yu, H. Choi, M. Cutkosky, and J. Bohg. Dexforce: Extracting force-informed actions from kinesthetic demonstrations for dexterous manipulation.IEEE Robotics and Au- tomation Letters, 2025

2025

[6] [6]

H. Fang, S. Tang, M. Mei, H. Qin, Z. He, J. Chen, Y . Feng, C. Wang, W. Liu, Z. He, C. Lu, and S. Wang. Force policy: Learning hybrid force-position control policy under interaction frame for contact-rich manipulation, 2026

2026

[7] [7]

Z. He, H. Fang, J. Chen, H.-S. Fang, and C. Lu. FoAR: Force-aware reactive policy for contact- rich robotic manipulation, 2024

2024

[8] [8]

Zhang, H

Z. Zhang, H. Xu, Z. Yang, C. Yue, Z. Lin, H.-a. Gao, Z. Wang, and H. Zhao. TA-VLA: Elucidating the design space of torque-aware vision-language-action models. In9th Annual Conference on Robot Learning, 2025

2025

[9] [9]

Z. Sun, Y . Wang, D. Held, and Z. Erickson. Force-constrained visual policy: Safe robot- assisted dressing via multi-modal sensing.IEEE Robotics and Automation Letters, 9(5):4178– 4185, 2024

2024

[10] [10]

Y . Hou, Z. Liu, C. Chi, E. Cousineau, N. Kuppuswamy, S. Feng, B. Burchfiel, and S. Song. Adaptive compliance policy: Learning approximate compliance for diffusion guided control, 2024

2024

[11] [11]

H. Xue, J. Ren, W. Chen, G. Zhang, Y . Fang, G. Gu, H. Xu, and C. Lu. Reactive diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation, 2025

2025

[12] [12]

Y . Li, P. Tang, W. Zhang, C. Zhu, Y . Duan, W. Shi, X. Zhang, Z. Yang, J. Ji, and Y . Zhang. FA VLA: A force-adaptive fast-slow vla model for contact-rich robotic manipulation, 2026

2026

[13] [13]

Y . Liao, P. Zhou, S. Huang, D. Yang, S. Chen, Y . Jiang, Y . Hu, J. Cai, S. Liu, J. Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

Pith/arXiv arXiv 2025

[14] [14]

L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, Y . Shen, and Y . Xu. Causal world modeling for robot control, 2026

2026

[15] [15]

J. Pai, L. Achenbach, V . Montesinos, B. Forrai, O. Mees, and E. Nava. mimic-video: Video- action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025. 9

Pith/arXiv arXiv 2025

[16] [16]

H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y . Feng, C. Xiang, Y . Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

Pith/arXiv arXiv 2025

[17] [17]

T. Yuan, Z. Dong, Y . Liu, and H. Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

Pith/arXiv arXiv 2026

[18] [18]

Buamanee, M

T. Buamanee, M. Kobayashi, Y . Uranishi, and H. Takemura. Bi-act: Bilateral control-based imitation learning via action chunking with transformer. In2024 IEEE International Confer- ence on Advanced Intelligent Mechatronics (AIM), pages 410–415. IEEE, 2024

2024

[19] [19]

J. Seo, A. Kruthiventy, S. Lee, M. Teng, S. Choi, X. Zhang, J. Choi, and R. Horowitz. Equicon- tact: A hierarchical se (3) vision-to-force equivariant policy for spatially generalizable contact- rich tasks.arXiv preprint arXiv:2507.10961, 2025

arXiv 2025

[20] [20]

B. Zhou, R. Jiao, Y . Li, X. Yuan, F. Fang, and S. Li. Admittance visuomotor policy learning for general-purpose contact-rich manipulations.IEEE Transactions on Industrial Electronics, 2025

2025

[21] [21]

J. J. Liu, Y . Li, K. Shaw, T. Tao, R. Salakhutdinov, and D. Pathak. FACTR: Force-attending curriculum training for contact-rich policy learning, 2025

2025

[22] [22]

Hafner, J

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models, 2023

2023

[23] [23]

S. Wang, J. Shi, Z. Fu, X. He, F. Liu, C. Yang, Y . Zhou, Z. Fei, J. Gong, J. Fu, M. Z. Shou, X. Huang, X. Qiu, and Y .-G. Jiang. World action models: The next frontier in embodied ai, 2026

2026

[24] [24]

Y . Shen, F. Wei, Z. Du, Y . Liang, Y . Lu, J. Yang, N. Zheng, and B. Guo. Videovla: Video generators can be generalizable robot manipulators.Advances in neural information processing systems, 38:95597–95621, 2026

2026

[25] [25]

B. Hou, G. Li, J. Jia, T. An, X. Guo, S. Leng, H. Geng, Y . Ze, T. Harada, P. Torr, et al. World model for robot learning: A comprehensive survey.arXiv preprint arXiv:2605.00080, 2026

Pith/arXiv arXiv 2026

[26] [26]

J. Lyu, Z. Li, X. Shi, C. Xu, Y . Wang, and H. Wang. Dywa: Dynamics-adaptive world ac- tion model for generalizable non-prehensile manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11058–11068, 2025

2025

[27] [27]

X. Liu, Z. Bai, H. Ci, K. Y . Ma, and M. Z. Shou. World-vla-loop: Closed-loop learning of video world model and vla policy.arXiv preprint arXiv:2602.06508, 2026

Pith/arXiv arXiv 2026

[28] [28]

A. L. Chandra, I. Nematollahi, C. Huang, T. Welschehold, W. Burgard, and A. Valada. Diwa: Diffusion policy adaptation with world models. InConference on Robot Learning, pages 3378–

[29] [29]

Zheng, J

R. Zheng, J. Wang, S. Reed, J. Bjorck, Y . Fang, F. Hu, J. Jang, K. Kundalia, Z. Lin, L. Magne, et al. Flare: Robot learning with implicit world modeling. InConference on Robot Learning, pages 3952–3971. PMLR, 2025

2025

[30] [30]

A. Ye, B. Wang, C. Ni, G. Huang, G. Zhao, H. Li, H. Li, J. Li, J. Lv, J. Liu, et al. Gigaworld- policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

arXiv 2026

[31] [31]

Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024. 10

Pith/arXiv arXiv 2024

[32] [32]

J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

Pith/arXiv arXiv 2025

[33] [33]

M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, and J. Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026

2026

[34] [34]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xi- ang, A. Malik, K. Lee, W. Liang, N. Ranawaka, J. Gu, Y . Xu, G. Wang, F. Hu, A. Narayan, J. Bjorck, J. Wang, G. Kim, D. Niu, R. Zheng, Y . Xie, J. Wu, Q. Wang, R. Julian, D. Xu, Y . Du, Y . Chebotar, S. Reed, J. Kautz, Y . Zhu, L. Fan, and J. Jang. World action mode...

2026

[35] [35]

S. Li, Y . Gao, D. Sadigh, and S. Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025

Pith/arXiv arXiv 2025

[36] [36]

C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta. Unified world models: Cou- pling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025

Pith/arXiv arXiv 2025

[37] [37]

Seo, B.-J

S. Seo, B.-J. Lee, J. Lee, H. Hwang, H. Yang, and K.-E. Kim. Mitigating covariate shift in behavioral cloning via robust stationary distribution correction. InAdvances in Neural Infor- mation Processing Systems, volume 37, 2024

2024

[38] [38]

S. Ross, G. J. Gordon, and J. A. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the Fourteenth International Confer- ence on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pages 627–635, 2011

2011

[39] [39]

Kelly, C

M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer. Hg-dagger: Interactive imitation learning with human experts. In2019 International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019

2019

[40] [40]

Spencer, S

J. Spencer, S. Choudhury, M. Barnes, M. Schmittle, M. Chiang, P. Ramadge, and S. Srinivasa. Learning from interventions: Human-robot interaction as both explicit and implicit feedback. InRobotics: Science and Systems, 2020

2020

[41] [41]

Mandlekar, D

A. Mandlekar, D. Xu, R. Mart ´ın-Mart´ın, Y . Zhu, L. Fei-Fei, and S. Savarese. Human-in-the- loop imitation learning using remote teleoperation, 2020

2020

[42] [42]

H. Liu, S. Nasiriany, L. Zhang, Z. Bao, and Y . Zhu. Robot learning on the job: Human-in- the-loop autonomy and learning during deployment.The International Journal of Robotics Research, 2022

2022

[43] [43]

Hoque, A

R. Hoque, A. Balakrishna, C. Putterman, M. Luo, D. S. Brown, D. Seita, B. Thananjeyan, E. Novoseller, and K. Goldberg. LazyDAgger: Reducing context switching in interactive imitation learning. InIEEE International Conference on Automation Science and Engineering, pages 502–509, 2021

2021

[44] [44]

Hoque, A

R. Hoque, A. Balakrishna, E. Novoseller, A. Wilcox, D. S. Brown, and K. Goldberg. ThriftyDAgger: Budget-aware novelty and risk gating for interactive imitation learning, 2021

2021

[45] [45]

P. Wu, Y . Shentu, Q. Liao, D. Jin, M. Guo, K. Sreenath, X. Lin, and P. Abbeel. Robocopi- lot: Human-in-the-loop interactive imitation learning for robot manipulation.arXiv preprint arXiv:2503.07771, 2025

arXiv 2025

[46] [46]

X. Xu, Y . Hou, C. Xin, Z. Liu, and S. Song. Compliant residual DAgger: Improving real- world contact-rich manipulation with human corrections. InAdvances in Neural Information Processing Systems, 2025. 11

2025

[47] [47]

Johannink, S

T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. A. Ojea, E. Solowjow, and S. Levine. Residual reinforcement learning for robot control. InIEEE International Conference on Robotics and Automation, pages 6023–6029, 2019

2019

[48] [48]

Ankile, A

L. Ankile, A. Simeonov, I. Shenfeld, M. Torne, and P. Agrawal. From imitation to refinement: Residual RL for precise assembly, 2024

2024

[49] [49]

X. Yuan, T. Mu, S. Tao, Y . Fang, M. Zhang, and H. Su. Policy decorator: Model-agnostic online refinement for large policy model, 2024

2024

[50] [50]

T. He, J. Gao, W. Xiao, Y . Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbab, C. Pan, et al. Asap: Aligning simulation and real-world physics for learning agile humanoid whole- body skills.arXiv preprint arXiv:2502.01143, 2025

arXiv 2025

[51] [51]

Haldar, J

S. Haldar, J. Pari, A. Rai, and L. Pinto. Teach a robot to fish: Versatile imitation from one minute of demonstrations.arXiv preprint arXiv:2303.01497, 2023

arXiv 2023

[52] [52]

Guzey, Y

I. Guzey, Y . Dai, B. Evans, S. Chintala, and L. Pinto. See to touch: Learning tactile dexterity through visual incentives. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 13825–13832. IEEE, 2024

2024

[53] [53]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: A vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025. 12 A Appendix A.1 Hardware and Training Details Hardware.Figure 6 shows the physical layout of the follower robot, FACTR leader ...

Pith/arXiv arXiv 2025