pith. sign in

arxiv: 2606.27325 · v1 · pith:3RNHGQC6new · submitted 2026-06-25 · 💻 cs.CV

Not All Actions Are Equal: Rethinking Conditioning for Dexterous World Model

Pith reviewed 2026-06-26 05:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords action conditioningdexterous manipulationworld modelsvideo predictionhigh-DoF controlsemantic priorsaction tokenization
0
0 comments X

The pith

High-DoF dexterous actions need structured conditioning with tokenization and semantic priors instead of uniform compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that high-DoF dexterous actions vary across orders of magnitude, so compressing an entire sequence into one vector creates optimization imbalances that obscure subtle motion signals. DexAC counters this by tokenizing actions to retain per-dimension semantics, then applying local refinement and global modulation to match visual dynamics, while a separate semantic branch supplies object-scene context. These changes produce measurable gains in FID, FVD, and PCK on EgoDex and EgoVerse, indicating tighter visual-temporal realism and action consistency. A reader would care because the approach offers a route to scale world models toward complex robotic control without simply enlarging model capacity.

Core claim

DexAC-WM models action conditioning as a structured process rather than global compression: actions are tokenized to preserve dimension-level semantics, then aligned with visual dynamics via local refinement and global modulation; a semantic branch further supplies object-scene priors that support high-DoF action-conditioned video prediction. Experiments on EgoDex and EgoVerse show that the combination improves FID, FVD, and PCK, and DexAC transfers to other backbones.

What carries the argument

DexAC, the structured action-conditioning mechanism that tokenizes high-DoF actions and applies local refinement plus global modulation to align heterogeneous signals with visual dynamics.

If this is right

  • Structured tokenization and modulation reduce imbalance in fine-grained action effects.
  • The semantic branch supplies priors that improve capture of dynamic visual details under high-DoF control.
  • DexAC extends across different backbone architectures without retraining the entire model.
  • Gains appear in both visual realism metrics and action-following consistency metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tokenization-plus-modulation pattern could be tested on other high-dimensional control signals such as full-body pose sequences.
  • Semantic priors might allow smaller visual encoders to reach comparable fidelity when actions are already well-structured.
  • Deployment on physical hardware would reveal whether the observed simulation gains survive sensor noise and actuation delays.

Load-bearing premise

High-DoF dexterous actions contain components of widely different magnitudes whose uniform aggregation produces optimization imbalance across those components.

What would settle it

If a baseline that compresses the full action sequence into a single vector matches or exceeds the FID, FVD, and PCK gains of DexAC plus the semantic branch on the same EgoDex and EgoVerse test sets, the claimed benefit of structured conditioning would be refuted.

read the original abstract

Recent advances in action-conditioned world models show promising progress in modeling complex interactions and forecasting future states under diverse action sequences. While these models are often driven by stronger visual representations and model capacity, action conditioning itself remains underexplored. Most existing approaches compress the entire action sequence into a single representation, which works well for low-DoF control but becomes less reliable in high-DoF scenarios. We observe that high-DoF dexterous actions are inherently heterogeneous, spanning multiple orders of magnitude, where large-scale motions coexist with subtle but important signals. When uniformly aggregated, optimization exhibits an imbalance across action components, which hinders the modeling of fine-grained effects and affects action fidelity. We therefore propose DexAC-WM, which treats action conditioning as a structured process rather than global compression. DexAC preserves dimension-level semantics via action tokenization and aligns action signals with visual dynamics through local refinement and global modulation. To address the limited high-level semantic grounding in existing world models, we further introduce a semantic branch that provides rich object-scene priors, which enables world model to capture dynamic visual details while supporting high-DoF action-conditioned video prediction. Experiments on EgoDex and EgoVerse show that combining the semantic branch with DexAC significantly improves FID, FVD, and PCK, demonstrating gains in visual-temporal realism and action-following consistency. We further verify that DexAC extends to other backbones, showing the scalability of our structured action-conditioning design. These results suggest that scaling world models to high-DoF control requires both structured action modeling and semantic grounding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims that uniform compression of high-DoF dexterous action sequences into a single representation causes optimization imbalance across heterogeneous action components (large-scale motions vs. subtle signals), hindering fine-grained modeling. It proposes DexAC-WM, which uses action tokenization to preserve dimension-level semantics, combined with local refinement and global modulation for alignment with visual dynamics, plus a semantic branch providing object-scene priors. Experiments on EgoDex and EgoVerse report gains in FID, FVD, and PCK when combining the semantic branch with DexAC, with additional verification of scalability to other backbones.

Significance. If the gains prove robust and the structured conditioning is shown to specifically alleviate the hypothesized imbalance, the work would highlight an important but underexplored aspect of action conditioning in world models for dexterous control, potentially improving visual-temporal consistency and action fidelity in high-DoF settings. The extension to other backbones is a positive indicator of generality.

major comments (1)
  1. [Experiments (EgoDex/EgoVerse results)] The central motivation—that uniform aggregation of heterogeneous high-DoF actions produces optimization imbalance—is load-bearing for the claim that DexAC's tokenization + modulation is the appropriate remedy, yet the experiments section provides no direct supporting evidence such as component-wise gradient magnitudes, per-dimension loss contributions, or an ablation isolating uniform compression versus tokenized conditioning.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below and will revise the manuscript to incorporate additional experimental support.

read point-by-point responses
  1. Referee: [Experiments (EgoDex/EgoVerse results)] The central motivation—that uniform aggregation of heterogeneous high-DoF actions produces optimization imbalance—is load-bearing for the claim that DexAC's tokenization + modulation is the appropriate remedy, yet the experiments section provides no direct supporting evidence such as component-wise gradient magnitudes, per-dimension loss contributions, or an ablation isolating uniform compression versus tokenized conditioning.

    Authors: We agree that the manuscript lacks direct quantitative diagnostics (e.g., per-dimension losses or gradient magnitudes) for the hypothesized optimization imbalance. Our motivation derives from development-stage observations that uniform compression degraded fidelity on low-magnitude action dimensions, which is indirectly supported by the reported PCK gains. To directly address the concern, the revised version will add (i) an ablation isolating uniform compression against tokenized conditioning and (ii) per-dimension loss breakdowns on EgoDex to quantify the imbalance. These additions will be placed in the experiments section and will not change the core claims or results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical proposal without load-bearing derivations or self-referential reductions

full rationale

The paper presents DexAC-WM as an empirical architecture motivated by the stated observation that high-DoF actions are heterogeneous and that uniform aggregation produces optimization imbalance. No equations, derivations, or parameter-fitting steps are described that reduce a claimed prediction back to the input by construction. The central claims rest on experimental metrics (FID, FVD, PCK) on EgoDex and EgoVerse rather than on any self-citation chain, uniqueness theorem, or ansatz smuggled via prior work. The method is therefore self-contained against external benchmarks with no circular steps identified.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Review based solely on abstract; full paper not available so ledger is minimal and provisional.

axioms (1)
  • domain assumption High-DoF dexterous actions are inherently heterogeneous spanning multiple orders of magnitude leading to optimization imbalance under uniform aggregation
    Stated directly in abstract as the core observation motivating the method.
invented entities (2)
  • DexAC no independent evidence
    purpose: Structured action conditioning via tokenization, local refinement and global modulation
    New proposed component introduced to replace global compression
  • semantic branch no independent evidence
    purpose: Provide rich object-scene priors for dynamic visual details
    New module added to address limited high-level semantic grounding

pith-pipeline@v0.9.1-grok · 5841 in / 1300 out tokens · 25212 ms · 2026-06-26T05:20:43.009673+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 23 linked inside Pith

  1. [1]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InCVPR, 2022

  2. [2]

    Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InCVPR, 2024

  3. [3]

    Egosim: Egocentric exploration in virtual worlds with multi-modal conditioning

    Wei Yu, Songheng Yin, Steve Easterbrook, and Animesh Garg. Egosim: Egocentric exploration in virtual worlds with multi-modal conditioning. InICLR, 2025

  4. [4]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InICML, 2024. 15

  5. [5]

    Daydreamer: World models for physical robot learning

    Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Daydreamer: World models for physical robot learning. InCoRL, 2023

  6. [6]

    Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

  7. [7]

    Td-mpc2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2023

    Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2023

  8. [8]

    Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023

    Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023

  9. [9]

    Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

    Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

  10. [10]

    The neural mechanisms of manual dexterity.Nature Reviews Neuroscience, 2021

    Anton R Sobinov and Sliman J Bensmaia. The neural mechanisms of manual dexterity.Nature Reviews Neuroscience, 2021

  11. [11]

    World models for learning dexterous hand-object interactions from human videos, 2026

    Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World models for learning dexterous hand-object interactions from human videos, 2026

  12. [12]

    Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

    Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

  13. [13]

    A survey of embodied learning for object-centric robotic manipulation.Machine Intelligence Research, 2025

    Ying Zheng, Lei Yao, Yuejiao Su, Yi Zhang, Yi Wang, Sicheng Zhao, Yiyi Zhang, and Lap-Pui Chau. A survey of embodied learning for object-centric robotic manipulation.Machine Intelligence Research, 2025

  14. [14]

    Dexterous manipulation through imitation learning: A survey.arXiv preprint arXiv:2504.03515, 2025

    Shan An, Ziyu Meng, Chao Tang, Yuning Zhou, Tengyu Liu, Fangqiang Ding, Shufang Zhang, Yao Mu, Ran Song, Wei Zhang, et al. Dexterous manipulation through imitation learning: A survey.arXiv preprint arXiv:2504.03515, 2025

  15. [15]

    A survey of multifingered robotic manipulation: Biological results, structural evolvements, and learning methods.Frontiers in Neurorobotics, 2022

    Yinlin Li, Peng Wang, Rui Li, Mo Tao, Zhiyong Liu, and Hong Qiao. A survey of multifingered robotic manipulation: Biological results, structural evolvements, and learning methods.Frontiers in Neurorobotics, 2022

  16. [16]

    Dexcap: Scalable and portable mocap data collection system for dexterous manipulation.arXiv preprint arXiv:2403.07788, 2024

    Chen Wang, Haochen Shi, Weizhuo Wang, Ruohan Zhang, Li Fei-Fei, and C Karen Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation.arXiv preprint arXiv:2403.07788, 2024

  17. [17]

    Dexmv: Imitation learning for dexterous manipulation from human videos

    Yuzhe Qin, Yueh-Hua Wu, Shaowei Liu, Hanwen Jiang, Ruihan Yang, Yang Fu, and Xiaolong Wang. Dexmv: Imitation learning for dexterous manipulation from human videos. InECCV, 2022

  18. [18]

    Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system.arXiv preprint arXiv:2307.04577, 2023

    Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dieter Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system.arXiv preprint arXiv:2307.04577, 2023

  19. [19]

    Dexpoint: Generalizable point cloud reinforcement learning for sim-to-real dexterous manipulation

    Yuzhe Qin, Binghao Huang, Zhao-Heng Yin, Hao Su, and Xiaolong Wang. Dexpoint: Generalizable point cloud reinforcement learning for sim-to-real dexterous manipulation. InCoRL, 2023

  20. [20]

    Dextreme: Transfer of agile in-hand manipulation from simulation to reality

    Ankur Handa, Arthur Allshire, Viktor Makoviychuk, Aleksei Petrenko, Ritvik Singh, Jingzhou Liu, Denys Makoviichuk, Karl Van Wyk, Alexander Zhurkevich, Balakumar Sundaralingam, et al. Dextreme: Transfer of agile in-hand manipulation from simulation to reality. InICRA, 2023

  21. [21]

    Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2023

    Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2023

  22. [22]

    Affordance diffusion: Synthesizing hand-object interactions

    Yufei Ye, Xueting Li, Abhinav Gupta, Shalini De Mello, Stan Birchfield, Jiaming Song, Shubham Tulsiani, and Sifei Liu. Affordance diffusion: Synthesizing hand-object interactions. InCVPR, 2023. 16

  23. [23]

    Follow your pose: Pose-guided text-to-video generation using pose-free videos.AAAI, 2024

    Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose-guided text-to-video generation using pose-free videos.AAAI, 2024

  24. [24]

    Vggt- world: Transforming vggt into an autoregressive geometry world model.arXiv preprint arXiv:2603.12655, 2026

    Xiangyu Sun, Shijie Wang, Fengyi Zhang, Lin Liu, Caiyan Jia, Ziying Song, Zi Huang, and Yadan Luo. Vggt- world: Transforming vggt into an autoregressive geometry world model.arXiv preprint arXiv:2603.12655, 2026

  25. [25]

    Dino-wm: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983, 2024

    Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983, 2024

  26. [26]

    Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

    Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, et al. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

  27. [27]

    Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  28. [28]

    Dinov3.arXiv preprint arXiv:2508.10104, 2025

    Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khali- dov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

  29. [29]

    Robodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024

    Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024

  30. [30]

    Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025

    Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025

  31. [31]

    Irasim: A fine-grained world model for robot manipulation

    Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. Irasim: A fine-grained world model for robot manipulation. InICCV, 2025

  32. [32]

    Eva: Aligning video world models with executable robot actions via inverse dynamics rewards.arXiv preprint arXiv:2603.17808, 2026

    Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, and Kui Jia. Eva: Aligning video world models with executable robot actions via inverse dynamics rewards.arXiv preprint arXiv:2603.17808, 2026

  33. [33]

    Flare: Robot learning with implicit world modeling.arXiv preprint arXiv:2505.15659, 2025

    Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, et al. Flare: Robot learning with implicit world modeling.arXiv preprint arXiv:2505.15659, 2025

  34. [34]

    Lumos: Language-conditioned imitation learning with world models

    Iman Nematollahi, Branton DeMoss, Akshay L Chandra, Nick Hawes, Wolfram Burgard, and Ingmar Posner. Lumos: Language-conditioned imitation learning with world models. InICRA, 2025

  35. [35]

    Multi-stage manipulation with demonstration-augmented reward, policy, and world model learning.arXiv preprint arXiv:2503.01837, 2025

    Adrià López Escoriza, Nicklas Hansen, Stone Tao, Tongzhou Mu, and Hao Su. Multi-stage manipulation with demonstration-augmented reward, policy, and world model learning.arXiv preprint arXiv:2503.01837, 2025

  36. [36]

    Prompting with the future: Open-world model predictive control with interactive digital twins.arXiv preprint arXiv:2506.13761, 2025

    Chuanruo Ning, Kuan Fang, and Wei-Chiu Ma. Prompting with the future: Open-world model predictive control with interactive digital twins.arXiv preprint arXiv:2506.13761, 2025

  37. [37]

    V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

  38. [38]

    Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

  39. [39]

    Mastering atari, go, chess and shogi by planning with a learned model.Nature, 2020

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 2020. 17

  40. [40]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

  41. [41]

    Decision transformer: Reinforcement learning via sequence modeling

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. NeurIPS, 2021

  42. [42]

    Masked visual pre-training for motor control.arXiv preprint arXiv:2203.06173, 2022

    Tete Xiao, Ilija Radosavovic, Trevor Darrell, and Jitendra Malik. Masked visual pre-training for motor control.arXiv preprint arXiv:2203.06173, 2022

  43. [43]

    R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022

  44. [44]

    Vip: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022

    Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022

  45. [45]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. InECCV, 2018

  46. [46]

    Hoi4d: A 4d egocentric dataset for category-level human-object interaction

    Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In CVPR, 2022

  47. [47]

    Videodex: Learning dexterity from internet videos

    Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. Videodex: Learning dexterity from internet videos. In CoRL, 2023

  48. [48]

    Hot3d: Hand and object tracking in 3d from egocentric multi-view videos

    Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi-view videos. InCVPR, 2025

  49. [49]

    Egomimic: Scaling imitation learning via egocentric video

    Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. InICRA, 2025

  50. [50]

    Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

    Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

  51. [51]

    Openego: A large-scale multimodal egocentric dataset for dexterous manipulation.arXiv preprint arXiv:2509.05513, 2025

    Ahad Jawaid and Yu Xiang. Openego: A large-scale multimodal egocentric dataset for dexterous manipulation.arXiv preprint arXiv:2509.05513, 2025

  52. [52]

    Egoverse: An egocentric human dataset for robot learning from around the world.arXiv preprint arXiv:2604.07607, 2026

    Ryan Punamiya, Simar Kareer, Zeyi Liu, Josh Citron, Ri-Zhao Qiu, Xiongyi Cai, Alexey Gavryushin, Jiaqi Chen, Davide Liconti, Lawrence Y Zhu, et al. Egoverse: An egocentric human dataset for robot learning from around the world.arXiv preprint arXiv:2604.07607, 2026

  53. [53]

    Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

    Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

  54. [54]

    Navigation world models

    Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In CVPR, 2025

  55. [55]

    Dexterous world models.arXiv preprint arXiv:2512.17907, 2025

    Byungjun Kim, Taeksoo Kim, Junyoung Lee, and Hanbyul Joo. Dexterous world models.arXiv preprint arXiv:2512.17907, 2025

  56. [56]

    World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

    Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

  57. [57]

    Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

    Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025. 18

  58. [58]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  59. [59]

    Vima: General robot manipulation with multimodal prompts

    Y Zhu et al. Vima: General robot manipulation with multimodal prompts. InICLR, 2023

  60. [60]

    Internvla-m1: A spatially guided vision-language-action framework for generalist robot policy.arXiv preprint arXiv:2510.13778, 2025

    Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, et al. Internvla-m1: A spatially guided vision-language-action framework for generalist robot policy.arXiv preprint arXiv:2510.13778, 2025

  61. [61]

    Image quality metrics: Psnr vs

    Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. InICPR, 2010

  62. [62]

    Image quality assessment: from error visibility to structural similarity.IEEE TIP, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE TIP, 2004

  63. [63]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

  64. [64]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS, 2017

  65. [65]

    Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

  66. [66]

    Articulated human detection with flexible mixtures of parts.IEEE TPAMI, 2012

    Yi Yang and Deva Ramanan. Articulated human detection with flexible mixtures of parts.IEEE TPAMI, 2012

  67. [67]

    Dreamsim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

    Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

  68. [68]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

  69. [69]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025. 19 A Overview of the Appendix This appendix contains additional analysis, experimental details, and discussions, organized as follows: •Sec. B outlines the additional implementation details in experi...