Not All Actions Are Equal: Rethinking Conditioning for Dexterous World Model

Lusong Li; Qiwei Liang; Renjing Xu; Taowen Wang; Yichi Wang; Yuetong Fang; Yunheng Wang; Zecui Zeng; Zhengtu Liang; Zizhao Yuan

arxiv: 2606.27325 · v1 · pith:3RNHGQC6new · submitted 2026-06-25 · 💻 cs.CV

Not All Actions Are Equal: Rethinking Conditioning for Dexterous World Model

Zizhao Yuan , Zhengtu Liang , Taowen Wang , Qiwei Liang , Yichi Wang , Yunheng Wang , Yuetong Fang , Lusong Li

show 2 more authors

Zecui Zeng Renjing Xu

This is my paper

Pith reviewed 2026-06-26 05:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords action conditioningdexterous manipulationworld modelsvideo predictionhigh-DoF controlsemantic priorsaction tokenization

0 comments

The pith

High-DoF dexterous actions need structured conditioning with tokenization and semantic priors instead of uniform compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that high-DoF dexterous actions vary across orders of magnitude, so compressing an entire sequence into one vector creates optimization imbalances that obscure subtle motion signals. DexAC counters this by tokenizing actions to retain per-dimension semantics, then applying local refinement and global modulation to match visual dynamics, while a separate semantic branch supplies object-scene context. These changes produce measurable gains in FID, FVD, and PCK on EgoDex and EgoVerse, indicating tighter visual-temporal realism and action consistency. A reader would care because the approach offers a route to scale world models toward complex robotic control without simply enlarging model capacity.

Core claim

DexAC-WM models action conditioning as a structured process rather than global compression: actions are tokenized to preserve dimension-level semantics, then aligned with visual dynamics via local refinement and global modulation; a semantic branch further supplies object-scene priors that support high-DoF action-conditioned video prediction. Experiments on EgoDex and EgoVerse show that the combination improves FID, FVD, and PCK, and DexAC transfers to other backbones.

What carries the argument

DexAC, the structured action-conditioning mechanism that tokenizes high-DoF actions and applies local refinement plus global modulation to align heterogeneous signals with visual dynamics.

If this is right

Structured tokenization and modulation reduce imbalance in fine-grained action effects.
The semantic branch supplies priors that improve capture of dynamic visual details under high-DoF control.
DexAC extends across different backbone architectures without retraining the entire model.
Gains appear in both visual realism metrics and action-following consistency metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tokenization-plus-modulation pattern could be tested on other high-dimensional control signals such as full-body pose sequences.
Semantic priors might allow smaller visual encoders to reach comparable fidelity when actions are already well-structured.
Deployment on physical hardware would reveal whether the observed simulation gains survive sensor noise and actuation delays.

Load-bearing premise

High-DoF dexterous actions contain components of widely different magnitudes whose uniform aggregation produces optimization imbalance across those components.

What would settle it

If a baseline that compresses the full action sequence into a single vector matches or exceeds the FID, FVD, and PCK gains of DexAC plus the semantic branch on the same EgoDex and EgoVerse test sets, the claimed benefit of structured conditioning would be refuted.

read the original abstract

Recent advances in action-conditioned world models show promising progress in modeling complex interactions and forecasting future states under diverse action sequences. While these models are often driven by stronger visual representations and model capacity, action conditioning itself remains underexplored. Most existing approaches compress the entire action sequence into a single representation, which works well for low-DoF control but becomes less reliable in high-DoF scenarios. We observe that high-DoF dexterous actions are inherently heterogeneous, spanning multiple orders of magnitude, where large-scale motions coexist with subtle but important signals. When uniformly aggregated, optimization exhibits an imbalance across action components, which hinders the modeling of fine-grained effects and affects action fidelity. We therefore propose DexAC-WM, which treats action conditioning as a structured process rather than global compression. DexAC preserves dimension-level semantics via action tokenization and aligns action signals with visual dynamics through local refinement and global modulation. To address the limited high-level semantic grounding in existing world models, we further introduce a semantic branch that provides rich object-scene priors, which enables world model to capture dynamic visual details while supporting high-DoF action-conditioned video prediction. Experiments on EgoDex and EgoVerse show that combining the semantic branch with DexAC significantly improves FID, FVD, and PCK, demonstrating gains in visual-temporal realism and action-following consistency. We further verify that DexAC extends to other backbones, showing the scalability of our structured action-conditioning design. These results suggest that scaling world models to high-DoF control requires both structured action modeling and semantic grounding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DexAC tokenizes high-DoF actions with local-global modulation plus a semantic branch and reports FID/FVD/PCK gains on EgoDex and EgoVerse, but the imbalance motivation lacks direct component-wise evidence.

read the letter

The paper's core move is to stop treating high-DoF actions as one compressed vector and instead tokenize them, apply local refinement plus global modulation, and add a semantic branch for object-scene context. Experiments on EgoDex and EgoVerse show the full combination lifts the three metrics, and the design reportedly ports to other backbones.

That structured conditioning is the actual novelty. It directly targets the scale mismatch in dexterous actions, where big motions and tiny signals sit in the same sequence. Keeping dimension-level semantics and tying them to visual dynamics is a reasonable engineering response to a known pain point in robotics world models.

The results are consistent with the claim that the combination helps visual-temporal quality and action fidelity. Extending the method beyond the main backbone is also useful to see.

The soft spot is the central motivation. The abstract states that uniform aggregation creates optimization imbalance across action components, yet the reported experiments only give overall metric improvements. No per-dimension loss breakdowns, gradient magnitude checks, or isolated ablation of tokenized versus uniform conditioning appear in the provided text. Without that, the imbalance remains an assumption rather than a measured effect.

This work is for researchers already building or using action-conditioned world models in high-DoF settings. Someone looking for a concrete design pattern to try on dexterous data will find the tokenization and semantic branch worth examining.

It deserves peer review. The problem is practical, the proposal is specific, and the datasets are relevant. A referee can ask for the missing ablations and check whether the gains hold under tighter controls.

Referee Report

1 major / 0 minor

Summary. The paper claims that uniform compression of high-DoF dexterous action sequences into a single representation causes optimization imbalance across heterogeneous action components (large-scale motions vs. subtle signals), hindering fine-grained modeling. It proposes DexAC-WM, which uses action tokenization to preserve dimension-level semantics, combined with local refinement and global modulation for alignment with visual dynamics, plus a semantic branch providing object-scene priors. Experiments on EgoDex and EgoVerse report gains in FID, FVD, and PCK when combining the semantic branch with DexAC, with additional verification of scalability to other backbones.

Significance. If the gains prove robust and the structured conditioning is shown to specifically alleviate the hypothesized imbalance, the work would highlight an important but underexplored aspect of action conditioning in world models for dexterous control, potentially improving visual-temporal consistency and action fidelity in high-DoF settings. The extension to other backbones is a positive indicator of generality.

major comments (1)

[Experiments (EgoDex/EgoVerse results)] The central motivation—that uniform aggregation of heterogeneous high-DoF actions produces optimization imbalance—is load-bearing for the claim that DexAC's tokenization + modulation is the appropriate remedy, yet the experiments section provides no direct supporting evidence such as component-wise gradient magnitudes, per-dimension loss contributions, or an ablation isolating uniform compression versus tokenized conditioning.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below and will revise the manuscript to incorporate additional experimental support.

read point-by-point responses

Referee: [Experiments (EgoDex/EgoVerse results)] The central motivation—that uniform aggregation of heterogeneous high-DoF actions produces optimization imbalance—is load-bearing for the claim that DexAC's tokenization + modulation is the appropriate remedy, yet the experiments section provides no direct supporting evidence such as component-wise gradient magnitudes, per-dimension loss contributions, or an ablation isolating uniform compression versus tokenized conditioning.

Authors: We agree that the manuscript lacks direct quantitative diagnostics (e.g., per-dimension losses or gradient magnitudes) for the hypothesized optimization imbalance. Our motivation derives from development-stage observations that uniform compression degraded fidelity on low-magnitude action dimensions, which is indirectly supported by the reported PCK gains. To directly address the concern, the revised version will add (i) an ablation isolating uniform compression against tokenized conditioning and (ii) per-dimension loss breakdowns on EgoDex to quantify the imbalance. These additions will be placed in the experiments section and will not change the core claims or results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical proposal without load-bearing derivations or self-referential reductions

full rationale

The paper presents DexAC-WM as an empirical architecture motivated by the stated observation that high-DoF actions are heterogeneous and that uniform aggregation produces optimization imbalance. No equations, derivations, or parameter-fitting steps are described that reduce a claimed prediction back to the input by construction. The central claims rest on experimental metrics (FID, FVD, PCK) on EgoDex and EgoVerse rather than on any self-citation chain, uniqueness theorem, or ansatz smuggled via prior work. The method is therefore self-contained against external benchmarks with no circular steps identified.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Review based solely on abstract; full paper not available so ledger is minimal and provisional.

axioms (1)

domain assumption High-DoF dexterous actions are inherently heterogeneous spanning multiple orders of magnitude leading to optimization imbalance under uniform aggregation
Stated directly in abstract as the core observation motivating the method.

invented entities (2)

DexAC no independent evidence
purpose: Structured action conditioning via tokenization, local refinement and global modulation
New proposed component introduced to replace global compression
semantic branch no independent evidence
purpose: Provide rich object-scene priors for dynamic visual details
New module added to address limited high-level semantic grounding

pith-pipeline@v0.9.1-grok · 5841 in / 1300 out tokens · 25212 ms · 2026-06-26T05:20:43.009673+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 23 linked inside Pith

[1]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InCVPR, 2022

2022
[2]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InCVPR, 2024

2024
[3]

Egosim: Egocentric exploration in virtual worlds with multi-modal conditioning

Wei Yu, Songheng Yin, Steve Easterbrook, and Animesh Garg. Egosim: Egocentric exploration in virtual worlds with multi-modal conditioning. InICLR, 2025

2025
[4]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InICML, 2024. 15

2024
[5]

Daydreamer: World models for physical robot learning

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Daydreamer: World models for physical robot learning. InCoRL, 2023

2023
[6]

Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

Pith/arXiv arXiv 2023
[7]

Td-mpc2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2023

Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2023

Pith/arXiv arXiv 2023
[8]

Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023

Pith/arXiv arXiv 2023
[9]

Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

Pith/arXiv arXiv 2025
[10]

The neural mechanisms of manual dexterity.Nature Reviews Neuroscience, 2021

Anton R Sobinov and Sliman J Bensmaia. The neural mechanisms of manual dexterity.Nature Reviews Neuroscience, 2021

2021
[11]

World models for learning dexterous hand-object interactions from human videos, 2026

Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World models for learning dexterous hand-object interactions from human videos, 2026

2026
[12]

Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

arXiv 2025
[13]

A survey of embodied learning for object-centric robotic manipulation.Machine Intelligence Research, 2025

Ying Zheng, Lei Yao, Yuejiao Su, Yi Zhang, Yi Wang, Sicheng Zhao, Yiyi Zhang, and Lap-Pui Chau. A survey of embodied learning for object-centric robotic manipulation.Machine Intelligence Research, 2025

2025
[14]

Dexterous manipulation through imitation learning: A survey.arXiv preprint arXiv:2504.03515, 2025

Shan An, Ziyu Meng, Chao Tang, Yuning Zhou, Tengyu Liu, Fangqiang Ding, Shufang Zhang, Yao Mu, Ran Song, Wei Zhang, et al. Dexterous manipulation through imitation learning: A survey.arXiv preprint arXiv:2504.03515, 2025

arXiv 2025
[15]

A survey of multifingered robotic manipulation: Biological results, structural evolvements, and learning methods.Frontiers in Neurorobotics, 2022

Yinlin Li, Peng Wang, Rui Li, Mo Tao, Zhiyong Liu, and Hong Qiao. A survey of multifingered robotic manipulation: Biological results, structural evolvements, and learning methods.Frontiers in Neurorobotics, 2022

2022
[16]

Dexcap: Scalable and portable mocap data collection system for dexterous manipulation.arXiv preprint arXiv:2403.07788, 2024

Chen Wang, Haochen Shi, Weizhuo Wang, Ruohan Zhang, Li Fei-Fei, and C Karen Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation.arXiv preprint arXiv:2403.07788, 2024

arXiv 2024
[17]

Dexmv: Imitation learning for dexterous manipulation from human videos

Yuzhe Qin, Yueh-Hua Wu, Shaowei Liu, Hanwen Jiang, Ruihan Yang, Yang Fu, and Xiaolong Wang. Dexmv: Imitation learning for dexterous manipulation from human videos. InECCV, 2022

2022
[18]

Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system.arXiv preprint arXiv:2307.04577, 2023

Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dieter Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system.arXiv preprint arXiv:2307.04577, 2023

arXiv 2023
[19]

Dexpoint: Generalizable point cloud reinforcement learning for sim-to-real dexterous manipulation

Yuzhe Qin, Binghao Huang, Zhao-Heng Yin, Hao Su, and Xiaolong Wang. Dexpoint: Generalizable point cloud reinforcement learning for sim-to-real dexterous manipulation. InCoRL, 2023

2023
[20]

Dextreme: Transfer of agile in-hand manipulation from simulation to reality

Ankur Handa, Arthur Allshire, Viktor Makoviychuk, Aleksei Petrenko, Ritvik Singh, Jingzhou Liu, Denys Makoviichuk, Karl Van Wyk, Alexander Zhurkevich, Balakumar Sundaralingam, et al. Dextreme: Transfer of agile in-hand manipulation from simulation to reality. InICRA, 2023

2023
[21]

Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2023

Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2023

Pith/arXiv arXiv 2023
[22]

Affordance diffusion: Synthesizing hand-object interactions

Yufei Ye, Xueting Li, Abhinav Gupta, Shalini De Mello, Stan Birchfield, Jiaming Song, Shubham Tulsiani, and Sifei Liu. Affordance diffusion: Synthesizing hand-object interactions. InCVPR, 2023. 16

2023
[23]

Follow your pose: Pose-guided text-to-video generation using pose-free videos.AAAI, 2024

Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose-guided text-to-video generation using pose-free videos.AAAI, 2024

2024
[24]

Vggt- world: Transforming vggt into an autoregressive geometry world model.arXiv preprint arXiv:2603.12655, 2026

Xiangyu Sun, Shijie Wang, Fengyi Zhang, Lin Liu, Caiyan Jia, Ziying Song, Zi Huang, and Yadan Luo. Vggt- world: Transforming vggt into an autoregressive geometry world model.arXiv preprint arXiv:2603.12655, 2026

arXiv 2026
[25]

Dino-wm: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983, 2024

Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983, 2024

Pith/arXiv arXiv 2024
[26]

Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, et al. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

arXiv 2025
[27]

Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025
[28]

Dinov3.arXiv preprint arXiv:2508.10104, 2025

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khali- dov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

Pith/arXiv arXiv 2025
[29]

Robodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024

Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024

Pith/arXiv arXiv 2024
[30]

Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025

Pith/arXiv arXiv 2025
[31]

Irasim: A fine-grained world model for robot manipulation

Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. Irasim: A fine-grained world model for robot manipulation. InICCV, 2025

2025
[32]

Eva: Aligning video world models with executable robot actions via inverse dynamics rewards.arXiv preprint arXiv:2603.17808, 2026

Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, and Kui Jia. Eva: Aligning video world models with executable robot actions via inverse dynamics rewards.arXiv preprint arXiv:2603.17808, 2026

arXiv 2026
[33]

Flare: Robot learning with implicit world modeling.arXiv preprint arXiv:2505.15659, 2025

Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, et al. Flare: Robot learning with implicit world modeling.arXiv preprint arXiv:2505.15659, 2025

Pith/arXiv arXiv 2025
[34]

Lumos: Language-conditioned imitation learning with world models

Iman Nematollahi, Branton DeMoss, Akshay L Chandra, Nick Hawes, Wolfram Burgard, and Ingmar Posner. Lumos: Language-conditioned imitation learning with world models. InICRA, 2025

2025
[35]

Multi-stage manipulation with demonstration-augmented reward, policy, and world model learning.arXiv preprint arXiv:2503.01837, 2025

Adrià López Escoriza, Nicklas Hansen, Stone Tao, Tongzhou Mu, and Hao Su. Multi-stage manipulation with demonstration-augmented reward, policy, and world model learning.arXiv preprint arXiv:2503.01837, 2025

arXiv 2025
[36]

Prompting with the future: Open-world model predictive control with interactive digital twins.arXiv preprint arXiv:2506.13761, 2025

Chuanruo Ning, Kuan Fang, and Wei-Chiu Ma. Prompting with the future: Open-world model predictive control with interactive digital twins.arXiv preprint arXiv:2506.13761, 2025

arXiv 2025
[37]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Pith/arXiv arXiv 2025
[38]

Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

Pith/arXiv arXiv 2023
[39]

Mastering atari, go, chess and shogi by planning with a learned model.Nature, 2020

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 2020. 17

2020
[40]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

2023
[41]

Decision transformer: Reinforcement learning via sequence modeling

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. NeurIPS, 2021

2021
[42]

Masked visual pre-training for motor control.arXiv preprint arXiv:2203.06173, 2022

Tete Xiao, Ilija Radosavovic, Trevor Darrell, and Jitendra Malik. Masked visual pre-training for motor control.arXiv preprint arXiv:2203.06173, 2022

arXiv 2022
[43]

R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022

Pith/arXiv arXiv 2022
[44]

Vip: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022

Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022

Pith/arXiv arXiv 2022
[45]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. InECCV, 2018

2018
[46]

Hoi4d: A 4d egocentric dataset for category-level human-object interaction

Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In CVPR, 2022

2022
[47]

Videodex: Learning dexterity from internet videos

Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. Videodex: Learning dexterity from internet videos. In CoRL, 2023

2023
[48]

Hot3d: Hand and object tracking in 3d from egocentric multi-view videos

Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi-view videos. InCVPR, 2025

2025
[49]

Egomimic: Scaling imitation learning via egocentric video

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. InICRA, 2025

2025
[50]

Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

Pith/arXiv arXiv 2025
[51]

Openego: A large-scale multimodal egocentric dataset for dexterous manipulation.arXiv preprint arXiv:2509.05513, 2025

Ahad Jawaid and Yu Xiang. Openego: A large-scale multimodal egocentric dataset for dexterous manipulation.arXiv preprint arXiv:2509.05513, 2025

arXiv 2025
[52]

Egoverse: An egocentric human dataset for robot learning from around the world.arXiv preprint arXiv:2604.07607, 2026

Ryan Punamiya, Simar Kareer, Zeyi Liu, Josh Citron, Ri-Zhao Qiu, Xiongyi Cai, Alexey Gavryushin, Jiaqi Chen, Davide Liconti, Lawrence Y Zhu, et al. Egoverse: An egocentric human dataset for robot learning from around the world.arXiv preprint arXiv:2604.07607, 2026

Pith/arXiv arXiv 2026
[53]

Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

arXiv 2026
[54]

Navigation world models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In CVPR, 2025

2025
[55]

Dexterous world models.arXiv preprint arXiv:2512.17907, 2025

Byungjun Kim, Taeksoo Kim, Junyoung Lee, and Hanbyul Joo. Dexterous world models.arXiv preprint arXiv:2512.17907, 2025

arXiv 2025
[56]

World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

Pith/arXiv arXiv 2025
[57]

Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025. 18

Pith/arXiv arXiv 2025
[58]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[59]

Vima: General robot manipulation with multimodal prompts

Y Zhu et al. Vima: General robot manipulation with multimodal prompts. InICLR, 2023

2023
[60]

Internvla-m1: A spatially guided vision-language-action framework for generalist robot policy.arXiv preprint arXiv:2510.13778, 2025

Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, et al. Internvla-m1: A spatially guided vision-language-action framework for generalist robot policy.arXiv preprint arXiv:2510.13778, 2025

Pith/arXiv arXiv 2025
[61]

Image quality metrics: Psnr vs

Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. InICPR, 2010

2010
[62]

Image quality assessment: from error visibility to structural similarity.IEEE TIP, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE TIP, 2004

2004
[63]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

2018
[64]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS, 2017

2017
[65]

Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

Pith/arXiv arXiv 2018
[66]

Articulated human detection with flexible mixtures of parts.IEEE TPAMI, 2012

Yi Yang and Deva Ramanan. Articulated human detection with flexible mixtures of parts.IEEE TPAMI, 2012

2012
[67]

Dreamsim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

Pith/arXiv arXiv 2023
[68]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

2022
[69]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025. 19 A Overview of the Appendix This appendix contains additional analysis, experimental details, and discussions, organized as follows: •Sec. B outlines the additional implementation details in experi...

2025

[1] [1]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InCVPR, 2022

2022

[2] [2]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InCVPR, 2024

2024

[3] [3]

Egosim: Egocentric exploration in virtual worlds with multi-modal conditioning

Wei Yu, Songheng Yin, Steve Easterbrook, and Animesh Garg. Egosim: Egocentric exploration in virtual worlds with multi-modal conditioning. InICLR, 2025

2025

[4] [4]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InICML, 2024. 15

2024

[5] [5]

Daydreamer: World models for physical robot learning

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Daydreamer: World models for physical robot learning. InCoRL, 2023

2023

[6] [6]

Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

Pith/arXiv arXiv 2023

[7] [7]

Td-mpc2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2023

Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2023

Pith/arXiv arXiv 2023

[8] [8]

Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023

Pith/arXiv arXiv 2023

[9] [9]

Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

Pith/arXiv arXiv 2025

[10] [10]

The neural mechanisms of manual dexterity.Nature Reviews Neuroscience, 2021

Anton R Sobinov and Sliman J Bensmaia. The neural mechanisms of manual dexterity.Nature Reviews Neuroscience, 2021

2021

[11] [11]

World models for learning dexterous hand-object interactions from human videos, 2026

Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World models for learning dexterous hand-object interactions from human videos, 2026

2026

[12] [12]

Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

arXiv 2025

[13] [13]

A survey of embodied learning for object-centric robotic manipulation.Machine Intelligence Research, 2025

Ying Zheng, Lei Yao, Yuejiao Su, Yi Zhang, Yi Wang, Sicheng Zhao, Yiyi Zhang, and Lap-Pui Chau. A survey of embodied learning for object-centric robotic manipulation.Machine Intelligence Research, 2025

2025

[14] [14]

Dexterous manipulation through imitation learning: A survey.arXiv preprint arXiv:2504.03515, 2025

Shan An, Ziyu Meng, Chao Tang, Yuning Zhou, Tengyu Liu, Fangqiang Ding, Shufang Zhang, Yao Mu, Ran Song, Wei Zhang, et al. Dexterous manipulation through imitation learning: A survey.arXiv preprint arXiv:2504.03515, 2025

arXiv 2025

[15] [15]

A survey of multifingered robotic manipulation: Biological results, structural evolvements, and learning methods.Frontiers in Neurorobotics, 2022

Yinlin Li, Peng Wang, Rui Li, Mo Tao, Zhiyong Liu, and Hong Qiao. A survey of multifingered robotic manipulation: Biological results, structural evolvements, and learning methods.Frontiers in Neurorobotics, 2022

2022

[16] [16]

Dexcap: Scalable and portable mocap data collection system for dexterous manipulation.arXiv preprint arXiv:2403.07788, 2024

Chen Wang, Haochen Shi, Weizhuo Wang, Ruohan Zhang, Li Fei-Fei, and C Karen Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation.arXiv preprint arXiv:2403.07788, 2024

arXiv 2024

[17] [17]

Dexmv: Imitation learning for dexterous manipulation from human videos

Yuzhe Qin, Yueh-Hua Wu, Shaowei Liu, Hanwen Jiang, Ruihan Yang, Yang Fu, and Xiaolong Wang. Dexmv: Imitation learning for dexterous manipulation from human videos. InECCV, 2022

2022

[18] [18]

Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system.arXiv preprint arXiv:2307.04577, 2023

Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dieter Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system.arXiv preprint arXiv:2307.04577, 2023

arXiv 2023

[19] [19]

Dexpoint: Generalizable point cloud reinforcement learning for sim-to-real dexterous manipulation

Yuzhe Qin, Binghao Huang, Zhao-Heng Yin, Hao Su, and Xiaolong Wang. Dexpoint: Generalizable point cloud reinforcement learning for sim-to-real dexterous manipulation. InCoRL, 2023

2023

[20] [20]

Dextreme: Transfer of agile in-hand manipulation from simulation to reality

Ankur Handa, Arthur Allshire, Viktor Makoviychuk, Aleksei Petrenko, Ritvik Singh, Jingzhou Liu, Denys Makoviichuk, Karl Van Wyk, Alexander Zhurkevich, Balakumar Sundaralingam, et al. Dextreme: Transfer of agile in-hand manipulation from simulation to reality. InICRA, 2023

2023

[21] [21]

Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2023

Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2023

Pith/arXiv arXiv 2023

[22] [22]

Affordance diffusion: Synthesizing hand-object interactions

Yufei Ye, Xueting Li, Abhinav Gupta, Shalini De Mello, Stan Birchfield, Jiaming Song, Shubham Tulsiani, and Sifei Liu. Affordance diffusion: Synthesizing hand-object interactions. InCVPR, 2023. 16

2023

[23] [23]

Follow your pose: Pose-guided text-to-video generation using pose-free videos.AAAI, 2024

Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose-guided text-to-video generation using pose-free videos.AAAI, 2024

2024

[24] [24]

Vggt- world: Transforming vggt into an autoregressive geometry world model.arXiv preprint arXiv:2603.12655, 2026

Xiangyu Sun, Shijie Wang, Fengyi Zhang, Lin Liu, Caiyan Jia, Ziying Song, Zi Huang, and Yadan Luo. Vggt- world: Transforming vggt into an autoregressive geometry world model.arXiv preprint arXiv:2603.12655, 2026

arXiv 2026

[25] [25]

Dino-wm: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983, 2024

Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983, 2024

Pith/arXiv arXiv 2024

[26] [26]

Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, et al. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

arXiv 2025

[27] [27]

Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025

[28] [28]

Dinov3.arXiv preprint arXiv:2508.10104, 2025

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khali- dov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

Pith/arXiv arXiv 2025

[29] [29]

Robodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024

Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024

Pith/arXiv arXiv 2024

[30] [30]

Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025

Pith/arXiv arXiv 2025

[31] [31]

Irasim: A fine-grained world model for robot manipulation

Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. Irasim: A fine-grained world model for robot manipulation. InICCV, 2025

2025

[32] [32]

Eva: Aligning video world models with executable robot actions via inverse dynamics rewards.arXiv preprint arXiv:2603.17808, 2026

Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, and Kui Jia. Eva: Aligning video world models with executable robot actions via inverse dynamics rewards.arXiv preprint arXiv:2603.17808, 2026

arXiv 2026

[33] [33]

Flare: Robot learning with implicit world modeling.arXiv preprint arXiv:2505.15659, 2025

Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, et al. Flare: Robot learning with implicit world modeling.arXiv preprint arXiv:2505.15659, 2025

Pith/arXiv arXiv 2025

[34] [34]

Lumos: Language-conditioned imitation learning with world models

Iman Nematollahi, Branton DeMoss, Akshay L Chandra, Nick Hawes, Wolfram Burgard, and Ingmar Posner. Lumos: Language-conditioned imitation learning with world models. InICRA, 2025

2025

[35] [35]

Multi-stage manipulation with demonstration-augmented reward, policy, and world model learning.arXiv preprint arXiv:2503.01837, 2025

Adrià López Escoriza, Nicklas Hansen, Stone Tao, Tongzhou Mu, and Hao Su. Multi-stage manipulation with demonstration-augmented reward, policy, and world model learning.arXiv preprint arXiv:2503.01837, 2025

arXiv 2025

[36] [36]

Prompting with the future: Open-world model predictive control with interactive digital twins.arXiv preprint arXiv:2506.13761, 2025

Chuanruo Ning, Kuan Fang, and Wei-Chiu Ma. Prompting with the future: Open-world model predictive control with interactive digital twins.arXiv preprint arXiv:2506.13761, 2025

arXiv 2025

[37] [37]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Pith/arXiv arXiv 2025

[38] [38]

Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

Pith/arXiv arXiv 2023

[39] [39]

Mastering atari, go, chess and shogi by planning with a learned model.Nature, 2020

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 2020. 17

2020

[40] [40]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

2023

[41] [41]

Decision transformer: Reinforcement learning via sequence modeling

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. NeurIPS, 2021

2021

[42] [42]

Masked visual pre-training for motor control.arXiv preprint arXiv:2203.06173, 2022

Tete Xiao, Ilija Radosavovic, Trevor Darrell, and Jitendra Malik. Masked visual pre-training for motor control.arXiv preprint arXiv:2203.06173, 2022

arXiv 2022

[43] [43]

R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022

Pith/arXiv arXiv 2022

[44] [44]

Vip: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022

Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022

Pith/arXiv arXiv 2022

[45] [45]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. InECCV, 2018

2018

[46] [46]

Hoi4d: A 4d egocentric dataset for category-level human-object interaction

Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In CVPR, 2022

2022

[47] [47]

Videodex: Learning dexterity from internet videos

Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. Videodex: Learning dexterity from internet videos. In CoRL, 2023

2023

[48] [48]

Hot3d: Hand and object tracking in 3d from egocentric multi-view videos

Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi-view videos. InCVPR, 2025

2025

[49] [49]

Egomimic: Scaling imitation learning via egocentric video

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. InICRA, 2025

2025

[50] [50]

Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

Pith/arXiv arXiv 2025

[51] [51]

Openego: A large-scale multimodal egocentric dataset for dexterous manipulation.arXiv preprint arXiv:2509.05513, 2025

Ahad Jawaid and Yu Xiang. Openego: A large-scale multimodal egocentric dataset for dexterous manipulation.arXiv preprint arXiv:2509.05513, 2025

arXiv 2025

[52] [52]

Egoverse: An egocentric human dataset for robot learning from around the world.arXiv preprint arXiv:2604.07607, 2026

Ryan Punamiya, Simar Kareer, Zeyi Liu, Josh Citron, Ri-Zhao Qiu, Xiongyi Cai, Alexey Gavryushin, Jiaqi Chen, Davide Liconti, Lawrence Y Zhu, et al. Egoverse: An egocentric human dataset for robot learning from around the world.arXiv preprint arXiv:2604.07607, 2026

Pith/arXiv arXiv 2026

[53] [53]

Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

arXiv 2026

[54] [54]

Navigation world models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In CVPR, 2025

2025

[55] [55]

Dexterous world models.arXiv preprint arXiv:2512.17907, 2025

Byungjun Kim, Taeksoo Kim, Junyoung Lee, and Hanbyul Joo. Dexterous world models.arXiv preprint arXiv:2512.17907, 2025

arXiv 2025

[56] [56]

World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

Pith/arXiv arXiv 2025

[57] [57]

Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025. 18

Pith/arXiv arXiv 2025

[58] [58]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[59] [59]

Vima: General robot manipulation with multimodal prompts

Y Zhu et al. Vima: General robot manipulation with multimodal prompts. InICLR, 2023

2023

[60] [60]

Internvla-m1: A spatially guided vision-language-action framework for generalist robot policy.arXiv preprint arXiv:2510.13778, 2025

Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, et al. Internvla-m1: A spatially guided vision-language-action framework for generalist robot policy.arXiv preprint arXiv:2510.13778, 2025

Pith/arXiv arXiv 2025

[61] [61]

Image quality metrics: Psnr vs

Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. InICPR, 2010

2010

[62] [62]

Image quality assessment: from error visibility to structural similarity.IEEE TIP, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE TIP, 2004

2004

[63] [63]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

2018

[64] [64]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS, 2017

2017

[65] [65]

Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

Pith/arXiv arXiv 2018

[66] [66]

Articulated human detection with flexible mixtures of parts.IEEE TPAMI, 2012

Yi Yang and Deva Ramanan. Articulated human detection with flexible mixtures of parts.IEEE TPAMI, 2012

2012

[67] [67]

Dreamsim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

Pith/arXiv arXiv 2023

[68] [68]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

2022

[69] [69]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025. 19 A Overview of the Appendix This appendix contains additional analysis, experimental details, and discussions, organized as follows: •Sec. B outlines the additional implementation details in experi...

2025