Not All Actions Are Equal: Rethinking Conditioning for Dexterous World Model
Pith reviewed 2026-06-26 05:20 UTC · model grok-4.3
The pith
High-DoF dexterous actions need structured conditioning with tokenization and semantic priors instead of uniform compression.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DexAC-WM models action conditioning as a structured process rather than global compression: actions are tokenized to preserve dimension-level semantics, then aligned with visual dynamics via local refinement and global modulation; a semantic branch further supplies object-scene priors that support high-DoF action-conditioned video prediction. Experiments on EgoDex and EgoVerse show that the combination improves FID, FVD, and PCK, and DexAC transfers to other backbones.
What carries the argument
DexAC, the structured action-conditioning mechanism that tokenizes high-DoF actions and applies local refinement plus global modulation to align heterogeneous signals with visual dynamics.
If this is right
- Structured tokenization and modulation reduce imbalance in fine-grained action effects.
- The semantic branch supplies priors that improve capture of dynamic visual details under high-DoF control.
- DexAC extends across different backbone architectures without retraining the entire model.
- Gains appear in both visual realism metrics and action-following consistency metrics.
Where Pith is reading between the lines
- The same tokenization-plus-modulation pattern could be tested on other high-dimensional control signals such as full-body pose sequences.
- Semantic priors might allow smaller visual encoders to reach comparable fidelity when actions are already well-structured.
- Deployment on physical hardware would reveal whether the observed simulation gains survive sensor noise and actuation delays.
Load-bearing premise
High-DoF dexterous actions contain components of widely different magnitudes whose uniform aggregation produces optimization imbalance across those components.
What would settle it
If a baseline that compresses the full action sequence into a single vector matches or exceeds the FID, FVD, and PCK gains of DexAC plus the semantic branch on the same EgoDex and EgoVerse test sets, the claimed benefit of structured conditioning would be refuted.
read the original abstract
Recent advances in action-conditioned world models show promising progress in modeling complex interactions and forecasting future states under diverse action sequences. While these models are often driven by stronger visual representations and model capacity, action conditioning itself remains underexplored. Most existing approaches compress the entire action sequence into a single representation, which works well for low-DoF control but becomes less reliable in high-DoF scenarios. We observe that high-DoF dexterous actions are inherently heterogeneous, spanning multiple orders of magnitude, where large-scale motions coexist with subtle but important signals. When uniformly aggregated, optimization exhibits an imbalance across action components, which hinders the modeling of fine-grained effects and affects action fidelity. We therefore propose DexAC-WM, which treats action conditioning as a structured process rather than global compression. DexAC preserves dimension-level semantics via action tokenization and aligns action signals with visual dynamics through local refinement and global modulation. To address the limited high-level semantic grounding in existing world models, we further introduce a semantic branch that provides rich object-scene priors, which enables world model to capture dynamic visual details while supporting high-DoF action-conditioned video prediction. Experiments on EgoDex and EgoVerse show that combining the semantic branch with DexAC significantly improves FID, FVD, and PCK, demonstrating gains in visual-temporal realism and action-following consistency. We further verify that DexAC extends to other backbones, showing the scalability of our structured action-conditioning design. These results suggest that scaling world models to high-DoF control requires both structured action modeling and semantic grounding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that uniform compression of high-DoF dexterous action sequences into a single representation causes optimization imbalance across heterogeneous action components (large-scale motions vs. subtle signals), hindering fine-grained modeling. It proposes DexAC-WM, which uses action tokenization to preserve dimension-level semantics, combined with local refinement and global modulation for alignment with visual dynamics, plus a semantic branch providing object-scene priors. Experiments on EgoDex and EgoVerse report gains in FID, FVD, and PCK when combining the semantic branch with DexAC, with additional verification of scalability to other backbones.
Significance. If the gains prove robust and the structured conditioning is shown to specifically alleviate the hypothesized imbalance, the work would highlight an important but underexplored aspect of action conditioning in world models for dexterous control, potentially improving visual-temporal consistency and action fidelity in high-DoF settings. The extension to other backbones is a positive indicator of generality.
major comments (1)
- [Experiments (EgoDex/EgoVerse results)] The central motivation—that uniform aggregation of heterogeneous high-DoF actions produces optimization imbalance—is load-bearing for the claim that DexAC's tokenization + modulation is the appropriate remedy, yet the experiments section provides no direct supporting evidence such as component-wise gradient magnitudes, per-dimension loss contributions, or an ablation isolating uniform compression versus tokenized conditioning.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the single major comment below and will revise the manuscript to incorporate additional experimental support.
read point-by-point responses
-
Referee: [Experiments (EgoDex/EgoVerse results)] The central motivation—that uniform aggregation of heterogeneous high-DoF actions produces optimization imbalance—is load-bearing for the claim that DexAC's tokenization + modulation is the appropriate remedy, yet the experiments section provides no direct supporting evidence such as component-wise gradient magnitudes, per-dimension loss contributions, or an ablation isolating uniform compression versus tokenized conditioning.
Authors: We agree that the manuscript lacks direct quantitative diagnostics (e.g., per-dimension losses or gradient magnitudes) for the hypothesized optimization imbalance. Our motivation derives from development-stage observations that uniform compression degraded fidelity on low-magnitude action dimensions, which is indirectly supported by the reported PCK gains. To directly address the concern, the revised version will add (i) an ablation isolating uniform compression against tokenized conditioning and (ii) per-dimension loss breakdowns on EgoDex to quantify the imbalance. These additions will be placed in the experiments section and will not change the core claims or results. revision: yes
Circularity Check
No significant circularity; empirical proposal without load-bearing derivations or self-referential reductions
full rationale
The paper presents DexAC-WM as an empirical architecture motivated by the stated observation that high-DoF actions are heterogeneous and that uniform aggregation produces optimization imbalance. No equations, derivations, or parameter-fitting steps are described that reduce a claimed prediction back to the input by construction. The central claims rest on experimental metrics (FID, FVD, PCK) on EgoDex and EgoVerse rather than on any self-citation chain, uniqueness theorem, or ansatz smuggled via prior work. The method is therefore self-contained against external benchmarks with no circular steps identified.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption High-DoF dexterous actions are inherently heterogeneous spanning multiple orders of magnitude leading to optimization imbalance under uniform aggregation
invented entities (2)
-
DexAC
no independent evidence
-
semantic branch
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InCVPR, 2022
2022
-
[2]
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InCVPR, 2024
2024
-
[3]
Egosim: Egocentric exploration in virtual worlds with multi-modal conditioning
Wei Yu, Songheng Yin, Steve Easterbrook, and Animesh Garg. Egosim: Egocentric exploration in virtual worlds with multi-modal conditioning. InICLR, 2025
2025
-
[4]
Genie: Generative interactive environments
Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InICML, 2024. 15
2024
-
[5]
Daydreamer: World models for physical robot learning
Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Daydreamer: World models for physical robot learning. InCoRL, 2023
2023
-
[6]
Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023
Pith/arXiv arXiv 2023
-
[7]
Td-mpc2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2023
Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2023
Pith/arXiv arXiv 2023
-
[8]
Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023
Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023
Pith/arXiv arXiv 2023
-
[9]
Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025
Pith/arXiv arXiv 2025
-
[10]
The neural mechanisms of manual dexterity.Nature Reviews Neuroscience, 2021
Anton R Sobinov and Sliman J Bensmaia. The neural mechanisms of manual dexterity.Nature Reviews Neuroscience, 2021
2021
-
[11]
World models for learning dexterous hand-object interactions from human videos, 2026
Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World models for learning dexterous hand-object interactions from human videos, 2026
2026
-
[12]
Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025
Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025
arXiv 2025
-
[13]
A survey of embodied learning for object-centric robotic manipulation.Machine Intelligence Research, 2025
Ying Zheng, Lei Yao, Yuejiao Su, Yi Zhang, Yi Wang, Sicheng Zhao, Yiyi Zhang, and Lap-Pui Chau. A survey of embodied learning for object-centric robotic manipulation.Machine Intelligence Research, 2025
2025
-
[14]
Dexterous manipulation through imitation learning: A survey.arXiv preprint arXiv:2504.03515, 2025
Shan An, Ziyu Meng, Chao Tang, Yuning Zhou, Tengyu Liu, Fangqiang Ding, Shufang Zhang, Yao Mu, Ran Song, Wei Zhang, et al. Dexterous manipulation through imitation learning: A survey.arXiv preprint arXiv:2504.03515, 2025
arXiv 2025
-
[15]
A survey of multifingered robotic manipulation: Biological results, structural evolvements, and learning methods.Frontiers in Neurorobotics, 2022
Yinlin Li, Peng Wang, Rui Li, Mo Tao, Zhiyong Liu, and Hong Qiao. A survey of multifingered robotic manipulation: Biological results, structural evolvements, and learning methods.Frontiers in Neurorobotics, 2022
2022
-
[16]
Chen Wang, Haochen Shi, Weizhuo Wang, Ruohan Zhang, Li Fei-Fei, and C Karen Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation.arXiv preprint arXiv:2403.07788, 2024
arXiv 2024
-
[17]
Dexmv: Imitation learning for dexterous manipulation from human videos
Yuzhe Qin, Yueh-Hua Wu, Shaowei Liu, Hanwen Jiang, Ruihan Yang, Yang Fu, and Xiaolong Wang. Dexmv: Imitation learning for dexterous manipulation from human videos. InECCV, 2022
2022
-
[18]
Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dieter Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system.arXiv preprint arXiv:2307.04577, 2023
arXiv 2023
-
[19]
Dexpoint: Generalizable point cloud reinforcement learning for sim-to-real dexterous manipulation
Yuzhe Qin, Binghao Huang, Zhao-Heng Yin, Hao Su, and Xiaolong Wang. Dexpoint: Generalizable point cloud reinforcement learning for sim-to-real dexterous manipulation. InCoRL, 2023
2023
-
[20]
Dextreme: Transfer of agile in-hand manipulation from simulation to reality
Ankur Handa, Arthur Allshire, Viktor Makoviychuk, Aleksei Petrenko, Ritvik Singh, Jingzhou Liu, Denys Makoviichuk, Karl Van Wyk, Alexander Zhurkevich, Balakumar Sundaralingam, et al. Dextreme: Transfer of agile in-hand manipulation from simulation to reality. InICRA, 2023
2023
-
[21]
Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2023
Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2023
Pith/arXiv arXiv 2023
-
[22]
Affordance diffusion: Synthesizing hand-object interactions
Yufei Ye, Xueting Li, Abhinav Gupta, Shalini De Mello, Stan Birchfield, Jiaming Song, Shubham Tulsiani, and Sifei Liu. Affordance diffusion: Synthesizing hand-object interactions. InCVPR, 2023. 16
2023
-
[23]
Follow your pose: Pose-guided text-to-video generation using pose-free videos.AAAI, 2024
Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose-guided text-to-video generation using pose-free videos.AAAI, 2024
2024
-
[24]
Xiangyu Sun, Shijie Wang, Fengyi Zhang, Lin Liu, Caiyan Jia, Ziying Song, Zi Huang, and Yadan Luo. Vggt- world: Transforming vggt into an autoregressive geometry world model.arXiv preprint arXiv:2603.12655, 2026
arXiv 2026
-
[25]
Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983, 2024
Pith/arXiv arXiv 2024
-
[26]
Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, et al. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025
arXiv 2025
-
[27]
Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025
Pith/arXiv arXiv 2025
-
[28]
Dinov3.arXiv preprint arXiv:2508.10104, 2025
Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khali- dov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025
Pith/arXiv arXiv 2025
-
[29]
Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024
Pith/arXiv arXiv 2024
-
[30]
Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025
Pith/arXiv arXiv 2025
-
[31]
Irasim: A fine-grained world model for robot manipulation
Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. Irasim: A fine-grained world model for robot manipulation. InICCV, 2025
2025
-
[32]
Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, and Kui Jia. Eva: Aligning video world models with executable robot actions via inverse dynamics rewards.arXiv preprint arXiv:2603.17808, 2026
arXiv 2026
-
[33]
Flare: Robot learning with implicit world modeling.arXiv preprint arXiv:2505.15659, 2025
Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, et al. Flare: Robot learning with implicit world modeling.arXiv preprint arXiv:2505.15659, 2025
Pith/arXiv arXiv 2025
-
[34]
Lumos: Language-conditioned imitation learning with world models
Iman Nematollahi, Branton DeMoss, Akshay L Chandra, Nick Hawes, Wolfram Burgard, and Ingmar Posner. Lumos: Language-conditioned imitation learning with world models. InICRA, 2025
2025
-
[35]
Adrià López Escoriza, Nicklas Hansen, Stone Tao, Tongzhou Mu, and Hao Su. Multi-stage manipulation with demonstration-augmented reward, policy, and world model learning.arXiv preprint arXiv:2503.01837, 2025
arXiv 2025
-
[36]
Chuanruo Ning, Kuan Fang, and Wei-Chiu Ma. Prompting with the future: Open-world model predictive control with interactive digital twins.arXiv preprint arXiv:2506.13761, 2025
arXiv 2025
-
[37]
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025
Pith/arXiv arXiv 2025
-
[38]
Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023
Pith/arXiv arXiv 2023
-
[39]
Mastering atari, go, chess and shogi by planning with a learned model.Nature, 2020
Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 2020. 17
2020
-
[40]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023
2023
-
[41]
Decision transformer: Reinforcement learning via sequence modeling
Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. NeurIPS, 2021
2021
-
[42]
Masked visual pre-training for motor control.arXiv preprint arXiv:2203.06173, 2022
Tete Xiao, Ilija Radosavovic, Trevor Darrell, and Jitendra Malik. Masked visual pre-training for motor control.arXiv preprint arXiv:2203.06173, 2022
arXiv 2022
-
[43]
R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022
Pith/arXiv arXiv 2022
-
[44]
Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022
Pith/arXiv arXiv 2022
-
[45]
Scaling egocentric vision: The epic-kitchens dataset
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. InECCV, 2018
2018
-
[46]
Hoi4d: A 4d egocentric dataset for category-level human-object interaction
Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In CVPR, 2022
2022
-
[47]
Videodex: Learning dexterity from internet videos
Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. Videodex: Learning dexterity from internet videos. In CoRL, 2023
2023
-
[48]
Hot3d: Hand and object tracking in 3d from egocentric multi-view videos
Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi-view videos. InCVPR, 2025
2025
-
[49]
Egomimic: Scaling imitation learning via egocentric video
Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. InICRA, 2025
2025
-
[50]
Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025
Pith/arXiv arXiv 2025
-
[51]
Ahad Jawaid and Yu Xiang. Openego: A large-scale multimodal egocentric dataset for dexterous manipulation.arXiv preprint arXiv:2509.05513, 2025
arXiv 2025
-
[52]
Ryan Punamiya, Simar Kareer, Zeyi Liu, Josh Citron, Ri-Zhao Qiu, Xiongyi Cai, Alexey Gavryushin, Jiaqi Chen, Davide Liconti, Lawrence Y Zhu, et al. Egoverse: An egocentric human dataset for robot learning from around the world.arXiv preprint arXiv:2604.07607, 2026
Pith/arXiv arXiv 2026
-
[53]
Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026
arXiv 2026
-
[54]
Navigation world models
Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In CVPR, 2025
2025
-
[55]
Dexterous world models.arXiv preprint arXiv:2512.17907, 2025
Byungjun Kim, Taeksoo Kim, Junyoung Lee, and Hanbyul Joo. Dexterous world models.arXiv preprint arXiv:2512.17907, 2025
arXiv 2025
-
[56]
World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025
Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025
Pith/arXiv arXiv 2025
-
[57]
Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025. 18
Pith/arXiv arXiv 2025
-
[58]
Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Pith/arXiv arXiv 2025
-
[59]
Vima: General robot manipulation with multimodal prompts
Y Zhu et al. Vima: General robot manipulation with multimodal prompts. InICLR, 2023
2023
-
[60]
Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, et al. Internvla-m1: A spatially guided vision-language-action framework for generalist robot policy.arXiv preprint arXiv:2510.13778, 2025
Pith/arXiv arXiv 2025
-
[61]
Image quality metrics: Psnr vs
Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. InICPR, 2010
2010
-
[62]
Image quality assessment: from error visibility to structural similarity.IEEE TIP, 2004
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE TIP, 2004
2004
-
[63]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018
2018
-
[64]
Gans trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS, 2017
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS, 2017
2017
-
[65]
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018
Pith/arXiv arXiv 2018
-
[66]
Articulated human detection with flexible mixtures of parts.IEEE TPAMI, 2012
Yi Yang and Deva Ramanan. Articulated human detection with flexible mixtures of parts.IEEE TPAMI, 2012
2012
-
[67]
Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023
Pith/arXiv arXiv 2023
-
[68]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022
2022
-
[69]
Vggt: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025. 19 A Overview of the Appendix This appendix contains additional analysis, experimental details, and discussions, organized as follows: •Sec. B outlines the additional implementation details in experi...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.