World Value Models for Robotic Manipulation
Pith reviewed 2026-06-25 23:32 UTC · model grok-4.3
The pith
Combining world models with value estimation yields more accurate robotic task progress assessments than vision-language model backbones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors construct the World Value Model by marrying a world model backbone to value estimation so that the model can both condition on historical context and anticipate future outcomes when judging task progress. This architecture is evaluated on standard benchmarks where it records state-of-the-art Value-Order Correlation scores and on the newly introduced Suboptimal-Value-Bench of 800 human-annotated suboptimal trajectories across multiple embodiments. When the resulting values are used to guide policy extraction, manipulation success rates rise across several methods in both simulated and real-world settings.
What carries the argument
World Value Model (WVM): a value estimation network whose backbone is a world model pretrained for sequential state prediction.
If this is right
- WVM records state-of-the-art Value-Order Correlation on existing benchmarks.
- The same leading performance holds on the new Suboptimal-Value-Bench that contains 800 human-labeled suboptimal trajectories.
- Value signals from WVM improve final manipulation performance when used with multiple policy extraction methods.
- Gains appear in both simulated environments and real-robot deployments.
Where Pith is reading between the lines
- The temporal advantage could extend to value estimation in other sequential embodied tasks such as navigation or long-horizon assembly.
- Scaling the underlying world model size might produce finer-grained value signals for extended task horizons.
- Value functions built this way could serve as an interface between predictive world models and reward-driven policy optimization.
- Further tests on additional robot embodiments would test how broadly the temporal benefit applies.
- keywords:[
- world models
- value models
- robotic manipulation
Load-bearing premise
That the temporal prediction strengths built into world models translate directly into more accurate value estimates for robotic manipulation trajectories.
What would settle it
Training a vision-language model backbone on the same robotic datasets with the same procedure and obtaining equal or higher Value-Order Correlation scores than the world-model version.
Figures
read the original abstract
Generalist value models play a pivotal role in scaling robotic policy learning from large-scale, mixed-quality data. Mathematically, accurate value estimation demands deep temporal understanding, requiring models to both ground the current belief using historical context and plan over future outcomes. However, most existing robotic value models are built on Vision-Language Model (VLM) backbones that are pretrained primarily on static or temporally sparse visual observations, lacking the requisite temporal modeling capabilities for value estimation. Unlike VLMs, world models naturally excel at temporal modeling and future planning, making them ideal foundations for learning generalizable value functions. Driven by this insight, we marry world models with value estimation to construct a new generalist robotic value model, World Value Model (WVM), that offers accurate task progressions to assess data quality. On standard benchmarks, WVM delivers state-of-the-art (SOTA) Value-Order Correlation (VOC) results. Complementing standard evaluation suites that contains only expert data, we further introduce Suboptimal-Value-Bench, a multi-embodiment benchmark consisting of 800 suboptimal trajectories with high-fidelity, human-labeled frame annotations. Our evaluations show that WVM maintains its SOTA performance on Suboptimal-Value-Bench, establishing its robustness in handling both expert and suboptimal data. When deployed for policy learning, WVM improves manipulation performance across various policy extraction approaches in both simulated and real-world deployment, providing robust guidance for learning from mixed-quality data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the World Value Model (WVM) by integrating world models with value estimation for robotic manipulation tasks. It argues that existing VLM-based value models lack temporal modeling capabilities, while world models excel at this, enabling accurate task progression assessment for data quality. The paper claims SOTA results on Value-Order Correlation (VOC) benchmarks, introduces the Suboptimal-Value-Bench with 800 suboptimal multi-embodiment trajectories and human-labeled annotations, and reports improved manipulation performance when WVM is used for policy learning in both simulation and real-world settings.
Significance. If the empirical claims hold after verification, the work would be significant for advancing generalist value models in robotics by leveraging world models' temporal strengths for learning from mixed-quality data. The introduction of Suboptimal-Value-Bench is a concrete contribution that addresses limitations of expert-only benchmarks. The paper ships no machine-checked proofs or parameter-free derivations, but the empirical focus on suboptimal data robustness is a positive aspect if properly ablated.
major comments (2)
- [Abstract] Abstract: the claim of SOTA VOC results and downstream policy gains is presented without equations, training details, ablation studies, or statistical tests, preventing verification of the data against the claim and undermining assessment of the central empirical result.
- [Introduction] Introduction and motivation sections: the load-bearing argument that observed gains arise specifically from world-model temporal modeling (rather than pretraining data volume, architecture scale, or fine-tuning) is not isolated; no comparison to a temporally-augmented VLM baseline (e.g., video-frame input or future-prediction auxiliary loss) is described, so the attribution to marrying world models with value estimation cannot be confirmed.
minor comments (1)
- [Benchmark description] The new benchmark is described only at a high level; details on embodiment diversity, annotation protocol, and exact VOC computation should be expanded for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address the two major comments point-by-point below and commit to revisions that improve clarity and strengthen the empirical attribution.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of SOTA VOC results and downstream policy gains is presented without equations, training details, ablation studies, or statistical tests, preventing verification of the data against the claim and undermining assessment of the central empirical result.
Authors: We acknowledge that the abstract's brevity precludes inclusion of equations, full training hyperparameters, or statistical tests. All such details appear in the main text: model formulation and training in Section 3, ablation studies in Section 4.3, and statistical significance reporting in Section 4.2. To improve verifiability, we will revise the abstract to reference these sections explicitly and add one sentence summarizing the evaluation protocol and significance testing. revision: yes
-
Referee: [Introduction] Introduction and motivation sections: the load-bearing argument that observed gains arise specifically from world-model temporal modeling (rather than pretraining data volume, architecture scale, or fine-tuning) is not isolated; no comparison to a temporally-augmented VLM baseline (e.g., video-frame input or future-prediction auxiliary loss) is described, so the attribution to marrying world models with value estimation cannot be confirmed.
Authors: We agree that isolating the contribution of world-model temporal modeling requires a controlled comparison against a temporally-augmented VLM. Our existing baselines compare against standard VLM value models, but we did not include an explicit video-input or auxiliary-prediction VLM variant. We will add this baseline experiment in the revised manuscript (new subsection in Section 4) to directly test whether the observed gains are attributable to the world-model integration rather than scale or data volume. revision: yes
Circularity Check
No circularity; empirical SOTA claims rest on external benchmarks and new labeled data
full rationale
The manuscript presents an empirical contribution: a new model (WVM) evaluated on standard benchmarks plus a newly introduced Suboptimal-Value-Bench with human-labeled annotations. No mathematical derivation chain, fitted-parameter-as-prediction step, or self-citation load-bearing premise appears in the abstract or described structure. The motivating claim that world models supply superior temporal modeling is tested by direct performance comparison rather than assumed by definition or prior self-work. The result is therefore self-contained against external data and does not reduce to its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025
Pith/arXiv arXiv 2025
-
[2]
Reinforcement learning with long short-term memory.Advancesin neural information processing systems, 14, 2001
Bram Bakker. Reinforcement learning with long short-term memory.Advancesin neural information processing systems, 14, 2001
2001
-
[3]
A distributional perspective on reinforcement learning
Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International conference on machine learning, pages 449–458. Pmlr, 2017
2017
-
[4]
Learning long-term dependencies with gradient descent is difficult
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994
1994
-
[5]
Kevin Black, Allen Z Ren, Michael Equi, and Sergey Levine. Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025
arXiv 2025
-
[6]
Shirui Chen, Cole Harrison, Ying-Chun Lee, Angela Jin Yang, Zhongzheng Ren, Lillian J Ratliff, Jiafei Duan, Dieter Fox, and Ranjay Krishna. Topreward: Token probabilities as hidden zero-shot rewards for robotics.arXiv preprint arXiv:2602.19313, 2026
arXiv 2026
-
[7]
Shivin Dass, Alaa Khaddaj, Logan Engstrom, Aleksander Madry, Andrew Ilyas, and Roberto Martín-Martín. Datamil: Selecting data for robot imitation learning with datamodels.arXiv preprint arXiv:2505.09603, 2025
Pith/arXiv arXiv 2025
-
[8]
Understanding world or predicting future? a comprehensive survey of world models
Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predicting future? a comprehensive survey of world models. ACM Computing Surveys, 58(3):1–38, 2025
2025
-
[9]
Value flows.arXiv preprint arXiv:2510.07650, 2025
Perry Dong, Chongyi Zheng, Chelsea Finn, Dorsa Sadigh, and Benjamin Eysenbach. Value flows.arXiv preprint arXiv:2510.07650, 2025
Pith/arXiv arXiv 2025
-
[10]
Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taïga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, et al. Stop regressing: Training value functions via classification for scalable deep rl.arXiv preprint arXiv:2403.03950, 2024
arXiv 2024
-
[11]
Studying the interplay between the actor and critic representations in reinforcement learning
Samuel Garcin, Trevor McInroe, Pablo Samuel Castro, Christopher Lucas, David Abel, Prakash Panangaden, and Stefano V Albrecht. Studying the interplay between the actor and critic representations in reinforcement learning. In International Conference on Learning Representations, volume 2025, pages 35590–35616, 2025
2025
-
[12]
World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018
David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018
Pith/arXiv arXiv 2018
-
[13]
Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019
Pith/arXiv arXiv 1912
-
[14]
Deep recurrent q-learning for partially observable mdps
Matthew J Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. InAAAI fall symposia, volume 45, page 141, 2015
2015
-
[15]
Joey Hejna, Chethan Bhateja, Yichen Jiang, Karl Pertsch, and Dorsa Sadigh. Re-mix: Optimizing data mixtures for large scale imitation learning.arXiv preprint arXiv:2408.14037, 2024
arXiv 2024
-
[16]
Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022
Pith/arXiv arXiv 2022
-
[17]
Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025
Pith/arXiv arXiv 2025
-
[18]
Optimizing agent behavior over long time scales by transporting value.Nature communications, 10(1):5223, 2019
Chia-Chun Hung, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. Optimizing agent behavior over long time scales by transporting value.Nature communications, 10(1):5223, 2019
2019
-
[19]
Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al.π∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025
Pith/arXiv arXiv 2025
-
[20]
arXiv preprint arXiv:2504.16054, 2025
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025. 19
Pith/arXiv arXiv 2025
-
[21]
Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, et al.π∗ 0.7: a steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026
Pith/arXiv arXiv 2026
-
[22]
Planning and acting in partially observable stochastic domains
Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998
1998
-
[23]
Recurrent experience replay in distributed reinforcement learning
Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. InInternational conference on learning representations, 2018
2018
-
[24]
Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024
Pith/arXiv arXiv 2024
-
[25]
Cosmos policy: Fine-tuning video models for visuomotor control and planning
Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026
Pith/arXiv arXiv 2026
-
[26]
Tony Lee, Andrew Wagenmaker, Karl Pertsch, Percy Liang, Sergey Levine, and Chelsea Finn. Roboreward: General-purpose vision-language reward models for robotics.arXiv preprint arXiv:2601.00675, 2026
arXiv 2026
-
[27]
Decisionnce: Embodied multimodal representations via implicit preference learning
Jianxiong Li, Jinliang Zheng, Yinan Zheng, Liyuan Mao, Xiao Hu, Sijie Cheng, Haoyi Niu, Jihao Liu, Yu Liu, Jingjing Liu, et al. Decisionnce: Embodied multimodal representations via implicit preference learning. In Forty-firstInternational Conference on Machine Learning, 2024
2024
-
[28]
Robo-mutual: Robotic multimodal task specification via unimodal learning
Jianxiong Li, Zhihao Wang, Jinliang Zheng, Xiaoai Zhou, Guanming Wang, Guanglu Song, Yu Liu, Jingjing Liu, Ya-Qin Zhang, Junzhi Yu, et al. Robo-mutual: Robotic multimodal task specification via unimodal learning. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 4182–4189. IEEE, 2025
2025
-
[29]
Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026
Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026
Pith/arXiv arXiv 2026
-
[30]
Gr-rl: Going dexterous and precise for long-horizon robotic manipulation
Yunfei Li, Xiao Ma, Jiafeng Xu, Yu Cui, Zhongren Cui, Zhigang Han, Liqun Huang, Tao Kong, Yuxiao Liu, Hao Niu, et al. Gr-rl: Going dexterous and precise for long-horizon robotic manipulation. arXiv preprint arXiv:2512.01801, 2025
arXiv 2025
-
[31]
Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S. Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, and Jesse Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons. arXiv preprint arXiv:2603.02115, 2026
Pith/arXiv arXiv 2026
-
[32]
Dichotomous diffusion policy optimization.arXiv preprint arXiv:2601.00898, 2026
Ruiming Liang, Yinan Zheng, Kexin Zheng, Tianyi Tan, Jianxiong Li, Liyuan Mao, Zhihao Wang, Guang Chen, Hangjun Ye, Jingjing Liu, Jinqiao Wang, and Xianyuan Zhan. Dichotomous diffusion policy optimization.arXiv preprint arXiv:2601.00898, 2026
arXiv 2026
-
[33]
Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.Transactions on Machine Learning Research, 2025
Weixin Liang, LILI YU, Liang Luo, Srini Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen tau Yih, Luke Zettlemoyer, and Xi Victoria Lin. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/forum?id=Nu6N69i8SB
2025
-
[34]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022
Pith/arXiv arXiv 2022
-
[35]
Flow matching guide and code.arXiv preprint arXiv:2412.06264, 2024
Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code.arXiv preprint arXiv:2412.06264, 2024
Pith/arXiv arXiv 2024
-
[36]
Dongxiu Liu, Haoyi Niu, Zhihao Wang, Jinliang Zheng, Yinan Zheng, Zhonghong Ou, Jianming Hu, Jianxiong Li, and Xianyuan Zhan. Efficient robotic policy learning via latent space backward planning.arXiv preprint arXiv:2505.06861, 2025
arXiv 2025
-
[37]
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 20
Pith/arXiv arXiv 2022
-
[38]
Sora: A review on background, technology, limitations, and opportunities of large vision models
Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024
Pith/arXiv arXiv 2024
-
[39]
Jindi Lv, Hao Li, Jie Li, Yifei Nie, Fankun Kong, Yang Wang, Xiaofeng Wang, Zheng Zhu, Chaojun Ni, Qiuping Deng, et al. Viva: A video-generative value model for robot reinforcement learning.arXiv preprint arXiv:2604.08168, 2026
Pith/arXiv arXiv 2026
-
[40]
Teli Ma, Jia Zheng, Zifan Wang, Chunli Jiang, Andy Cui, Junwei Liang, and Shuo Yang. Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2026
arXiv 2026
-
[41]
Vip: Towardsuniversalvisualrewardandrepresentationviavalue-implicitpre-training
Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towardsuniversalvisualrewardandrepresentationviavalue-implicitpre-training. arXivpreprintarXiv:2210.00030, 2022
Pith/arXiv arXiv 2022
-
[42]
Liv: Language- image representations and rewards for robotic control
Yecheng Jason Ma, Vikash Kumar, Amy Zhang, Osbert Bastani, and Dinesh Jayaraman. Liv: Language- image representations and rewards for robotic control. InInternational Conference on Machine Learning, pages 23301–23320. PMLR, 2023
2023
-
[43]
Vision language models are in-context value learners
Yecheng Jason Ma, Joey Hejna, Chuyuan Fu, Dhruv Shah, Jacky Liang, Zhuo Xu, Sean Kirmani, Peng Xu, Danny Driess, Ted Xiao, et al. Vision language models are in-context value learners. InThe Thirteenth International Conference on Learning Representations, 2024
2024
-
[44]
Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026
Pith/arXiv arXiv 2026
-
[45]
R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022
Pith/arXiv arXiv 2022
-
[46]
Value prediction network.Advances in neural information processing systems, 30, 2017
Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network.Advances in neural information processing systems, 30, 2017
2017
-
[47]
Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024
2024
-
[48]
Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint 2512.15692, 2025
Pith/arXiv arXiv 2025
-
[49]
Mingjie Pan, Siyuan Feng, Qinglin Zhang, Xinchen Li, Jianheng Song, Chendi Qu, Yi Wang, Chuankang Li, Ziyu Xiong, Zhi Chen, et al. Sop: A scalable online post-training system for vision-language-action models.arXiv preprint arXiv:2601.03044, 2026
arXiv 2026
-
[50]
Stabilizing transformers for reinforcement learning
Emilio Parisotto, Francis Song, Jack Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant Jayakumar, Max Jaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury, et al. Stabilizing transformers for reinforcement learning. In International conference on machine learning, pages 7487–7498. PMLR, 2020
2020
-
[51]
Karl Pearson. Vii. note on regression and inheritance in the case of two parents.proceedings of the royal society of London, 58(347-352):240–242, 1895
-
[52]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
-
[53]
Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019
Pith/arXiv arXiv 1910
-
[54]
Reinforcement learning by reward-weighted regression for operational space control
Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750, 2007
2007
-
[55]
Eduardo Pignatelli, Johan Ferret, Matthieu Geist, Thomas Mesnard, Hado van Hasselt, Olivier Pietquin, and Laura Toni. A survey of temporal credit assignment in deep reinforcement learning.arXiv preprint arXiv:2312.01072, 2023
arXiv 2023
-
[56]
Universal value function approximators
Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In Francis Bach and David Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, volume 37 21 of Proceedings of Machine Learning Research, pages 1312–1320, Lille, France, 07–09 Jul 2015. PMLR. URL https://proceedings.mlr.press/v3...
2015
-
[57]
Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020
Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020
2020
-
[58]
Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026
Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026
Pith/arXiv arXiv 2026
-
[59]
Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016
2016
-
[60]
MIT press Cambridge, 1998
Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998
1998
-
[61]
Robo-dopamine: General process reward modeling for high-precision robotic manipulation
Huajie Tan, Sixiang Chen, Yijie Xu, Zixiao Wang, Yuheng Ji, Cheng Chi, Yaoxu Lyu, Zhongxia Zhao, Xiansheng Chen, Peterson Co, et al. Robo-dopamine: General process reward modeling for high-precision robotic manipulation. arXiv preprint arXiv:2512.23703, 2025
arXiv 2025
-
[62]
Motubrain: An advanced world action model for robot control.arXiv preprint arXiv:2604.27792, 2026
MotuBrain Team, Chendong Xiang, Fan Bao, Haitian Liu, Hengkai Tan, Hongzhe Bi, James Li, Jiabao Liu, Jingrui Pang, Kiro Jing, et al. Motubrain: An advanced world action model for robot control.arXiv preprint arXiv:2604.27792, 2026
Pith/arXiv arXiv 2026
-
[63]
Mem: Multi-scale embodied memory for vision language action models
Marcel Torne, Karl Pertsch, Homer Walke, Kyle Vedder, Suraj Nair, Brian Ichter, Allen Z Ren, Haohuan Wang, Jiaming Tang, Kyle Stachowicz, et al. Mem: Multi-scale embodied memory for vision language action models. arXiv preprint arXiv:2603.03596, 2026
arXiv 2026
-
[64]
Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Pith/arXiv arXiv 2025
-
[65]
World action models: The next frontier in embodied ai, 2026
Siyin Wang, Junhao Shi, Zhaoyang Fu, Xinzhe He, Feihong Liu, Chenchen Yang, Yikang Zhou, Zhaoye Fei, Jingjing Gong, Jinlan Fu, Mike Zheng Shou, Xuanjing Huang, Xipeng Qiu, and Yu-Gang Jiang. World action models: The next frontier in embodied ai, 2026. URLhttps://arxiv.org/abs/2605.12090
Pith/arXiv arXiv 2026
-
[66]
Physiagent: An embodied agent framework in physical world.arXiv preprint arXiv:2509.24524, 2025
Zhihao Wang, Jianxiong Li, Jinliang Zheng, Wencong Zhang, Dongxiu Liu, Yinan Zheng, Haoyi Niu, Junzhi Yu, and Xianyuan Zhan. Physiagent: An embodied agent framework in physical world.arXiv preprint arXiv:2509.24524, 2025
arXiv 2025
-
[67]
Solving deep memory pomdps with recurrent policy gradients
Daan Wierstra, Alexander Foerster, Jan Peters, and Juergen Schmidhuber. Solving deep memory pomdps with recurrent policy gradients. In International conference on artificial neural networks, pages 697–706. Springer, 2007
2007
-
[68]
Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation
Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation. arXiv preprint arXiv:2511.17441, 2025
Pith/arXiv arXiv 2025
-
[69]
Charles Xu, Jost Tobias Springenberg, Michael Equi, Ali Amin, Adnan Esmail, Sergey Levine, and Liyiming Ke. Rl token: Bootstrapping online rl with vision-language-action models.arXiv preprint arXiv:2604.23073, 2026
Pith/arXiv arXiv 2026
-
[70]
Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026
arXiv 2026
-
[71]
World action models are zero-shot policies, 2026
Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...
Pith/arXiv arXiv 2026
-
[72]
Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination? arXiv preprint arXiv:2603.16666, 2026. URLhttps://arxiv.org/abs/2603.16666. 22
Pith/arXiv arXiv 2026
-
[73]
A vision-language-action-critic model for robotic real-world reinforcement learning
Shaopeng Zhai, Qi Zhang, Tianyi Zhang, Fuxian Huang, Haoran Zhang, Ming Zhou, Shengzhe Zhang, Litao Liu, Sixu Lin, and Jiangmiao Pang. A vision-language-action-critic model for robotic real-world reinforcement learning. arXiv preprint arXiv:2509.15937, 2025
arXiv 2025
-
[74]
RewiND: Language-guided rewards teach robot policies without new demonstrations
Jiahui Zhang, Yusen Luo, Abrar Anwar, Sumedh Anand Sontakke, Joseph J Lim, Jesse Thomason, Erdem Biyik, and Jesse Zhang. RewiND: Language-guided rewards teach robot policies without new demonstrations. In9th Annual Conference on Robot Learning, 2025. URLhttps://openreview.net/forum?id=XjjXLxfPou
2025
-
[75]
Universal actions for enhanced embodied foundation models
Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, and Xianyuan Zhan. Universal actions for enhanced embodied foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22508–22519, 2025
2025
-
[76]
X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model
Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025. 23
Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.