World Value Models for Robotic Manipulation

Jianxiong Li; Junzhi Yu; Xianyuan Zhan; Xiao Ma; Yuan Gao; Yu Cui; Zhihao Wang

arxiv: 2606.24742 · v1 · pith:OOABD2SVnew · submitted 2026-06-23 · 💻 cs.RO

World Value Models for Robotic Manipulation

Zhihao Wang , Jianxiong Li , Yu Cui , Yuan Gao , Xianyuan Zhan , Junzhi Yu , Xiao Ma This is my paper

Pith reviewed 2026-06-25 23:32 UTC · model grok-4.3

classification 💻 cs.RO

keywords valuemodelsdatalearningroboticworldestimationmodel

0 comments

The pith

Combining world models with value estimation yields more accurate robotic task progress assessments than vision-language model backbones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that value estimation in robotics requires models to track how actions unfold across time, grounding the present state in past observations while projecting future outcomes. Most prior value models rely on vision-language backbones pretrained on static or sparse images and therefore lack this sequential reasoning. The authors instead build the World Value Model on a world model backbone, which is already trained to predict future states, and use it to score how close any trajectory is to task completion. This produces stronger correlation with human progress labels on both expert and suboptimal data and supplies better guidance when extracting policies from mixed-quality datasets. The result appears in improved manipulation performance in simulation and on physical robots.

Core claim

The authors construct the World Value Model by marrying a world model backbone to value estimation so that the model can both condition on historical context and anticipate future outcomes when judging task progress. This architecture is evaluated on standard benchmarks where it records state-of-the-art Value-Order Correlation scores and on the newly introduced Suboptimal-Value-Bench of 800 human-annotated suboptimal trajectories across multiple embodiments. When the resulting values are used to guide policy extraction, manipulation success rates rise across several methods in both simulated and real-world settings.

What carries the argument

World Value Model (WVM): a value estimation network whose backbone is a world model pretrained for sequential state prediction.

If this is right

WVM records state-of-the-art Value-Order Correlation on existing benchmarks.
The same leading performance holds on the new Suboptimal-Value-Bench that contains 800 human-labeled suboptimal trajectories.
Value signals from WVM improve final manipulation performance when used with multiple policy extraction methods.
Gains appear in both simulated environments and real-robot deployments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The temporal advantage could extend to value estimation in other sequential embodied tasks such as navigation or long-horizon assembly.
Scaling the underlying world model size might produce finer-grained value signals for extended task horizons.
Value functions built this way could serve as an interface between predictive world models and reward-driven policy optimization.
Further tests on additional robot embodiments would test how broadly the temporal benefit applies.
keywords:[
world models
value models
robotic manipulation

Load-bearing premise

That the temporal prediction strengths built into world models translate directly into more accurate value estimates for robotic manipulation trajectories.

What would settle it

Training a vision-language model backbone on the same robotic datasets with the same procedure and obtaining equal or higher Value-Order Correlation scores than the world-model version.

Figures

Figures reproduced from arXiv: 2606.24742 by Jianxiong Li, Junzhi Yu, Xianyuan Zhan, Xiao Ma, Yuan Gao, Yu Cui, Zhihao Wang.

read the original abstract

Generalist value models play a pivotal role in scaling robotic policy learning from large-scale, mixed-quality data. Mathematically, accurate value estimation demands deep temporal understanding, requiring models to both ground the current belief using historical context and plan over future outcomes. However, most existing robotic value models are built on Vision-Language Model (VLM) backbones that are pretrained primarily on static or temporally sparse visual observations, lacking the requisite temporal modeling capabilities for value estimation. Unlike VLMs, world models naturally excel at temporal modeling and future planning, making them ideal foundations for learning generalizable value functions. Driven by this insight, we marry world models with value estimation to construct a new generalist robotic value model, World Value Model (WVM), that offers accurate task progressions to assess data quality. On standard benchmarks, WVM delivers state-of-the-art (SOTA) Value-Order Correlation (VOC) results. Complementing standard evaluation suites that contains only expert data, we further introduce Suboptimal-Value-Bench, a multi-embodiment benchmark consisting of 800 suboptimal trajectories with high-fidelity, human-labeled frame annotations. Our evaluations show that WVM maintains its SOTA performance on Suboptimal-Value-Bench, establishing its robustness in handling both expert and suboptimal data. When deployed for policy learning, WVM improves manipulation performance across various policy extraction approaches in both simulated and real-world deployment, providing robust guidance for learning from mixed-quality data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WVM pairs a world model with value estimation to score mixed-quality robot trajectories and claims SOTA VOC plus policy gains, but the attribution to temporal modeling lacks isolating tests.

read the letter

The main takeaway is that this paper builds a value model on a world-model backbone rather than a VLM and reports stronger correlation with human judgments of trajectory quality, plus better downstream manipulation policies in both simulation and real hardware. They also release Suboptimal-Value-Bench, a new multi-embodiment set of 800 labeled suboptimal trajectories.

What stands out is the recognition that value estimation for robotics needs explicit future prediction and history grounding, which standard VLMs may not supply if their pretraining is mostly static. Adding a benchmark that includes non-expert data is a practical step, since most prior suites only test on curated expert rollouts.

The soft spots are in the evidence for the core claim. The abstract states SOTA VOC numbers and policy improvements but gives no equations, architecture diagram, training recipe, or ablation that separates the world-model inductive bias from differences in data scale, model size, or fine-tuning procedure. A temporally augmented VLM baseline could easily close the gap, which would leave the "marriage" story unsupported. Without those controls the SOTA result is hard to interpret.

The work is aimed at researchers scaling policy learning from noisy, large-scale robot data. Anyone already using world models or value functions in manipulation would find the benchmark and the empirical numbers worth checking, provided the full paper supplies the missing implementation details and comparisons.

It deserves peer review. The idea targets a real bottleneck and the new benchmark has reuse potential, even if the current write-up needs stronger controls before the conclusions can be taken at face value.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes the World Value Model (WVM) by integrating world models with value estimation for robotic manipulation tasks. It argues that existing VLM-based value models lack temporal modeling capabilities, while world models excel at this, enabling accurate task progression assessment for data quality. The paper claims SOTA results on Value-Order Correlation (VOC) benchmarks, introduces the Suboptimal-Value-Bench with 800 suboptimal multi-embodiment trajectories and human-labeled annotations, and reports improved manipulation performance when WVM is used for policy learning in both simulation and real-world settings.

Significance. If the empirical claims hold after verification, the work would be significant for advancing generalist value models in robotics by leveraging world models' temporal strengths for learning from mixed-quality data. The introduction of Suboptimal-Value-Bench is a concrete contribution that addresses limitations of expert-only benchmarks. The paper ships no machine-checked proofs or parameter-free derivations, but the empirical focus on suboptimal data robustness is a positive aspect if properly ablated.

major comments (2)

[Abstract] Abstract: the claim of SOTA VOC results and downstream policy gains is presented without equations, training details, ablation studies, or statistical tests, preventing verification of the data against the claim and undermining assessment of the central empirical result.
[Introduction] Introduction and motivation sections: the load-bearing argument that observed gains arise specifically from world-model temporal modeling (rather than pretraining data volume, architecture scale, or fine-tuning) is not isolated; no comparison to a temporally-augmented VLM baseline (e.g., video-frame input or future-prediction auxiliary loss) is described, so the attribution to marrying world models with value estimation cannot be confirmed.

minor comments (1)

[Benchmark description] The new benchmark is described only at a high level; details on embodiment diversity, annotation protocol, and exact VOC computation should be expanded for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the two major comments point-by-point below and commit to revisions that improve clarity and strengthen the empirical attribution.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of SOTA VOC results and downstream policy gains is presented without equations, training details, ablation studies, or statistical tests, preventing verification of the data against the claim and undermining assessment of the central empirical result.

Authors: We acknowledge that the abstract's brevity precludes inclusion of equations, full training hyperparameters, or statistical tests. All such details appear in the main text: model formulation and training in Section 3, ablation studies in Section 4.3, and statistical significance reporting in Section 4.2. To improve verifiability, we will revise the abstract to reference these sections explicitly and add one sentence summarizing the evaluation protocol and significance testing. revision: yes
Referee: [Introduction] Introduction and motivation sections: the load-bearing argument that observed gains arise specifically from world-model temporal modeling (rather than pretraining data volume, architecture scale, or fine-tuning) is not isolated; no comparison to a temporally-augmented VLM baseline (e.g., video-frame input or future-prediction auxiliary loss) is described, so the attribution to marrying world models with value estimation cannot be confirmed.

Authors: We agree that isolating the contribution of world-model temporal modeling requires a controlled comparison against a temporally-augmented VLM. Our existing baselines compare against standard VLM value models, but we did not include an explicit video-input or auxiliary-prediction VLM variant. We will add this baseline experiment in the revised manuscript (new subsection in Section 4) to directly test whether the observed gains are attributable to the world-model integration rather than scale or data volume. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical SOTA claims rest on external benchmarks and new labeled data

full rationale

The manuscript presents an empirical contribution: a new model (WVM) evaluated on standard benchmarks plus a newly introduced Suboptimal-Value-Bench with human-labeled annotations. No mathematical derivation chain, fitted-parameter-as-prediction step, or self-citation load-bearing premise appears in the abstract or described structure. The motivating claim that world models supply superior temporal modeling is tested by direct performance comparison rather than assumed by definition or prior self-work. The result is therefore self-contained against external data and does not reduce to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract only; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5797 in / 1101 out tokens · 32059 ms · 2026-06-25T23:32:02.234196+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 33 linked inside Pith

[1]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Pith/arXiv arXiv 2025
[2]

Reinforcement learning with long short-term memory.Advancesin neural information processing systems, 14, 2001

Bram Bakker. Reinforcement learning with long short-term memory.Advancesin neural information processing systems, 14, 2001

2001
[3]

A distributional perspective on reinforcement learning

Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International conference on machine learning, pages 449–458. Pmlr, 2017

2017
[4]

Learning long-term dependencies with gradient descent is difficult

Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994

1994
[5]

Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

Kevin Black, Allen Z Ren, Michael Equi, and Sergey Levine. Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

arXiv 2025
[6]

Topreward: Token probabilities as hidden zero-shot rewards for robotics.arXiv preprint arXiv:2602.19313, 2026

Shirui Chen, Cole Harrison, Ying-Chun Lee, Angela Jin Yang, Zhongzheng Ren, Lillian J Ratliff, Jiafei Duan, Dieter Fox, and Ranjay Krishna. Topreward: Token probabilities as hidden zero-shot rewards for robotics.arXiv preprint arXiv:2602.19313, 2026

arXiv 2026
[7]

Datamil: Selecting data for robot imitation learning with datamodels.arXiv preprint arXiv:2505.09603, 2025

Shivin Dass, Alaa Khaddaj, Logan Engstrom, Aleksander Madry, Andrew Ilyas, and Roberto Martín-Martín. Datamil: Selecting data for robot imitation learning with datamodels.arXiv preprint arXiv:2505.09603, 2025

Pith/arXiv arXiv 2025
[8]

Understanding world or predicting future? a comprehensive survey of world models

Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predicting future? a comprehensive survey of world models. ACM Computing Surveys, 58(3):1–38, 2025

2025
[9]

Value flows.arXiv preprint arXiv:2510.07650, 2025

Perry Dong, Chongyi Zheng, Chelsea Finn, Dorsa Sadigh, and Benjamin Eysenbach. Value flows.arXiv preprint arXiv:2510.07650, 2025

Pith/arXiv arXiv 2025
[10]

Stop regressing: Training value functions via classification for scalable deep rl.arXiv preprint arXiv:2403.03950, 2024

Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taïga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, et al. Stop regressing: Training value functions via classification for scalable deep rl.arXiv preprint arXiv:2403.03950, 2024

arXiv 2024
[11]

Studying the interplay between the actor and critic representations in reinforcement learning

Samuel Garcin, Trevor McInroe, Pablo Samuel Castro, Christopher Lucas, David Abel, Prakash Panangaden, and Stefano V Albrecht. Studying the interplay between the actor and critic representations in reinforcement learning. In International Conference on Learning Representations, volume 2025, pages 35590–35616, 2025

2025
[12]

World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

Pith/arXiv arXiv 2018
[13]

Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

Pith/arXiv arXiv 1912
[14]

Deep recurrent q-learning for partially observable mdps

Matthew J Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. InAAAI fall symposia, volume 45, page 141, 2015

2015
[15]

Re-mix: Optimizing data mixtures for large scale imitation learning.arXiv preprint arXiv:2408.14037, 2024

Joey Hejna, Chethan Bhateja, Yichen Jiang, Karl Pertsch, and Dorsa Sadigh. Re-mix: Optimizing data mixtures for large scale imitation learning.arXiv preprint arXiv:2408.14037, 2024

arXiv 2024
[16]

Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

Pith/arXiv arXiv 2022
[17]

Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

Pith/arXiv arXiv 2025
[18]

Optimizing agent behavior over long time scales by transporting value.Nature communications, 10(1):5223, 2019

Chia-Chun Hung, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. Optimizing agent behavior over long time scales by transporting value.Nature communications, 10(1):5223, 2019

2019
[19]

Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al.π∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

Pith/arXiv arXiv 2025
[20]

arXiv preprint arXiv:2504.16054, 2025

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025. 19

Pith/arXiv arXiv 2025
[21]

Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, et al.π∗ 0.7: a steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

Pith/arXiv arXiv 2026
[22]

Planning and acting in partially observable stochastic domains

Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998

1998
[23]

Recurrent experience replay in distributed reinforcement learning

Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. InInternational conference on learning representations, 2018

2018
[24]

Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Pith/arXiv arXiv 2024
[25]

Cosmos policy: Fine-tuning video models for visuomotor control and planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

Pith/arXiv arXiv 2026
[26]

Roboreward: General-purpose vision-language reward models for robotics.arXiv preprint arXiv:2601.00675, 2026

Tony Lee, Andrew Wagenmaker, Karl Pertsch, Percy Liang, Sergey Levine, and Chelsea Finn. Roboreward: General-purpose vision-language reward models for robotics.arXiv preprint arXiv:2601.00675, 2026

arXiv 2026
[27]

Decisionnce: Embodied multimodal representations via implicit preference learning

Jianxiong Li, Jinliang Zheng, Yinan Zheng, Liyuan Mao, Xiao Hu, Sijie Cheng, Haoyi Niu, Jihao Liu, Yu Liu, Jingjing Liu, et al. Decisionnce: Embodied multimodal representations via implicit preference learning. In Forty-firstInternational Conference on Machine Learning, 2024

2024
[28]

Robo-mutual: Robotic multimodal task specification via unimodal learning

Jianxiong Li, Zhihao Wang, Jinliang Zheng, Xiaoai Zhou, Guanming Wang, Guanglu Song, Yu Liu, Jingjing Liu, Ya-Qin Zhang, Junzhi Yu, et al. Robo-mutual: Robotic multimodal task specification via unimodal learning. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 4182–4189. IEEE, 2025

2025
[29]

Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026
[30]

Gr-rl: Going dexterous and precise for long-horizon robotic manipulation

Yunfei Li, Xiao Ma, Jiafeng Xu, Yu Cui, Zhongren Cui, Zhigang Han, Liqun Huang, Tao Kong, Yuxiao Liu, Hao Niu, et al. Gr-rl: Going dexterous and precise for long-horizon robotic manipulation. arXiv preprint arXiv:2512.01801, 2025

arXiv 2025
[31]

Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, and Jesse Zhang

Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S. Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, and Jesse Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons. arXiv preprint arXiv:2603.02115, 2026

Pith/arXiv arXiv 2026
[32]

Dichotomous diffusion policy optimization.arXiv preprint arXiv:2601.00898, 2026

Ruiming Liang, Yinan Zheng, Kexin Zheng, Tianyi Tan, Jianxiong Li, Liyuan Mao, Zhihao Wang, Guang Chen, Hangjun Ye, Jingjing Liu, Jinqiao Wang, and Xianyuan Zhan. Dichotomous diffusion policy optimization.arXiv preprint arXiv:2601.00898, 2026

arXiv 2026
[33]

Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.Transactions on Machine Learning Research, 2025

Weixin Liang, LILI YU, Liang Luo, Srini Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen tau Yih, Luke Zettlemoyer, and Xi Victoria Lin. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/forum?id=Nu6N69i8SB

2025
[34]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022
[35]

Flow matching guide and code.arXiv preprint arXiv:2412.06264, 2024

Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code.arXiv preprint arXiv:2412.06264, 2024

Pith/arXiv arXiv 2024
[36]

Efficient robotic policy learning via latent space backward planning.arXiv preprint arXiv:2505.06861, 2025

Dongxiu Liu, Haoyi Niu, Zhihao Wang, Jinliang Zheng, Yinan Zheng, Zhonghong Ou, Jianming Hu, Jianxiong Li, and Xianyuan Zhan. Efficient robotic policy learning via latent space backward planning.arXiv preprint arXiv:2505.06861, 2025

arXiv 2025
[37]

Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 20

Pith/arXiv arXiv 2022
[38]

Sora: A review on background, technology, limitations, and opportunities of large vision models

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024

Pith/arXiv arXiv 2024
[39]

Viva: A video-generative value model for robot reinforcement learning.arXiv preprint arXiv:2604.08168, 2026

Jindi Lv, Hao Li, Jie Li, Yifei Nie, Fankun Kong, Yang Wang, Xiaofeng Wang, Zheng Zhu, Chaojun Ni, Qiuping Deng, et al. Viva: A video-generative value model for robot reinforcement learning.arXiv preprint arXiv:2604.08168, 2026

Pith/arXiv arXiv 2026
[40]

Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2026

Teli Ma, Jia Zheng, Zifan Wang, Chunli Jiang, Andy Cui, Junwei Liang, and Shuo Yang. Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2026

arXiv 2026
[41]

Vip: Towardsuniversalvisualrewardandrepresentationviavalue-implicitpre-training

Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towardsuniversalvisualrewardandrepresentationviavalue-implicitpre-training. arXivpreprintarXiv:2210.00030, 2022

Pith/arXiv arXiv 2022
[42]

Liv: Language- image representations and rewards for robotic control

Yecheng Jason Ma, Vikash Kumar, Amy Zhang, Osbert Bastani, and Dinesh Jayaraman. Liv: Language- image representations and rewards for robotic control. InInternational Conference on Machine Learning, pages 23301–23320. PMLR, 2023

2023
[43]

Vision language models are in-context value learners

Yecheng Jason Ma, Joey Hejna, Chuyuan Fu, Dhruv Shah, Jacky Liang, Zhuo Xu, Sean Kirmani, Peng Xu, Danny Driess, Ted Xiao, et al. Vision language models are in-context value learners. InThe Thirteenth International Conference on Learning Representations, 2024

2024
[44]

Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

Pith/arXiv arXiv 2026
[45]

R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022

Pith/arXiv arXiv 2022
[46]

Value prediction network.Advances in neural information processing systems, 30, 2017

Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network.Advances in neural information processing systems, 30, 2017

2017
[47]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024
[48]

mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint 2512.15692, 2025

Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint 2512.15692, 2025

Pith/arXiv arXiv 2025
[49]

Sop: A scalable online post-training system for vision-language-action models.arXiv preprint arXiv:2601.03044, 2026

Mingjie Pan, Siyuan Feng, Qinglin Zhang, Xinchen Li, Jianheng Song, Chendi Qu, Yi Wang, Chuankang Li, Ziyu Xiong, Zhi Chen, et al. Sop: A scalable online post-training system for vision-language-action models.arXiv preprint arXiv:2601.03044, 2026

arXiv 2026
[50]

Stabilizing transformers for reinforcement learning

Emilio Parisotto, Francis Song, Jack Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant Jayakumar, Max Jaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury, et al. Stabilizing transformers for reinforcement learning. In International conference on machine learning, pages 7487–7498. PMLR, 2020

2020
[51]

Karl Pearson. Vii. note on regression and inheritance in the case of two parents.proceedings of the royal society of London, 58(347-352):240–242, 1895
[52]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[53]

Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019

Pith/arXiv arXiv 1910
[54]

Reinforcement learning by reward-weighted regression for operational space control

Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750, 2007

2007
[55]

A survey of temporal credit assignment in deep reinforcement learning.arXiv preprint arXiv:2312.01072, 2023

Eduardo Pignatelli, Johan Ferret, Matthieu Geist, Thomas Mesnard, Hado van Hasselt, Olivier Pietquin, and Laura Toni. A survey of temporal credit assignment in deep reinforcement learning.arXiv preprint arXiv:2312.01072, 2023

arXiv 2023
[56]

Universal value function approximators

Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In Francis Bach and David Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, volume 37 21 of Proceedings of Machine Learning Research, pages 1312–1320, Lille, France, 07–09 Jul 2015. PMLR. URL https://proceedings.mlr.press/v3...

2015
[57]

Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

2020
[58]

Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

Pith/arXiv arXiv 2026
[59]

Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

2016
[60]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

1998
[61]

Robo-dopamine: General process reward modeling for high-precision robotic manipulation

Huajie Tan, Sixiang Chen, Yijie Xu, Zixiao Wang, Yuheng Ji, Cheng Chi, Yaoxu Lyu, Zhongxia Zhao, Xiansheng Chen, Peterson Co, et al. Robo-dopamine: General process reward modeling for high-precision robotic manipulation. arXiv preprint arXiv:2512.23703, 2025

arXiv 2025
[62]

Motubrain: An advanced world action model for robot control.arXiv preprint arXiv:2604.27792, 2026

MotuBrain Team, Chendong Xiang, Fan Bao, Haitian Liu, Hengkai Tan, Hongzhe Bi, James Li, Jiabao Liu, Jingrui Pang, Kiro Jing, et al. Motubrain: An advanced world action model for robot control.arXiv preprint arXiv:2604.27792, 2026

Pith/arXiv arXiv 2026
[63]

Mem: Multi-scale embodied memory for vision language action models

Marcel Torne, Karl Pertsch, Homer Walke, Kyle Vedder, Suraj Nair, Brian Ichter, Allen Z Ren, Haohuan Wang, Jiaming Tang, Kyle Stachowicz, et al. Mem: Multi-scale embodied memory for vision language action models. arXiv preprint arXiv:2603.03596, 2026

arXiv 2026
[64]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[65]

World action models: The next frontier in embodied ai, 2026

Siyin Wang, Junhao Shi, Zhaoyang Fu, Xinzhe He, Feihong Liu, Chenchen Yang, Yikang Zhou, Zhaoye Fei, Jingjing Gong, Jinlan Fu, Mike Zheng Shou, Xuanjing Huang, Xipeng Qiu, and Yu-Gang Jiang. World action models: The next frontier in embodied ai, 2026. URLhttps://arxiv.org/abs/2605.12090

Pith/arXiv arXiv 2026
[66]

Physiagent: An embodied agent framework in physical world.arXiv preprint arXiv:2509.24524, 2025

Zhihao Wang, Jianxiong Li, Jinliang Zheng, Wencong Zhang, Dongxiu Liu, Yinan Zheng, Haoyi Niu, Junzhi Yu, and Xianyuan Zhan. Physiagent: An embodied agent framework in physical world.arXiv preprint arXiv:2509.24524, 2025

arXiv 2025
[67]

Solving deep memory pomdps with recurrent policy gradients

Daan Wierstra, Alexander Foerster, Jan Peters, and Juergen Schmidhuber. Solving deep memory pomdps with recurrent policy gradients. In International conference on artificial neural networks, pages 697–706. Springer, 2007

2007
[68]

Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation

Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation. arXiv preprint arXiv:2511.17441, 2025

Pith/arXiv arXiv 2025
[69]

Rl token: Bootstrapping online rl with vision-language-action models.arXiv preprint arXiv:2604.23073, 2026

Charles Xu, Jost Tobias Springenberg, Michael Equi, Ali Amin, Adnan Esmail, Sergey Levine, and Liyiming Ke. Rl token: Bootstrapping online rl with vision-language-action models.arXiv preprint arXiv:2604.23073, 2026

Pith/arXiv arXiv 2026
[70]

Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

arXiv 2026
[71]

World action models are zero-shot policies, 2026

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

Pith/arXiv arXiv 2026
[72]

Fast-wam: Do world action models need test-time future imagination? arXiv preprint arXiv:2603.16666, 2026

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination? arXiv preprint arXiv:2603.16666, 2026. URLhttps://arxiv.org/abs/2603.16666. 22

Pith/arXiv arXiv 2026
[73]

A vision-language-action-critic model for robotic real-world reinforcement learning

Shaopeng Zhai, Qi Zhang, Tianyi Zhang, Fuxian Huang, Haoran Zhang, Ming Zhou, Shengzhe Zhang, Litao Liu, Sixu Lin, and Jiangmiao Pang. A vision-language-action-critic model for robotic real-world reinforcement learning. arXiv preprint arXiv:2509.15937, 2025

arXiv 2025
[74]

RewiND: Language-guided rewards teach robot policies without new demonstrations

Jiahui Zhang, Yusen Luo, Abrar Anwar, Sumedh Anand Sontakke, Joseph J Lim, Jesse Thomason, Erdem Biyik, and Jesse Zhang. RewiND: Language-guided rewards teach robot policies without new demonstrations. In9th Annual Conference on Robot Learning, 2025. URLhttps://openreview.net/forum?id=XjjXLxfPou

2025
[75]

Universal actions for enhanced embodied foundation models

Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, and Xianyuan Zhan. Universal actions for enhanced embodied foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22508–22519, 2025

2025
[76]

X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025. 23

Pith/arXiv arXiv 2025

[1] [1]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Pith/arXiv arXiv 2025

[2] [2]

Reinforcement learning with long short-term memory.Advancesin neural information processing systems, 14, 2001

Bram Bakker. Reinforcement learning with long short-term memory.Advancesin neural information processing systems, 14, 2001

2001

[3] [3]

A distributional perspective on reinforcement learning

Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International conference on machine learning, pages 449–458. Pmlr, 2017

2017

[4] [4]

Learning long-term dependencies with gradient descent is difficult

Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994

1994

[5] [5]

Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

Kevin Black, Allen Z Ren, Michael Equi, and Sergey Levine. Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

arXiv 2025

[6] [6]

Topreward: Token probabilities as hidden zero-shot rewards for robotics.arXiv preprint arXiv:2602.19313, 2026

Shirui Chen, Cole Harrison, Ying-Chun Lee, Angela Jin Yang, Zhongzheng Ren, Lillian J Ratliff, Jiafei Duan, Dieter Fox, and Ranjay Krishna. Topreward: Token probabilities as hidden zero-shot rewards for robotics.arXiv preprint arXiv:2602.19313, 2026

arXiv 2026

[7] [7]

Datamil: Selecting data for robot imitation learning with datamodels.arXiv preprint arXiv:2505.09603, 2025

Shivin Dass, Alaa Khaddaj, Logan Engstrom, Aleksander Madry, Andrew Ilyas, and Roberto Martín-Martín. Datamil: Selecting data for robot imitation learning with datamodels.arXiv preprint arXiv:2505.09603, 2025

Pith/arXiv arXiv 2025

[8] [8]

Understanding world or predicting future? a comprehensive survey of world models

Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predicting future? a comprehensive survey of world models. ACM Computing Surveys, 58(3):1–38, 2025

2025

[9] [9]

Value flows.arXiv preprint arXiv:2510.07650, 2025

Perry Dong, Chongyi Zheng, Chelsea Finn, Dorsa Sadigh, and Benjamin Eysenbach. Value flows.arXiv preprint arXiv:2510.07650, 2025

Pith/arXiv arXiv 2025

[10] [10]

Stop regressing: Training value functions via classification for scalable deep rl.arXiv preprint arXiv:2403.03950, 2024

Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taïga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, et al. Stop regressing: Training value functions via classification for scalable deep rl.arXiv preprint arXiv:2403.03950, 2024

arXiv 2024

[11] [11]

Studying the interplay between the actor and critic representations in reinforcement learning

Samuel Garcin, Trevor McInroe, Pablo Samuel Castro, Christopher Lucas, David Abel, Prakash Panangaden, and Stefano V Albrecht. Studying the interplay between the actor and critic representations in reinforcement learning. In International Conference on Learning Representations, volume 2025, pages 35590–35616, 2025

2025

[12] [12]

World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

Pith/arXiv arXiv 2018

[13] [13]

Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

Pith/arXiv arXiv 1912

[14] [14]

Deep recurrent q-learning for partially observable mdps

Matthew J Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. InAAAI fall symposia, volume 45, page 141, 2015

2015

[15] [15]

Re-mix: Optimizing data mixtures for large scale imitation learning.arXiv preprint arXiv:2408.14037, 2024

Joey Hejna, Chethan Bhateja, Yichen Jiang, Karl Pertsch, and Dorsa Sadigh. Re-mix: Optimizing data mixtures for large scale imitation learning.arXiv preprint arXiv:2408.14037, 2024

arXiv 2024

[16] [16]

Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

Pith/arXiv arXiv 2022

[17] [17]

Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

Pith/arXiv arXiv 2025

[18] [18]

Optimizing agent behavior over long time scales by transporting value.Nature communications, 10(1):5223, 2019

Chia-Chun Hung, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. Optimizing agent behavior over long time scales by transporting value.Nature communications, 10(1):5223, 2019

2019

[19] [19]

Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al.π∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

Pith/arXiv arXiv 2025

[20] [20]

arXiv preprint arXiv:2504.16054, 2025

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025. 19

Pith/arXiv arXiv 2025

[21] [21]

Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, et al.π∗ 0.7: a steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

Pith/arXiv arXiv 2026

[22] [22]

Planning and acting in partially observable stochastic domains

Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998

1998

[23] [23]

Recurrent experience replay in distributed reinforcement learning

Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. InInternational conference on learning representations, 2018

2018

[24] [24]

Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Pith/arXiv arXiv 2024

[25] [25]

Cosmos policy: Fine-tuning video models for visuomotor control and planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

Pith/arXiv arXiv 2026

[26] [26]

Roboreward: General-purpose vision-language reward models for robotics.arXiv preprint arXiv:2601.00675, 2026

Tony Lee, Andrew Wagenmaker, Karl Pertsch, Percy Liang, Sergey Levine, and Chelsea Finn. Roboreward: General-purpose vision-language reward models for robotics.arXiv preprint arXiv:2601.00675, 2026

arXiv 2026

[27] [27]

Decisionnce: Embodied multimodal representations via implicit preference learning

Jianxiong Li, Jinliang Zheng, Yinan Zheng, Liyuan Mao, Xiao Hu, Sijie Cheng, Haoyi Niu, Jihao Liu, Yu Liu, Jingjing Liu, et al. Decisionnce: Embodied multimodal representations via implicit preference learning. In Forty-firstInternational Conference on Machine Learning, 2024

2024

[28] [28]

Robo-mutual: Robotic multimodal task specification via unimodal learning

Jianxiong Li, Zhihao Wang, Jinliang Zheng, Xiaoai Zhou, Guanming Wang, Guanglu Song, Yu Liu, Jingjing Liu, Ya-Qin Zhang, Junzhi Yu, et al. Robo-mutual: Robotic multimodal task specification via unimodal learning. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 4182–4189. IEEE, 2025

2025

[29] [29]

Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026

[30] [30]

Gr-rl: Going dexterous and precise for long-horizon robotic manipulation

Yunfei Li, Xiao Ma, Jiafeng Xu, Yu Cui, Zhongren Cui, Zhigang Han, Liqun Huang, Tao Kong, Yuxiao Liu, Hao Niu, et al. Gr-rl: Going dexterous and precise for long-horizon robotic manipulation. arXiv preprint arXiv:2512.01801, 2025

arXiv 2025

[31] [31]

Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, and Jesse Zhang

Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S. Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, and Jesse Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons. arXiv preprint arXiv:2603.02115, 2026

Pith/arXiv arXiv 2026

[32] [32]

Dichotomous diffusion policy optimization.arXiv preprint arXiv:2601.00898, 2026

Ruiming Liang, Yinan Zheng, Kexin Zheng, Tianyi Tan, Jianxiong Li, Liyuan Mao, Zhihao Wang, Guang Chen, Hangjun Ye, Jingjing Liu, Jinqiao Wang, and Xianyuan Zhan. Dichotomous diffusion policy optimization.arXiv preprint arXiv:2601.00898, 2026

arXiv 2026

[33] [33]

Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.Transactions on Machine Learning Research, 2025

Weixin Liang, LILI YU, Liang Luo, Srini Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen tau Yih, Luke Zettlemoyer, and Xi Victoria Lin. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/forum?id=Nu6N69i8SB

2025

[34] [34]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022

[35] [35]

Flow matching guide and code.arXiv preprint arXiv:2412.06264, 2024

Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code.arXiv preprint arXiv:2412.06264, 2024

Pith/arXiv arXiv 2024

[36] [36]

Efficient robotic policy learning via latent space backward planning.arXiv preprint arXiv:2505.06861, 2025

Dongxiu Liu, Haoyi Niu, Zhihao Wang, Jinliang Zheng, Yinan Zheng, Zhonghong Ou, Jianming Hu, Jianxiong Li, and Xianyuan Zhan. Efficient robotic policy learning via latent space backward planning.arXiv preprint arXiv:2505.06861, 2025

arXiv 2025

[37] [37]

Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 20

Pith/arXiv arXiv 2022

[38] [38]

Sora: A review on background, technology, limitations, and opportunities of large vision models

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024

Pith/arXiv arXiv 2024

[39] [39]

Viva: A video-generative value model for robot reinforcement learning.arXiv preprint arXiv:2604.08168, 2026

Jindi Lv, Hao Li, Jie Li, Yifei Nie, Fankun Kong, Yang Wang, Xiaofeng Wang, Zheng Zhu, Chaojun Ni, Qiuping Deng, et al. Viva: A video-generative value model for robot reinforcement learning.arXiv preprint arXiv:2604.08168, 2026

Pith/arXiv arXiv 2026

[40] [40]

Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2026

Teli Ma, Jia Zheng, Zifan Wang, Chunli Jiang, Andy Cui, Junwei Liang, and Shuo Yang. Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2026

arXiv 2026

[41] [41]

Vip: Towardsuniversalvisualrewardandrepresentationviavalue-implicitpre-training

Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towardsuniversalvisualrewardandrepresentationviavalue-implicitpre-training. arXivpreprintarXiv:2210.00030, 2022

Pith/arXiv arXiv 2022

[42] [42]

Liv: Language- image representations and rewards for robotic control

Yecheng Jason Ma, Vikash Kumar, Amy Zhang, Osbert Bastani, and Dinesh Jayaraman. Liv: Language- image representations and rewards for robotic control. InInternational Conference on Machine Learning, pages 23301–23320. PMLR, 2023

2023

[43] [43]

Vision language models are in-context value learners

Yecheng Jason Ma, Joey Hejna, Chuyuan Fu, Dhruv Shah, Jacky Liang, Zhuo Xu, Sean Kirmani, Peng Xu, Danny Driess, Ted Xiao, et al. Vision language models are in-context value learners. InThe Thirteenth International Conference on Learning Representations, 2024

2024

[44] [44]

Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

Pith/arXiv arXiv 2026

[45] [45]

R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022

Pith/arXiv arXiv 2022

[46] [46]

Value prediction network.Advances in neural information processing systems, 30, 2017

Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network.Advances in neural information processing systems, 30, 2017

2017

[47] [47]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024

[48] [48]

mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint 2512.15692, 2025

Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint 2512.15692, 2025

Pith/arXiv arXiv 2025

[49] [49]

Sop: A scalable online post-training system for vision-language-action models.arXiv preprint arXiv:2601.03044, 2026

Mingjie Pan, Siyuan Feng, Qinglin Zhang, Xinchen Li, Jianheng Song, Chendi Qu, Yi Wang, Chuankang Li, Ziyu Xiong, Zhi Chen, et al. Sop: A scalable online post-training system for vision-language-action models.arXiv preprint arXiv:2601.03044, 2026

arXiv 2026

[50] [50]

Stabilizing transformers for reinforcement learning

Emilio Parisotto, Francis Song, Jack Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant Jayakumar, Max Jaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury, et al. Stabilizing transformers for reinforcement learning. In International conference on machine learning, pages 7487–7498. PMLR, 2020

2020

[51] [51]

Karl Pearson. Vii. note on regression and inheritance in the case of two parents.proceedings of the royal society of London, 58(347-352):240–242, 1895

[52] [52]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[53] [53]

Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019

Pith/arXiv arXiv 1910

[54] [54]

Reinforcement learning by reward-weighted regression for operational space control

Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750, 2007

2007

[55] [55]

A survey of temporal credit assignment in deep reinforcement learning.arXiv preprint arXiv:2312.01072, 2023

Eduardo Pignatelli, Johan Ferret, Matthieu Geist, Thomas Mesnard, Hado van Hasselt, Olivier Pietquin, and Laura Toni. A survey of temporal credit assignment in deep reinforcement learning.arXiv preprint arXiv:2312.01072, 2023

arXiv 2023

[56] [56]

Universal value function approximators

Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In Francis Bach and David Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, volume 37 21 of Proceedings of Machine Learning Research, pages 1312–1320, Lille, France, 07–09 Jul 2015. PMLR. URL https://proceedings.mlr.press/v3...

2015

[57] [57]

Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

2020

[58] [58]

Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

Pith/arXiv arXiv 2026

[59] [59]

Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

2016

[60] [60]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

1998

[61] [61]

Robo-dopamine: General process reward modeling for high-precision robotic manipulation

Huajie Tan, Sixiang Chen, Yijie Xu, Zixiao Wang, Yuheng Ji, Cheng Chi, Yaoxu Lyu, Zhongxia Zhao, Xiansheng Chen, Peterson Co, et al. Robo-dopamine: General process reward modeling for high-precision robotic manipulation. arXiv preprint arXiv:2512.23703, 2025

arXiv 2025

[62] [62]

Motubrain: An advanced world action model for robot control.arXiv preprint arXiv:2604.27792, 2026

MotuBrain Team, Chendong Xiang, Fan Bao, Haitian Liu, Hengkai Tan, Hongzhe Bi, James Li, Jiabao Liu, Jingrui Pang, Kiro Jing, et al. Motubrain: An advanced world action model for robot control.arXiv preprint arXiv:2604.27792, 2026

Pith/arXiv arXiv 2026

[63] [63]

Mem: Multi-scale embodied memory for vision language action models

Marcel Torne, Karl Pertsch, Homer Walke, Kyle Vedder, Suraj Nair, Brian Ichter, Allen Z Ren, Haohuan Wang, Jiaming Tang, Kyle Stachowicz, et al. Mem: Multi-scale embodied memory for vision language action models. arXiv preprint arXiv:2603.03596, 2026

arXiv 2026

[64] [64]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[65] [65]

World action models: The next frontier in embodied ai, 2026

Siyin Wang, Junhao Shi, Zhaoyang Fu, Xinzhe He, Feihong Liu, Chenchen Yang, Yikang Zhou, Zhaoye Fei, Jingjing Gong, Jinlan Fu, Mike Zheng Shou, Xuanjing Huang, Xipeng Qiu, and Yu-Gang Jiang. World action models: The next frontier in embodied ai, 2026. URLhttps://arxiv.org/abs/2605.12090

Pith/arXiv arXiv 2026

[66] [66]

Physiagent: An embodied agent framework in physical world.arXiv preprint arXiv:2509.24524, 2025

Zhihao Wang, Jianxiong Li, Jinliang Zheng, Wencong Zhang, Dongxiu Liu, Yinan Zheng, Haoyi Niu, Junzhi Yu, and Xianyuan Zhan. Physiagent: An embodied agent framework in physical world.arXiv preprint arXiv:2509.24524, 2025

arXiv 2025

[67] [67]

Solving deep memory pomdps with recurrent policy gradients

Daan Wierstra, Alexander Foerster, Jan Peters, and Juergen Schmidhuber. Solving deep memory pomdps with recurrent policy gradients. In International conference on artificial neural networks, pages 697–706. Springer, 2007

2007

[68] [68]

Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation

Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation. arXiv preprint arXiv:2511.17441, 2025

Pith/arXiv arXiv 2025

[69] [69]

Rl token: Bootstrapping online rl with vision-language-action models.arXiv preprint arXiv:2604.23073, 2026

Charles Xu, Jost Tobias Springenberg, Michael Equi, Ali Amin, Adnan Esmail, Sergey Levine, and Liyiming Ke. Rl token: Bootstrapping online rl with vision-language-action models.arXiv preprint arXiv:2604.23073, 2026

Pith/arXiv arXiv 2026

[70] [70]

Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

arXiv 2026

[71] [71]

World action models are zero-shot policies, 2026

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

Pith/arXiv arXiv 2026

[72] [72]

Fast-wam: Do world action models need test-time future imagination? arXiv preprint arXiv:2603.16666, 2026

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination? arXiv preprint arXiv:2603.16666, 2026. URLhttps://arxiv.org/abs/2603.16666. 22

Pith/arXiv arXiv 2026

[73] [73]

A vision-language-action-critic model for robotic real-world reinforcement learning

Shaopeng Zhai, Qi Zhang, Tianyi Zhang, Fuxian Huang, Haoran Zhang, Ming Zhou, Shengzhe Zhang, Litao Liu, Sixu Lin, and Jiangmiao Pang. A vision-language-action-critic model for robotic real-world reinforcement learning. arXiv preprint arXiv:2509.15937, 2025

arXiv 2025

[74] [74]

RewiND: Language-guided rewards teach robot policies without new demonstrations

Jiahui Zhang, Yusen Luo, Abrar Anwar, Sumedh Anand Sontakke, Joseph J Lim, Jesse Thomason, Erdem Biyik, and Jesse Zhang. RewiND: Language-guided rewards teach robot policies without new demonstrations. In9th Annual Conference on Robot Learning, 2025. URLhttps://openreview.net/forum?id=XjjXLxfPou

2025

[75] [75]

Universal actions for enhanced embodied foundation models

Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, and Xianyuan Zhan. Universal actions for enhanced embodied foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22508–22519, 2025

2025

[76] [76]

X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025. 23

Pith/arXiv arXiv 2025