arxiv: 2604.22152 · v1 · submitted 2026-04-24 · 💻 cs.RO

Recognition: unknown

dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

Yaxuan Li , Zhongyi Zhou , Yefei Chen , Yaokai Xue , Yichen Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:41 UTC · model grok-4.3

classification 💻 cs.RO

keywords robotic policy evaluationdiscrete diffusionworld modelrobotics simulationdiffusion modelspolicy assessmenttask progress trackingunified token space

0 comments

The pith

A discrete diffusion world model evaluates robot policies at scale by unifying vision, language and actions into tokens and tracking task progress.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Evaluating robotic policies across thousands of environments and tasks demands too many physical trials to be practical. The paper proposes dWorldEval, which treats a discrete diffusion model as a proxy simulator that maps vision, language and actions into one token space and denoises them with a single transformer. Sparse keyframe memory keeps the simulated sequences consistent over time, while a progress token tracks how close the task is to completion. At inference the model predicts both future observations and the progress value, declaring success once the token reaches 1. Experiments on LIBERO, RoboTwin and real robots show the approach outperforms earlier world-model evaluators.

Core claim

The paper claims that a discrete diffusion world model, by mapping all modalities into a unified token space and denoising them with one transformer network together with sparse keyframe memory and an explicit progress token, functions as an accurate scalable proxy that jointly forecasts future observations and automatically determines policy success when the progress token reaches 1.

What carries the argument

discrete diffusion world model that unifies vision, language and robotic actions into tokens, denoises them via transformer, and tracks task completion with a progress token

If this is right

Policy evaluation becomes feasible across thousands of environments and tasks without physical robot time.
Success is determined automatically by monitoring when the progress token reaches 1.
Spatiotemporal consistency during long rollouts is preserved by the sparse keyframe memory.
The same architecture outperforms prior evaluators on LIBERO, RoboTwin and multiple real-robot tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The unified token approach could support policy improvement loops that use the model's simulated feedback for training.
The progress-token mechanism might transfer to other long-horizon sequential tasks outside robotics.
Larger versions of the model could handle evaluation of multi-agent or deformable-object scenarios.

Load-bearing premise

The learned discrete diffusion dynamics and progress token accurately reflect real task success and failure without systematic bias from tokenization or training data.

What would settle it

Execute the same set of policies both in dWorldEval simulations and on physical robots for identical tasks, then check whether the model's predicted success rates match the observed real-world success rates.

read the original abstract

Evaluating robotics policies across thousands of environments and thousands of tasks is infeasible with existing approaches. This motivates the need for a new methodology for scalable robotics policy evaluation. In this paper, we propose dWorldEval, which uses a discrete diffusion world model as a scalable evaluation proxy for robotics policies. Specifically, dWorldEval maps all modalities - including vision, language, and robotic actions - into a unified token space, modeling them via a single transformer-based denoising network. In this paper, we propose dWorldEval, using a discrete diffusion world model as a scalable evaluation proxy for robotics policy. Specifically, it maps all modalities, including vision, language, and robotics action into a unified token space, then denoises them with a single transformer network. Building on this architecture, we employ a sparse keyframe memory to maintain spatiotemporal consistency. We also introduce a progress token that indicates the degree of task completion. At inference, the model jointly predicts future observations and progress token, allowing automatically determine success when the progress reaches 1. Extensive experiments demonstrate that dWorldEval significantly outperforms previous approaches, i.e., WorldEval, Ctrl-World, and WorldGym, on LIBERO, RoboTwin, and multiple real-robot tasks. It paves the way for a new architectural paradigm in building world simulators for robotics evaluation at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

dWorldEval's unified discrete diffusion setup with a progress token for auto-success detection is a fresh architectural try at the robotics evaluation bottleneck, but the missing validation on that token and absent metrics leave the outperformance claims unsupported.

read the letter

The main thing to know is that this paper puts forward a discrete diffusion world model that unifies vision, language, and actions into one token space, denoises them with a single transformer, adds sparse keyframe memory for consistency, and uses a learned progress token to flag task completion at inference. That combination for scalable policy evaluation is not in the prior work they cite. It targets a genuine pain point: running thousands of real or simulated rollouts is too slow, so a reliable proxy could help a lot if the proxy actually tracks reality. The architecture itself is cleanly described and the progress token idea is a practical addition for avoiding manual success labeling. The sparse memory choice also looks like a reasonable way to handle longer sequences without blowing up compute. Those pieces show some thought about what robotics world models actually need. The soft spots sit right at the center of the claims. The abstract asserts clear wins over WorldEval, Ctrl-World, and WorldGym on LIBERO, RoboTwin, and real-robot tasks, yet supplies zero numbers, no ablation tables, no statistical tests, and no direct comparison of the progress token against ground-truth labels. Without that, we cannot tell whether the token systematically over- or under-counts success due to tokenization choices or training distribution gaps. If the token is off, the reported policy rankings become incomparable to real evaluation, which undercuts the whole point. The paper does not appear to include any external validation step for the token either. This work is for researchers building or using world models in robotics who want faster iteration on policy testing. A reader already working on multimodal tokenization or diffusion for control might pick up useful implementation details even if the results section needs strengthening. It deserves a serious referee because the problem is important, the architecture is concrete, and the gaps are fixable with added experiments rather than fundamental. I would send it to review with instructions to add token validation metrics and full baseline details, but I would not cite it yet in my own work.

Referee Report

3 major / 2 minor

Summary. The paper proposes dWorldEval, a discrete diffusion world model for scalable robotic policy evaluation. It unifies vision, language, and actions into a shared token space processed by a single transformer-based denoising network, augments this with sparse keyframe memory for spatiotemporal consistency, and introduces a progress token that is jointly predicted with future observations; success is auto-labeled when the progress token reaches 1. The central empirical claim is that dWorldEval significantly outperforms WorldEval, Ctrl-World, and WorldGym on LIBERO, RoboTwin, and real-robot tasks.

Significance. If the progress-token-based success metric proves reliable, the method could enable evaluation of robotics policies at scales infeasible with direct simulation or hardware, reducing cost and time for benchmarking. The unified discrete diffusion architecture and progress token constitute a concrete architectural contribution to learned world models for robotics.

major comments (3)

[Abstract and Experiments] The headline outperformance claim on LIBERO, RoboTwin, and real-robot tasks is load-bearing on the progress token reaching 1 as an automatic success label, yet no quantitative validation (precision, recall, or correlation against external ground-truth success annotations) is supplied for this token on any benchmark; without it, reported success rates versus baselines are not demonstrably comparable to real evaluation.
[Method] The unified tokenization of vision/language/actions plus sparse keyframe memory is presented as preserving consistency, but no analysis quantifies how tokenization artifacts or training-distribution biases propagate into the progress token or policy-evaluation metrics (e.g., systematic over- or under-estimation of partial progress).
[Experiments] The abstract asserts outperformance but supplies no numerical metrics, number of trials, ablation results on the progress token, or statistical tests; the full experiments section must include these to support the central claim.

minor comments (2)

[Abstract] The abstract contains a duplicated sentence describing the proposal of dWorldEval.
[Method] Notation for the progress token (its range, exact training loss, and inference thresholding) should be defined explicitly with an equation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that additional quantitative validation and reporting are needed to strengthen the central claims, and we will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and Experiments] The headline outperformance claim on LIBERO, RoboTwin, and real-robot tasks is load-bearing on the progress token reaching 1 as an automatic success label, yet no quantitative validation (precision, recall, or correlation against external ground-truth success annotations) is supplied for this token on any benchmark; without it, reported success rates versus baselines are not demonstrably comparable to real evaluation.

Authors: We acknowledge that the absence of quantitative validation for the progress token limits the interpretability of the reported success rates. The current manuscript relies on qualitative inspection and end-to-end performance gains. In the revision we will add a dedicated validation subsection reporting precision, recall, and Pearson/Spearman correlation of the progress token against external ground-truth success labels on held-out subsets of LIBERO, RoboTwin, and real-robot trajectories. revision: yes
Referee: [Method] The unified tokenization of vision/language/actions plus sparse keyframe memory is presented as preserving consistency, but no analysis quantifies how tokenization artifacts or training-distribution biases propagate into the progress token or policy-evaluation metrics (e.g., systematic over- or under-estimation of partial progress).

Authors: We agree that an explicit analysis of tokenization effects is warranted. We will insert a new paragraph and accompanying ablation table that measures how vocabulary size, keyframe sparsity, and training-data distribution shifts affect progress-token accuracy and downstream policy-evaluation bias (over- or under-estimation of partial progress). revision: yes
Referee: [Experiments] The abstract asserts outperformance but supplies no numerical metrics, number of trials, ablation results on the progress token, or statistical tests; the full experiments section must include these to support the central claim.

Authors: We will revise the abstract to include the main numerical success rates, trial counts, and a brief mention of statistical significance. In addition, the experiments section will be expanded with (i) an ablation isolating the progress token, (ii) explicit reporting of the number of evaluation episodes per method and benchmark, and (iii) statistical tests (e.g., paired t-tests or bootstrap confidence intervals) comparing dWorldEval against the baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity: progress token is an independent architectural addition evaluated on external benchmarks

full rationale

The paper proposes dWorldEval as a discrete diffusion world model for scalable robotics policy evaluation, mapping modalities to tokens and using a progress token to auto-determine success at inference. No equations, derivations, or claims in the abstract or description reduce the outperformance results on LIBERO, RoboTwin, or real-robot tasks to a fitted parameter, self-definition, or self-citation chain. The progress token is presented as a novel component whose correlation with task completion is assessed via external benchmarks rather than by construction. This aligns with the default expectation of no circularity for methodological papers whose central claims rest on independent experimental validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that a learned discrete diffusion process on tokenized multimodal data can serve as a faithful proxy for real robotic dynamics and task outcomes.

axioms (1)

domain assumption Discrete diffusion on a unified token space can faithfully model robotic observation-action dynamics and task progress
This is the core modeling choice that allows the world model to act as an evaluation proxy.

invented entities (1)

progress token no independent evidence
purpose: Indicates degree of task completion and triggers automatic success detection when it reaches 1
New architectural component introduced to enable end-to-end success prediction without external reward functions.

pith-pipeline@v0.9.0 · 5545 in / 1245 out tokens · 64683 ms · 2026-05-08T11:41:12.277792+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

54 extracted references · 40 canonical work pages · cited by 1 Pith paper · 16 internal anchors

[1]

1X World Model — 1x.tech.https://www.1x.tech/discover/1x-world-model, 2025

1X Technologies. 1X World Model — 1x.tech.https://www.1x.tech/discover/1x-world-model, 2025. [Ac- cessed 16-05-2025]

2025
[2]

Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

2021
[3]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review arXiv 2025
[4]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...
[5]

URLhttps://arxiv.org/abs/2410.24164

work page internal anchor Pith review arXiv
[6]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review arXiv 2023
[7]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.arXiv preprint arXiv:2303.04137, 2023

work page internal anchor Pith review arXiv 2023
[8]

Peract2: Benchmarking and learning for robotic bimanual manipulation tasks

Markus Grotz, Mohit Shridhar, Yu-Wei Chao, Tamim Asfour, and Dieter Fox. Peract2: Benchmarking and learning for robotic bimanual manipulation tasks. InCoRL 2024 Workshop on Whole-body Control and Bimanual Manipulation: Applications in Humanoids and Beyond, 2024

2024
[9]

Maniskill2: A unified benchmark for generalizable manipulation skills.arXiv preprint arXiv:2302.04659, 2023

Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills.arXiv preprint arXiv:2302.04659, 2023

work page arXiv 2023
[10]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review arXiv 2025
[11]

Ctrl-world: A controllable generative world model for robot manipulation, 2026

Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

work page arXiv 2025
[12]

1x world model: evaluating bits, not atoms, 2025

D HO, J MONAS, JT REN, and C YU. 1x world model: evaluating bits, not atoms, 2025

2025
[13]

Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning,

Yingdong Hu, Fanqi Lin, Tong Zhang, Li Yi, and Yang Gao. Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning.arXiv preprint arXiv:2311.17842, 2023

work page arXiv 2023
[14]

arXiv preprint arXiv:2501.01895 (2025)

Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Yue Liao, Peng Gao, Hongsheng Li, Maoqing Yao, et al. Enerverse: Envisioning embodied future space for robotics manipulation. arXiv preprint arXiv:2501.01895, 2025

work page arXiv 2025
[15]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review arXiv 2025
[16]

Enerverse-ac: Envisioning embodied environments with action condition,

Yuxin Jiang, Shengcong Chen, Siyuan Huang, Liliang Chen, Pengfei Zhou, Yue Liao, Xindong He, Chiming Liu, Hongsheng Li, Maoqing Yao, et al. Enerverse-ac: Envisioning embodied environments with action condition. arXiv preprint arXiv:2505.09723, 2025

work page arXiv 2025
[17]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review arXiv 2024
[18]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025. 10

work page internal anchor Pith review arXiv 2025
[19]

Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi

Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for multimodal under- standing.arXiv preprint arXiv:2505.16839, 2025

work page arXiv 2025
[20]

Evaluating Real-World Robot Manipulation Policies in Simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

work page internal anchor Pith review arXiv 2024
[21]

Worldeval: World model as real-world robot policies evaluator,

Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

work page arXiv 2025
[22]

Discrete diffusion vla: Bring- ing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Xiaokang Yang, Jiangmiao Pang, Yao Mu, and Ping Luo. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision- language-action policies.arXiv preprint arXiv:2508.20072, 2025

work page arXiv 2025
[23]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

2023
[24]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36, 2024

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36, 2024

2024
[25]

Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model.ArXiv, abs/2503.10631, 2025

Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language- action model.arXiv preprint arXiv:2503.10631, 2025

work page arXiv 2025
[26]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

work page internal anchor Pith review arXiv 2023
[27]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

work page internal anchor Pith review arXiv 2021
[28]

Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

2022
[29]

Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Automation Letters, 8(6):3740–3747, 2023

Mayank Mittal, Calvin Yu, Qinxi Yu, Jingzhou Liu, Nikita Rudin, David Hoeller, Jia Lin Yuan, Ritvik Singh, Yunrong Guo, Hammad Mazhar, Ajay Mandlekar, Buck Babich, Gavriel State, Marco Hutter, and Animesh Garg. Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Automation Letters, 8(6):3740–3747, 2023. doi:...

work page doi:10.1109/lra.2023.3270034 2023
[30]

Maniskill: Generalizable manipulation skill bench- mark with large-scale demonstrations.arXiv preprint arXiv:2107.14483, 2021

Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations.arXiv preprint arXiv:2107.14483, 2021

work page arXiv 2021
[31]

Robotwin: Dual-arm robot benchmark with generative digital twins, 2025

Yao Mu, Tianxing Chen, Zanxin Chen, Shijia Peng, Zhiqian Lan, Zeyu Gao, Zhixuan Liang, Qiaojun Yu, Yude Zou, Mingkun Xu, et al. Robotwin: Dual-arm robot benchmark with generative digital twins.arXiv preprint arXiv:2504.13059, 2025

work page arXiv 2025
[32]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review arXiv 2025
[33]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review arXiv 2025
[34]

Worldgym: World model as an environment for policy evaluation, 2025

Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, and Sherry Yang. Worldgym: World model as an environment for policy evaluation, 2025.URL https://arxiv. org/abs/2506.00613, 2(4):9, 2025

work page arXiv 2025
[35]

Simpleandeffectivemaskeddiffusionlanguagemodels.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, andVolodymyrKuleshov. Simpleandeffectivemaskeddiffusionlanguagemodels.Advances in Neural Information Processing Systems, 37:130136–130184, 2024. 11

2024
[36]

Hu- manoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation

Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin, Youngwoon Lee, and Pieter Abbeel. Humanoidbench: Sim- ulated humanoid benchmark for whole-body locomotion and manipulation.arXiv preprint arXiv:2403.10506, 2024

work page arXiv 2024
[37]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Are- nas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review arXiv 2025
[38]

arXiv preprint arXiv:2512.10675 (2025)

Gemini Robotics Team, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Abhishek Jindal, Thomas Kipf, Sean Kirmani, Fangchen Liu, Anirudha Majumdar, et al. Evaluating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025

work page arXiv 2025
[39]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012. doi: 10.1109/ IROS.2012.6386109

work page arXiv 2012
[40]

Scalable policy evaluation with video world models.arXiv preprint arXiv:2511.11520, 2025

Wei-Cheng Tseng, Jinwei Gu, Qinsheng Zhang, Hanzi Mao, Ming-Yu Liu, Florian Shkurti, and Lin Yen-Chen. Scalable policy evaluation with video world models.arXiv preprint arXiv:2511.11520, 2025

work page arXiv 2025
[41]

dvla: Diffusion vision-language-action model with multimodal chain-of-thought.arXiv preprint arXiv:2509.25681, 2025

Junjie Wen, Minjie Zhu, Jiaming Liu, Zhiyuan Liu, Yicun Yang, Linfeng Zhang, Shanghang Zhang, Yichen Zhu, and Yi Xu. dvla: Diffusion vision-language-action model with multimodal chain-of-thought.arXiv preprint arXiv:2509.25681, 2025

work page arXiv 2025
[42]

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

work page Pith review arXiv 2025
[43]

Llada-vla: Vision language diffusion action models.arXiv preprint arXiv:2509.06932,

Yuqing Wen, Hebei Li, Kefan Gu, Yucheng Zhao, Tiancai Wang, and Xiaoyan Sun. Llada-vla: Vision language diffusion action models.arXiv preprint arXiv:2509.06932, 2025

work page arXiv 2025
[44]

Sapien: A simulated part-based interactive environment

Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020

2020
[45]

Mmada: Multimodal large diffusion language models

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

work page arXiv 2025
[46]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vigh- nesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual genera- tion.arXiv preprint arXiv:2310.05737, 2023

work page internal anchor Pith review arXiv 2023
[47]

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConference on robot learning, pages 1094–1100. PMLR, 2020

2020
[48]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint arXiv:2503.22020, 2025

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint arXiv:2503.22020, 2025

work page arXiv 2025
[49]

Vlas: Vision-language-action model with speech instructions for customized robot manipulation,

Wei Zhao, Pengxiang Ding, Min Zhang, Zhefei Gong, Shuanghao Bai, Han Zhao, and Donglin Wang. Vlas: Vision-language-action model with speech instructions for customized robot manipulation.arXiv preprint arXiv:2502.13508, 2025

work page arXiv 2025
[50]

3D-VLA: A 3D Vision-Language-Action Generative World Model

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024

work page internal anchor Pith review arXiv 2024
[51]

Clean the table

Zhiyuan Zhou, Pranav Atreya, You Liang Tan, Karl Pertsch, and Sergey Levine. Autoeval: Autonomous evalu- ation of generalist robot manipulation policies in the real world.arXiv preprint arXiv:2503.24278, 2025. 12 Appendices A More on Experimental Setup In this section, we provide further details regarding the task definitions, data collection pipelines, a...

work page arXiv 2025
[52]

A detailed task definition and rigid scoring rules
[53]

Three anchor examples with pre-labeled scores (e.g., 0.2, 0.4, and 0.6) to demonstrate intermediate states
[54]

pick up the bbq sauce and place it in the basket

A batch of query frames (typically 10 frames) to be evaluated independently. This batch-processing approach significantly stabilizes the output and enforces strict adherence to the discrete scoring criteria. C.1 Prompt Template FortheLibero-Object[23]suite, whichprimarilyinvolvespick-and-placemanipulation, weredefinethescoring criteria to reflect the sequ...