Recognition: unknown
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
Pith reviewed 2026-05-08 11:41 UTC · model grok-4.3
The pith
A discrete diffusion world model evaluates robot policies at scale by unifying vision, language and actions into tokens and tracking task progress.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a discrete diffusion world model, by mapping all modalities into a unified token space and denoising them with one transformer network together with sparse keyframe memory and an explicit progress token, functions as an accurate scalable proxy that jointly forecasts future observations and automatically determines policy success when the progress token reaches 1.
What carries the argument
discrete diffusion world model that unifies vision, language and robotic actions into tokens, denoises them via transformer, and tracks task completion with a progress token
If this is right
- Policy evaluation becomes feasible across thousands of environments and tasks without physical robot time.
- Success is determined automatically by monitoring when the progress token reaches 1.
- Spatiotemporal consistency during long rollouts is preserved by the sparse keyframe memory.
- The same architecture outperforms prior evaluators on LIBERO, RoboTwin and multiple real-robot tasks.
Where Pith is reading between the lines
- The unified token approach could support policy improvement loops that use the model's simulated feedback for training.
- The progress-token mechanism might transfer to other long-horizon sequential tasks outside robotics.
- Larger versions of the model could handle evaluation of multi-agent or deformable-object scenarios.
Load-bearing premise
The learned discrete diffusion dynamics and progress token accurately reflect real task success and failure without systematic bias from tokenization or training data.
What would settle it
Execute the same set of policies both in dWorldEval simulations and on physical robots for identical tasks, then check whether the model's predicted success rates match the observed real-world success rates.
read the original abstract
Evaluating robotics policies across thousands of environments and thousands of tasks is infeasible with existing approaches. This motivates the need for a new methodology for scalable robotics policy evaluation. In this paper, we propose dWorldEval, which uses a discrete diffusion world model as a scalable evaluation proxy for robotics policies. Specifically, dWorldEval maps all modalities - including vision, language, and robotic actions - into a unified token space, modeling them via a single transformer-based denoising network. In this paper, we propose dWorldEval, using a discrete diffusion world model as a scalable evaluation proxy for robotics policy. Specifically, it maps all modalities, including vision, language, and robotics action into a unified token space, then denoises them with a single transformer network. Building on this architecture, we employ a sparse keyframe memory to maintain spatiotemporal consistency. We also introduce a progress token that indicates the degree of task completion. At inference, the model jointly predicts future observations and progress token, allowing automatically determine success when the progress reaches 1. Extensive experiments demonstrate that dWorldEval significantly outperforms previous approaches, i.e., WorldEval, Ctrl-World, and WorldGym, on LIBERO, RoboTwin, and multiple real-robot tasks. It paves the way for a new architectural paradigm in building world simulators for robotics evaluation at scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes dWorldEval, a discrete diffusion world model for scalable robotic policy evaluation. It unifies vision, language, and actions into a shared token space processed by a single transformer-based denoising network, augments this with sparse keyframe memory for spatiotemporal consistency, and introduces a progress token that is jointly predicted with future observations; success is auto-labeled when the progress token reaches 1. The central empirical claim is that dWorldEval significantly outperforms WorldEval, Ctrl-World, and WorldGym on LIBERO, RoboTwin, and real-robot tasks.
Significance. If the progress-token-based success metric proves reliable, the method could enable evaluation of robotics policies at scales infeasible with direct simulation or hardware, reducing cost and time for benchmarking. The unified discrete diffusion architecture and progress token constitute a concrete architectural contribution to learned world models for robotics.
major comments (3)
- [Abstract and Experiments] The headline outperformance claim on LIBERO, RoboTwin, and real-robot tasks is load-bearing on the progress token reaching 1 as an automatic success label, yet no quantitative validation (precision, recall, or correlation against external ground-truth success annotations) is supplied for this token on any benchmark; without it, reported success rates versus baselines are not demonstrably comparable to real evaluation.
- [Method] The unified tokenization of vision/language/actions plus sparse keyframe memory is presented as preserving consistency, but no analysis quantifies how tokenization artifacts or training-distribution biases propagate into the progress token or policy-evaluation metrics (e.g., systematic over- or under-estimation of partial progress).
- [Experiments] The abstract asserts outperformance but supplies no numerical metrics, number of trials, ablation results on the progress token, or statistical tests; the full experiments section must include these to support the central claim.
minor comments (2)
- [Abstract] The abstract contains a duplicated sentence describing the proposal of dWorldEval.
- [Method] Notation for the progress token (its range, exact training loss, and inference thresholding) should be defined explicitly with an equation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that additional quantitative validation and reporting are needed to strengthen the central claims, and we will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and Experiments] The headline outperformance claim on LIBERO, RoboTwin, and real-robot tasks is load-bearing on the progress token reaching 1 as an automatic success label, yet no quantitative validation (precision, recall, or correlation against external ground-truth success annotations) is supplied for this token on any benchmark; without it, reported success rates versus baselines are not demonstrably comparable to real evaluation.
Authors: We acknowledge that the absence of quantitative validation for the progress token limits the interpretability of the reported success rates. The current manuscript relies on qualitative inspection and end-to-end performance gains. In the revision we will add a dedicated validation subsection reporting precision, recall, and Pearson/Spearman correlation of the progress token against external ground-truth success labels on held-out subsets of LIBERO, RoboTwin, and real-robot trajectories. revision: yes
-
Referee: [Method] The unified tokenization of vision/language/actions plus sparse keyframe memory is presented as preserving consistency, but no analysis quantifies how tokenization artifacts or training-distribution biases propagate into the progress token or policy-evaluation metrics (e.g., systematic over- or under-estimation of partial progress).
Authors: We agree that an explicit analysis of tokenization effects is warranted. We will insert a new paragraph and accompanying ablation table that measures how vocabulary size, keyframe sparsity, and training-data distribution shifts affect progress-token accuracy and downstream policy-evaluation bias (over- or under-estimation of partial progress). revision: yes
-
Referee: [Experiments] The abstract asserts outperformance but supplies no numerical metrics, number of trials, ablation results on the progress token, or statistical tests; the full experiments section must include these to support the central claim.
Authors: We will revise the abstract to include the main numerical success rates, trial counts, and a brief mention of statistical significance. In addition, the experiments section will be expanded with (i) an ablation isolating the progress token, (ii) explicit reporting of the number of evaluation episodes per method and benchmark, and (iii) statistical tests (e.g., paired t-tests or bootstrap confidence intervals) comparing dWorldEval against the baselines. revision: yes
Circularity Check
No significant circularity: progress token is an independent architectural addition evaluated on external benchmarks
full rationale
The paper proposes dWorldEval as a discrete diffusion world model for scalable robotics policy evaluation, mapping modalities to tokens and using a progress token to auto-determine success at inference. No equations, derivations, or claims in the abstract or description reduce the outperformance results on LIBERO, RoboTwin, or real-robot tasks to a fitted parameter, self-definition, or self-citation chain. The progress token is presented as a novel component whose correlation with task completion is assessed via external benchmarks rather than by construction. This aligns with the default expectation of no circularity for methodological papers whose central claims rest on independent experimental validation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Discrete diffusion on a unified token space can faithfully model robotic observation-action dynamics and task progress
invented entities (1)
-
progress token
no independent evidence
Forward citations
Cited by 1 Pith paper
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Reference graph
Works this paper leans on
-
[1]
1X World Model — 1x.tech.https://www.1x.tech/discover/1x-world-model, 2025
1X Technologies. 1X World Model — 1x.tech.https://www.1x.tech/discover/1x-world-model, 2025. [Ac- cessed 16-05-2025]
2025
-
[2]
Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021
2021
-
[3]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review arXiv 2025
-
[4]
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...
-
[5]
URLhttps://arxiv.org/abs/2410.24164
work page internal anchor Pith review arXiv
-
[6]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023
work page internal anchor Pith review arXiv 2023
-
[7]
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.arXiv preprint arXiv:2303.04137, 2023
work page internal anchor Pith review arXiv 2023
-
[8]
Peract2: Benchmarking and learning for robotic bimanual manipulation tasks
Markus Grotz, Mohit Shridhar, Yu-Wei Chao, Tamim Asfour, and Dieter Fox. Peract2: Benchmarking and learning for robotic bimanual manipulation tasks. InCoRL 2024 Workshop on Whole-body Control and Bimanual Manipulation: Applications in Humanoids and Beyond, 2024
2024
-
[9]
Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills.arXiv preprint arXiv:2302.04659, 2023
-
[10]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025
work page internal anchor Pith review arXiv 2025
-
[11]
Ctrl-world: A controllable generative world model for robot manipulation, 2026
Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025
-
[12]
1x world model: evaluating bits, not atoms, 2025
D HO, J MONAS, JT REN, and C YU. 1x world model: evaluating bits, not atoms, 2025
2025
-
[13]
Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning,
Yingdong Hu, Fanqi Lin, Tong Zhang, Li Yi, and Yang Gao. Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning.arXiv preprint arXiv:2311.17842, 2023
-
[14]
arXiv preprint arXiv:2501.01895 (2025)
Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Yue Liao, Peng Gao, Hongsheng Li, Maoqing Yao, et al. Enerverse: Envisioning embodied future space for robotics manipulation. arXiv preprint arXiv:2501.01895, 2025
-
[15]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025
work page internal anchor Pith review arXiv 2025
-
[16]
Enerverse-ac: Envisioning embodied environments with action condition,
Yuxin Jiang, Shengcong Chen, Siyuan Huang, Liliang Chen, Pengfei Zhou, Yue Liao, Xindong He, Chiming Liu, Hongsheng Li, Maoqing Yao, et al. Enerverse-ac: Envisioning embodied environments with action condition. arXiv preprint arXiv:2505.09723, 2025
-
[17]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review arXiv 2024
-
[18]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025. 10
work page internal anchor Pith review arXiv 2025
-
[19]
Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for multimodal under- standing.arXiv preprint arXiv:2505.16839, 2025
-
[20]
Evaluating Real-World Robot Manipulation Policies in Simulation
Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024
work page internal anchor Pith review arXiv 2024
-
[21]
Worldeval: World model as real-world robot policies evaluator,
Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025
-
[22]
Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Xiaokang Yang, Jiangmiao Pang, Yao Mu, and Ping Luo. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision- language-action policies.arXiv preprint arXiv:2508.20072, 2025
-
[23]
Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023
2023
-
[24]
Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36, 2024
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36, 2024
2024
-
[25]
Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language- action model.arXiv preprint arXiv:2503.10631, 2025
-
[26]
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023
work page internal anchor Pith review arXiv 2023
-
[27]
Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021
work page internal anchor Pith review arXiv 2021
-
[28]
Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022
Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022
2022
-
[29]
Mayank Mittal, Calvin Yu, Qinxi Yu, Jingzhou Liu, Nikita Rudin, David Hoeller, Jia Lin Yuan, Ritvik Singh, Yunrong Guo, Hammad Mazhar, Ajay Mandlekar, Buck Babich, Gavriel State, Marco Hutter, and Animesh Garg. Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Automation Letters, 8(6):3740–3747, 2023. doi:...
-
[30]
Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations.arXiv preprint arXiv:2107.14483, 2021
-
[31]
Robotwin: Dual-arm robot benchmark with generative digital twins, 2025
Yao Mu, Tianxing Chen, Zanxin Chen, Shijia Peng, Zhiqian Lan, Zeyu Gao, Zhixuan Liang, Qiaojun Yu, Yude Zou, Mingkun Xu, et al. Robotwin: Dual-arm robot benchmark with generative digital twins.arXiv preprint arXiv:2504.13059, 2025
-
[32]
Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025
work page internal anchor Pith review arXiv 2025
-
[33]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review arXiv 2025
-
[34]
Worldgym: World model as an environment for policy evaluation, 2025
Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, and Sherry Yang. Worldgym: World model as an environment for policy evaluation, 2025.URL https://arxiv. org/abs/2506.00613, 2(4):9, 2025
-
[35]
Simpleandeffectivemaskeddiffusionlanguagemodels.Advances in Neural Information Processing Systems, 37:130136–130184, 2024
Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, andVolodymyrKuleshov. Simpleandeffectivemaskeddiffusionlanguagemodels.Advances in Neural Information Processing Systems, 37:130136–130184, 2024. 11
2024
-
[36]
Hu- manoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation
Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin, Youngwoon Lee, and Pieter Abbeel. Humanoidbench: Sim- ulated humanoid benchmark for whole-body locomotion and manipulation.arXiv preprint arXiv:2403.10506, 2024
-
[37]
Gemini Robotics: Bringing AI into the Physical World
Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Are- nas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025
work page internal anchor Pith review arXiv 2025
-
[38]
arXiv preprint arXiv:2512.10675 (2025)
Gemini Robotics Team, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Abhishek Jindal, Thomas Kipf, Sean Kirmani, Fangchen Liu, Anirudha Majumdar, et al. Evaluating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025
-
[39]
Mujoco: A physics engine for model-based control
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012. doi: 10.1109/ IROS.2012.6386109
-
[40]
Scalable policy evaluation with video world models.arXiv preprint arXiv:2511.11520, 2025
Wei-Cheng Tseng, Jinwei Gu, Qinsheng Zhang, Hanzi Mao, Ming-Yu Liu, Florian Shkurti, and Lin Yen-Chen. Scalable policy evaluation with video world models.arXiv preprint arXiv:2511.11520, 2025
-
[41]
Junjie Wen, Minjie Zhu, Jiaming Liu, Zhiyuan Liu, Yicun Yang, Linfeng Zhang, Shanghang Zhang, Yichen Zhu, and Yi Xu. dvla: Diffusion vision-language-action model with multimodal chain-of-thought.arXiv preprint arXiv:2509.25681, 2025
-
[42]
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025
work page Pith review arXiv 2025
-
[43]
Llada-vla: Vision language diffusion action models.arXiv preprint arXiv:2509.06932,
Yuqing Wen, Hebei Li, Kefan Gu, Yucheng Zhao, Tiancai Wang, and Xiaoyan Sun. Llada-vla: Vision language diffusion action models.arXiv preprint arXiv:2509.06932, 2025
-
[44]
Sapien: A simulated part-based interactive environment
Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020
2020
-
[45]
Mmada: Multimodal large diffusion language models
Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025
-
[46]
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vigh- nesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual genera- tion.arXiv preprint arXiv:2310.05737, 2023
work page internal anchor Pith review arXiv 2023
-
[47]
Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning
Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConference on robot learning, pages 1094–1100. PMLR, 2020
2020
-
[48]
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint arXiv:2503.22020, 2025
-
[49]
Vlas: Vision-language-action model with speech instructions for customized robot manipulation,
Wei Zhao, Pengxiang Ding, Min Zhang, Zhefei Gong, Shuanghao Bai, Han Zhao, and Donglin Wang. Vlas: Vision-language-action model with speech instructions for customized robot manipulation.arXiv preprint arXiv:2502.13508, 2025
-
[50]
3D-VLA: A 3D Vision-Language-Action Generative World Model
Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024
work page internal anchor Pith review arXiv 2024
-
[51]
Zhiyuan Zhou, Pranav Atreya, You Liang Tan, Karl Pertsch, and Sergey Levine. Autoeval: Autonomous evalu- ation of generalist robot manipulation policies in the real world.arXiv preprint arXiv:2503.24278, 2025. 12 Appendices A More on Experimental Setup In this section, we provide further details regarding the task definitions, data collection pipelines, a...
-
[52]
A detailed task definition and rigid scoring rules
-
[53]
Three anchor examples with pre-labeled scores (e.g., 0.2, 0.4, and 0.6) to demonstrate intermediate states
-
[54]
pick up the bbq sauce and place it in the basket
A batch of query frames (typically 10 frames) to be evaluated independently. This batch-processing approach significantly stabilizes the output and enforces strict adherence to the discrete scoring criteria. C.1 Prompt Template FortheLibero-Object[23]suite, whichprimarilyinvolvespick-and-placemanipulation, weredefinethescoring criteria to reflect the sequ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.