pith. machine review for the scientific record. sign in

arxiv: 2604.09860 · v3 · submitted 2026-04-10 · 💻 cs.RO · cs.AI

Recognition: 1 theorem link

· Lean Theorem

RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:23 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords roboticssimulation benchmarktask generalizationpolicy evaluationperturbation analysishigh-fidelity simulationfoundation modelsrobotic competencies
0
0 comments X

The pith

RoboLab simulation benchmark reveals significant performance gaps in state-of-the-art robotic policies under controlled perturbations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RoboLab, a high-fidelity simulation framework that generates diverse scenes and tasks to test robotic policies in a robot- and policy-agnostic way. It creates the RoboLab-120 benchmark with 120 tasks split across visual, procedural, and relational competencies at three difficulty levels. The core goal is to measure how much real-world policy performance can be understood from simulation behavior and to identify which factors most strongly influence that behavior. By applying systematic perturbations, the analysis exposes large gaps in current models that standard benchmarks miss due to quick saturation and training overlap. This matters because it offers a scalable way to assess true generalization without immediate reliance on physical robot tests.

Core claim

RoboLab enables human-authored and LLM-enabled generation of scenes and tasks within a high-fidelity simulation environment. The RoboLab-120 benchmark consists of 120 tasks categorized into visual, procedural, and relational competency axes across three difficulty levels. A systematic analysis quantifies both the performance of real-world policies and the sensitivity of their behavior to controlled perturbations, exposing significant performance gaps in current state-of-the-art models.

What carries the argument

RoboLab-120 benchmark with 120 tasks across visual, procedural, and relational axes plus controlled perturbation analysis to quantify policy sensitivity in high-fidelity simulation.

If this is right

  • Real-world policy performance can be predicted and understood through simulation analysis.
  • Specific factors that most strongly affect policy behavior can be isolated via controlled perturbations.
  • Granular metrics beyond simple success rates become available for comparing task-generalist models.
  • Scalable task generation supports broader evaluation of generalization capabilities across difficulty levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Policies could be iteratively improved by targeting the specific sensitivities identified in simulation.
  • If the simulation-to-reality link holds, extensive physical testing could be reduced for initial policy screening.
  • The framework's structure could be adapted to benchmark policies in related areas such as multi-robot coordination.

Load-bearing premise

High-fidelity simulation results can be used to understand and predict real-world policy behavior without substantial unmodeled domain gaps between simulation and reality.

What would settle it

Running the same controlled perturbation tests on the evaluated policies in physical robot experiments and observing substantially different sensitivity patterns or performance levels than those measured in the RoboLab simulations.

Figures

Figures reproduced from arXiv: 2604.09860 by Alex Zook, Ankit Goyal, Fabio Ramos, Hugo Hadfield, Jonathan Tremblay, Rishit Dagli, Stan Birchfield, Xuning Yang.

Figure 1
Figure 1. Figure 1: Overview of RoboLab. RoboLab addresses the simulation-to-real gap by evaluating robotics policies on entirely held-out domains. By featuring a streamlined generation pipeline for new scenes and tasks (top row), RoboLab enables rapid extensibility for testing generalization capabilities. Our accompanying benchmark introduces visual, relational, and procedural testing axes, paired with robust metrics designe… view at source ↗
Figure 2
Figure 2. Figure 2: Three approaches for robotic benchmarks. LEFT: To date, pure simulation based benchmarks have exhibited low visual quality, creating a large sim2real transfer gap. MIDDLE: Real2sim benchmarks address this issue by using techniques to bring real-world visual texture into simulation. However, these environments are extremely costly with reported per-scene generation time of ∼1hr [10]. RIGHT: Our approach ach… view at source ↗
Figure 3
Figure 3. Figure 3: Task progression of a few tasks, illustrating errors encountered during policy rollout. Top row: Although the task is successfully completed, errors were encountered during execution: 1) The robot drops the milk jug too early, missing the bin. 2) the robot grasps an orange (wrong object) and puts it in the bin. Mid row: An extraneous object was reoriented before the actual intended object. Final row: Inten… view at source ↗
Figure 4
Figure 4. Figure 4: Example of language instructions in RoboLab-120. taxonomic decomposition enables fine-grained analysis of policy capabilities by systematically assessing performance [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of policy performance for bowl-in-bin manipulation. Rows represent distinct policies shown in chronological order (left to right). Successful execution involves grasping the central red bowl and depositing it into the gray bin on the right. Unsuccessful attempts are characterized by aimless arm trajectories and a lack of object interaction. a normalized graded score Sc(T ) = 1 |T | P τ∈T Sc(τ ).… view at source ↗
Figure 6
Figure 6. Figure 6: Example scene variations, lighting variations, and camera pose variations in RoboLab. [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of language ablation experiments. Top: Same scene and goal, but the instruction wording ranges from precise to increasingly vague. Middle: Same scene, but the instruction specifies different tasks to perform. Bottom: Same instruction, but the scene becomes progressively more complex. 2) uses a geometric solver and physics simulation to check asset placement validity; and 3) refines the scene if it… view at source ↗
Figure 8
Figure 8. Figure 8: Results of the sensitivity analysis using MNPE. Policies were highly sensitive to wrist-camera displacement from the nominal pose, indicating strong dependence on wrist-mounted camera calibration. Success also peaked for objects placed at approximately 0.5m from the robot, likely due to robot reachability. versatile 3b vlm for transfer, 2024. URL https://arxiv.org/ abs/2407.07726. [5] Kevin Black, Noah Bro… view at source ↗
Figure 9
Figure 9. Figure 9: (Left) We show one of our Gaussian Splat + Mesh scenes in RoboLab. This scene has a Gaussian splat background with a collision mesh for the splat estimated with 3DGRUT [22, 30], and a mesh foreground. All objects in the scene have spatially varying density, and thus mass is estimated with VoMP [6]. (Right) We show a VLA running a task in this scene. A. Stage I: Predicates for Semantic Planning The followin… view at source ↗
Figure 10
Figure 10. Figure 10: System prompt for Stage I (Semantic Planning). This prompt instructs the LLM to generate physically plausible scene [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Continued System prompt for Stage I (Semantic Planning). This prompt instructs the LLM to generate physically [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: User prompt template for Stage I (medium target count). The highlighted fields are populated at runtime (theme, target [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Feedback block appended to the user prompt when spatial solving, physical placement, grammar checks, or intersection [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
read the original abstract

The pursuit of general-purpose robotics has yielded impressive foundation models, yet simulation-based benchmarking remains a bottleneck due to rapid performance saturation and a lack of true generalization testing. Existing benchmarks often exhibit significant domain overlap between training and evaluation, trivializing success rates and obscuring insights into robustness. We introduce RoboLab, a simulation benchmarking framework designed to address these challenges. Concretely, our framework is designed to answer two questions: (1) to what extent can we understand the performance of a real-world policy by analyzing its behavior in simulation, and (2) which factor most strongly affect policy behavior. First, RoboLab enables human-authored and LLM-enabled generation of scenes and tasks in a robot- and policy-agnostic manner within a high-fidelity simulation environment. We introduce an accompanying RoboLab-120 benchmark, consisting of 120 tasks categorized into three competency axes: visual, procedural, relational, across three difficulty levels. Second, we introduce a systematic analysis of real-world policies that quantify both their performance and the sensitivity of their behavior to controlled perturbations, exposing significant performance gap in current state-of-the-art models. By providing granular metrics and a scalable toolset, RoboLab offers a scalable framework for evaluating the true generalization capabilities of task-generalist robotic policies. Project website: https://research.nvidia.com/labs/srl/projects/robolab/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces RoboLab, a high-fidelity simulation benchmarking framework for evaluating task-generalist robotic policies. It presents the RoboLab-120 benchmark with 120 tasks across visual, procedural, and relational competency axes at three difficulty levels, generated in a robot- and policy-agnostic manner via human authoring and LLM assistance. The framework supports systematic analysis of real-world policies by measuring their performance and sensitivity to controlled perturbations within simulation, claiming to reveal significant performance gaps in current state-of-the-art models and to enable better understanding of generalization without direct real-robot testing.

Significance. If the simulation faithfully reproduces real-world dynamics, sensors, and contacts for the evaluated policies, RoboLab could provide a scalable and safe tool for diagnosing robustness issues in foundation models that existing saturated benchmarks obscure. The granular metrics across competency axes and perturbation analysis would strengthen empirical evaluation practices in robotics, particularly for generalization testing.

major comments (2)
  1. [Abstract] The central claim that RoboLab simulation enables understanding of real-world policy behavior and exposes SOTA gaps (Abstract) requires that simulated dynamics and perturbation responses match physical robot outcomes. The manuscript supplies no quantitative sim-to-real alignment experiments, such as matched task success rates, sensitivity curves, or error analysis between simulation and physical robots, leaving open the possibility that unmodeled effects (actuator latency, surface compliance, lighting) decouple the reported gaps from actual real-world performance.
  2. [Analysis section] The systematic analysis of real-world policies (Abstract) quantifies performance and perturbation sensitivity but does not specify how the controlled perturbations are calibrated to real-world conditions or validated for fidelity. Without such grounding, it is unclear whether the exposed gaps reflect policy limitations or simulation artifacts, which is load-bearing for the claim that simulation analysis predicts real-world behavior.
minor comments (2)
  1. [Benchmark description] Clarify the exact distribution of the 120 tasks across the three competency axes and difficulty levels, including any tables or figures that break down the categories.
  2. Ensure the project website link is prominently included and that any code or benchmark release details are stated explicitly for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address the concerns about sim-to-real validation by tempering claims, clarifying the scope of the simulation analysis, and adding explicit discussion of limitations.

read point-by-point responses
  1. Referee: [Abstract] The central claim that RoboLab simulation enables understanding of real-world policy behavior and exposes SOTA gaps (Abstract) requires that simulated dynamics and perturbation responses match physical robot outcomes. The manuscript supplies no quantitative sim-to-real alignment experiments, such as matched task success rates, sensitivity curves, or error analysis between simulation and physical robots, leaving open the possibility that unmodeled effects (actuator latency, surface compliance, lighting) decouple the reported gaps from actual real-world performance.

    Authors: We agree that quantitative sim-to-real alignment experiments would strengthen the central claims. The current work positions RoboLab as a high-fidelity simulation tool for controlled analysis of generalization and sensitivity, rather than a direct proxy for physical outcomes. We have revised the abstract and introduction to replace phrasing about 'understanding the performance of a real-world policy' with 'providing simulation-based insights that can inform real-world policy evaluation'. A new limitations paragraph was added acknowledging the absence of matched physical-robot validation experiments and outlining future directions for such studies. revision: yes

  2. Referee: [Analysis section] The systematic analysis of real-world policies (Abstract) quantifies performance and perturbation sensitivity but does not specify how the controlled perturbations are calibrated to real-world conditions or validated for fidelity. Without such grounding, it is unclear whether the exposed gaps reflect policy limitations or simulation artifacts, which is load-bearing for the claim that simulation analysis predicts real-world behavior.

    Authors: We acknowledge that the perturbations were designed from domain knowledge of common real-world variations (lighting, texture, procedural ordering) without explicit calibration to physical sensor or dynamics data. We have expanded the analysis section to document the exact perturbation parameter ranges and generation procedure. The text now emphasizes that reported sensitivity metrics are relative within the simulated environment and revised the abstract to avoid implying direct prediction of real-world behavior without additional validation. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark framework is self-contained evaluation tool

full rationale

The paper introduces RoboLab as a new high-fidelity simulation benchmark and analysis framework for task-generalist policies. No equations, fitted parameters, or derivations are present that could reduce predictions to inputs by construction. The central claims rest on the design of the benchmark itself (scene generation, 120-task suite across visual/procedural/relational axes, perturbation sensitivity metrics) rather than any self-referential loop or self-citation chain. The assumption that simulation can inform real-world behavior is stated as a motivating goal but is not justified via internal reduction; it remains an external empirical question. This is a standard non-circular benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard assumptions about simulation fidelity rather than introducing new fitted parameters or entities.

axioms (1)
  • domain assumption High-fidelity simulation environments can approximate real-world conditions sufficiently for policy behavior analysis
    Invoked to justify using simulation results to understand real-world policy performance and sensitivity.

pith-pipeline@v0.9.0 · 5566 in / 1209 out tokens · 58618 ms · 2026-05-15T06:23:49.454458+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 1 Pith paper · 9 internal anchors

  1. [1]

    Roboarena: Distributed real-world evaluation of generalist robot policies

    Pranav Atreya, Karl Pertsch, Tony Lee, Moo Jin Kim, Arhan Jain, Artur Kuramshin, Clemens Eppner, Cyrus Neary, Edward Hu, Fabio Ramos, et al. Roboarena: Distributed real-world evaluation of generalist robot policies. InProceedings of the Conference on Robot Learning (CoRL 2025), 2025

  2. [2]

    A robust and sensitive metric for quantifying movement smoothness.IEEE Transactions on Biomedical Engineering, 59(8): 2126–2136, 2012

    Sivakumar Balasubramanian, Alejandro Melendez- Calderon, and Etienne Burdet. A robust and sensitive metric for quantifying movement smoothness.IEEE Transactions on Biomedical Engineering, 59(8): 2126–2136, 2012. doi: 10.1109/TBME.2011.2179545

  3. [3]

    HOT3D: Hand and object tracking in 3D from egocentric multi-view videos.CVPR, 2025

    Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, Richard Newcombe, Robert Wang, Jakob Julian Engel, and Tomas Hodan. HOT3D: Hand and object tracking in 3D from egocentric multi-view videos.CVPR, 2025

  4. [4]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Key- sers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias...

  5. [5]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xi- aoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A vis...

  6. [6]

    Levin, and Maria Shugrina

    Rishit Dagli, Donglai Xiang, Vismay Modi, Charles Loop, Clement Fuji Tsang, Anka He Chen, Anita Hu, Gavriel State, David I.W. Levin, and Maria Shugrina. V omp: Predicting volumetric mechanical property fields.arXiv preprint, 2025

  7. [7]

    Maniskill2: A unified benchmark for generalizable manipulation skills.arXiv preprint arXiv:2302.04659, 2023

    Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills.arXiv preprint arXiv:2302.04659, 2023

  8. [8]

    HANDAL: A dataset of real-world manipulable object categories with pose annotations, affordances, and recon- structions

    Andrew Guo, Bowen Wen, Jianhe Yuan, Jonathan Trem- blay, Stephen Tyree, Jeffrey Smith, and Stan Birchfield. HANDAL: A dataset of real-world manipulable object categories with pose annotations, affordances, and recon- structions. InIROS, 2023

  9. [9]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

  10. [10]

    Polaris: Scalable real-to-sim evaluations for generalist robot policies, 2025

    Arhan Jain, Mingtong Zhang, Kanav Arora, William Chen, Marcel Torne, Muhammad Zubair Irshad, Sergey Zakharov, Yue Wang, Sergey Levine, Chelsea Finn, Wei- Chiu Ma, Dhruv Shah, Abhishek Gupta, and Karl Pertsch. Polaris: Scalable real-to-sim evaluations for generalist robot policies, 2025. URL https://arxiv.org/abs/2512. 16881

  11. [11]

    Arrojo, and Andrew J

    Stephen James, Zicong Ma, David R. Arrojo, and Andrew J. Davison. RLBench: The Robot Learning Benchmark & Learning Environment.RAL, 2020

  12. [12]

    Robotarena ∞: Scalable robot benchmarking via real-to-sim translation, 2025

    Yash Jangir, Yidi Zhang, Kashu Yamazaki, Chenyu Zhang, Kuan-Hsun Tu, Tsung-Wei Ke, Lei Ke, Yonatan Bisk, and Katerina Fragkiadaki. Robotarena ∞: Scalable robot benchmarking via real-to-sim translation, 2025. URL https://arxiv.org/abs/2510.23571

  13. [13]

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yun- liang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee,...

  14. [14]

    Robot learning as an empirical science: Best practices for policy evaluation, 2024

    Hadas Kress-Gazit, Kunimatsu Hashimoto, Naveen Kup- puswamy, Paarth Shah, Phoebe Horgan, Gordon Richard- son, Siyuan Feng, and Benjamin Burchfiel. Robot learning as an empirical science: Best practices for policy evaluation, 2024. URL https://arxiv.org/abs/2409.09491

  15. [15]

    Behavior-1k: A human-centered, embodied ai benchmark with 1,000 everyday activities and realistic simulation

    Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Mart ´ın-Mart´ın, Chen Wang, Gabrael Levine, Wensi Ai, Benjamin Martinez, et al. Behavior-1k: A human-centered, embodied ai benchmark with 1,000 everyday activities and realistic simulation. arXiv preprint arXiv:2403.09227, 2024

  16. [16]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

  17. [17]

    Vela, and Stan Birchfield

    Yunzhi Lin, Jonathan Tremblay, Stephen Tyree, Patricio A. Vela, and Stan Birchfield. Multi-view fusion for multi- level robotic scene understanding. InIEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS), pages 6817–6824, 2021. doi: 10.1109/IROS51168. 2021.9635994

  18. [18]

    Evaluating text-to-visual generation with image-to-text generation, 2024

    Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation, 2024. URL https://arxiv.org/abs/2404.01291

  19. [19]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

  20. [20]

    CALVIN: A Benchmark for Language- Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. CALVIN: A Benchmark for Language- Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

  21. [21]

    Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

    Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Mu ˜noz, Xinjie Yao, Ren ´e Zurbr ¨ugg, Nikita Rudin, Lukasz Wawrzyniak, Milad Rakhsha, Alain Denzler, Eric Heiden, Ales Borovicka, Ossama Ahmed, Iretiayo Akinola, Abrar Anwar, Mark T. Carlson, Ji Yuan Feng, Animesh Garg, Renato Gasoto, Lionel Gulich, Yijie Guo...

  22. [22]

    3d gaussian ray tracing: Fast tracing of particle scenes.ACM Transactions on Graphics and SIGGRAPH Asia, 2024

    Nicolas Moenne-Loccoz, Ashkan Mirzaei, Or Perel, Riccardo de Lutio, Janick Martinez Esturo, Gavriel State, Sanja Fidler, Nicholas Sharp, and Zan Gojcic. 3d gaussian ray tracing: Fast tracing of particle scenes.ACM Transactions on Graphics and SIGGRAPH Asia, 2024

  23. [23]

    Isaac Sim

    NVIDIA. Isaac Sim. URL https://github.com/isaac-sim/ IsaacSim

  24. [24]

    GR00T N1: An open foundation model for generalist humanoid robots

    NVIDIA, Johan Bjorck, Nikita Cherniadev Fer- nando Casta ˜neda, Xingye Da, Runyu Ding, Linxi ”Jim” Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, Y...

  25. [25]

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agar- wal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Be...

  26. [26]

    OpenAI, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich,...

  27. [27]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models, 2025. URL https://arxiv. org/abs/2501.09747

  28. [28]

    Splatsim: Zero-shot sim2real transfer of rgb manipulation policies using gaussian splatting, 2024

    Mohammad Nomaan Qureshi, Sparsh Garg, Francisco Yandun, David Held, George Kantor, and Abhishesh Silwal. Splatsim: Zero-shot sim2real transfer of rgb manipulation policies using gaussian splatting, 2024. URL https://arxiv.org/abs/2409.10161

  29. [29]

    Training deep networks with synthetic data: Bridging the reality gap by domain randomization

    Jonathan Tremblay, Aayush Prakash, David Acuna, Mark Brophy, Varun Jampani, Cem Anil, Thang To, Eric Camer- acci, Shaad Boochoon, and Stan Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 969–977, 2018

  30. [30]

    3dgut: Enabling distorted cameras and secondary rays in gaussian splatting

    Qi Wu, Janick Martinez Esturo, Ashkan Mirzaei, Nicolas Moenne-Loccoz, and Zan Gojcic. 3dgut: Enabling distorted cameras and secondary rays in gaussian splatting. Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  31. [31]

    Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes

    Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. 2018

  32. [32]

    Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent, 2025

    Yandan Yang, Baoxiong Jia, Shujie Zhang, and Siyuan Huang. Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent, 2025. URL https: //arxiv.org/abs/2509.20414

  33. [33]

    Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2021

    Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Avnish Narayan, Hayden Shively, Adithya Bellathur, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2021. URL https://arxiv.org/ abs/1910.10897

  34. [34]

    Real-to-sim robot policy evaluation with gaussian splatting simulation of soft-body interactions.arXiv preprint arXiv:2511.04665, 2025

    Kaifeng Zhang, Shuo Sha, Hanxiao Jiang, Matthew Loper, Hyunjong Song, Guangyan Cai, Zhuo Xu, Xiaochen Hu, Changxi Zheng, and Yunzhu Li. Real-to-sim robot policy evaluation with gaussian splatting simulation of soft-body interactions.arXiv preprint arXiv:2511.04665, 2025

  35. [35]

    Libero-pro: Towards robust and fair evalua- tion of vision-language-action models beyond memoriza- tion.[arXiv preprint arXiv:2510.03827], 2025

    Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, and Lichao Sun. Libero-pro: Towards robust and fair evalua- tion of vision-language-action models beyond memoriza- tion.[arXiv preprint arXiv:2510.03827], 2025

  36. [36]

    robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

    Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Mart´ın- Mart´ın, Abhishek Joshi, Kevin Lin, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation framework and benchmark for robot learning. InarXiv preprint arXiv:2009.12293, 2020

  37. [37]

    Take all the bananas out of the grey bin and put it on the table

    Alex Zook, Fan-Yun Sun, Josef Spjut, Valts Blukis, Stan Birchfield, and Jonathan Tremblay. Grs: Generating robotic simulation tasks from real-world images. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 594–603, 2025. APPENDIXA DETAILS ON THEROBOLABBENCHMARK In this section we provide detail on the benchmark. RoboLab prov...

  38. [38]

    Objects form CLUSTERS - not evenly spaced grids

  39. [39]

    Containers (bowls, bins) have objects INSIDE them

  40. [40]

    Supports (plates, trays) have objects ON TOP

  41. [41]

    Objects scatter naturally AROUND containers

  42. [42]

    type”: “place-on-base

    Orientations V ARY - not all aligned to 0 ◦/90◦ COORDINATE SYSTEM: - Table bounds: X=[0.25 to 0.85], Y=[-0.40 to 0.40] (meters) - Table center: (0.55, 0.0) - Front=+X, Back=-X, Left=+Y , Right=-Y PLACEMENT TYPES: 1.place-on-base: Object directly on table {“type”: “place-on-base”, “object”: “bowl 0”, “x”: 0.4, “y”: 0.1, “yaw”: 23} V ARY yaw angles (15, 47,...

  43. [43]

    Place 1-2 ANCHOR objects (containers/supports) on table

  44. [44]

    Put objects INSIDE containers (place-in)

  45. [45]

    Put objects ON supports (place-on)

  46. [46]

    Cluster objects AROUND anchors (cluster-around)

  47. [47]

    objects”: [ {“name

    Add a few LOOSE objects to fill space REALISTIC SPACING: - Anchors: 0.25-0.35m apart - Clustered objects: 0.08-0.15m from anchor - Loose objects: fill remaining space naturally Fig. 10: System prompt for Stage I (Semantic Planning). This prompt instructs the LLM to generate physically plausible scene layouts using structured predicates rather than raw coo...

  48. [48]

    Object names MUST match EXACTLY from catalog

  49. [49]

    Containers/supports MUST be placed before objects go in/on them

  50. [50]

    Create INTERESTING scenes with containment, stacking, AND clustering

  51. [51]

    V ARY yaw angles - real scenes aren’t grid-aligned

  52. [52]

    11: Continued System prompt for Stage I (Semantic Planning)

    Return ONLY valid JSON, no markdown Fig. 11: Continued System prompt for Stage I (Semantic Planning). This prompt instructs the LLM to generate physically plausible scene layouts using structured predicates rather than raw coordinates. TABLE XV:Quantitative comparison across Difficulty Splits.We evaluate our method against the baseline on Easy, Medium, an...