pith. machine review for the scientific record. sign in

arxiv: 2605.00244 · v1 · submitted 2026-04-30 · 💻 cs.RO · cs.CV

Recognition: unknown

Lucid-XR: An Extended-Reality Data Engine for Robotic Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:46 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords robotic manipulationsynthetic datazero-shot transferextended realityphysics simulationvisual policiesdexterous manipulationsim-to-real
0
0 comments X

The pith

Lucid-XR produces synthetic multi-modal data from web-based XR physics simulations that trains robot visual policies for zero-shot transfer to real cluttered and dimly lit environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Lucid-XR as a system that creates large volumes of realistic synthetic training data for robotic manipulation without any real-world collection. It centers on vuer, a web-accessible physics simulator running on XR headsets, paired with human pose retargeting and a language-steerable video generation step that adds physical consistency. Policies trained only on this data are shown to work directly in new real settings with clutter and poor lighting. A sympathetic reader would care because it removes the main practical barrier of gathering expensive, hard-to-vary real data for complex contact-rich tasks. The examples cover soft objects, particles, and rigid-body interactions.

Core claim

Lucid-XR is a generative data engine whose core is an on-device physics simulation environment called vuer that runs directly in an XR headset over the web, integrated with human-to-robot pose retargeting and a physics-guided video generation pipeline that accepts natural-language specifications; training visual policies exclusively on the resulting synthetic multi-modal data produces zero-shot transfer to previously unseen real-world evaluation scenes that are cluttered and badly lit.

What carries the argument

vuer, the web-based physics simulation environment that runs directly on the XR headset to provide latency-free immersive data collection and interaction at internet scale.

If this is right

  • Visual policies for dexterous manipulation can be trained entirely in simulation and deployed without real-world adaptation.
  • Data collection for soft-material, particle, and rigid-contact tasks becomes scalable through web-based XR access rather than physical hardware.
  • Natural-language steering of the video generation step allows targeted creation of training distributions for specific manipulation challenges.
  • Internet-scale access to the simulation removes the need for specialized lab equipment when gathering diverse multi-modal robot data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could make it practical to retrain policies frequently as robot hardware or task requirements change, since new data can be generated on demand without physical setup.
  • Language steerability opens the possibility that domain experts outside robotics could create custom datasets for narrow industrial or home tasks.
  • If the physics simulation in vuer proves accurate for contact dynamics, similar web-based engines might accelerate data generation for other embodied AI domains such as navigation or assembly.

Load-bearing premise

The synthetic data generated by the XR physics simulation and language-steerable video pipeline is realistic and diverse enough in physics and appearance to bridge the sim-to-real gap for visual policies without any real-world fine-tuning.

What would settle it

Deploy the same policy trained only on Lucid-XR data onto a physical robot and measure success rate on a standardized set of dexterous tasks in a cluttered, dimly lit room; if performance collapses relative to a policy that received even modest real data, the zero-shot claim is falsified.

Figures

Figures reproduced from arXiv: 2605.00244 by Adam Rashid, Alan Yu, Ge Yang, Gio Huh, Kai McClennen, Kevin Yang, Phillip Isola, Qinxi Yu, Xiaolong Wang, Yajvan Ravan, Zhutian Yang.

Figure 1
Figure 1. Figure 1: An Extended Reality Data Engine for Robotic Manipulation. Left: we deliver physics simulation to run directly on XR devices via the web browser, to enable internet-scale crowdsourcing of demonstration data collection. Right: Our GenAI-powered synthetic data engine creates steerable, diverse, and realistic multi-view visual data to train real-world robots. Abstract: We introduce Lucid-XR, a generative data … view at source ↗
Figure 2
Figure 2. Figure 2: System Schematic of the Lucid-XR Pipeline. The results in this paper require hand￾crafted, but basic 3D scenes. The data collection is done collectively by the authors. The resulting simulated datasets are augmented by a generative pipeline powered by language and text-to-image models. to cover rare but mission-critical events that, by definition, are scarce in real-world datasets. What makes robotics more… view at source ↗
Figure 3
Figure 3. Figure 3: Moving physics simulation on-device enables untethered access to immersive simula￾tions. The key benefits are twofold: first, it enables the simulation of deformable objects that involve modifying a large amount of mesh data that are too slow to send over WiFi. Second, it eliminates delays due to network latency, allowing the simulation to run at the device’s native frame rate [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 4
Figure 4. Figure 4: Multiple types of physics running natively inside the browser in real-time. From left to right: flexible material interacting with a solid; Signed-distance function (SDF) based collision solver for non-convex shapes; A deck of cards interacting with air/wind; Soft skin material interact￾ing with a solid. Rendering and simulation are both native in a web browser. landscape is fast-evolving — in the near fut… view at source ↗
Figure 5
Figure 5. Figure 5: Controlling Hand Pose via Motion Caption Sites. We specify mocap sites by first aligning the proximal joints, and scale the hand so that fingers are similar in size as the robot hand. We then weld the SE(3) pose of the fingertip to similar sites on the robot hand and the wrist. We adjust the torque scale to balance tracking of the position and the rotation portion of the pose [PITH_FULL_IMAGE:figures/full… view at source ↗
Figure 6
Figure 6. Figure 6: Handling dynamic tasks and deformable objects. The on-device retargeting is accurate enough to balance three blocks on one hand; is fast enough to handle dynamic tasks such as throwing a basketball, and handle deformable objects for cloth folding. applying SE(3) transformation to the target gripper in the local frame of the mocap site. This design took inspiration from the hitchhiking hand [8], and let the… view at source ↗
Figure 7
Figure 7. Figure 7: (a) Spot robot from Menagerie (b - c) RobotCasa Scenes [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: shows our setup for generating realistic-looking images. We follow the LucidSim [11] recipe that starts with a collection of diverse text prompts collected from chatGPT, and use the semantic mask labels from the physics simulation to control the image generation process. Prompts for generation are sourced en masse from ChatGPT via a meta-prompt (see appendix). Prompts for the background tend to be more com… view at source ↗
Figure 9
Figure 9. Figure 9: Lucid-XR can simulate contact-rich manipulation of diverse physics. (a) granular materials to simulate pouring liquid, (b) deformable materials when tying a knot, (c) tight tolerances between objects in contact, in a shape insertion task. (d) placing a mug onto a drying rack (e) a ball sorting toy. • Block Stacking: This task requires creating a three block stack with a dextrous hand. • Pour Liquid: This t… view at source ↗
Figure 10
Figure 10. Figure 10: Amount of data collected for LucidXR vs Real-World Teleop. We record observations and actions as SE(3) mocap poses at 25 Hz; rotations use the 6D representation of [13]. Demonstrations are embodiment-agnostic, so behavior-cloned policies can deploy on any two￾finger gripper. Policies take proprioception as well as either a wrist RGB view or three fixed RGB views of the workspace and output chunked absolut… view at source ↗
Figure 11
Figure 11. Figure 11: Evaluation results from real-to-sim transfer (left) and per-task performance in simulation [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Loading a Dishwasher. Showing dexterous manipulation of a dishwasher with articu￾lated doors, a plate, and its interaction with the racks inside the dishwasher. 7.2 Examples of Policy Unroll in Simulation [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Cleaning The Kitchen. The robot learns to pick up a cup and stack it on top of the bowl, followed by placing both objects into the kitchen sink. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Re-try Behavior in the Mug Tree Environment. The policy came into contact twice: First during the picking up of the mug; the second time is when trying to hang the mug onto the arm of the mug tree drying rack. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Additional Examples of Generated Images from Lucid-XR. Notice the control over lighting, geometry, and content diversity. 7.5 The Vuer Scene Description Language Usage of the MuJoCo engine API in Python tends to follow an imperative pattern, where objects, material texture, and lighting are changed by mutating MuJoCo’s physics and modeling buffers. To improve the readability and reusability of the MuJoCo … view at source ↗
Figure 17
Figure 17. Figure 17: The real-world fishing toy set (left), versus the simulated learning environment used to [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Clean Kitchen. Contains no clutter. We place 3D object assets programmatically. (a) Aligning the 3D mesh with the MuJoCo scene. (b) The scene after alignment [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Aligning scan with the physics environment. 7.8 Image Generation Workflow We provide the complete image generation workflow in JSON form in the supplementary material. This workflow can be loaded into ComfyUI as-is. A screenshot of this workflow can be found in [PITH_FULL_IMAGE:figures/full_fig_p017_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Image Generation Workflow. Using two object masks plus a normalized inverse depth image, we are able to control the geometry, lighting, and the composition of the generated images. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_20.png] view at source ↗
read the original abstract

We introduce Lucid-XR, a generative data engine for creating diverse and realistic-looking multi-modal data to train real-world robotic systems. At the core of Lucid-XR is vuer, a web-based physics simulation environment that runs directly on the XR headset, enabling internet-scale access to immersive, latency-free virtual interactions without requiring specialized equipment. The complete system integrates on-device physics simulation with human-to-robot pose retargeting. Data collected is further amplified by a physics-guided video generation pipeline steerable via natural language specifications. We demonstrate zero-shot transfer of robot visual policies to unseen, cluttered, and badly lit evaluation environments, after training entirely on Lucid-XR's synthetic data. We include examples across dexterous manipulation tasks that involve soft materials, loosely bound particles, and rigid body contact. Project website: https://lucidxr.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Circularity Check

0 steps flagged

No circularity: empirical system demonstration with no derivations, fitted parameters, or self-referential predictions

full rationale

The paper introduces an engineering system (Lucid-XR with vuer on-device physics, pose retargeting, and language-steerable video generation) and reports an empirical demonstration of zero-shot policy transfer from synthetic data. No equations, parameter-fitting procedures, uniqueness theorems, or derivation chains are present in the provided text. The central claim is a direct empirical assertion about real-world performance after training on generated data; it does not reduce to any input by construction, self-citation, or renaming. Self-citations, if any, are not load-bearing for a mathematical result. This matches the default case of a non-circular engineering paper whose validity rests on external validation rather than internal tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified assumption that the on-device physics simulation and video generation produce data distributionally close enough to real-world conditions for zero-shot transfer.

axioms (1)
  • domain assumption The physics simulation running on the XR headset accurately captures contact dynamics, soft-body behavior, and particle interactions relevant to manipulation.
    Invoked as the basis for generating useful synthetic data without real-world collection.
invented entities (1)
  • vuer no independent evidence
    purpose: Web-based physics simulation environment running directly on XR headsets for latency-free interaction.
    New component introduced to enable internet-scale immersive data collection.

pith-pipeline@v0.9.0 · 5477 in / 1228 out tokens · 47943 ms · 2026-05-09T19:46:42.629369+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 26 canonical work pages · 3 internal anchors

  1. [1]

    J.-H. Ryu. Reality & effect: A cultural history of visual effects.Communication Dissertations, 2007

  2. [2]

    J. Turnock. Before Industrial Light and Magic: the independent Hollywood special effects business, 1968–75: Research Article.New Rev. Film Telev. Stud., 7(2):133–156, June 2009

  3. [3]

    S. Das. The evolution of visual effects in cinema: A journey from practical effects to CGI. Journal of Emerging Technologies and Innovative Research, 10(11):303–309, 2023

  4. [4]

    Murodillayev

    B. Murodillayev. The impact of visual effects on the cinema experience: A comprehensive analysis.Art Des. Rev., 2024

  5. [5]

    Todorov, T

    E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–

  6. [6]
  7. [8]

    Fluid forces.https://mujoco.readthedocs.io/en/latest/ computation/fluid.html, 2025

    MuJoCo Documentation. Fluid forces.https://mujoco.readthedocs.io/en/latest/ computation/fluid.html, 2025. Accessed: 2025-August-29

  8. [9]

    R. Ban, K. Matsumoto, and T. Narumi. Hitchhiking hands: Remote interaction by switching multiple hand avatars with gaze. InSIGGRAPH Asia 2023 Emerging Technologies, pages 1–2. 2023

  9. [10]

    Expressive whole-body control for humanoid robots,

    X. Cheng, Y . Ji, J. Chen, R. Yang, G. Yang, and X. Wang. Expressive whole-body control for humanoid robots, 2024. URLhttps://arxiv.org/abs/2402.16796

  10. [11]

    Open-television: Teleoperation with immersive active visual feedback,

    X. Cheng, J. Li, S. Yang, G. Yang, and X. Wang. Open-television: Teleoperation with immer- sive active visual feedback.arXiv preprint arXiv:2407.01512, 2024

  11. [12]

    A. Yu, G. Yang, R. Choi, Y . Ravan, J. Leonard, and P. Isola. Learning visual parkour from generated images. In8th Annual Conference on Robot Learning, 2024

  12. [13]

    Mandlekar, S

    A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations,

  13. [14]

    URLhttps://arxiv.org/abs/2310.17596

  14. [15]

    Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2019

  15. [16]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  16. [17]

    Xinlei Chen, Hao Fang, Tsung-yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C Lawrence Zitnick

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-end object detection with transformers, 2020. URLhttps://arxiv.org/abs/2005.12872. 9

  17. [18]

    Garipov, S

    T. Garipov, S. D. Peuter, G. Yang, V . Garg, S. Kaski, and T. Jaakkola. Compositional sculpting of iterative generative processes, 2023. URLhttps://arxiv.org/abs/2309.16115

  18. [19]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024

  19. [20]

    Planning with Diffusion for Flexible Behavior Synthesis

    M. Janner, Y . Du, J. B. Tenenbaum, and S. Levine. Planning with diffusion for flexible behavior synthesis, 2022. URLhttps://arxiv.org/abs/2205.09991

  20. [21]

    FiLM: Visual Reasoning with a General Conditioning Layer

    E. Perez, F. Strub, H. de Vries, V . Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer, 2017. URLhttps://arxiv.org/abs/1709.07871

  21. [22]

    S. Chen, C. Wang, K. Nguyen, L. Fei-Fei, and C. K. Liu. Arcap: Collecting high-quality human demonstrations for robot learning with augmented reality feedback.arXiv preprint arXiv:2410.08464, 2024

  22. [23]

    Nechyporenko, R

    N. Nechyporenko, R. Hoque, C. Webb, M. Sivapurapu, and J. Zhang. Armada: Augmented reality for robot manipulation and robot-free data acquisition, 2024. URLhttps://arxiv. org/abs/2412.10631

  23. [24]

    Jiang, P

    X. Jiang, P. Mattes, X. Jia, N. Schreiber, G. Neumann, and R. Lioutikov. A comprehensive user study on augmented reality-based data collection interfaces for robot learning. In2024 19th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 333–342, 2024

  24. [25]

    Y . Yang, B. Ikeda, G. Bertasius, and D. Szafir. Arcade: Scalable demonstration collection and generation via augmented reality for imitation learning, 2024. URLhttps://arxiv.org/ abs/2410.15994

  25. [26]

    J. Duan, Y . R. Wang, M. Shridhar, D. Fox, and R. Krishna. Ar2-d2: Training a robot without a robot. 2023

  26. [27]

    Wang, C.-C

    J. Wang, C.-C. Chang, J. Duan, D. Fox, and R. Krishna. Eve: Enabling anyone to train robots using augmented reality, 2024. URLhttps://arxiv.org/abs/2404.06089

  27. [28]

    van Haastregt, M

    J. van Haastregt, M. C. Welle, Y . Zhang, and D. Kragic. Puppeteer your robot: Augmented reality leader-follower teleoperation, 2024. URLhttps://arxiv.org/abs/2407.11741

  28. [29]

    A. Iyer, Z. Peng, Y . Dai, I. Guzey, S. Haldar, S. Chintala, and L. Pinto. Open teach: A versatile teleoperation system for robotic manipulation, 2024. URLhttps://arxiv.org/abs/2403. 07870

  29. [30]

    Naceri, D

    A. Naceri, D. Mazzanti, J. Bimbo, Y . T. Tefera, D. Prattichizzo, D. G. Caldwell, L. S. Mat- tos, and N. Deshpande. The vicarios virtual reality interface for remote robotic teleopera- tion: Teleporting for intuitive tele-manipulation.J. Intell. Robotics Syst., 101(4), Apr. 2021. ISSN 0921-0296. doi:10.1007/s10846-021-01311-7. URLhttps://doi.org/10.1007/ ...

  30. [31]

    Jiang, Q

    X. Jiang, Q. Yuan, E. U. Dincer, H. Zhou, G. Li, X. Li, J. Haag, N. Schreiber, K. Li, G. Neumann, and R. Lioutikov. Iris: An immersive robot interaction system, 2025. URL https://arxiv.org/abs/2502.03297

  31. [32]

    Y . Park, J. S. Bhatia, L. Ankile, and P. Agrawal. Dexhub and dart: Towards internet scale robot data collection, 2024. URLhttps://arxiv.org/abs/2411.02214

  32. [33]

    Gen2sim: Scaling up robot learning in simulation with generative models.arXiv preprint arXiv:2310.18308,

    P. Katara, Z. Xian, and K. Fragkiadaki. Gen2sim: Scaling up robot learning in simulation with generative models, 2023. URLhttps://arxiv.org/abs/2310.18308. 10

  33. [34]

    Y . Wang, Z. Xian, F. Chen, T.-H. Wang, Y . Wang, K. Fragkiadaki, Z. Erickson, D. Held, and C. Gan. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation, 2024. URLhttps://arxiv.org/abs/2311.01455

  34. [35]

    Y . J. Ma, W. Liang, H.-J. Wang, S. Wang, Y . Zhu, L. Fan, O. Bastani, and D. Jayaraman. Dreureka: Language model guided sim-to-real transfer, 2024. URLhttps://arxiv.org/ abs/2406.01967

  35. [36]

    Z. Chen, A. Walsman, M. Memmel, K. Mo, A. Fang, K. Vemuri, A. Wu, D. Fox, and A. Gupta. Urdformer: A pipeline for constructing articulated simulation environments from real-world images.arXiv preprint arXiv:2405.11656, 2024

  36. [37]

    T. Yu, T. Xiao, A. Stone, J. Tompson, A. Brohan, S. Wang, J. Singh, C. Tan, D. M, J. Per- alta, B. Ichter, K. Hausman, and F. Xia. Scaling robot learning with semantically imagined experience, 2023. URLhttps://arxiv.org/abs/2302.11550

  37. [38]

    Cacti: A framework for scal- able multi-task multi-scene visual imitation learning,

    Z. Mandi, H. Bharadhwaj, V . Moens, S. Song, A. Rajeswaran, and V . Kumar. Cacti: A framework for scalable multi-task multi-scene visual imitation learning.arXiv preprint arXiv:2212.05711, 2022

  38. [39]

    J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Lin, L. Magne, A. Mandlekar, A. Narayan, Y . L. Tan, G. Wang, J. Wang, Q. Wang, Y . Xu, X. Zeng, K. Zheng, R. Zheng, M.-Y . Liu, L. Zettlemoyer, D. Fox, J. Kautz, S. Reed, Y . Zhu, and L. Fan. Dreamgen: Unlocking generalization in robot learning through video world...

  39. [40]

    Z. Chen, S. Kiami, A. Gupta, and V . Kumar. Genaug: Retargeting behaviors to unseen situa- tions via generative augmentation.arXiv preprint arXiv:2302.06671, 2023

  40. [41]

    O. X.-E. Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Her- zog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Kolobov, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakr- ishna, A. W...

  41. [42]

    H.-S. Fang, H. Fang, Z. Tang, J. Liu, J. Wang, H. Zhu, and C. Lu. Rh20t: A robotic dataset for learning diverse skills in one-shot. InRSS 2023 Workshop on Learning for Task and Motion Planning, 2023

  42. [43]

    " " 5< actuator > 6< adhesion body =

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...