arxiv: 2605.00244 · v1 · submitted 2026-04-30 · 💻 cs.RO · cs.CV

Recognition: unknown

Lucid-XR: An Extended-Reality Data Engine for Robotic Manipulation

Yajvan Ravan , Adam Rashid , Alan Yu , Kai McClennen , Gio Huh , Kevin Yang , Zhutian Yang , Qinxi Yu

show 3 more authors

Xiaolong Wang Phillip Isola Ge Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:46 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords robotic manipulationsynthetic datazero-shot transferextended realityphysics simulationvisual policiesdexterous manipulationsim-to-real

0 comments

The pith

Lucid-XR produces synthetic multi-modal data from web-based XR physics simulations that trains robot visual policies for zero-shot transfer to real cluttered and dimly lit environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Lucid-XR as a system that creates large volumes of realistic synthetic training data for robotic manipulation without any real-world collection. It centers on vuer, a web-accessible physics simulator running on XR headsets, paired with human pose retargeting and a language-steerable video generation step that adds physical consistency. Policies trained only on this data are shown to work directly in new real settings with clutter and poor lighting. A sympathetic reader would care because it removes the main practical barrier of gathering expensive, hard-to-vary real data for complex contact-rich tasks. The examples cover soft objects, particles, and rigid-body interactions.

Core claim

Lucid-XR is a generative data engine whose core is an on-device physics simulation environment called vuer that runs directly in an XR headset over the web, integrated with human-to-robot pose retargeting and a physics-guided video generation pipeline that accepts natural-language specifications; training visual policies exclusively on the resulting synthetic multi-modal data produces zero-shot transfer to previously unseen real-world evaluation scenes that are cluttered and badly lit.

What carries the argument

vuer, the web-based physics simulation environment that runs directly on the XR headset to provide latency-free immersive data collection and interaction at internet scale.

If this is right

Visual policies for dexterous manipulation can be trained entirely in simulation and deployed without real-world adaptation.
Data collection for soft-material, particle, and rigid-contact tasks becomes scalable through web-based XR access rather than physical hardware.
Natural-language steering of the video generation step allows targeted creation of training distributions for specific manipulation challenges.
Internet-scale access to the simulation removes the need for specialized lab equipment when gathering diverse multi-modal robot data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could make it practical to retrain policies frequently as robot hardware or task requirements change, since new data can be generated on demand without physical setup.
Language steerability opens the possibility that domain experts outside robotics could create custom datasets for narrow industrial or home tasks.
If the physics simulation in vuer proves accurate for contact dynamics, similar web-based engines might accelerate data generation for other embodied AI domains such as navigation or assembly.

Load-bearing premise

The synthetic data generated by the XR physics simulation and language-steerable video pipeline is realistic and diverse enough in physics and appearance to bridge the sim-to-real gap for visual policies without any real-world fine-tuning.

What would settle it

Deploy the same policy trained only on Lucid-XR data onto a physical robot and measure success rate on a standardized set of dexterous tasks in a cluttered, dimly lit room; if performance collapses relative to a policy that received even modest real data, the zero-shot claim is falsified.

Figures

Figures reproduced from arXiv: 2605.00244 by Adam Rashid, Alan Yu, Ge Yang, Gio Huh, Kai McClennen, Kevin Yang, Phillip Isola, Qinxi Yu, Xiaolong Wang, Yajvan Ravan, Zhutian Yang.

**Figure 1.** Figure 1: An Extended Reality Data Engine for Robotic Manipulation. Left: we deliver physics simulation to run directly on XR devices via the web browser, to enable internet-scale crowdsourcing of demonstration data collection. Right: Our GenAI-powered synthetic data engine creates steerable, diverse, and realistic multi-view visual data to train real-world robots. Abstract: We introduce Lucid-XR, a generative data … view at source ↗

**Figure 2.** Figure 2: System Schematic of the Lucid-XR Pipeline. The results in this paper require handcrafted, but basic 3D scenes. The data collection is done collectively by the authors. The resulting simulated datasets are augmented by a generative pipeline powered by language and text-to-image models. to cover rare but mission-critical events that, by definition, are scarce in real-world datasets. What makes robotics more… view at source ↗

**Figure 3.** Figure 3: Moving physics simulation on-device enables untethered access to immersive simulations. The key benefits are twofold: first, it enables the simulation of deformable objects that involve modifying a large amount of mesh data that are too slow to send over WiFi. Second, it eliminates delays due to network latency, allowing the simulation to run at the device’s native frame rate [PITH_FULL_IMAGE:figures/ful… view at source ↗

**Figure 4.** Figure 4: Multiple types of physics running natively inside the browser in real-time. From left to right: flexible material interacting with a solid; Signed-distance function (SDF) based collision solver for non-convex shapes; A deck of cards interacting with air/wind; Soft skin material interacting with a solid. Rendering and simulation are both native in a web browser. landscape is fast-evolving — in the near fut… view at source ↗

**Figure 5.** Figure 5: Controlling Hand Pose via Motion Caption Sites. We specify mocap sites by first aligning the proximal joints, and scale the hand so that fingers are similar in size as the robot hand. We then weld the SE(3) pose of the fingertip to similar sites on the robot hand and the wrist. We adjust the torque scale to balance tracking of the position and the rotation portion of the pose [PITH_FULL_IMAGE:figures/full… view at source ↗

**Figure 6.** Figure 6: Handling dynamic tasks and deformable objects. The on-device retargeting is accurate enough to balance three blocks on one hand; is fast enough to handle dynamic tasks such as throwing a basketball, and handle deformable objects for cloth folding. applying SE(3) transformation to the target gripper in the local frame of the mocap site. This design took inspiration from the hitchhiking hand [8], and let the… view at source ↗

**Figure 7.** Figure 7: (a) Spot robot from Menagerie (b - c) RobotCasa Scenes [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: shows our setup for generating realistic-looking images. We follow the LucidSim [11] recipe that starts with a collection of diverse text prompts collected from chatGPT, and use the semantic mask labels from the physics simulation to control the image generation process. Prompts for generation are sourced en masse from ChatGPT via a meta-prompt (see appendix). Prompts for the background tend to be more com… view at source ↗

**Figure 9.** Figure 9: Lucid-XR can simulate contact-rich manipulation of diverse physics. (a) granular materials to simulate pouring liquid, (b) deformable materials when tying a knot, (c) tight tolerances between objects in contact, in a shape insertion task. (d) placing a mug onto a drying rack (e) a ball sorting toy. • Block Stacking: This task requires creating a three block stack with a dextrous hand. • Pour Liquid: This t… view at source ↗

**Figure 10.** Figure 10: Amount of data collected for LucidXR vs Real-World Teleop. We record observations and actions as SE(3) mocap poses at 25 Hz; rotations use the 6D representation of [13]. Demonstrations are embodiment-agnostic, so behavior-cloned policies can deploy on any twofinger gripper. Policies take proprioception as well as either a wrist RGB view or three fixed RGB views of the workspace and output chunked absolut… view at source ↗

**Figure 11.** Figure 11: Evaluation results from real-to-sim transfer (left) and per-task performance in simulation [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗

**Figure 13.** Figure 13: Loading a Dishwasher. Showing dexterous manipulation of a dishwasher with articulated doors, a plate, and its interaction with the racks inside the dishwasher. 7.2 Examples of Policy Unroll in Simulation [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗

**Figure 14.** Figure 14: Cleaning The Kitchen. The robot learns to pick up a cup and stack it on top of the bowl, followed by placing both objects into the kitchen sink. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗

**Figure 15.** Figure 15: Re-try Behavior in the Mug Tree Environment. The policy came into contact twice: First during the picking up of the mug; the second time is when trying to hang the mug onto the arm of the mug tree drying rack. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗

**Figure 16.** Figure 16: Additional Examples of Generated Images from Lucid-XR. Notice the control over lighting, geometry, and content diversity. 7.5 The Vuer Scene Description Language Usage of the MuJoCo engine API in Python tends to follow an imperative pattern, where objects, material texture, and lighting are changed by mutating MuJoCo’s physics and modeling buffers. To improve the readability and reusability of the MuJoCo … view at source ↗

**Figure 17.** Figure 17: The real-world fishing toy set (left), versus the simulated learning environment used to [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗

**Figure 18.** Figure 18: Clean Kitchen. Contains no clutter. We place 3D object assets programmatically. (a) Aligning the 3D mesh with the MuJoCo scene. (b) The scene after alignment [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗

**Figure 19.** Figure 19: Aligning scan with the physics environment. 7.8 Image Generation Workflow We provide the complete image generation workflow in JSON form in the supplementary material. This workflow can be loaded into ComfyUI as-is. A screenshot of this workflow can be found in [PITH_FULL_IMAGE:figures/full_fig_p017_19.png] view at source ↗

**Figure 20.** Figure 20: Image Generation Workflow. Using two object masks plus a normalized inverse depth image, we are able to control the geometry, lighting, and the composition of the generated images. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_20.png] view at source ↗

read the original abstract

We introduce Lucid-XR, a generative data engine for creating diverse and realistic-looking multi-modal data to train real-world robotic systems. At the core of Lucid-XR is vuer, a web-based physics simulation environment that runs directly on the XR headset, enabling internet-scale access to immersive, latency-free virtual interactions without requiring specialized equipment. The complete system integrates on-device physics simulation with human-to-robot pose retargeting. Data collected is further amplified by a physics-guided video generation pipeline steerable via natural language specifications. We demonstrate zero-shot transfer of robot visual policies to unseen, cluttered, and badly lit evaluation environments, after training entirely on Lucid-XR's synthetic data. We include examples across dexterous manipulation tasks that involve soft materials, loosely bound particles, and rigid body contact. Project website: https://lucidxr.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Circularity Check

0 steps flagged

No circularity: empirical system demonstration with no derivations, fitted parameters, or self-referential predictions

full rationale

The paper introduces an engineering system (Lucid-XR with vuer on-device physics, pose retargeting, and language-steerable video generation) and reports an empirical demonstration of zero-shot policy transfer from synthetic data. No equations, parameter-fitting procedures, uniqueness theorems, or derivation chains are present in the provided text. The central claim is a direct empirical assertion about real-world performance after training on generated data; it does not reduce to any input by construction, self-citation, or renaming. Self-citations, if any, are not load-bearing for a mathematical result. This matches the default case of a non-circular engineering paper whose validity rests on external validation rather than internal tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified assumption that the on-device physics simulation and video generation produce data distributionally close enough to real-world conditions for zero-shot transfer.

axioms (1)

domain assumption The physics simulation running on the XR headset accurately captures contact dynamics, soft-body behavior, and particle interactions relevant to manipulation.
Invoked as the basis for generating useful synthetic data without real-world collection.

invented entities (1)

vuer no independent evidence
purpose: Web-based physics simulation environment running directly on XR headsets for latency-free interaction.
New component introduced to enable internet-scale immersive data collection.

pith-pipeline@v0.9.0 · 5477 in / 1228 out tokens · 47943 ms · 2026-05-09T19:46:42.629369+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 26 canonical work pages · 3 internal anchors

[1]

J.-H. Ryu. Reality & effect: A cultural history of visual effects.Communication Dissertations, 2007

2007
[2]

J. Turnock. Before Industrial Light and Magic: the independent Hollywood special effects business, 1968–75: Research Article.New Rev. Film Telev. Stud., 7(2):133–156, June 2009

1968
[3]

S. Das. The evolution of visual effects in cinema: A journey from practical effects to CGI. Journal of Emerging Technologies and Innovative Research, 10(11):303–309, 2023

2023
[4]

Murodillayev

B. Murodillayev. The impact of visual effects on the cinema experience: A comprehensive analysis.Art Des. Rev., 2024

2024
[5]

Todorov, T

E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–

2012
[6]

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly

IEEE, 2012. doi:10.1109/IROS.2012.6386109

work page doi:10.1109/iros.2012.6386109 2012
[8]

Fluid forces.https://mujoco.readthedocs.io/en/latest/ computation/fluid.html, 2025

MuJoCo Documentation. Fluid forces.https://mujoco.readthedocs.io/en/latest/ computation/fluid.html, 2025. Accessed: 2025-August-29

2025
[9]

R. Ban, K. Matsumoto, and T. Narumi. Hitchhiking hands: Remote interaction by switching multiple hand avatars with gaze. InSIGGRAPH Asia 2023 Emerging Technologies, pages 1–2. 2023

2023
[10]

Expressive whole-body control for humanoid robots,

X. Cheng, Y . Ji, J. Chen, R. Yang, G. Yang, and X. Wang. Expressive whole-body control for humanoid robots, 2024. URLhttps://arxiv.org/abs/2402.16796

work page arXiv 2024
[11]

Open-television: Teleoperation with immersive active visual feedback,

X. Cheng, J. Li, S. Yang, G. Yang, and X. Wang. Open-television: Teleoperation with immer- sive active visual feedback.arXiv preprint arXiv:2407.01512, 2024

work page arXiv 2024
[12]

A. Yu, G. Yang, R. Choi, Y . Ravan, J. Leonard, and P. Isola. Learning visual parkour from generated images. In8th Annual Conference on Robot Learning, 2024

2024
[13]

Mandlekar, S

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations,
[14]

URLhttps://arxiv.org/abs/2310.17596

work page arXiv
[15]

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2019

2019
[16]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review arXiv 2023
[17]

Xinlei Chen, Hao Fang, Tsung-yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C Lawrence Zitnick

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-end object detection with transformers, 2020. URLhttps://arxiv.org/abs/2005.12872. 9

work page arXiv 2020
[18]

Garipov, S

T. Garipov, S. D. Peuter, G. Yang, V . Garg, S. Kaski, and T. Jaakkola. Compositional sculpting of iterative generative processes, 2023. URLhttps://arxiv.org/abs/2309.16115

work page arXiv 2023
[19]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024

2024
[20]

Planning with Diffusion for Flexible Behavior Synthesis

M. Janner, Y . Du, J. B. Tenenbaum, and S. Levine. Planning with diffusion for flexible behavior synthesis, 2022. URLhttps://arxiv.org/abs/2205.09991

work page internal anchor Pith review arXiv 2022
[21]

FiLM: Visual Reasoning with a General Conditioning Layer

E. Perez, F. Strub, H. de Vries, V . Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer, 2017. URLhttps://arxiv.org/abs/1709.07871

work page Pith review arXiv 2017
[22]

S. Chen, C. Wang, K. Nguyen, L. Fei-Fei, and C. K. Liu. Arcap: Collecting high-quality human demonstrations for robot learning with augmented reality feedback.arXiv preprint arXiv:2410.08464, 2024

work page arXiv 2024
[23]

Nechyporenko, R

N. Nechyporenko, R. Hoque, C. Webb, M. Sivapurapu, and J. Zhang. Armada: Augmented reality for robot manipulation and robot-free data acquisition, 2024. URLhttps://arxiv. org/abs/2412.10631

work page arXiv 2024
[24]

Jiang, P

X. Jiang, P. Mattes, X. Jia, N. Schreiber, G. Neumann, and R. Lioutikov. A comprehensive user study on augmented reality-based data collection interfaces for robot learning. In2024 19th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 333–342, 2024

2024
[25]

Y . Yang, B. Ikeda, G. Bertasius, and D. Szafir. Arcade: Scalable demonstration collection and generation via augmented reality for imitation learning, 2024. URLhttps://arxiv.org/ abs/2410.15994

work page arXiv 2024
[26]

J. Duan, Y . R. Wang, M. Shridhar, D. Fox, and R. Krishna. Ar2-d2: Training a robot without a robot. 2023

2023
[27]

Wang, C.-C

J. Wang, C.-C. Chang, J. Duan, D. Fox, and R. Krishna. Eve: Enabling anyone to train robots using augmented reality, 2024. URLhttps://arxiv.org/abs/2404.06089

work page arXiv 2024
[28]

van Haastregt, M

J. van Haastregt, M. C. Welle, Y . Zhang, and D. Kragic. Puppeteer your robot: Augmented reality leader-follower teleoperation, 2024. URLhttps://arxiv.org/abs/2407.11741

work page arXiv 2024
[29]

A. Iyer, Z. Peng, Y . Dai, I. Guzey, S. Haldar, S. Chintala, and L. Pinto. Open teach: A versatile teleoperation system for robotic manipulation, 2024. URLhttps://arxiv.org/abs/2403. 07870

2024
[30]

Naceri, D

A. Naceri, D. Mazzanti, J. Bimbo, Y . T. Tefera, D. Prattichizzo, D. G. Caldwell, L. S. Mat- tos, and N. Deshpande. The vicarios virtual reality interface for remote robotic teleopera- tion: Teleporting for intuitive tele-manipulation.J. Intell. Robotics Syst., 101(4), Apr. 2021. ISSN 0921-0296. doi:10.1007/s10846-021-01311-7. URLhttps://doi.org/10.1007/ ...

work page doi:10.1007/s10846-021-01311-7 2021
[31]

Jiang, Q

X. Jiang, Q. Yuan, E. U. Dincer, H. Zhou, G. Li, X. Li, J. Haag, N. Schreiber, K. Li, G. Neumann, and R. Lioutikov. Iris: An immersive robot interaction system, 2025. URL https://arxiv.org/abs/2502.03297

work page arXiv 2025
[32]

Y . Park, J. S. Bhatia, L. Ankile, and P. Agrawal. Dexhub and dart: Towards internet scale robot data collection, 2024. URLhttps://arxiv.org/abs/2411.02214

work page arXiv 2024
[33]

Gen2sim: Scaling up robot learning in simulation with generative models.arXiv preprint arXiv:2310.18308,

P. Katara, Z. Xian, and K. Fragkiadaki. Gen2sim: Scaling up robot learning in simulation with generative models, 2023. URLhttps://arxiv.org/abs/2310.18308. 10

work page arXiv 2023
[34]

Y . Wang, Z. Xian, F. Chen, T.-H. Wang, Y . Wang, K. Fragkiadaki, Z. Erickson, D. Held, and C. Gan. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation, 2024. URLhttps://arxiv.org/abs/2311.01455

work page arXiv 2024
[35]

Y . J. Ma, W. Liang, H.-J. Wang, S. Wang, Y . Zhu, L. Fan, O. Bastani, and D. Jayaraman. Dreureka: Language model guided sim-to-real transfer, 2024. URLhttps://arxiv.org/ abs/2406.01967

work page arXiv 2024
[36]

Z. Chen, A. Walsman, M. Memmel, K. Mo, A. Fang, K. Vemuri, A. Wu, D. Fox, and A. Gupta. Urdformer: A pipeline for constructing articulated simulation environments from real-world images.arXiv preprint arXiv:2405.11656, 2024

work page arXiv 2024
[37]

T. Yu, T. Xiao, A. Stone, J. Tompson, A. Brohan, S. Wang, J. Singh, C. Tan, D. M, J. Per- alta, B. Ichter, K. Hausman, and F. Xia. Scaling robot learning with semantically imagined experience, 2023. URLhttps://arxiv.org/abs/2302.11550

work page arXiv 2023
[38]

Cacti: A framework for scal- able multi-task multi-scene visual imitation learning,

Z. Mandi, H. Bharadhwaj, V . Moens, S. Song, A. Rajeswaran, and V . Kumar. Cacti: A framework for scalable multi-task multi-scene visual imitation learning.arXiv preprint arXiv:2212.05711, 2022

work page arXiv 2022
[39]

J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Lin, L. Magne, A. Mandlekar, A. Narayan, Y . L. Tan, G. Wang, J. Wang, Q. Wang, Y . Xu, X. Zeng, K. Zheng, R. Zheng, M.-Y . Liu, L. Zettlemoyer, D. Fox, J. Kautz, S. Reed, Y . Zhu, and L. Fan. Dreamgen: Unlocking generalization in robot learning through video world...

work page arXiv 2025
[40]

Z. Chen, S. Kiami, A. Gupta, and V . Kumar. Genaug: Retargeting behaviors to unseen situa- tions via generative augmentation.arXiv preprint arXiv:2302.06671, 2023

work page arXiv 2023
[41]

O. X.-E. Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Her- zog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Kolobov, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakr- ishna, A. W...

work page internal anchor Pith review arXiv 2023
[42]

H.-S. Fang, H. Fang, Z. Tang, J. Liu, J. Wang, H. Zhu, and C. Lu. Rh20t: A robotic dataset for learning diverse skills in one-shot. InRSS 2023 Workshop on Learning for Task and Motion Planning, 2023

2023
[43]

" " 5< actuator > 6< adhesion body =

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

2024