Recognition: unknown
RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization
Pith reviewed 2026-05-15 12:09 UTC · model grok-4.3
The pith
RoboStereo’s dual-tower 4D world model unifies policy optimization and delivers over 97 percent relative improvement on robot manipulation tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce RoboStereo, a symmetric dual-tower 4D world model that employs bidirectional cross-modal enhancement to ensure spatiotemporal geometric consistency and alleviate physics hallucinations. Building upon this high-fidelity 4D simulator, we present the first unified framework for world-model-based policy optimization consisting of Test-Time Policy Augmentation, Imitative-Evolutionary Policy Learning, and Open-Exploration Policy Learning.
What carries the argument
The symmetric dual-tower 4D world model with bidirectional cross-modal enhancement, which enforces spatiotemporal geometric consistency across modalities to support reliable policy rollouts.
If this is right
- Test-Time Policy Augmentation allows verification of policies before real execution.
- Imitative-Evolutionary Policy Learning lets agents improve from expert demonstrations via visual perceptual rewards.
- Open-Exploration Policy Learning supports autonomous skill discovery and self-correction.
- The approach yields state-of-the-art generation quality with over 97 percent average relative improvement on fine-grained manipulation tasks.
Where Pith is reading between the lines
- If the consistency mechanism scales, the same dual-tower structure could support longer-horizon tasks without additional real data.
- Lower hallucination rates might allow policy training to rely almost entirely on simulated rollouts rather than mixed real-sim data.
- The three-part unified framework could transfer to other simulation-heavy domains such as autonomous driving or virtual agents.
Load-bearing premise
The bidirectional cross-modal enhancement ensures enough spatiotemporal geometric consistency and reduces physics hallucinations for the claimed policy improvements to hold.
What would settle it
Running the same manipulation tasks with the bidirectional cross-modal enhancement removed and finding comparable or superior policy performance would show the dual-tower design is not required for the gains.
Figures
read the original abstract
Scalable Embodied AI faces fundamental constraints due to prohibitive costs and safety risks of real-world interaction. While Embodied World Models (EWMs) offer promise through imagined rollouts, existing approaches suffer from geometric hallucinations and lack unified optimization frameworks for practical policy improvement. We introduce RoboStereo, a symmetric dual-tower 4D world model that employs bidirectional cross-modal enhancement to ensure spatiotemporal geometric consistency and alleviate physics hallucinations. Building upon this high-fidelity 4D simulator, we present the first unified framework for world-model-based policy optimization: (1) Test-Time Policy Augmentation (TTPA) for pre-execution verification, (2) Imitative-Evolutionary Policy Learning (IEPL) leveraging visual perceptual rewards to learn from expert demonstrations, and (3) Open-Exploration Policy Learning (OEPL) enabling autonomous skill discovery and self-correction. Comprehensive experiments demonstrate RoboStereo achieves state-of-the-art generation quality, with our unified framework delivering >97% average relative improvement on fine-grained manipulation tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RoboStereo, a symmetric dual-tower 4D embodied world model that uses bidirectional cross-modal enhancement to enforce spatiotemporal geometric consistency and reduce physics hallucinations. It proposes the first unified framework for world-model-based policy optimization consisting of Test-Time Policy Augmentation (TTPA), Imitative-Evolutionary Policy Learning (IEPL), and Open-Exploration Policy Learning (OEPL), and reports state-of-the-art generation quality together with greater than 97% average relative improvement on fine-grained manipulation tasks.
Significance. If the experimental claims hold after verification, the work could meaningfully advance Embodied AI by supplying a high-fidelity 4D simulator that supports safer policy learning and optimization, thereby reducing reliance on costly real-world rollouts. The unified optimization framework is a potentially useful organizing contribution provided the quantitative gains are shown to be robust and attributable to the world-model fidelity.
major comments (3)
- [Abstract] Abstract: the central claim of >97% average relative improvement on fine-grained manipulation tasks is stated without any accompanying baselines, metrics, error bars, or dataset descriptions, rendering the result unverifiable from the manuscript text.
- [Experiments] Experiments: no ablation isolating the bidirectional cross-modal enhancement module is reported, so it remains unclear whether the claimed policy gains derive from the dual-tower architecture or from the TTPA/IEPL/OEPL optimization components alone.
- [Method] Method and Experiments: the manuscript provides no hallucination-specific quantitative metrics (e.g., 3D geometric consistency error or physics-violation rate) that would directly support the assertion that the architecture alleviates physics hallucinations sufficiently to drive the reported policy improvements.
minor comments (1)
- [Abstract] Abstract: a short statement of the concrete tasks, datasets, and evaluation metrics used in the experiments would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments. We address each major point below and commit to incorporating revisions that strengthen the verifiability and attribution of our claims. All requested clarifications and additions are feasible within the manuscript structure.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of >97% average relative improvement on fine-grained manipulation tasks is stated without any accompanying baselines, metrics, error bars, or dataset descriptions, rendering the result unverifiable from the manuscript text.
Authors: We agree that the abstract should be self-contained for verifiability. In the revised version we will expand the abstract to briefly specify the baselines (standard single-tower world models and direct policy optimization methods), the primary metrics (task success rate with relative improvement), the datasets (e.g., manipulation benchmarks from RLBench and custom 4D simulation suites), and note that error bars are reported in the full experimental tables. This addition keeps the abstract concise while making the central claim traceable. revision: yes
-
Referee: [Experiments] Experiments: no ablation isolating the bidirectional cross-modal enhancement module is reported, so it remains unclear whether the claimed policy gains derive from the dual-tower architecture or from the TTPA/IEPL/OEPL optimization components alone.
Authors: We recognize the value of isolating the architectural contribution. We will add a new ablation subsection that fixes the TTPA/IEPL/OEPL components and compares the full dual-tower RoboStereo against a single-tower baseline and a dual-tower variant without bidirectional cross-modal enhancement. Quantitative results on policy success rates will be reported to demonstrate the incremental benefit of the cross-modal module. revision: yes
-
Referee: [Method] Method and Experiments: the manuscript provides no hallucination-specific quantitative metrics (e.g., 3D geometric consistency error or physics-violation rate) that would directly support the assertion that the architecture alleviates physics hallucinations sufficiently to drive the reported policy improvements.
Authors: While the current manuscript relies on qualitative visualizations and downstream policy gains as indirect evidence, we agree that direct metrics would strengthen the causal link. In the revision we will define and report two new quantitative metrics—3D geometric consistency error (measured via point-cloud alignment on predicted vs. ground-truth 4D trajectories) and physics-violation rate (counting collisions and penetration events in simulated rollouts)—evaluated on held-out test sequences. These will be correlated with the observed policy improvements to support the claim. revision: yes
Circularity Check
No circularity detected; claims rest on experimental results rather than self-referential derivations
full rationale
The paper presents RoboStereo as a new dual-tower 4D world model using bidirectional cross-modal enhancement, followed by a unified policy optimization framework (TTPA, IEPL, OEPL). All performance claims, including SOTA generation quality and >97% relative improvement on manipulation tasks, are attributed directly to comprehensive experiments without any visible equations, parameter fits, or first-principles derivations that reduce outputs to inputs by construction. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described structure. The central argument relies on architectural design choices validated empirically, making the derivation chain self-contained and independent of circular reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Bidirectional cross-modal enhancement ensures spatiotemporal geometric consistency and alleviates physics hallucinations
Forward citations
Cited by 2 Pith papers
-
KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.
-
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
Reference graph
Works this paper leans on
-
[1]
Cosmos World Foundation Model Platform for Physical AI
Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al.: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [2]
-
[3]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L.X., Tanner, J., Vuong, Q., Walling, A., Wang, H., Zhilinsky, U.:π0: A vision-language-action flow model for general robot control. arXiv pre...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Chemical reviews109(11), 5402–5436 (2009)
Boyer, C., Bulmus, V., Davis, T.P., Ladmiral, V., Liu, J., Perrier, S.: Bioapplica- tions of raft polymerization. Chemical reviews109(11), 5402–5436 (2009)
work page 2009
-
[6]
SAM 3: Segment Anything with Concepts
Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
arXiv preprint arXiv:2509.22642 (2025)
Chi, X., Jia, P., Fan, C.K., Ju, X., Mi, W., Qin, Z., Zhang, K., Tian, W., Ge, K., Li, H., et al.: Wow: Towards a world omniscient world model through embodied interaction. arXiv preprint arXiv:2509.22642 (2025)
-
[8]
Dai, Y., Jiang, F., Wang, C., Xu, M., Qi, Y.: FantasyWorld: Geometry-consistent world modeling via unified video and 3D prediction. In: The Fourteenth In- ternational Conference on Learning Representations (ICLR) (2026),https:// openreview.net/forum?id=3q9vHEqsNx
work page 2026
-
[9]
arXiv preprint arXiv:2512.17661 (2025)
Feng, Y., Xiang, C., Mao, X., Tan, H., Zhang, Z., Huang, S., Zheng, K., Liu, H., Su, H., Zhu, J.: Vidarc: Embodied video diffusion model for closed-loop control. arXiv preprint arXiv:2512.17661 (2025)
-
[10]
arXiv preprint arXiv:2506.01943 (2025)
Fu, X., Wang, X., Liu, X., Bai, J., Xu, R., Wan, P., Zhang, D., Lin, D.: Learning video generation for robotic manipulation with collaborative trajectory control. arXiv preprint arXiv:2506.01943 (2025)
- [11]
-
[12]
arXiv preprint arXiv:2408.16506 (2024) 16 R
Jin, X., Xu, Z., Ou, M., Yang, W.: Alignment is all you need: A training- free augmentation strategy for pose-guided video generation. arXiv preprint arXiv:2408.16506 (2024) 16 R. Zhang et al
-
[13]
In: Proceedings of the IEEE/CVF international conference on computer vision
Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5148–5157 (2021)
work page 2021
-
[14]
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics42(4) (July 2023),https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
work page 2023
-
[15]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success. arXiv preprint arXiv:2502.19645 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
LAION-AI: Aesthetic predictor.https://github.com/LAION- AI/aesthetic- predictor(2022), accessed: 2024
work page 2022
-
[18]
arXiv preprint arXiv:2510.00406 (2025)
Li, H., Ding, P., Suo, R., Wang, Y., Ge, Z., Zang, D., Yu, K., Sun, M., Zhang, H., Wang, D., et al.: Vla-rft: Vision-language-action reinforcement fine-tuning with verified rewards in world simulators. arXiv preprint arXiv:2510.00406 (2025)
-
[19]
arXiv preprint arXiv:2508.05635 (2025)
Liao, Y., Zhou, P., Huang, S., Yang, D., Chen, S., Jiang, Y., Hu, Y., Cai, J., Liu, S., Luo, J., et al.: Genie envisioner: A unified world foundation platform for robotic manipulation. arXiv preprint arXiv:2508.05635 (2025)
-
[20]
Depth Anything 3: Recovering the Visual Space from Any Views
Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
arXiv preprint arXiv:2510.01284 (2025)
Low, C., Wang, W., Katyal, C.: Ovi: Twin backbone cross-modal fusion for audio- video generation. arXiv preprint arXiv:2510.01284 (2025)
-
[23]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)
Lu, G., Jia, B., Li, P., Chen, Y., Wang, Z., Tang, Y., Huang, S.: GWM: Towards scalable Gaussian world models for robotic manipulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)
work page 2025
-
[24]
In: 7th Annual Conference on Robot Learning (2023)
Mandlekar, A., Nasiriany, S., Wen, B., Akinola, I., Narang, Y., Fan, L., Zhu, Y., Fox, D.: Mimicgen: A data generation system for scalable robot learning using human demonstrations. In: 7th Annual Conference on Robot Learning (2023)
work page 2023
-
[25]
arXiv preprint arXiv:2511.18922 (2025)
Mi, Z., Wang, Y., Xu, D.: One4d: Unified 4d generation and reconstruction via decoupled lora control. arXiv preprint arXiv:2511.18922 (2025)
-
[26]
arXiv preprint arXiv:2506.08440 (2025)
Niu, C., et al.: TGRPO: Fine-tuning vision-language-action model via trajectory- wise group relative policy optimization. arXiv preprint arXiv:2506.08440 (2025)
-
[27]
NVIDIA: World simulation with video foundation models for physical AI (2025)
work page 2025
-
[28]
Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4195–
-
[29]
In: Proceedings of the AAAI conference on artificial intelligence
Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: Visual rea- soning with a general conditioning layer. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)
work page 2018
-
[30]
arXiv preprint arXiv:2510.07313 (2025)
Qian, Z., Chi, X., Li, Y., Wang, S., Han, S., Zhang, S.: WristWorld: Generat- ing wrist-views via 4D world models for robotic manipulation. arXiv preprint arXiv:2510.07313 (2025)
-
[31]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) RoboStereo 17
work page 2021
-
[32]
Advances in neural information processing systems36, 53728–53741 (2023)
Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems36, 53728–53741 (2023)
work page 2023
-
[33]
Shang, Y., Li, Z., Ma, Y., Su, W., Jin, X., Wang, Z., Jin, L., Zhang, X., Tang, Y., Su, H., Gao, C., Wu, W., Liu, X., Shah, D., Zhang, Z., Chen, Z., Zhu, J., Tian, Y., Chua, T.S., Zhu, W., Li, Y.: Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models (2026)
work page 2026
-
[34]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
arXiv preprint arXiv:2511.19861 (2025)
Team, G., Ye, A., Wang, B., Ni, C., Huang, G., Zhao, G., Li, H., Zhu, J., Li, K., Xu, M., et al.: Gigaworld-0: World models as data engine to empower embodied ai. arXiv preprint arXiv:2511.19861 (2025)
-
[36]
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation
Team, I., Feng, T., Han, Y., He, J., He, Y., Lin, X., Liu, T., Lu, H., Tang, J., Wang, W., et al.: Inferix: A block-diffusion based next-generation inference engine for world simulation. arXiv preprint arXiv:2511.20714 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
In: Conference on Robot Learning (CoRL)
Walke, H., Black, K., Lee, A., Kim, M.J., Du, M., Zheng, C., Zhao, T., Hansen- Estruch, P., Vuong, Q., He, A., Myers, V., Fang, K., Finn, C., Levine, S.: Bridge- data v2: A dataset for robot learning at scale. In: Conference on Robot Learning (CoRL). PMLR (2023)
work page 2023
-
[38]
In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., Qiao, Y.: Videomae v2: Scaling video masked autoencoders with dual masking. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14549–14560 (June 2023)
work page 2023
-
[39]
Xu, R., Zhang, J., Guo, M., Wen, Y., Yang, H., Lin, M., Huang, J., Li, Z., Zhang, K., Wang, L., et al.: A0: An affordance-aware hierarchical model for general robotic manipulation.In:ProceedingsoftheIEEE/CVFInternationalConferenceonCom- puter Vision. pp. 13491–13501 (2025)
work page 2025
-
[40]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10371–10381 (2024)
work page 2024
-
[41]
arXiv preprint arXiv:2509.11766 (2025)
Zhai, A., Liu, B., Fang, B., Cai, C., Ma, E., Yin, E., Wang, H., Zhou, H., Wang, J., Shi, L., Liang, L., Wang, M., Wang, Q., Gan, R., Yu, R., Li, S., Liu, S., Chen, S., Chen, V., Xu, Z.: Igniting vlms toward the embodied space. arXiv preprint arXiv:2509.11766 (2025)
-
[42]
Advances in Neural Information Processing Systems37, 107225–107248 (2024)
Zhang, G., Liu, C., Cui, Y., Zhao, X., Ma, K., Wang, L.: Vfimamba: Video frame interpolation with state space models. Advances in Neural Information Processing Systems37, 107225–107248 (2024)
work page 2024
-
[43]
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.Y.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Zhang, K., Xu, R., Ren, P., Lin, J., Wu, H., Lin, L., Liang, X.: Robridge: A hier- archical architecture bridging cognition and execution for general robotic manip- ulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 14590–14601. IEEE (2025)
work page 2025
-
[45]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 586–595. IEEE (2018) 18 R. Zhang et al
work page 2018
-
[46]
In: Proceedings of the 33rd ACM International Conference on Multimedia (MM)
Zhang, R., Sun, Y., Zhang, Z., Li, J., Liu, X., Au, H.F., Guo, H., Yan, P.: Marl- mambacontour: Unleashing multi-agent deep reinforcement learning for active con- tour optimization in medical image segmentation. In: Proceedings of the 33rd ACM International Conference on Multimedia (MM). pp. 7815–7824. ACM (2025)
work page 2025
-
[47]
arXiv preprint arXiv:2509.06723 (2025)
Zhang, R., Zhou, J., Xu, Z., Liu, Z., Huang, J., Zhang, M., Sun, Y., Li, X.: Zero- shot 3d-aware trajectory-guided image-to-video generation via test-time training. arXiv preprint arXiv:2509.06723 (2025)
-
[48]
arXiv preprint arXiv:2512.06628 (2025)
Zhang, R., Zhang, M., Zhou, J., Guo, Z., Liu, X., Xu, Z., Zhong, Z., Yan, P., Luo, H., Li, X.: Mind-v: Hierarchical video generation for long-horizon robotic manipulation with rl-based physical alignment. arXiv preprint arXiv:2512.06628 (2025)
-
[49]
In: 2025 IEEE Interna- tionalSymposiumonMixedandAugmentedReality(ISMAR).pp.614–623(2025)
Zhao, L., Lu, X., Hu, B., Ke, W., Wang, L.: Gshoi denoiser: Denoising gaus- sian hand-object interaction for photorealistic rendering. In: 2025 IEEE Interna- tionalSymposiumonMixedandAugmentedReality(ISMAR).pp.614–623(2025). https://doi.org/10.1109/ISMAR67309.2025.00071
-
[50]
arXiv preprint arXiv:2504.20995 (2025)
Zhen, H., Sun, Q., Zhang, H., Li, J., Zhou, S., Du, Y., Gan, C.: TesserAct: Learning 4D embodied world models. arXiv preprint arXiv:2504.20995 (2025)
-
[51]
In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV)
Zhu, F., Wu, H., Guo, S., Liu, Y., Cheang, C., Kong, T.: Irasim: A fine-grained world model for robot manipulation. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV). pp. 9834–9844. IEEE (2025)
work page 2025
-
[52]
Zhu, F., Yan, Z., Hong, Z., Shou, Q., Ma, X., Guo, S.: Wmpo: World model- based policy optimization for vision-language-action models. arXiv preprint arXiv:2511.09515 (2025) RoboStereo 19 Supplementary Material 6 Inference Efficiency Analysis Fig.8:Inference speed comparison. Method Speed (FPS)↑ WoW [7] 0.05 MIND-V [48] 0.38 IRASim [51] 0.53 RoboMaster [1...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.