arxiv: 2603.12639 · v2 · submitted 2026-03-13 · 💻 cs.CV

Recognition: unknown

RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization

Ruicheng Zhang , Guangyu Chen , Zunnan Xu , Zihao Liu , Zhizhou Zhong , Mingyang Zhang , Jun Zhou , Xiu Li

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords RoboStereo4D world modelsembodied AIpolicy optimizationrobot manipulationdual-tower architecturegeometric consistencyphysics hallucinations

0 comments

The pith

RoboStereo’s dual-tower 4D world model unifies policy optimization and delivers over 97 percent relative improvement on robot manipulation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RoboStereo as a symmetric dual-tower 4D world model that applies bidirectional cross-modal enhancement to maintain spatiotemporal geometric consistency and reduce physics hallucinations during imagined rollouts. It then builds the first unified framework for world-model-based policy optimization through three components: test-time policy augmentation for verification, imitative-evolutionary learning from expert demonstrations using visual rewards, and open-exploration learning for autonomous skill discovery. This addresses the prohibitive costs and safety risks of real-world robot interaction by shifting optimization to high-fidelity simulated environments. A sympathetic reader would care because reliable world models could make scalable embodied AI practical without constant physical trials.

Core claim

We introduce RoboStereo, a symmetric dual-tower 4D world model that employs bidirectional cross-modal enhancement to ensure spatiotemporal geometric consistency and alleviate physics hallucinations. Building upon this high-fidelity 4D simulator, we present the first unified framework for world-model-based policy optimization consisting of Test-Time Policy Augmentation, Imitative-Evolutionary Policy Learning, and Open-Exploration Policy Learning.

What carries the argument

The symmetric dual-tower 4D world model with bidirectional cross-modal enhancement, which enforces spatiotemporal geometric consistency across modalities to support reliable policy rollouts.

If this is right

Test-Time Policy Augmentation allows verification of policies before real execution.
Imitative-Evolutionary Policy Learning lets agents improve from expert demonstrations via visual perceptual rewards.
Open-Exploration Policy Learning supports autonomous skill discovery and self-correction.
The approach yields state-of-the-art generation quality with over 97 percent average relative improvement on fine-grained manipulation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the consistency mechanism scales, the same dual-tower structure could support longer-horizon tasks without additional real data.
Lower hallucination rates might allow policy training to rely almost entirely on simulated rollouts rather than mixed real-sim data.
The three-part unified framework could transfer to other simulation-heavy domains such as autonomous driving or virtual agents.

Load-bearing premise

The bidirectional cross-modal enhancement ensures enough spatiotemporal geometric consistency and reduces physics hallucinations for the claimed policy improvements to hold.

What would settle it

Running the same manipulation tasks with the bidirectional cross-modal enhancement removed and finding comparable or superior policy performance would show the dual-tower design is not required for the gains.

Figures

Figures reproduced from arXiv: 2603.12639 by Guangyu Chen, Jun Zhou, Mingyang Zhang, Ruicheng Zhang, Xiu Li, Zhizhou Zhong, Zihao Liu, Zunnan Xu.

**Figure 1.** Figure 1: (a) Qualitative comparison of RoboStereo against SOTA EWMs. (b) Quantitative comparison of unified policy optimization framework against traditional paradigms. To overcome the physical-world constraints, Embodied World Models (EWMs) [1, 7, 9, 10, 19, 35, 48, 51] have emerged as a promising alternative. By learning to predict future observations conditioned on robot actions, EWMs act as differentiable dig… view at source ↗

**Figure 2.** Figure 2: RoboStereo Architecture. Symmetric dual DiT towers (a) process RGB and XYZ pointmaps via bidirectional cross-attention for visual-geometric fusion (b) and a Gaussian head for flexible-viewpoint rendering. Dual-path action-conditioned timestep embedding mechanism (c) ensures precise frame-level trajectory control. as the foundational backbone for scalable VLA policy improvement across both deployment and le… view at source ↗

**Figure 3.** Figure 3: Illustration of Test-Time Policy Augmentation (TTPA). RoboStereo functions as a learned world model pϕ that predicts future visual observations conditioned on the current state and action chunk. Formally, given an initial state s0 and an action sequence {a0, . . . , aT −1}, the world model synthesizes an imagined visual trajectory: \hat {\boldsymbol {\tau }} = \{\hat {\mathbf {s}}_1, \ldots , \hat {\mathb… view at source ↗

**Figure 4.** Figure 4: Illustration of Imitative-Evolutionary Policy Learning (IEPL). under flexible camera viewpoints for policy learning. This paradigm is designed to learn from expert demonstrations through trajectory matching via reinforcement learning. <1> Reward Function Design. To guide policy optimization, we formulate a novel 4D visual imitation reward based on the perceptual alignment between expert and policy-induce… view at source ↗

**Figure 5.** Figure 5: Illustration of Open-Exploration Policy Learning (OEPL). <3> Iterative Rollout Process. IEPL iterates through three stages: (1) Imagined Trajectory Generation: The current VLA policy πθ and expert actions are executed in RoboStereo to generate paired imagined trajectories, τˆpolicy and τˆexpert. Every k-th frame (k = 8) from τˆpolicy is fed back as updated observations, creating closed-loop imitation. (2)… view at source ↗

**Figure 6.** Figure 6: Training procedures for IEPL and OEPL. IEPL (left) learns through visual imitation by matching policy rollouts with expert trajectories. OEPL (right) learns through open exploration guided by a discriminator-based reward model. <2> GRPO Optimization. Similar to IEPL, we optimize the policy via GRPO. In each iteration, we sample K trajectories {τ (k)} K k=1 from the current policy and compute their cumulati… view at source ↗

**Figure 7.** Figure 7: Visualizations of the 4D Gaussian representations, RGB videos, and depth maps produced by RoboStereo. RoboStereo generates precise future trajectories conditioned on action instructions, exhibiting high visual fidelity and geometric consistency. tiotemporal consistency in both visual appearance and geometric structure? (2) Can RoboStereo serve as a high-fidelity embodied simulator to enhance VLA policies w… view at source ↗

**Figure 8.** Figure 8: Inference speed comparison. Method Speed (FPS) ↑ WoW [7] 0.05 MIND-V [48] 0.38 IRASim [51] 0.53 RoboMaster [10] 0.63 Ours (Video Tower) 1.50 Beyond generation quality, practical embodied simulation requires high inference efficiency to enable large-scale trajectory rollouts. We evaluate the inference speed (measured in frames per second, FPS) of our video generation model against several SOTA baselines. A… view at source ↗

**Figure 9.** Figure 9: Visualization of fine-grained geometric reasoning enabled by IEPL [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 11.** Figure 11: Qualitative results of real-world deployment. [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

read the original abstract

Scalable Embodied AI faces fundamental constraints due to prohibitive costs and safety risks of real-world interaction. While Embodied World Models (EWMs) offer promise through imagined rollouts, existing approaches suffer from geometric hallucinations and lack unified optimization frameworks for practical policy improvement. We introduce RoboStereo, a symmetric dual-tower 4D world model that employs bidirectional cross-modal enhancement to ensure spatiotemporal geometric consistency and alleviate physics hallucinations. Building upon this high-fidelity 4D simulator, we present the first unified framework for world-model-based policy optimization: (1) Test-Time Policy Augmentation (TTPA) for pre-execution verification, (2) Imitative-Evolutionary Policy Learning (IEPL) leveraging visual perceptual rewards to learn from expert demonstrations, and (3) Open-Exploration Policy Learning (OEPL) enabling autonomous skill discovery and self-correction. Comprehensive experiments demonstrate RoboStereo achieves state-of-the-art generation quality, with our unified framework delivering >97% average relative improvement on fine-grained manipulation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RoboStereo introduces a dual-tower 4D world model and unifies TTPA, IEPL, and OEPL for policy optimization, but the >97% gains and SOTA claims rest on experimental details that are not shown.

read the letter

The paper's core move is a symmetric dual-tower 4D architecture that uses bidirectional cross-modal enhancement to keep spatiotemporal geometry consistent and reduce physics hallucinations. It then folds three policy techniques into one framework: test-time verification, imitation from demonstrations with visual rewards, and open exploration for skill discovery. That unification is the clearest new element on offer. Prior work already uses world models for imagined rollouts and does cross-modal fusion, but packaging these specific optimization pieces together under a single high-fidelity simulator is not something I have seen laid out this way before. The motivation section also lands cleanly: real-world robot training is costly and unsafe, so better internal simulators matter for scaling. The bidirectional enhancement idea is a reasonable architectural response to the hallucination problem that has plagued earlier models. If the towers actually enforce geometric consistency without blowing up compute, that could be a practical step forward for manipulation tasks. The soft spots are concentrated in the results. The abstract states SOTA generation quality and more than 97 percent average relative improvement on fine-grained manipulation, yet supplies no baselines, no ablation on the enhancement module, no hallucination-specific metrics such as 3D consistency error, and no comparison against single-tower versions. Without those numbers it is impossible to tell whether the policy gains come from the world-model fidelity or simply from running the three optimization methods on any decent simulator. The stress-test note is right on this point. This paper is for groups already working on embodied world models and robot policy learning. A reader who needs a concrete framework to combine test-time checks with imitation and exploration could extract useful structure once the experiments appear. It deserves peer review because the ideas are coherent and the problem is important, but any referee will have to insist on the missing tables, ablations, and code before the performance claims can be evaluated. I would send it out rather than desk-reject.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces RoboStereo, a symmetric dual-tower 4D embodied world model that uses bidirectional cross-modal enhancement to enforce spatiotemporal geometric consistency and reduce physics hallucinations. It proposes the first unified framework for world-model-based policy optimization consisting of Test-Time Policy Augmentation (TTPA), Imitative-Evolutionary Policy Learning (IEPL), and Open-Exploration Policy Learning (OEPL), and reports state-of-the-art generation quality together with greater than 97% average relative improvement on fine-grained manipulation tasks.

Significance. If the experimental claims hold after verification, the work could meaningfully advance Embodied AI by supplying a high-fidelity 4D simulator that supports safer policy learning and optimization, thereby reducing reliance on costly real-world rollouts. The unified optimization framework is a potentially useful organizing contribution provided the quantitative gains are shown to be robust and attributable to the world-model fidelity.

major comments (3)

[Abstract] Abstract: the central claim of >97% average relative improvement on fine-grained manipulation tasks is stated without any accompanying baselines, metrics, error bars, or dataset descriptions, rendering the result unverifiable from the manuscript text.
[Experiments] Experiments: no ablation isolating the bidirectional cross-modal enhancement module is reported, so it remains unclear whether the claimed policy gains derive from the dual-tower architecture or from the TTPA/IEPL/OEPL optimization components alone.
[Method] Method and Experiments: the manuscript provides no hallucination-specific quantitative metrics (e.g., 3D geometric consistency error or physics-violation rate) that would directly support the assertion that the architecture alleviates physics hallucinations sufficiently to drive the reported policy improvements.

minor comments (1)

[Abstract] Abstract: a short statement of the concrete tasks, datasets, and evaluation metrics used in the experiments would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major point below and commit to incorporating revisions that strengthen the verifiability and attribution of our claims. All requested clarifications and additions are feasible within the manuscript structure.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of >97% average relative improvement on fine-grained manipulation tasks is stated without any accompanying baselines, metrics, error bars, or dataset descriptions, rendering the result unverifiable from the manuscript text.

Authors: We agree that the abstract should be self-contained for verifiability. In the revised version we will expand the abstract to briefly specify the baselines (standard single-tower world models and direct policy optimization methods), the primary metrics (task success rate with relative improvement), the datasets (e.g., manipulation benchmarks from RLBench and custom 4D simulation suites), and note that error bars are reported in the full experimental tables. This addition keeps the abstract concise while making the central claim traceable. revision: yes
Referee: [Experiments] Experiments: no ablation isolating the bidirectional cross-modal enhancement module is reported, so it remains unclear whether the claimed policy gains derive from the dual-tower architecture or from the TTPA/IEPL/OEPL optimization components alone.

Authors: We recognize the value of isolating the architectural contribution. We will add a new ablation subsection that fixes the TTPA/IEPL/OEPL components and compares the full dual-tower RoboStereo against a single-tower baseline and a dual-tower variant without bidirectional cross-modal enhancement. Quantitative results on policy success rates will be reported to demonstrate the incremental benefit of the cross-modal module. revision: yes
Referee: [Method] Method and Experiments: the manuscript provides no hallucination-specific quantitative metrics (e.g., 3D geometric consistency error or physics-violation rate) that would directly support the assertion that the architecture alleviates physics hallucinations sufficiently to drive the reported policy improvements.

Authors: While the current manuscript relies on qualitative visualizations and downstream policy gains as indirect evidence, we agree that direct metrics would strengthen the causal link. In the revision we will define and report two new quantitative metrics—3D geometric consistency error (measured via point-cloud alignment on predicted vs. ground-truth 4D trajectories) and physics-violation rate (counting collisions and penetration events in simulated rollouts)—evaluated on held-out test sequences. These will be correlated with the observed policy improvements to support the claim. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on experimental results rather than self-referential derivations

full rationale

The paper presents RoboStereo as a new dual-tower 4D world model using bidirectional cross-modal enhancement, followed by a unified policy optimization framework (TTPA, IEPL, OEPL). All performance claims, including SOTA generation quality and >97% relative improvement on manipulation tasks, are attributed directly to comprehensive experiments without any visible equations, parameter fits, or first-principles derivations that reduce outputs to inputs by construction. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described structure. The central argument relies on architectural design choices validated empirically, making the derivation chain self-contained and independent of circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the central claims rest on the unproven assumption that cross-modal enhancement produces consistent 4D geometry.

axioms (1)

domain assumption Bidirectional cross-modal enhancement ensures spatiotemporal geometric consistency and alleviates physics hallucinations
Invoked in the abstract as the mechanism enabling high-fidelity simulation but without supporting derivation or evidence.

pith-pipeline@v0.9.0 · 5497 in / 1175 out tokens · 48675 ms · 2026-05-15T12:09:52.448569+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
cs.CV 2026-05 unverdicted novelty 7.0

KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
cs.RO 2026-04 unverdicted novelty 6.0

A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 2 Pith papers · 11 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al.: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

1–16 (2019)

Agbinya, J.I.: 1 Markov Chain and its Applications, pp. 1–16 (2019)

work page 2019
[3]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L.X., Tanner, J., Vuong, Q., Walling, A., Wang, H., Zhilinsky, U.:π0: A vision-language-action flow model for general robot control. arXiv pre...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Chemical reviews109(11), 5402–5436 (2009)

Boyer, C., Bulmus, V., Davis, T.P., Ladmiral, V., Liu, J., Perrier, S.: Bioapplica- tions of raft polymerization. Chemical reviews109(11), 5402–5436 (2009)

work page 2009
[6]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

arXiv preprint arXiv:2509.22642 (2025)

Chi, X., Jia, P., Fan, C.K., Ju, X., Mi, W., Qin, Z., Zhang, K., Tian, W., Ge, K., Li, H., et al.: Wow: Towards a world omniscient world model through embodied interaction. arXiv preprint arXiv:2509.22642 (2025)

work page arXiv 2025
[8]

In: The Fourteenth In- ternational Conference on Learning Representations (ICLR) (2026),https:// openreview.net/forum?id=3q9vHEqsNx

Dai, Y., Jiang, F., Wang, C., Xu, M., Qi, Y.: FantasyWorld: Geometry-consistent world modeling via unified video and 3D prediction. In: The Fourteenth In- ternational Conference on Learning Representations (ICLR) (2026),https:// openreview.net/forum?id=3q9vHEqsNx

work page 2026
[9]

arXiv preprint arXiv:2512.17661 (2025)

Feng, Y., Xiang, C., Mao, X., Tan, H., Zhang, Z., Huang, S., Zheng, K., Liu, H., Su, H., Zhu, J.: Vidarc: Embodied video diffusion model for closed-loop control. arXiv preprint arXiv:2512.17661 (2025)

work page arXiv 2025
[10]

arXiv preprint arXiv:2506.01943 (2025)

Fu, X., Wang, X., Liu, X., Bai, J., Xu, R., Wan, P., Zhang, D., Lin, D.: Learning video generation for robotic manipulation with collaborative trajectory control. arXiv preprint arXiv:2506.01943 (2025)

work page arXiv 2025
[11]

Huang, J., Xu, Z., Zhou, J., Liu, T., Xiao, Y., Ou, M., Ji, B., Li, X., Yuan, K.: Sam-r1: Leveraging sam for reward feedback in multimodal segmentation via rein- forcement learning (2025),https://arxiv.org/abs/2505.22596

work page arXiv 2025
[12]

arXiv preprint arXiv:2408.16506 (2024) 16 R

Jin, X., Xu, Z., Ou, M., Yang, W.: Alignment is all you need: A training- free augmentation strategy for pose-guided video generation. arXiv preprint arXiv:2408.16506 (2024) 16 R. Zhang et al

work page arXiv 2024
[13]

In: Proceedings of the IEEE/CVF international conference on computer vision

Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5148–5157 (2021)

work page 2021
[14]

ACM Transactions on Graphics42(4) (July 2023),https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics42(4) (July 2023),https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

work page 2023
[15]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success. arXiv preprint arXiv:2502.19645 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

LAION-AI: Aesthetic predictor.https://github.com/LAION- AI/aesthetic- predictor(2022), accessed: 2024

work page 2022
[18]

arXiv preprint arXiv:2510.00406 (2025)

Li, H., Ding, P., Suo, R., Wang, Y., Ge, Z., Zang, D., Yu, K., Sun, M., Zhang, H., Wang, D., et al.: Vla-rft: Vision-language-action reinforcement fine-tuning with verified rewards in world simulators. arXiv preprint arXiv:2510.00406 (2025)

work page arXiv 2025
[19]

arXiv preprint arXiv:2508.05635 (2025)

Liao, Y., Zhou, P., Huang, S., Yang, D., Chen, S., Jiang, Y., Hu, Y., Cai, J., Liu, S., Luo, J., et al.: Genie envisioner: A unified world foundation platform for robotic manipulation. arXiv preprint arXiv:2508.05635 (2025)

work page arXiv 2025
[20]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

arXiv preprint arXiv:2510.01284 (2025)

Low, C., Wang, W., Katyal, C.: Ovi: Twin backbone cross-modal fusion for audio- video generation. arXiv preprint arXiv:2510.01284 (2025)

work page arXiv 2025
[23]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

Lu, G., Jia, B., Li, P., Chen, Y., Wang, Z., Tang, Y., Huang, S.: GWM: Towards scalable Gaussian world models for robotic manipulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

work page 2025
[24]

In: 7th Annual Conference on Robot Learning (2023)

Mandlekar, A., Nasiriany, S., Wen, B., Akinola, I., Narang, Y., Fan, L., Zhu, Y., Fox, D.: Mimicgen: A data generation system for scalable robot learning using human demonstrations. In: 7th Annual Conference on Robot Learning (2023)

work page 2023
[25]

arXiv preprint arXiv:2511.18922 (2025)

Mi, Z., Wang, Y., Xu, D.: One4d: Unified 4d generation and reconstruction via decoupled lora control. arXiv preprint arXiv:2511.18922 (2025)

work page arXiv 2025
[26]

arXiv preprint arXiv:2506.08440 (2025)

Niu, C., et al.: TGRPO: Fine-tuning vision-language-action model via trajectory- wise group relative policy optimization. arXiv preprint arXiv:2506.08440 (2025)

work page arXiv 2025
[27]

NVIDIA: World simulation with video foundation models for physical AI (2025)

work page 2025
[28]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4195–

work page
[29]

In: Proceedings of the AAAI conference on artificial intelligence

Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: Visual rea- soning with a general conditioning layer. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)

work page 2018
[30]

arXiv preprint arXiv:2510.07313 (2025)

Qian, Z., Chi, X., Li, Y., Wang, S., Han, S., Zhang, S.: WristWorld: Generat- ing wrist-views via 4D world models for robotic manipulation. arXiv preprint arXiv:2510.07313 (2025)

work page arXiv 2025
[31]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) RoboStereo 17

work page 2021
[32]

Advances in neural information processing systems36, 53728–53741 (2023)

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems36, 53728–53741 (2023)

work page 2023
[33]

Shang, Y., Li, Z., Ma, Y., Su, W., Jin, X., Wang, Z., Jin, L., Zhang, X., Tang, Y., Su, H., Gao, C., Wu, W., Liu, X., Shah, D., Zhang, Z., Chen, Z., Zhu, J., Tian, Y., Chua, T.S., Zhu, W., Li, Y.: Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models (2026)

work page 2026
[34]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

arXiv preprint arXiv:2511.19861 (2025)

Team, G., Ye, A., Wang, B., Ni, C., Huang, G., Zhao, G., Li, H., Zhu, J., Li, K., Xu, M., et al.: Gigaworld-0: World models as data engine to empower embodied ai. arXiv preprint arXiv:2511.19861 (2025)

work page arXiv 2025
[36]

Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

Team, I., Feng, T., Han, Y., He, J., He, Y., Lin, X., Liu, T., Lu, H., Tang, J., Wang, W., et al.: Inferix: A block-diffusion based next-generation inference engine for world simulation. arXiv preprint arXiv:2511.20714 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

In: Conference on Robot Learning (CoRL)

Walke, H., Black, K., Lee, A., Kim, M.J., Du, M., Zheng, C., Zhao, T., Hansen- Estruch, P., Vuong, Q., He, A., Myers, V., Fang, K., Finn, C., Levine, S.: Bridge- data v2: A dataset for robot learning at scale. In: Conference on Robot Learning (CoRL). PMLR (2023)

work page 2023
[38]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., Qiao, Y.: Videomae v2: Scaling video masked autoencoders with dual masking. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14549–14560 (June 2023)

work page 2023
[39]

Xu, R., Zhang, J., Guo, M., Wen, Y., Yang, H., Lin, M., Huang, J., Li, Z., Zhang, K., Wang, L., et al.: A0: An affordance-aware hierarchical model for general robotic manipulation.In:ProceedingsoftheIEEE/CVFInternationalConferenceonCom- puter Vision. pp. 13491–13501 (2025)

work page 2025
[40]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10371–10381 (2024)

work page 2024
[41]

arXiv preprint arXiv:2509.11766 (2025)

Zhai, A., Liu, B., Fang, B., Cai, C., Ma, E., Yin, E., Wang, H., Zhou, H., Wang, J., Shi, L., Liang, L., Wang, M., Wang, Q., Gan, R., Yu, R., Li, S., Liu, S., Chen, S., Chen, V., Xu, Z.: Igniting vlms toward the embodied space. arXiv preprint arXiv:2509.11766 (2025)

work page arXiv 2025
[42]

Advances in Neural Information Processing Systems37, 107225–107248 (2024)

Zhang, G., Liu, C., Cui, Y., Zhao, X., Ma, K., Wang, L.: Vfimamba: Video frame interpolation with state space models. Advances in Neural Information Processing Systems37, 107225–107248 (2024)

work page 2024
[43]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.Y.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Zhang, K., Xu, R., Ren, P., Lin, J., Wu, H., Lin, L., Liang, X.: Robridge: A hier- archical architecture bridging cognition and execution for general robotic manip- ulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 14590–14601. IEEE (2025)

work page 2025
[45]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 586–595. IEEE (2018) 18 R. Zhang et al

work page 2018
[46]

In: Proceedings of the 33rd ACM International Conference on Multimedia (MM)

Zhang, R., Sun, Y., Zhang, Z., Li, J., Liu, X., Au, H.F., Guo, H., Yan, P.: Marl- mambacontour: Unleashing multi-agent deep reinforcement learning for active con- tour optimization in medical image segmentation. In: Proceedings of the 33rd ACM International Conference on Multimedia (MM). pp. 7815–7824. ACM (2025)

work page 2025
[47]

arXiv preprint arXiv:2509.06723 (2025)

Zhang, R., Zhou, J., Xu, Z., Liu, Z., Huang, J., Zhang, M., Sun, Y., Li, X.: Zero- shot 3d-aware trajectory-guided image-to-video generation via test-time training. arXiv preprint arXiv:2509.06723 (2025)

work page arXiv 2025
[48]

arXiv preprint arXiv:2512.06628 (2025)

Zhang, R., Zhang, M., Zhou, J., Guo, Z., Liu, X., Xu, Z., Zhong, Z., Yan, P., Luo, H., Li, X.: Mind-v: Hierarchical video generation for long-horizon robotic manipulation with rl-based physical alignment. arXiv preprint arXiv:2512.06628 (2025)

work page arXiv 2025
[49]

In: 2025 IEEE Interna- tionalSymposiumonMixedandAugmentedReality(ISMAR).pp.614–623(2025)

Zhao, L., Lu, X., Hu, B., Ke, W., Wang, L.: Gshoi denoiser: Denoising gaus- sian hand-object interaction for photorealistic rendering. In: 2025 IEEE Interna- tionalSymposiumonMixedandAugmentedReality(ISMAR).pp.614–623(2025). https://doi.org/10.1109/ISMAR67309.2025.00071

work page doi:10.1109/ismar67309.2025.00071 2025
[50]

arXiv preprint arXiv:2504.20995 (2025)

Zhen, H., Sun, Q., Zhang, H., Li, J., Zhou, S., Du, Y., Gan, C.: TesserAct: Learning 4D embodied world models. arXiv preprint arXiv:2504.20995 (2025)

work page arXiv 2025
[51]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV)

Zhu, F., Wu, H., Guo, S., Liu, Y., Cheang, C., Kong, T.: Irasim: A fine-grained world model for robot manipulation. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV). pp. 9834–9844. IEEE (2025)

work page 2025
[52]

arXiv preprint arXiv:2511.09515 (2025) RoboStereo 19 Supplementary Material 6 Inference Efficiency Analysis Fig.8:Inference speed comparison

Zhu, F., Yan, Z., Hong, Z., Shou, Q., Ma, X., Guo, S.: Wmpo: World model- based policy optimization for vision-language-action models. arXiv preprint arXiv:2511.09515 (2025) RoboStereo 19 Supplementary Material 6 Inference Efficiency Analysis Fig.8:Inference speed comparison. Method Speed (FPS)↑ WoW [7] 0.05 MIND-V [48] 0.38 IRASim [51] 0.53 RoboMaster [1...

work page arXiv 2025