pith. sign in

arxiv: 2605.16713 · v1 · pith:FBJ5XBK3new · submitted 2026-05-15 · 💻 cs.CV · cs.AI

GeoWorld-VLM: Geometry from World Models for Vision-Language Models

Pith reviewed 2026-05-20 17:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language modelsspatial reasoningworld modelsfeature distillationgeometric alignment3D cuesvisual pathway
0
0 comments X

The pith

GeoWorld-VLM improves spatial reasoning in VLMs by aligning features with frozen world models without changing the language backbone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models often fail at simple spatial relations like left, right, or behind because their visual features lose important 3D information during extraction. The paper introduces GeoWorld-VLM as a way to fix this by distilling geometric structure from pre-trained camera-conditioned video world models. It does so by fine-tuning only the image encoder and the projector to align features with the world model's intermediate representations, while keeping the language model frozen. A combination of answer supervision, feature matching, and a preservation term ensures the original capabilities stay intact. Tests on two different VLM architectures show roughly 4 percent better results on the What'sUp and VSR benchmarks.

Core claim

GeoWorld-VLM transfers geometric structure from frozen camera-conditioned video world models into VLMs by aligning post-projector image features with the world model's intermediate representations. Given an image, prompt, and sampled camera trajectory, the world model provides a synthetic multi-view spatial signal. Training uses spatial answer supervision, teacher-student feature alignment, and a preservation anchor to the original VLM, with only the encoder and projector updated.

What carries the argument

The feature alignment between post-projector VLM features and intermediate representations from the frozen world model that converts static images into multi-view spatial signals via camera trajectories.

If this is right

  • Improved spatial judgments result from enhanced visual representations rather than changes to language processing.
  • The method works across different VLM architectures, showing generality.
  • Original linguistic capabilities are preserved because the language model stays frozen.
  • Performance gains appear on multiple spatial reasoning datasets like What'sUp and VSR.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • World models could be used as geometric teachers for other tasks requiring 3D understanding beyond spatial relations in images.
  • Combining this with more advanced or larger world models might lead to further improvements in VLM spatial performance.
  • This suggests a broader strategy of using specialized models to inject missing structures into general-purpose models without full retraining.

Load-bearing premise

The intermediate representations from the frozen world models contain geometric structure that can be transferred to directly cause better spatial reasoning in the VLM.

What would settle it

If removing the feature alignment with world model representations eliminates the performance gains on spatial benchmarks while keeping other training components, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.16713 by Kaichen Zhou, Mengyu Wang, Renjie Gu, Yan Luo.

Figure 1
Figure 1. Figure 1: Overview. Given an input image and a spatial reasoning question, GeoWorld-VLM en￾hances the spatial understanding of standard vision-language models by injecting world-model pri￾ors at the feature-map level. Compared with the original VLM features, GeoWorld-VLM produces more geometry-aware representations, leading to clearer spatial grounding and improved answer ac￾curacy. As shown on the right, our method… view at source ↗
Figure 2
Figure 2. Figure 2: GeoWorld-VLM. During training, GeoWorld-VLM fine-tunes only the vision blocks including vision encoder and multimodal projector. It aligns the latent features produced by the VLM vision encoder with intermediate world-model representations, where the world model takes the input image, text prompt, and randomly sampled camera poses as input. At inference time, GeoWorld-VLM no longer requires the world model… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison. We compare the predictions of the original Gemma4, task-only fine-tuned Gemma4, and GeoWorld-VLM on representative spatial reasoning examples. GeoWorld￾VLM produces more spatially consistent answers than the baselines, illustrating the benefit of in￾jecting camera-conditioned world-model supervision into the VLM visual pathway. features with intermediate camera-conditioned world-mod… view at source ↗
Figure 4
Figure 4. Figure 4: Overall comparison on the What’sUp+VSR suite. GeoWorld-VLM improves spatial reasoning performance across diverse sub-benchmarks, showing consistent gains over the original VLM, task-only fine-tuning, and DINOv3-based static feature distillation. also transfers to more complex natural-image spatial reasoning. The qualitative examples in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Modern Vision-Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left of, on, behind, and between. One cause of this failure arises before language reasoning begins: the visual pathway may compress or discard critical 3D structural cues during feature extraction, so the language model receives image representations that are already insufficient for reliable spatial judgment. We introduce GeoWorld-VLM, a VLM-side distillation framework that transfers geometric structure from frozen camera-conditioned video world models into VLMs. GeoWorld-VLM fine-tunes only the image encoder and multimodal projector, aligning post-projector image features with intermediate world-model representations while leaving the main backbone frozen. Given images, a prompt, and a sampled camera trajectory, the world-model teacher converts static visual input into a synthetic multi-view spatial signal. Training combines spatial answer supervision, teacher-student feature alignment, and a preservation anchor to the original VLM. Since the language model remains frozen, GeoWorld-VLM preserves the original model's linguistic capabilities while attributing spatial improvements to the enhanced visual pathway. To evaluate the effectiveness and generality of the proposed method, we apply GeoWorld-VLM to two distinct VLM architectures and observe consistent improvements across both backbones. GeoWorld-VLM improves performance by approximately 4 percent on both the What'sUp and VSR benchmarks, suggesting that world-model-guided visual alignment generalizes across model structures and spatial reasoning datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces GeoWorld-VLM, a VLM-side distillation method that transfers geometric structure from frozen camera-conditioned video world models into the image encoder and multimodal projector of VLMs. Only these visual components are fine-tuned via a combination of spatial answer supervision, teacher-student feature alignment to world-model intermediates, and a preservation anchor to the original VLM, while the language model remains frozen. The authors apply the approach to two distinct VLM architectures and report consistent ~4% gains on the What'sUp and VSR spatial reasoning benchmarks.

Significance. If the gains are shown to arise specifically from transferable 3D structure in the world-model representations rather than incidental effects of spatial fine-tuning, the framework could provide an efficient route to bolstering spatial capabilities in existing VLMs without retraining the language backbone or sacrificing semantic performance. The consistency across two backbones and the use of external frozen world models as geometric teachers are notable strengths of the empirical design.

major comments (2)
  1. [Experiments] Experiments section: the manuscript reports ~4% gains on What'sUp and VSR but provides no ablation that removes the feature-alignment term to the world-model intermediates while retaining spatial answer supervision and the preservation anchor. This control is required to establish that the observed improvements are caused by geometric transfer rather than generic adaptation from the spatial loss alone.
  2. [Method] Method and Evaluation sections: no information is given on the collection protocol for spatial answers, the choice of baselines, or any statistical significance testing for the reported improvements, leaving the reliability and magnitude of the 4% gains difficult to evaluate.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'approximately 4 percent' should be accompanied by the precise metric (e.g., accuracy delta) and the identity of the baseline model for each benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We agree that the suggested ablation and additional methodological details will strengthen the paper and will incorporate them in the revised manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the manuscript reports ~4% gains on What'sUp and VSR but provides no ablation that removes the feature-alignment term to the world-model intermediates while retaining spatial answer supervision and the preservation anchor. This control is required to establish that the observed improvements are caused by geometric transfer rather than generic adaptation from the spatial loss alone.

    Authors: We agree that this ablation is important to isolate the contribution of geometric transfer. In the revised manuscript we will add a control experiment that retains spatial answer supervision and the preservation anchor but removes the feature-alignment term to world-model intermediates. Performance of this variant will be reported alongside the full GeoWorld-VLM results on both benchmarks and both architectures to demonstrate that the observed gains require the world-model alignment. revision: yes

  2. Referee: [Method] Method and Evaluation sections: no information is given on the collection protocol for spatial answers, the choice of baselines, or any statistical significance testing for the reported improvements, leaving the reliability and magnitude of the 4% gains difficult to evaluate.

    Authors: We will expand the Method and Evaluation sections to address these points. Spatial answers are taken directly from the ground-truth annotations of the What'sUp and VSR datasets; we will describe the exact prompt templates and answer formats used. Baselines consist of the two unmodified original VLM architectures plus relevant prior spatial-reasoning methods. We will also report statistical significance by including standard deviations across multiple random seeds and, where appropriate, paired statistical tests to support the reliability of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No derivation chain or self-referential reductions present

full rationale

The paper presents an empirical VLM fine-tuning framework that aligns image features with intermediate representations from externally frozen camera-conditioned world models. No equations, derivations, fitted parameters, or predictions are defined or claimed anywhere in the abstract or method description. Improvements on What'sUp and VSR are reported as observed benchmark outcomes after training, not as quantities derived from the method's own inputs by construction. The approach relies on external pre-trained models and standard supervision signals rather than any self-citation chain, ansatz smuggling, or renaming of known results, rendering the reported gains independent of internal circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that world-model intermediate features encode usable 3D geometry for static images; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Camera-conditioned video world models produce intermediate representations that encode transferable 3D spatial structure from single images plus a sampled trajectory.
    This premise is required for the teacher-student alignment step to supply useful geometric information.

pith-pipeline@v0.9.0 · 5790 in / 1257 out tokens · 82122 ms · 2026-05-20T17:51:39.800934+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 22 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y . Chen, Y . Cui, Y . Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  2. [2]

    The Claude 3 model family: Opus, Sonnet, Haiku

    Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku. Technical report, Anthropic, 2024

  3. [3]

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-VL: A ver- satile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

  4. [4]

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

  5. [5]

    PaliGemma: A versatile 3B VLM for transfer

    L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  6. [6]

    Bruce, M

    J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  7. [7]

    Caron, H

    M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the International Conference on Computer Vision (ICCV), 2021

  8. [8]

    S. Chen, T. Zhu, R. Zhou, J. Zhang, S. Gao, J. C. Niebles, M. Geva, J. He, J. Wu, and M. Li. Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas, 2025

  9. [9]

    X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, et al. PaLI: A jointly-scaled multilingual language-image model. InInternational Conference on Learning Representations (ICLR), 2023

  10. [10]

    J. H. Cho, B. Ivanovic, Y . Cao, E. Schmerling, Y . Wang, X. Weng, B. Li, Y . You, P. Krähenbühl, Y . Wang, et al. Language-image models with 3d understanding.arXiv preprint arXiv:2405.03685, 2024

  11. [11]

    W. Chow, J. Mao, B. Li, D. Seita, V . Guizilini, and Y . Wang. PhysBench: Benchmarking and enhancing vision-language models for physical world understanding.arXiv preprint arXiv:2501.16411, 2025

  12. [12]

    W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  13. [13]

    M. Du, B. Wu, Z. Li, X. Huang, and Z. Wei. EmbSpatial-Bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Volume 2: Short Papers, 2024

  14. [14]

    X. Fu, Y . Hu, B. Li, Y . Feng, H. Wang, X. Lin, D. Roth, N. A. Smith, W.-C. Ma, and R. Krishna. BLINK: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision (ECCV), 2024

  15. [15]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Google. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  16. [16]

    Mastering Diverse Domains through World Models

    D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

  17. [17]

    Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan. 3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023

  18. [19]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

  19. [20]

    What’s" up" with vision-language models? investigating their strug- gle with spatial reasoning.arXiv preprint arXiv:2310.19785,

    A. Kamath, J. Hessel, and K.-W. Chang. What’s “up” with vision-language models? Investigating their struggle with spatial reasoning.arXiv preprint arXiv:2310.19785, 2023. 11

  20. [21]

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, G. Schindler, R. Hornung, V . Birodkar, J. Yan, M.-C. Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023

  21. [22]

    W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. Hun- yuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  22. [23]

    P. Y . Lee, J. Je, C. Park, M. A. Uy, L. Guibas, and M. Sung. Perspective-aware reasoning in vision- language models via mental imagery simulation. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 9241–9251, 2025

  23. [24]

    J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning (ICML), 2023

  24. [25]

    B. Lin, Y . Ge, X. Cheng, Z. Li, B. Zhu, S. Wang, X. He, Y . Ye, S. Yuan, L. Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

  25. [26]

    T.-Y . Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár. Microsoft coco: Common objects in context, 2015

  26. [27]

    F. Liu, G. Emerson, and N. Collier. Visual spatial reasoning.Transactions of the Association for Compu- tational Linguistics, 11:635–651, 2023

  27. [28]

    H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  28. [29]

    Y . Liu, F. Zhan, K. Zhou, Y . Du, P. P. Liang, and H. Pfister. Abstract 3d perception for spatial intelligence in vision-language models.arXiv preprint arXiv:2511.10946, 2025

  29. [30]

    G. Luo, G. Yang, Z. Gong, G. Chen, H. Duan, E. Cui, R. Tong, Z. Hou, T. Zhang, Z. Chen, et al. Visual embodied brain: Let multimodal large language models see, think, and control in spaces.arXiv preprint arXiv:2506.00123, 2025

  30. [31]

    Introducing GPT-5.https://openai.com/index/introducing-gpt-5/, 2025

    OpenAI. Introducing GPT-5.https://openai.com/index/introducing-gpt-5/, 2025. Accessed: 2026-05-04

  31. [32]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  32. [33]

    Qwen3-VL Technical Report

    Qwen Team. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

  33. [34]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  34. [35]

    A. Ray, J. Duan, E. Brown, R. Tan, D. Bashkirova, R. Hendrix, K. Ehsani, A. Kembhavi, B. A. Plummer, R. Krishna, K.-H. Zeng, and K. Saenko. SAT: Dynamic spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 2024

  35. [36]

    DINOv3

    O. Siméoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

  36. [37]

    Y . Tang, X. Han, X. Li, Q. Yu, Y . Hao, L. Hu, and M. Chen. Minigpt-3d: Efficiently aligning 3d point clouds with large language models using 2d priors. InProceedings of the 32nd ACM International Con- ference on Multimedia, pages 6617–6626, 2024

  37. [38]

    G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

  38. [39]

    Q. Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

  39. [40]

    R. Team, Z. Gao, Q. Wang, Y . Zeng, J. Zhu, K. L. Cheng, Y . Li, H. Wang, Y . Xu, S. Ma, Y . Chen, J. Liu, Y . Cheng, Y . Yao, J. Zhu, Y . Meng, K. Zheng, Q. Bai, J. Chen, Z. Shen, Y . Yu, X. Zhu, Y . Shen, and H. Ouyang. Advancing open-source world models, 2026. 12

  40. [41]

    R. Team, Z. Gao, Q. Wang, Y . Zeng, J. Zhu, K. L. Cheng, Y . Li, H. Wang, Y . Xu, S. Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

  41. [42]

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  42. [43]

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

  43. [44]

    K. Wang, P. Zhang, Z. Wang, Y . Gao, L. Li, Q. Wang, H. Chen, C. Wan, Y . Lu, Z. Yang, et al. Vagen: Reinforcing world model reasoning for multi-turn vlm agents.arXiv preprint arXiv:2510.16907, 2025

  44. [45]

    W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  45. [46]

    W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, Z. Wang, Z. Chen, H. Zhang, G. Yang, H. Wang, Q. Wei, J. Yin, W. Li, E. Cui, G. Chen, Z. Ding, C. Tian, Z. Wu, J. Xie, Z. Li, B. Yang, Y . Duan, X. Wang, Z. Hou, H. Hao, T. Zhang, S. Li, X. Zhao, H. Duan, N. Deng, B. Fu, Y . He, Y . Wang, C. He, B. Shi, J. He, Y . Xiong, H. Lv...

  46. [47]

    X. Wang, Z. Zhu, G. Huang, B. Wang, X. Chen, and J. Lu. Worlddreamer: Towards general world models for video generation via predicting masked tokens.arXiv preprint arXiv:2401.09985, 2024

  47. [48]

    C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S.-m. Yin, S. Bai, X. Xu, Y . Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  48. [49]

    Xiong, Y

    H. Xiong, Y . Zhuge, J. Zhu, L. Zhang, and H. Lu. 3ur-llm: An end-to-end multimodal large language model for 3d scene understanding.IEEE Transactions on Multimedia, 2025

  49. [50]

    J. Yang, R. Tan, Q. Wu, R. Zheng, B. Peng, Y . Liang, Y . Gu, M. Cai, S. Ye, J. Jang, et al. Magma: A foundation model for multimodal ai agents. InProceedings of the computer vision and pattern recognition conference, pages 14203–14214, 2025

  50. [51]

    Y . Yang, J. Liu, Z. Zhang, S. Zhou, R. Tan, J. Yang, Y . Du, and C. Gan. Mindjourney: Test-time scaling with world models for spatial reasoning.arXiv preprint arXiv:2507.12508, 2025

  51. [52]

    Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  52. [53]

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

  53. [54]

    G. Zhao, X. Wang, Z. Zhu, X. Chen, G. Huang, X. Bao, and X. Wang. Drivedreamer-2: Llm-enhanced world models for diverse driving video generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 10412–10420, 2025

  54. [55]

    J. Zhou, H. Gao, V . V oleti, A. Vasishta, C.-H. Yao, M. Boss, P. Torr, C. Rupprecht, and V . Jampani. Stable virtual camera: Generative view synthesis with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12405–12414, 2025

  55. [56]

    K. Zhou, Y . Wang, G. Chen, X. Chang, G. Beaudouin, F. Zhan, P. P. Liang, and M. Wang. Page-4d: Disentangled pose and geometry estimation for 4d perception.arXiv preprint arXiv:2510.17568, 2025

  56. [57]

    B. Zou, M. Cai, J. Zhang, and Y . J. Lee. Vgbench: Evaluating large language models on vector graphics understanding and generation, 2024. 13 A More Details about LingBot-World-Fast GeoWorld-VLM uses LingBot-World-Fast as the default world-model teacher. LingBot-World-Fast is the efficient variant of LingBot-World, an open-source video-based world simulat...