GeoWorld-VLM: Geometry from World Models for Vision-Language Models

Kaichen Zhou; Mengyu Wang; Renjie Gu; Yan Luo

arxiv: 2605.16713 · v1 · pith:FBJ5XBK3new · submitted 2026-05-15 · 💻 cs.CV · cs.AI

GeoWorld-VLM: Geometry from World Models for Vision-Language Models

Renjie Gu , Kaichen Zhou , Yan Luo , Mengyu Wang This is my paper

Pith reviewed 2026-05-20 17:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelsspatial reasoningworld modelsfeature distillationgeometric alignment3D cuesvisual pathway

0 comments

The pith

GeoWorld-VLM improves spatial reasoning in VLMs by aligning features with frozen world models without changing the language backbone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models often fail at simple spatial relations like left, right, or behind because their visual features lose important 3D information during extraction. The paper introduces GeoWorld-VLM as a way to fix this by distilling geometric structure from pre-trained camera-conditioned video world models. It does so by fine-tuning only the image encoder and the projector to align features with the world model's intermediate representations, while keeping the language model frozen. A combination of answer supervision, feature matching, and a preservation term ensures the original capabilities stay intact. Tests on two different VLM architectures show roughly 4 percent better results on the What'sUp and VSR benchmarks.

Core claim

GeoWorld-VLM transfers geometric structure from frozen camera-conditioned video world models into VLMs by aligning post-projector image features with the world model's intermediate representations. Given an image, prompt, and sampled camera trajectory, the world model provides a synthetic multi-view spatial signal. Training uses spatial answer supervision, teacher-student feature alignment, and a preservation anchor to the original VLM, with only the encoder and projector updated.

What carries the argument

The feature alignment between post-projector VLM features and intermediate representations from the frozen world model that converts static images into multi-view spatial signals via camera trajectories.

If this is right

Improved spatial judgments result from enhanced visual representations rather than changes to language processing.
The method works across different VLM architectures, showing generality.
Original linguistic capabilities are preserved because the language model stays frozen.
Performance gains appear on multiple spatial reasoning datasets like What'sUp and VSR.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

World models could be used as geometric teachers for other tasks requiring 3D understanding beyond spatial relations in images.
Combining this with more advanced or larger world models might lead to further improvements in VLM spatial performance.
This suggests a broader strategy of using specialized models to inject missing structures into general-purpose models without full retraining.

Load-bearing premise

The intermediate representations from the frozen world models contain geometric structure that can be transferred to directly cause better spatial reasoning in the VLM.

What would settle it

If removing the feature alignment with world model representations eliminates the performance gains on spatial benchmarks while keeping other training components, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.16713 by Kaichen Zhou, Mengyu Wang, Renjie Gu, Yan Luo.

**Figure 1.** Figure 1: Overview. Given an input image and a spatial reasoning question, GeoWorld-VLM enhances the spatial understanding of standard vision-language models by injecting world-model priors at the feature-map level. Compared with the original VLM features, GeoWorld-VLM produces more geometry-aware representations, leading to clearer spatial grounding and improved answer accuracy. As shown on the right, our method… view at source ↗

**Figure 2.** Figure 2: GeoWorld-VLM. During training, GeoWorld-VLM fine-tunes only the vision blocks including vision encoder and multimodal projector. It aligns the latent features produced by the VLM vision encoder with intermediate world-model representations, where the world model takes the input image, text prompt, and randomly sampled camera poses as input. At inference time, GeoWorld-VLM no longer requires the world model… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison. We compare the predictions of the original Gemma4, task-only fine-tuned Gemma4, and GeoWorld-VLM on representative spatial reasoning examples. GeoWorldVLM produces more spatially consistent answers than the baselines, illustrating the benefit of injecting camera-conditioned world-model supervision into the VLM visual pathway. features with intermediate camera-conditioned world-mod… view at source ↗

**Figure 4.** Figure 4: Overall comparison on the What’sUp+VSR suite. GeoWorld-VLM improves spatial reasoning performance across diverse sub-benchmarks, showing consistent gains over the original VLM, task-only fine-tuning, and DINOv3-based static feature distillation. also transfers to more complex natural-image spatial reasoning. The qualitative examples in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Modern Vision-Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left of, on, behind, and between. One cause of this failure arises before language reasoning begins: the visual pathway may compress or discard critical 3D structural cues during feature extraction, so the language model receives image representations that are already insufficient for reliable spatial judgment. We introduce GeoWorld-VLM, a VLM-side distillation framework that transfers geometric structure from frozen camera-conditioned video world models into VLMs. GeoWorld-VLM fine-tunes only the image encoder and multimodal projector, aligning post-projector image features with intermediate world-model representations while leaving the main backbone frozen. Given images, a prompt, and a sampled camera trajectory, the world-model teacher converts static visual input into a synthetic multi-view spatial signal. Training combines spatial answer supervision, teacher-student feature alignment, and a preservation anchor to the original VLM. Since the language model remains frozen, GeoWorld-VLM preserves the original model's linguistic capabilities while attributing spatial improvements to the enhanced visual pathway. To evaluate the effectiveness and generality of the proposed method, we apply GeoWorld-VLM to two distinct VLM architectures and observe consistent improvements across both backbones. GeoWorld-VLM improves performance by approximately 4 percent on both the What'sUp and VSR benchmarks, suggesting that world-model-guided visual alignment generalizes across model structures and spatial reasoning datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets modest gains on spatial benchmarks by aligning VLM visual features to frozen world-model intermediates, but the gains could come from extra task training rather than geometry transfer.

read the letter

The core idea is to improve spatial reasoning in VLMs by distilling multi-view geometric signals from camera-conditioned world models into just the image encoder and projector. They keep the language model frozen and add a preservation term so linguistic performance does not drop. The method produces roughly 4% better results on What'sUp and VSR across two different VLM backbones, which suggests the approach is not tied to one architecture.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces GeoWorld-VLM, a VLM-side distillation method that transfers geometric structure from frozen camera-conditioned video world models into the image encoder and multimodal projector of VLMs. Only these visual components are fine-tuned via a combination of spatial answer supervision, teacher-student feature alignment to world-model intermediates, and a preservation anchor to the original VLM, while the language model remains frozen. The authors apply the approach to two distinct VLM architectures and report consistent ~4% gains on the What'sUp and VSR spatial reasoning benchmarks.

Significance. If the gains are shown to arise specifically from transferable 3D structure in the world-model representations rather than incidental effects of spatial fine-tuning, the framework could provide an efficient route to bolstering spatial capabilities in existing VLMs without retraining the language backbone or sacrificing semantic performance. The consistency across two backbones and the use of external frozen world models as geometric teachers are notable strengths of the empirical design.

major comments (2)

[Experiments] Experiments section: the manuscript reports ~4% gains on What'sUp and VSR but provides no ablation that removes the feature-alignment term to the world-model intermediates while retaining spatial answer supervision and the preservation anchor. This control is required to establish that the observed improvements are caused by geometric transfer rather than generic adaptation from the spatial loss alone.
[Method] Method and Evaluation sections: no information is given on the collection protocol for spatial answers, the choice of baselines, or any statistical significance testing for the reported improvements, leaving the reliability and magnitude of the 4% gains difficult to evaluate.

minor comments (1)

[Abstract] Abstract: the phrase 'approximately 4 percent' should be accompanied by the precise metric (e.g., accuracy delta) and the identity of the baseline model for each benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We agree that the suggested ablation and additional methodological details will strengthen the paper and will incorporate them in the revised manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: the manuscript reports ~4% gains on What'sUp and VSR but provides no ablation that removes the feature-alignment term to the world-model intermediates while retaining spatial answer supervision and the preservation anchor. This control is required to establish that the observed improvements are caused by geometric transfer rather than generic adaptation from the spatial loss alone.

Authors: We agree that this ablation is important to isolate the contribution of geometric transfer. In the revised manuscript we will add a control experiment that retains spatial answer supervision and the preservation anchor but removes the feature-alignment term to world-model intermediates. Performance of this variant will be reported alongside the full GeoWorld-VLM results on both benchmarks and both architectures to demonstrate that the observed gains require the world-model alignment. revision: yes
Referee: [Method] Method and Evaluation sections: no information is given on the collection protocol for spatial answers, the choice of baselines, or any statistical significance testing for the reported improvements, leaving the reliability and magnitude of the 4% gains difficult to evaluate.

Authors: We will expand the Method and Evaluation sections to address these points. Spatial answers are taken directly from the ground-truth annotations of the What'sUp and VSR datasets; we will describe the exact prompt templates and answer formats used. Baselines consist of the two unmodified original VLM architectures plus relevant prior spatial-reasoning methods. We will also report statistical significance by including standard deviations across multiple random seeds and, where appropriate, paired statistical tests to support the reliability of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No derivation chain or self-referential reductions present

full rationale

The paper presents an empirical VLM fine-tuning framework that aligns image features with intermediate representations from externally frozen camera-conditioned world models. No equations, derivations, fitted parameters, or predictions are defined or claimed anywhere in the abstract or method description. Improvements on What'sUp and VSR are reported as observed benchmark outcomes after training, not as quantities derived from the method's own inputs by construction. The approach relies on external pre-trained models and standard supervision signals rather than any self-citation chain, ansatz smuggling, or renaming of known results, rendering the reported gains independent of internal circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that world-model intermediate features encode usable 3D geometry for static images; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Camera-conditioned video world models produce intermediate representations that encode transferable 3D spatial structure from single images plus a sampled trajectory.
This premise is required for the teacher-student alignment step to supply useful geometric information.

pith-pipeline@v0.9.0 · 5790 in / 1257 out tokens · 82122 ms · 2026-05-20T17:51:39.800934+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GeoWorld-VLM fine-tunes only the image encoder and multimodal projector, aligning post-projector image features with intermediate world-model representations while leaving the main backbone frozen.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Training combines spatial answer supervision, teacher-student feature alignment, and a preservation anchor to the original VLM.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 22 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y . Chen, Y . Cui, Y . Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

The Claude 3 model family: Opus, Sonnet, Haiku

Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku. Technical report, Anthropic, 2024

work page 2024
[3]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-VL: A ver- satile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

PaliGemma: A versatile 3B VLM for transfer

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Bruce, M

J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

work page 2024
[7]

Caron, H

M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the International Conference on Computer Vision (ICCV), 2021

work page 2021
[8]

S. Chen, T. Zhu, R. Zhou, J. Zhang, S. Gao, J. C. Niebles, M. Geva, J. He, J. Wu, and M. Li. Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas, 2025

work page 2025
[9]

X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, et al. PaLI: A jointly-scaled multilingual language-image model. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[10]

J. H. Cho, B. Ivanovic, Y . Cao, E. Schmerling, Y . Wang, X. Weng, B. Li, Y . You, P. Krähenbühl, Y . Wang, et al. Language-image models with 3d understanding.arXiv preprint arXiv:2405.03685, 2024

work page arXiv 2024
[11]

W. Chow, J. Mao, B. Li, D. Seita, V . Guizilini, and Y . Wang. PhysBench: Benchmarking and enhancing vision-language models for physical world understanding.arXiv preprint arXiv:2501.16411, 2025

work page arXiv 2025
[12]

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[13]

M. Du, B. Wu, Z. Li, X. Huang, and Z. Wei. EmbSpatial-Bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Volume 2: Short Papers, 2024

work page 2024
[14]

X. Fu, Y . Hu, B. Li, Y . Feng, H. Wang, X. Lin, D. Roth, N. A. Smith, W.-C. Ma, and R. Krishna. BLINK: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision (ECCV), 2024

work page 2024
[15]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Google. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Mastering Diverse Domains through World Models

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan. 3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023

work page 2023
[19]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

What’s" up" with vision-language models? investigating their strug- gle with spatial reasoning.arXiv preprint arXiv:2310.19785,

A. Kamath, J. Hessel, and K.-W. Chang. What’s “up” with vision-language models? Investigating their struggle with spatial reasoning.arXiv preprint arXiv:2310.19785, 2023. 11

work page arXiv 2023
[21]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, G. Schindler, R. Hornung, V . Birodkar, J. Yan, M.-C. Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. Hun- yuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

P. Y . Lee, J. Je, C. Park, M. A. Uy, L. Guibas, and M. Sung. Perspective-aware reasoning in vision- language models via mental imagery simulation. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 9241–9251, 2025

work page 2025
[24]

J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning (ICML), 2023

work page 2023
[25]

B. Lin, Y . Ge, X. Cheng, Z. Li, B. Zhu, S. Wang, X. He, Y . Ye, S. Yuan, L. Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

T.-Y . Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár. Microsoft coco: Common objects in context, 2015

work page 2015
[27]

F. Liu, G. Emerson, and N. Collier. Visual spatial reasoning.Transactions of the Association for Compu- tational Linguistics, 11:635–651, 2023

work page 2023
[28]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[29]

Y . Liu, F. Zhan, K. Zhou, Y . Du, P. P. Liang, and H. Pfister. Abstract 3d perception for spatial intelligence in vision-language models.arXiv preprint arXiv:2511.10946, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

G. Luo, G. Yang, Z. Gong, G. Chen, H. Duan, E. Cui, R. Tong, Z. Hou, T. Zhang, Z. Chen, et al. Visual embodied brain: Let multimodal large language models see, think, and control in spaces.arXiv preprint arXiv:2506.00123, 2025

work page arXiv 2025
[31]

Introducing GPT-5.https://openai.com/index/introducing-gpt-5/, 2025

OpenAI. Introducing GPT-5.https://openai.com/index/introducing-gpt-5/, 2025. Accessed: 2026-05-04

work page 2025
[32]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Qwen3-VL Technical Report

Qwen Team. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[35]

A. Ray, J. Duan, E. Brown, R. Tan, D. Bashkirova, R. Hendrix, K. Ehsani, A. Kembhavi, B. A. Plummer, R. Krishna, K.-H. Zeng, and K. Saenko. SAT: Dynamic spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 2024

work page arXiv 2024
[36]

DINOv3

O. Siméoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Y . Tang, X. Han, X. Li, Q. Yu, Y . Hao, L. Hu, and M. Chen. Minigpt-3d: Efficiently aligning 3d point clouds with large language models using 2d priors. InProceedings of the 32nd ACM International Con- ference on Multimedia, pages 6617–6626, 2024

work page 2024
[38]

G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Q. Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

R. Team, Z. Gao, Q. Wang, Y . Zeng, J. Zhu, K. L. Cheng, Y . Li, H. Wang, Y . Xu, S. Ma, Y . Chen, J. Liu, Y . Cheng, Y . Yao, J. Zhu, Y . Meng, K. Zheng, Q. Bai, J. Chen, Z. Shen, Y . Yu, X. Zhu, Y . Shen, and H. Ouyang. Advancing open-source world models, 2026. 12

work page 2026
[41]

R. Team, Z. Gao, Q. Wang, Y . Zeng, J. Zhu, K. L. Cheng, Y . Li, H. Wang, Y . Xu, S. Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

work page 2025
[44]

K. Wang, P. Zhang, Z. Wang, Y . Gao, L. Li, Q. Wang, H. Chen, C. Wan, Y . Lu, Z. Yang, et al. Vagen: Reinforcing world model reasoning for multi-turn vlm agents.arXiv preprint arXiv:2510.16907, 2025

work page arXiv 2025
[45]

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, Z. Wang, Z. Chen, H. Zhang, G. Yang, H. Wang, Q. Wei, J. Yin, W. Li, E. Cui, G. Chen, Z. Ding, C. Tian, Z. Wu, J. Xie, Z. Li, B. Yang, Y . Duan, X. Wang, Z. Hou, H. Hao, T. Zhang, S. Li, X. Zhao, H. Duan, N. Deng, B. Fu, Y . He, Y . Wang, C. He, B. Shi, J. He, Y . Xiong, H. Lv...

work page 2025
[47]

X. Wang, Z. Zhu, G. Huang, B. Wang, X. Chen, and J. Lu. Worlddreamer: Towards general world models for video generation via predicting masked tokens.arXiv preprint arXiv:2401.09985, 2024

work page arXiv 2024
[48]

C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S.-m. Yin, S. Bai, X. Xu, Y . Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Xiong, Y

H. Xiong, Y . Zhuge, J. Zhu, L. Zhang, and H. Lu. 3ur-llm: An end-to-end multimodal large language model for 3d scene understanding.IEEE Transactions on Multimedia, 2025

work page 2025
[50]

J. Yang, R. Tan, Q. Wu, R. Zheng, B. Peng, Y . Liang, Y . Gu, M. Cai, S. Ye, J. Jang, et al. Magma: A foundation model for multimodal ai agents. InProceedings of the computer vision and pattern recognition conference, pages 14203–14214, 2025

work page 2025
[51]

Y . Yang, J. Liu, Z. Zhang, S. Zhou, R. Tan, J. Yang, Y . Du, and C. Gan. Mindjourney: Test-time scaling with world models for spatial reasoning.arXiv preprint arXiv:2507.12508, 2025

work page arXiv 2025
[52]

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

work page 2023
[54]

G. Zhao, X. Wang, Z. Zhu, X. Chen, G. Huang, X. Bao, and X. Wang. Drivedreamer-2: Llm-enhanced world models for diverse driving video generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 10412–10420, 2025

work page 2025
[55]

J. Zhou, H. Gao, V . V oleti, A. Vasishta, C.-H. Yao, M. Boss, P. Torr, C. Rupprecht, and V . Jampani. Stable virtual camera: Generative view synthesis with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12405–12414, 2025

work page 2025
[56]

K. Zhou, Y . Wang, G. Chen, X. Chang, G. Beaudouin, F. Zhan, P. P. Liang, and M. Wang. Page-4d: Disentangled pose and geometry estimation for 4d perception.arXiv preprint arXiv:2510.17568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

B. Zou, M. Cai, J. Zhang, and Y . J. Lee. Vgbench: Evaluating large language models on vector graphics understanding and generation, 2024. 13 A More Details about LingBot-World-Fast GeoWorld-VLM uses LingBot-World-Fast as the default world-model teacher. LingBot-World-Fast is the efficient variant of LingBot-World, an open-source video-based world simulat...

work page arXiv 2024

[1] [1]

Cosmos World Foundation Model Platform for Physical AI

N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y . Chen, Y . Cui, Y . Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

The Claude 3 model family: Opus, Sonnet, Haiku

Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku. Technical report, Anthropic, 2024

work page 2024

[3] [3]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-VL: A ver- satile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

PaliGemma: A versatile 3B VLM for transfer

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Bruce, M

J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

work page 2024

[7] [7]

Caron, H

M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the International Conference on Computer Vision (ICCV), 2021

work page 2021

[8] [8]

S. Chen, T. Zhu, R. Zhou, J. Zhang, S. Gao, J. C. Niebles, M. Geva, J. He, J. Wu, and M. Li. Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas, 2025

work page 2025

[9] [9]

X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, et al. PaLI: A jointly-scaled multilingual language-image model. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023

[10] [10]

J. H. Cho, B. Ivanovic, Y . Cao, E. Schmerling, Y . Wang, X. Weng, B. Li, Y . You, P. Krähenbühl, Y . Wang, et al. Language-image models with 3d understanding.arXiv preprint arXiv:2405.03685, 2024

work page arXiv 2024

[11] [11]

W. Chow, J. Mao, B. Li, D. Seita, V . Guizilini, and Y . Wang. PhysBench: Benchmarking and enhancing vision-language models for physical world understanding.arXiv preprint arXiv:2501.16411, 2025

work page arXiv 2025

[12] [12]

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[13] [13]

M. Du, B. Wu, Z. Li, X. Huang, and Z. Wei. EmbSpatial-Bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Volume 2: Short Papers, 2024

work page 2024

[14] [14]

X. Fu, Y . Hu, B. Li, Y . Feng, H. Wang, X. Lin, D. Roth, N. A. Smith, W.-C. Ma, and R. Krishna. BLINK: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision (ECCV), 2024

work page 2024

[15] [15]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Google. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Mastering Diverse Domains through World Models

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan. 3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023

work page 2023

[18] [19]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [20]

What’s" up" with vision-language models? investigating their strug- gle with spatial reasoning.arXiv preprint arXiv:2310.19785,

A. Kamath, J. Hessel, and K.-W. Chang. What’s “up” with vision-language models? Investigating their struggle with spatial reasoning.arXiv preprint arXiv:2310.19785, 2023. 11

work page arXiv 2023

[20] [21]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, G. Schindler, R. Hornung, V . Birodkar, J. Yan, M.-C. Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [22]

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. Hun- yuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [23]

P. Y . Lee, J. Je, C. Park, M. A. Uy, L. Guibas, and M. Sung. Perspective-aware reasoning in vision- language models via mental imagery simulation. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 9241–9251, 2025

work page 2025

[23] [24]

J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning (ICML), 2023

work page 2023

[24] [25]

B. Lin, Y . Ge, X. Cheng, Z. Li, B. Zhu, S. Wang, X. He, Y . Ye, S. Yuan, L. Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [26]

T.-Y . Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár. Microsoft coco: Common objects in context, 2015

work page 2015

[26] [27]

F. Liu, G. Emerson, and N. Collier. Visual spatial reasoning.Transactions of the Association for Compu- tational Linguistics, 11:635–651, 2023

work page 2023

[27] [28]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[28] [29]

Y . Liu, F. Zhan, K. Zhou, Y . Du, P. P. Liang, and H. Pfister. Abstract 3d perception for spatial intelligence in vision-language models.arXiv preprint arXiv:2511.10946, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [30]

G. Luo, G. Yang, Z. Gong, G. Chen, H. Duan, E. Cui, R. Tong, Z. Hou, T. Zhang, Z. Chen, et al. Visual embodied brain: Let multimodal large language models see, think, and control in spaces.arXiv preprint arXiv:2506.00123, 2025

work page arXiv 2025

[30] [31]

Introducing GPT-5.https://openai.com/index/introducing-gpt-5/, 2025

OpenAI. Introducing GPT-5.https://openai.com/index/introducing-gpt-5/, 2025. Accessed: 2026-05-04

work page 2025

[31] [32]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [33]

Qwen3-VL Technical Report

Qwen Team. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [34]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021

[34] [35]

A. Ray, J. Duan, E. Brown, R. Tan, D. Bashkirova, R. Hendrix, K. Ehsani, A. Kembhavi, B. A. Plummer, R. Krishna, K.-H. Zeng, and K. Saenko. SAT: Dynamic spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 2024

work page arXiv 2024

[35] [36]

DINOv3

O. Siméoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [37]

Y . Tang, X. Han, X. Li, Q. Yu, Y . Hao, L. Hu, and M. Chen. Minigpt-3d: Efficiently aligning 3d point clouds with large language models using 2d priors. InProceedings of the 32nd ACM International Con- ference on Multimedia, pages 6617–6626, 2024

work page 2024

[37] [38]

G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [39]

Q. Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [40]

R. Team, Z. Gao, Q. Wang, Y . Zeng, J. Zhu, K. L. Cheng, Y . Li, H. Wang, Y . Xu, S. Ma, Y . Chen, J. Liu, Y . Cheng, Y . Yao, J. Zhu, Y . Meng, K. Zheng, Q. Bai, J. Chen, Z. Shen, Y . Yu, X. Zhu, Y . Shen, and H. Ouyang. Advancing open-source world models, 2026. 12

work page 2026

[40] [41]

R. Team, Z. Gao, Q. Wang, Y . Zeng, J. Zhu, K. L. Cheng, Y . Li, H. Wang, Y . Xu, S. Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[41] [42]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [43]

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

work page 2025

[43] [44]

K. Wang, P. Zhang, Z. Wang, Y . Gao, L. Li, Q. Wang, H. Chen, C. Wan, Y . Lu, Z. Yang, et al. Vagen: Reinforcing world model reasoning for multi-turn vlm agents.arXiv preprint arXiv:2510.16907, 2025

work page arXiv 2025

[44] [45]

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [46]

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, Z. Wang, Z. Chen, H. Zhang, G. Yang, H. Wang, Q. Wei, J. Yin, W. Li, E. Cui, G. Chen, Z. Ding, C. Tian, Z. Wu, J. Xie, Z. Li, B. Yang, Y . Duan, X. Wang, Z. Hou, H. Hao, T. Zhang, S. Li, X. Zhao, H. Duan, N. Deng, B. Fu, Y . He, Y . Wang, C. He, B. Shi, J. He, Y . Xiong, H. Lv...

work page 2025

[46] [47]

X. Wang, Z. Zhu, G. Huang, B. Wang, X. Chen, and J. Lu. Worlddreamer: Towards general world models for video generation via predicting masked tokens.arXiv preprint arXiv:2401.09985, 2024

work page arXiv 2024

[47] [48]

C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S.-m. Yin, S. Bai, X. Xu, Y . Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [49]

Xiong, Y

H. Xiong, Y . Zhuge, J. Zhu, L. Zhang, and H. Lu. 3ur-llm: An end-to-end multimodal large language model for 3d scene understanding.IEEE Transactions on Multimedia, 2025

work page 2025

[49] [50]

J. Yang, R. Tan, Q. Wu, R. Zheng, B. Peng, Y . Liang, Y . Gu, M. Cai, S. Ye, J. Jang, et al. Magma: A foundation model for multimodal ai agents. InProceedings of the computer vision and pattern recognition conference, pages 14203–14214, 2025

work page 2025

[50] [51]

Y . Yang, J. Liu, Z. Zhang, S. Zhou, R. Tan, J. Yang, Y . Du, and C. Gan. Mindjourney: Test-time scaling with world models for spatial reasoning.arXiv preprint arXiv:2507.12508, 2025

work page arXiv 2025

[51] [52]

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [53]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

work page 2023

[53] [54]

G. Zhao, X. Wang, Z. Zhu, X. Chen, G. Huang, X. Bao, and X. Wang. Drivedreamer-2: Llm-enhanced world models for diverse driving video generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 10412–10420, 2025

work page 2025

[54] [55]

J. Zhou, H. Gao, V . V oleti, A. Vasishta, C.-H. Yao, M. Boss, P. Torr, C. Rupprecht, and V . Jampani. Stable virtual camera: Generative view synthesis with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12405–12414, 2025

work page 2025

[55] [56]

K. Zhou, Y . Wang, G. Chen, X. Chang, G. Beaudouin, F. Zhan, P. P. Liang, and M. Wang. Page-4d: Disentangled pose and geometry estimation for 4d perception.arXiv preprint arXiv:2510.17568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [57]

B. Zou, M. Cai, J. Zhang, and Y . J. Lee. Vgbench: Evaluating large language models on vector graphics understanding and generation, 2024. 13 A More Details about LingBot-World-Fast GeoWorld-VLM uses LingBot-World-Fast as the default world-model teacher. LingBot-World-Fast is the efficient variant of LingBot-World, an open-source video-based world simulat...

work page arXiv 2024