pith. sign in

arxiv: 2606.00963 · v1 · pith:CICVJ6DVnew · submitted 2026-05-31 · 💻 cs.CV · cs.CL

Reasmory: 3D Reconstruction as Explicit Memory for VLMs Spatial Reasoning

Pith reviewed 2026-06-28 17:44 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords 3D reconstructionvision-language modelsspatial reasoningdomain-specific languageexplicit memorymulti-view imagesvideo spatial understandingtool use validation
0
0 comments X

The pith

VLMs achieve more reliable spatial reasoning by running validated DSL programs over explicit 3D reconstructions instead of free-form tool calls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that vision-language models struggle with precise spatial tasks because cues in multi-view images and videos are sparse and hard to organize implicitly. It shows that turning reconstruction outputs into explicit 3D memory, augmented with object instances, and letting models generate programs in a constrained DSL solves the problem by forcing correct syntax and semantics before execution. This structured access outperforms unconstrained tool use on viewpoint reasoning, directional comparison, and distance estimation. A sympathetic reader would care because the method improves reliability without retraining the underlying models and works on both image sets and monocular video.

Core claim

Reasmory builds explicit 3D memory from reconstruction models, augments it with semantically grounded object instances, and supplies a lightweight DSL so that VLMs generate programs to query objects and cameras, apply viewpoint transforms, and render observations; these programs are parsed and validated before execution, yielding 6-18 percent gains over strong baselines on multi-view and video spatial benchmarks.

What carries the argument

A lightweight Domain-Specific Language whose operations query objects and cameras, transform viewpoints, and render observations over reconstructed point clouds and instances, with parsing and validation before execution.

If this is right

  • Explicit 3D memory accessed through validated programs outperforms free-form tool invocation on the same underlying reconstruction and VLM components.
  • Gains appear consistently across multi-view image sets and monocular video sequences on spatial reasoning benchmarks.
  • The approach works with existing reconstruction foundation models and current VLMs without additional training.
  • Constrained, validated execution reduces errors from incorrect tool calls, skipped transformations, or misused intermediate results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same DSL-plus-validation pattern could be applied to other forms of explicit memory such as scene graphs or occupancy grids when spatial cues are sparse.
  • If reconstruction quality improves on real-world video, the performance gap between constrained and free-form access may widen further.
  • The method suggests a general template for making tool-using agents more reliable by replacing open-ended API calls with a checked intermediate language.

Load-bearing premise

Reconstruction models can turn sparse views into accurate explicit 3D memory and VLMs can emit DSL programs that are both syntactically valid and semantically sufficient for the needed reasoning.

What would settle it

A controlled experiment in which VLMs generate invalid or incomplete DSL programs at high rates, or in which reconstruction quality drops on the same benchmarks, produces no accuracy gain or a drop relative to free-form baselines.

Figures

Figures reproduced from arXiv: 2606.00963 by Chieh Hubert Lin, Jixuan He, Ming-Hsuan Yang, Xueting Li.

Figure 1
Figure 1. Figure 1: Overview of Reasmory. Spatial evidence in multi-view images and videos is often sparse and redundant, making it important to organize evidence explicitly for VLM spatial reasoning. Reasmory addresses this by constructing explicit 3D spatial memory and constraining VLM interaction with this memory through validated DSL programs. tasks require perceiving relevant objects, retaining observations across time o… view at source ↗
Figure 2
Figure 2. Figure 2: Camera-transition results on MindCube. Explicit tool-use planning yields more [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Reasmory. The system constructs reconstruction-based spatial mem￾ory, augments it with grounded 3D object instances, generates and validates a DSL program, and executes the program to support spatial reasoning. forward pass. Using the predicted depth and camera parameters, each pixel can be back￾projected into 3D. Specifically, for a pixel p = (u, v) in image Vi with depth value Di(u,v), the co… view at source ↗
Figure 4
Figure 4. Figure 4: An end-to-end reasoning example. The trajectory illustrates how verification and repair refine a DSL plan before spatial-memory execution; see Sec. 4.5 for details. of executing an incorrect plan, the compiler detects the mismatch against the decomposi￾tion and rejects the program with explicit feedback. After receiving the error message, the planner generates a revised program that correctly implements th… view at source ↗
read the original abstract

Vision-Language Models (VLMs) exhibit emerging spatial reasoning capabilities, yet they remain unreliable on tasks requiring precise spatial understanding, such as viewpoint reasoning, directional comparison, and distance estimation. In multi-view images and monocular videos, relevant spatial cues are often sparse and distributed across redundant observations, making them difficult to organize and exploit. Reconstruction-based Vision Foundation Models (VFMs) offer a natural way to aggregate such observations into explicit spatial memory, such as point clouds. However, simply exposing reconstruction models as free-form tools is brittle, VLMs may invoke tools incorrectly, skip required spatial transformations, or misuse intermediate results. We propose \textbf{Reasmory}, a framework that formulates spatial reasoning as structured program execution over reconstructed spatial memory. Reasmory constructs explicit 3D memory, augments it with semantically grounded 3D object instances, and introduces a lightweight Domain-Specific Language (DSL) that constrains how VLMs query objects and cameras, transform viewpoints, and render observations during reasoning. Generated programs are parsed and validated before execution, enabling more reliable interaction with spatial memory than unconstrained tool use. Experiments on multi-view image and video spatial reasoning benchmarks show consistent gains of 6--18\% over strong baselines, including GPT-5-mini and Gemini-3-flash, indicating that explicit 3D memory is most useful when accessed through constrained, validated operations rather than free-form tool calls.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Reasmory, a framework that aggregates sparse multi-view and video observations into explicit 3D spatial memory via reconstruction VFMs (point clouds augmented with grounded object instances) and introduces a lightweight DSL to constrain VLMs to validated program execution for spatial queries, viewpoint transforms, and rendering. It claims this structured access yields consistent 6--18% gains over strong baselines including GPT-5-mini and Gemini-3-flash on multi-view image and video spatial reasoning benchmarks, attributing the improvement to constrained operations rather than free-form tool use.

Significance. If the experimental results prove robust and the reconstruction quality is demonstrably sufficient, the work would be significant for VLM spatial reasoning by providing evidence that explicit 3D memory is most effective when accessed via a validated DSL rather than unconstrained tools. The design choice to parse and validate programs before execution directly targets a known brittleness in tool-augmented VLMs.

major comments (2)
  1. [Abstract] Abstract: the central claim of 6--18% gains over GPT-5-mini and Gemini-3-flash is presented without any dataset descriptions, baseline implementations, statistical tests, error bars, or ablation studies, preventing assessment of whether gains are attributable to usable 3D memory or to the DSL constraint itself.
  2. [Abstract] Abstract: the assumption that reconstruction VFMs reliably produce accurate enough explicit memory (point clouds + grounded instances) from sparse/redundant observations is load-bearing for the claimed gains, yet no reconstruction quality metrics, failure modes, or ablation isolating memory fidelity are supplied.
minor comments (1)
  1. The invented terms 'Reasmory' and 'lightweight DSL' appear without an explicit definition or comparison to prior DSLs for spatial or geometric reasoning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting areas where the abstract could better support assessment of the claims. The full manuscript contains the requested experimental details in Sections 4 and 5, but we agree the abstract presentation can be improved. We address each point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 6--18% gains over GPT-5-mini and Gemini-3-flash is presented without any dataset descriptions, baseline implementations, statistical tests, error bars, or ablation studies, preventing assessment of whether gains are attributable to usable 3D memory or to the DSL constraint itself.

    Authors: The abstract is intentionally concise. The manuscript provides full details on the multi-view image and video spatial reasoning benchmarks, exact baseline implementations (including prompting strategies for GPT-5-mini and Gemini-3-flash), results with error bars and statistical tests, and ablations that separate the DSL constraint from free-form tool use and from the 3D memory itself. We will revise the abstract to briefly reference the evaluation benchmarks and note that ablations attribute gains to the validated DSL. revision: partial

  2. Referee: [Abstract] Abstract: the assumption that reconstruction VFMs reliably produce accurate enough explicit memory (point clouds + grounded instances) from sparse/redundant observations is load-bearing for the claimed gains, yet no reconstruction quality metrics, failure modes, or ablation isolating memory fidelity are supplied.

    Authors: We acknowledge that the current manuscript does not supply quantitative reconstruction quality metrics, explicit failure mode analysis, or an ablation isolating memory fidelity from the DSL. These elements are load-bearing and their absence limits evaluation of the framework. We will add reconstruction error metrics on the evaluation datasets, representative failure cases, and a dedicated ablation on memory fidelity in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with independent experimental validation

full rationale

The paper introduces Reasmory as a constructive framework (explicit 3D memory + grounded instances + DSL-constrained program execution) and reports empirical gains of 6-18% on spatial reasoning benchmarks. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim that constrained DSL access outperforms free-form tool use is tested directly via ablation-style comparisons against baselines including GPT-5-mini and Gemini-3-flash; these outcomes are not forced by definition or prior self-referential results. The derivation chain is self-contained as an engineering proposal whose value rests on external benchmark performance rather than internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Review performed on abstract only; full paper would be needed to enumerate free parameters, axioms, or invented entities with precision.

axioms (2)
  • domain assumption Reconstruction models produce sufficiently accurate explicit spatial memory from multi-view or video input
    Stated as the foundation for the memory component.
  • domain assumption VLMs can produce valid programs in the introduced DSL
    Required for the structured execution approach to function.
invented entities (2)
  • Reasmory framework no independent evidence
    purpose: Structured program execution over 3D memory
    New system proposed in the paper.
  • lightweight DSL for spatial queries no independent evidence
    purpose: Constrain and validate VLM interactions with 3D memory
    Introduced to replace free-form tool calls.

pith-pipeline@v0.9.1-grok · 5791 in / 1427 out tokens · 29316 ms · 2026-06-28T17:44:33.999421+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 2 canonical work pages

  1. [1]

    Qwen3-vl technical report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, An- drew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  3. [3]

    Tensorf: Tensorial radiance fields

    Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. InEuropean conference on computer vision, pages 333–350. Springer, 2022

  4. [4]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities.arXiv preprint arXiv:2401.12168, 2024

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Danny Driess, Pete Florence, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities.arXiv preprint arXiv:2401.12168, 2024

  5. [5]

    TTT3r: 3d reconstruction as test-time training

    Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. TTT3r: 3d reconstruction as test-time training. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum? id=aMs6FtNaY5

  6. [6]

    Think with 3d: Geo- metric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025

    Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, and Ruqi Huang. Think with 3d: Geo- metric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025

  7. [7]

    Flow3r: Fac- tored flow prediction for scalable visual geometry learning

    Zhongxiao Cong, Qitao Zhao, Minsik Jeon, and Shubham Tulsiani. Flow3r: Fac- tored flow prediction for scalable visual geometry learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026. URLhttps://openaccess.thecvf.com/content/CVPR2026/html/ Cong_Flow3r_Factored_Flow_Prediction_for_Scalable_Visual_ Geometry_Learni...

  8. [8]

    Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

  9. [9]

    ISBN 9781450383912

    Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sablé-Meyer, Lucas Morales, Luke Hewitt, Luc Cary, Armando Solar-Lezama, and Joshua B. Tenenbaum. Dream- coder: bootstrapping inductive program synthesis with wake-sleep library learning. InProceedings of the 42nd ACM SIGPLAN International Conference on Program- ming Language Design and Implementation, PLD...

  10. [10]

    Vlm-3r: Vision- language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, et al. Vlm-3r: Vision- language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025

  11. [11]

    Video-r1: Reinforcing video reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in MLLMs. InThe Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems, 2026. URLhttps://openreview.net/forum? id=a2JTVVvcEl

  12. [12]

    Pearson Education, 2010

    Martin Fowler.Domain-Specific Languages, Portable Documents. Pearson Education, 2010

  13. [13]

    Pal: Program-aided language models.arXiv preprint arXiv:2211.10435, 2022

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models.arXiv preprint arXiv:2211.10435, 2022

  14. [14]

    Pursuing minimal sufficiency in spatial reasoning

    Yejie Guo, Yunzhong Hou, Wufei Ma, Meng Tang, and Ming-Hsuan Yang. Pursuing minimal sufficiency in spatial reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum? id=bZAKJwyn1n

  15. [15]

    Visual programming: Compositional visual reasoning without training

    Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023

  16. [16]

    Mem4nav: Boosting vision-and-language navigation in urban environments with a hierarchical spatial-cognition long-short memory system.arXiv preprint arXiv:2506.19433, 2025

    Lixuan He, Haoyu Dong, Zhenxing Chen, Yangcheng Yu, Jie Feng, and Yong Li. Mem4nav: Boosting vision-and-language navigation in urban environments with a hierarchical spatial-cognition long-short memory system.arXiv preprint arXiv:2506.19433, 2025

  17. [17]

    G 2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688, 2025

    Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, and Jiangmiao Pang. G 2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688, 2025

  18. [18]

    Mllms need 3d-aware represen- tation supervision for scene understanding.CoRR, abs/2506.01946, June 2025

    Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. Mllms need 3d-aware represen- tation supervision for scene understanding.CoRR, abs/2506.01946, June 2025. URL https://doi.org/10.48550/arXiv.2506.01946

  19. [19]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4): 139–1, 2023

  20. [20]

    Open- vla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Open- vla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 18HE, LI, LIN AND Y ANG: REASMORY

  21. [21]

    Spatialladder: Pro- gressive training for spatial reasoning in vision-language models

    Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Pro- gressive training for spatial reasoning in vision-language models. InThe Four- teenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=KtrFXlvgrK

  22. [22]

    Vmem: Consistent interac- tive video scene generation with surfel-indexed view memory

    Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interac- tive video scene generation with surfel-indexed view memory. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25690–25699, 2025

  23. [23]

    Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang

    Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views. InThe Four- teenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=yirunib8l8

  24. [24]

    Msnav: Zero-shot vision-and-language navigation with dynamic mem- ory and llm spatial reasoning

    Chenghao Liu, Zhimu Zhou, Jiachen Zhang, Minghao Zhang, Songfang Huang, and Huiling Duan. Msnav: Zero-shot vision-and-language navigation with dynamic mem- ory and llm spatial reasoning. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 20112–20116. IEEE, 2026

  25. [25]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023

  26. [26]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

  27. [27]

    Dynamem: Online dynamic spatio-semantic memory for open world mobile manipulation

    Peiqi Liu, Zhanqiu Guo, Mohit Warke, Soumith Chintala, Chris Paxton, Nur Muham- mad Mahi Shafiullah, and Lerrel Pinto. Dynamem: Online dynamic spatio-semantic memory for open world mobile manipulation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13346–13355. IEEE, 2025

  28. [28]

    pyspatial: Generating 3d visual programs for zero-shot spatial reasoning.arXiv preprint arXiv:2603.00905, 2026

    Zhanpeng Luo, Ce Zhang, Silong Yong, Cunxi Dai, Qianwei Wang, Haoxi Ran, Guanya Shi, Katia Sycara, and Yaqi Xie. pyspatial: Generating 3d visual programs for zero-shot spatial reasoning.arXiv preprint arXiv:2603.00905, 2026

  29. [29]

    Nerf in the wild: Neural radiance fields for uncon- strained photo collections

    Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for uncon- strained photo collections. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7210–7219, 2021

  30. [30]

    When and how to develop domain-specific languages.ACM computing surveys (CSUR), 37(4):316–344, 2005

    Marjan Mernik, Jan Heering, and Anthony M Sloane. When and how to develop domain-specific languages.ACM computing surveys (CSUR), 37(4):316–344, 2005

  31. [31]

    Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ra- mamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

  32. [32]

    Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graph- ics (TOG), 41(4):1–15, 2022

    Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graph- ics (TOG), 41(4):1–15, 2022. HE, LI, LIN AND Y ANG: REASMORY19

  33. [33]

    Oxford university press, 1978

    John O’keefe and Lynn Nadel.The hippocampus as a cognitive map. Oxford university press, 1978

  34. [34]

    Single unit activity in the rat hippocampus during a spatial memory task.Experimental brain research, 68(1):1–27, 1987

    John O’Keefe and Andrew Speakman. Single unit activity in the rat hippocampus during a spatial memory task.Experimental brain research, 68(1):1–27, 1987

  35. [35]

    Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

    Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

  36. [36]

    Long-context state-space video world models

    Ryan Po, Yotam Nitzan, Richard Zhang, Berlin Chen, Tri Dao, Eli Shechtman, Gordon Wetzstein, and Xun Huang. Long-context state-space video world models. InProceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 8733–8744, 2025

  37. [37]

    Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221, 2025

    Zhangyang Qi, Zhixiong Zhang, Yizhou Yu, Jiaqi Wang, and Hengshuang Zhao. Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221, 2025

  38. [38]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational confer- ence on machine learning, pages 8748–8763. PmLR, 2021

  39. [39]

    Vqasynth, 2024

    remyxai. Vqasynth, 2024. URLhttps://github.com/remyxai/VQASynth/ tree/main. GitHub repository

  40. [40]

    Statespacediffuser: Bringing long context to diffusion world models

    Nedko Savov, Naser Kazemi, Deheng Zhang, Danda Pani Paudel, Xi Wang, and Luc Van Gool. Statespacediffuser: Bringing long context to diffusion world models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=g52NwTQj0Q

  41. [41]

    Toolformer: Lan- guage models can teach themselves to use tools.Advances in neural information pro- cessing systems, 36:68539–68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Lan- guage models can teach themselves to use tools.Advances in neural information pro- cessing systems, 36:68539–68551, 2023

  42. [42]

    The development of spatial representations of large-scale environments.Advances in child development and behavior, 10:9–55, 1975

    Alexander W Siegel and Sheldon H White. The development of spatial representations of large-scale environments.Advances in child development and behavior, 10:9–55, 1975

  43. [43]

    Vipergpt: Visual inference via python execution for reasoning

    Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11888–11898, 2023

  44. [44]

    Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

    Edward C Tolman. Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

  45. [45]

    3d reconstruction with spatial memory

    Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. In2025 International Conference on 3D Vision (3DV), pages 78–89. IEEE, 2025. 20HE, LI, LIN AND Y ANG: REASMORY

  46. [46]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

  47. [47]

    Continuous 3d perception model with persistent state

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

  48. [48]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

  49. [49]

    $\pi^3$: Permutation- equivariant visual geometry learning

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. $\pi^3$: Permutation- equivariant visual geometry learning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum? id=DTQIjngDta

  50. [50]

    Spatial-MLLM: Boost- ing MLLM capabilities in visual-based spatial intelligence

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-MLLM: Boost- ing MLLM capabilities in visual-based spatial intelligence. InThe Thirty-ninth An- nual Conference on Neural Information Processing Systems, 2026. URLhttps: //openreview.net/forum?id=RnXS7aK4rK

  51. [51]

    Point3r: Streaming 3d recon- struction with explicit spatial pointer memory

    Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3r: Streaming 3d recon- struction with explicit spatial pointer memory. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview. net/forum?id=yk1iqV9Etr

  52. [52]

    Worldmem: Long-term consistent world simulation with memory

    Zeqi Xiao, Yushi LAN, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=c6CAVKlKmU

  53. [53]

    Maskclustering: View consensus based mask graph clustering for open-vocabulary 3d instance segmentation

    Mi Yan, Jiazhao Zhang, Yan Zhu, and He Wang. Maskclustering: View consensus based mask graph clustering for open-vocabulary 3d instance segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  54. [54]

    Thinking in space: How multimodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025

  55. [55]

    Mindjourney: Test-time scaling with world mod- els for spatial reasoning

    Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, and Chuang Gan. Mindjourney: Test-time scaling with world mod- els for spatial reasoning. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2026. URLhttps://openreview.net/forum?id= L2W4wQsNkY

  56. [56]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023. HE, LI, LIN AND Y ANG: REASMORY21

  57. [57]

    Spa- tial mental modeling from limited views

    Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, et al. Spa- tial mental modeling from limited views. InStructural Priors for Vision Workshop at ICCV’25, 2025

  58. [58]

    Instainpaint: Instant 3d-scene inpainting with masked large reconstruction model

    Junqi You, Chieh Hubert Lin, Weijie Lyu, Zhengbo Zhang, and Ming-Hsuan Yang. Instainpaint: Instant 3d-scene inpainting with masked large reconstruction model. In Adv. Neural Inform. Process. Syst., 2025

  59. [59]

    Context as memory: Scene-consistent interactive long video generation with memory retrieval

    Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. InProceedings of the SIGGRAPH Asia 2025 Conference Pa- pers, pages 1–11, 2025

  60. [60]

    Boosting mllm spatial reasoning with geometrically referenced 3d scene representations.arXiv preprint arXiv:2603.08592, 2026

    Jiangye Yuan, Gowri Kumar, and Baoyuan Wang. Boosting mllm spatial reasoning with geometrically referenced 3d scene representations.arXiv preprint arXiv:2603.08592, 2026

  61. [61]

    3dgraphllm: Combining semantic graphs and large language models for 3d scene understanding

    Tatiana Zemskova and Dmitry Yudin. 3dgraphllm: Combining semantic graphs and large language models for 3d scene understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8885–8895, 2025

  62. [62]

    Loger: Long-context geometric reconstruction with hybrid memory.arXiv preprint arXiv:2603.03269, 2026

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, and Deqing Sun. Loger: Long-context geometric reconstruction with hybrid memory.arXiv preprint arXiv:2603.03269, 2026

  63. [63]

    Construct- ing coherent spatial memory in llm agents through graph rectification.arXiv preprint arXiv:2510.04195, 2025

    Puzhen Zhang, Xuyang Chen, Yu Feng, Yuhan Jiang, and Liqiu Meng. Construct- ing coherent spatial memory in llm agents through graph rectification.arXiv preprint arXiv:2510.04195, 2025

  64. [64]

    Freeman, and Hao Tan

    Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Freeman, and Hao Tan. Test-time training done right. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=Tb9qAxT3xv

  65. [65]

    Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029, 2026

    Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Bing- hao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, et al. Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029, 2026

  66. [66]

    Spacemind: Camera-guided modality fusion for spatial reasoning in vision-language models.arXiv preprint arXiv:2511.23075, 2025

    Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Wei- jian Sun, and Zizhuang Wei. Spacemind: Camera-guided modality fusion for spatial reasoning in vision-language models.arXiv preprint arXiv:2511.23075, 2025

  67. [67]

    Vlm4d: Towards spatiotemporal awareness in vision language models

    Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Na- gachandra, Di Chang, Dongdong Chen, Xin Eric Wang, and Achuta Kadambi. Vlm4d: Towards spatiotemporal awareness in vision language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 8600–8612, 2025

  68. [68]

    Learning 3d persistent embodied world models.arXiv preprint arXiv:2505.05495, 2025

    Siyuan Zhou, Yilun Du, Yuncong Yang, Lei Han, Peihao Chen, Dit-Yan Yeung, and Chuang Gan. Learning 3d persistent embodied world models.arXiv preprint arXiv:2505.05495, 2025. 22HE, LI, LIN AND Y ANG: REASMORY

  69. [69]

    Stream- ing visual geometry transformer

    Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Stream- ing visual geometry transformer. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum? id=5APgTKsnx8

  70. [70]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023