Reasmory: 3D Reconstruction as Explicit Memory for VLMs Spatial Reasoning

Chieh Hubert Lin; Jixuan He; Ming-Hsuan Yang; Xueting Li

arxiv: 2606.00963 · v1 · pith:CICVJ6DVnew · submitted 2026-05-31 · 💻 cs.CV · cs.CL

Reasmory: 3D Reconstruction as Explicit Memory for VLMs Spatial Reasoning

Jixuan He , Xueting Li , Chieh Hubert Lin , Ming-Hsuan Yang This is my paper

Pith reviewed 2026-06-28 17:44 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords 3D reconstructionvision-language modelsspatial reasoningdomain-specific languageexplicit memorymulti-view imagesvideo spatial understandingtool use validation

0 comments

The pith

VLMs achieve more reliable spatial reasoning by running validated DSL programs over explicit 3D reconstructions instead of free-form tool calls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that vision-language models struggle with precise spatial tasks because cues in multi-view images and videos are sparse and hard to organize implicitly. It shows that turning reconstruction outputs into explicit 3D memory, augmented with object instances, and letting models generate programs in a constrained DSL solves the problem by forcing correct syntax and semantics before execution. This structured access outperforms unconstrained tool use on viewpoint reasoning, directional comparison, and distance estimation. A sympathetic reader would care because the method improves reliability without retraining the underlying models and works on both image sets and monocular video.

Core claim

Reasmory builds explicit 3D memory from reconstruction models, augments it with semantically grounded object instances, and supplies a lightweight DSL so that VLMs generate programs to query objects and cameras, apply viewpoint transforms, and render observations; these programs are parsed and validated before execution, yielding 6-18 percent gains over strong baselines on multi-view and video spatial benchmarks.

What carries the argument

A lightweight Domain-Specific Language whose operations query objects and cameras, transform viewpoints, and render observations over reconstructed point clouds and instances, with parsing and validation before execution.

If this is right

Explicit 3D memory accessed through validated programs outperforms free-form tool invocation on the same underlying reconstruction and VLM components.
Gains appear consistently across multi-view image sets and monocular video sequences on spatial reasoning benchmarks.
The approach works with existing reconstruction foundation models and current VLMs without additional training.
Constrained, validated execution reduces errors from incorrect tool calls, skipped transformations, or misused intermediate results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same DSL-plus-validation pattern could be applied to other forms of explicit memory such as scene graphs or occupancy grids when spatial cues are sparse.
If reconstruction quality improves on real-world video, the performance gap between constrained and free-form access may widen further.
The method suggests a general template for making tool-using agents more reliable by replacing open-ended API calls with a checked intermediate language.

Load-bearing premise

Reconstruction models can turn sparse views into accurate explicit 3D memory and VLMs can emit DSL programs that are both syntactically valid and semantically sufficient for the needed reasoning.

What would settle it

A controlled experiment in which VLMs generate invalid or incomplete DSL programs at high rates, or in which reconstruction quality drops on the same benchmarks, produces no accuracy gain or a drop relative to free-form baselines.

Figures

Figures reproduced from arXiv: 2606.00963 by Chieh Hubert Lin, Jixuan He, Ming-Hsuan Yang, Xueting Li.

**Figure 1.** Figure 1: Overview of Reasmory. Spatial evidence in multi-view images and videos is often sparse and redundant, making it important to organize evidence explicitly for VLM spatial reasoning. Reasmory addresses this by constructing explicit 3D spatial memory and constraining VLM interaction with this memory through validated DSL programs. tasks require perceiving relevant objects, retaining observations across time o… view at source ↗

**Figure 2.** Figure 2: Camera-transition results on MindCube. Explicit tool-use planning yields more [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of Reasmory. The system constructs reconstruction-based spatial memory, augments it with grounded 3D object instances, generates and validates a DSL program, and executes the program to support spatial reasoning. forward pass. Using the predicted depth and camera parameters, each pixel can be backprojected into 3D. Specifically, for a pixel p = (u, v) in image Vi with depth value Di(u,v), the co… view at source ↗

**Figure 4.** Figure 4: An end-to-end reasoning example. The trajectory illustrates how verification and repair refine a DSL plan before spatial-memory execution; see Sec. 4.5 for details. of executing an incorrect plan, the compiler detects the mismatch against the decomposition and rejects the program with explicit feedback. After receiving the error message, the planner generates a revised program that correctly implements th… view at source ↗

read the original abstract

Vision-Language Models (VLMs) exhibit emerging spatial reasoning capabilities, yet they remain unreliable on tasks requiring precise spatial understanding, such as viewpoint reasoning, directional comparison, and distance estimation. In multi-view images and monocular videos, relevant spatial cues are often sparse and distributed across redundant observations, making them difficult to organize and exploit. Reconstruction-based Vision Foundation Models (VFMs) offer a natural way to aggregate such observations into explicit spatial memory, such as point clouds. However, simply exposing reconstruction models as free-form tools is brittle, VLMs may invoke tools incorrectly, skip required spatial transformations, or misuse intermediate results. We propose \textbf{Reasmory}, a framework that formulates spatial reasoning as structured program execution over reconstructed spatial memory. Reasmory constructs explicit 3D memory, augments it with semantically grounded 3D object instances, and introduces a lightweight Domain-Specific Language (DSL) that constrains how VLMs query objects and cameras, transform viewpoints, and render observations during reasoning. Generated programs are parsed and validated before execution, enabling more reliable interaction with spatial memory than unconstrained tool use. Experiments on multi-view image and video spatial reasoning benchmarks show consistent gains of 6--18\% over strong baselines, including GPT-5-mini and Gemini-3-flash, indicating that explicit 3D memory is most useful when accessed through constrained, validated operations rather than free-form tool calls.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reasmory adds a validated DSL on top of 3D reconstruction memory to improve VLM spatial reasoning, but the abstract leaves it unclear whether the gains come from the memory or just the constraint.

read the letter

The main point is that forcing the VLM to output programs in a small, validated DSL when querying reconstructed 3D memory produces steadier results on spatial tasks than free-form tool calls.

The paper starts from the practical problem that spatial cues in multi-view images or videos are scattered and hard for VLMs to organize. It uses reconstruction models to build point clouds plus grounded object instances as explicit memory, then defines a lightweight DSL covering object queries, camera transforms, and rendering. The VLM generates code that gets parsed and checked before execution. This setup directly targets the brittleness of letting the model call tools however it wants.

The approach is straightforward and addresses a real pain point in current VLM tool use. The reported 6-18% gains over GPT-5-mini and Gemini-3-flash on the relevant benchmarks suggest the constrained access helps in practice.

The soft spots sit in the evaluation details. The abstract gives no reconstruction accuracy numbers, no ablation that separates the effect of the 3D memory from the DSL constraint itself, and no error analysis or failure cases. Without those, it remains possible that the improvement comes mostly from limiting what the VLM can do rather than from usable explicit memory. The stress-test concern about reconstruction quality on sparse views therefore still applies until the full paper shows the supporting metrics.

This is aimed at people working on reliable spatial tool use for robotics or AR. A reader who wants concrete ideas for structuring VLM interactions with 3D data would get something usable from the framework.

I would send it to peer review. The idea is concrete enough that referees can check whether the experiments actually support the claim about the memory.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Reasmory, a framework that aggregates sparse multi-view and video observations into explicit 3D spatial memory via reconstruction VFMs (point clouds augmented with grounded object instances) and introduces a lightweight DSL to constrain VLMs to validated program execution for spatial queries, viewpoint transforms, and rendering. It claims this structured access yields consistent 6--18% gains over strong baselines including GPT-5-mini and Gemini-3-flash on multi-view image and video spatial reasoning benchmarks, attributing the improvement to constrained operations rather than free-form tool use.

Significance. If the experimental results prove robust and the reconstruction quality is demonstrably sufficient, the work would be significant for VLM spatial reasoning by providing evidence that explicit 3D memory is most effective when accessed via a validated DSL rather than unconstrained tools. The design choice to parse and validate programs before execution directly targets a known brittleness in tool-augmented VLMs.

major comments (2)

[Abstract] Abstract: the central claim of 6--18% gains over GPT-5-mini and Gemini-3-flash is presented without any dataset descriptions, baseline implementations, statistical tests, error bars, or ablation studies, preventing assessment of whether gains are attributable to usable 3D memory or to the DSL constraint itself.
[Abstract] Abstract: the assumption that reconstruction VFMs reliably produce accurate enough explicit memory (point clouds + grounded instances) from sparse/redundant observations is load-bearing for the claimed gains, yet no reconstruction quality metrics, failure modes, or ablation isolating memory fidelity are supplied.

minor comments (1)

The invented terms 'Reasmory' and 'lightweight DSL' appear without an explicit definition or comparison to prior DSLs for spatial or geometric reasoning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting areas where the abstract could better support assessment of the claims. The full manuscript contains the requested experimental details in Sections 4 and 5, but we agree the abstract presentation can be improved. We address each point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 6--18% gains over GPT-5-mini and Gemini-3-flash is presented without any dataset descriptions, baseline implementations, statistical tests, error bars, or ablation studies, preventing assessment of whether gains are attributable to usable 3D memory or to the DSL constraint itself.

Authors: The abstract is intentionally concise. The manuscript provides full details on the multi-view image and video spatial reasoning benchmarks, exact baseline implementations (including prompting strategies for GPT-5-mini and Gemini-3-flash), results with error bars and statistical tests, and ablations that separate the DSL constraint from free-form tool use and from the 3D memory itself. We will revise the abstract to briefly reference the evaluation benchmarks and note that ablations attribute gains to the validated DSL. revision: partial
Referee: [Abstract] Abstract: the assumption that reconstruction VFMs reliably produce accurate enough explicit memory (point clouds + grounded instances) from sparse/redundant observations is load-bearing for the claimed gains, yet no reconstruction quality metrics, failure modes, or ablation isolating memory fidelity are supplied.

Authors: We acknowledge that the current manuscript does not supply quantitative reconstruction quality metrics, explicit failure mode analysis, or an ablation isolating memory fidelity from the DSL. These elements are load-bearing and their absence limits evaluation of the framework. We will add reconstruction error metrics on the evaluation datasets, representative failure cases, and a dedicated ablation on memory fidelity in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with independent experimental validation

full rationale

The paper introduces Reasmory as a constructive framework (explicit 3D memory + grounded instances + DSL-constrained program execution) and reports empirical gains of 6-18% on spatial reasoning benchmarks. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim that constrained DSL access outperforms free-form tool use is tested directly via ablation-style comparisons against baselines including GPT-5-mini and Gemini-3-flash; these outcomes are not forced by definition or prior self-referential results. The derivation chain is self-contained as an engineering proposal whose value rests on external benchmark performance rather than internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Review performed on abstract only; full paper would be needed to enumerate free parameters, axioms, or invented entities with precision.

axioms (2)

domain assumption Reconstruction models produce sufficiently accurate explicit spatial memory from multi-view or video input
Stated as the foundation for the memory component.
domain assumption VLMs can produce valid programs in the introduced DSL
Required for the structured execution approach to function.

invented entities (2)

Reasmory framework no independent evidence
purpose: Structured program execution over 3D memory
New system proposed in the paper.
lightweight DSL for spatial queries no independent evidence
purpose: Constrain and validate VLM interactions with 3D memory
Introduced to replace free-form tool calls.

pith-pipeline@v0.9.1-grok · 5791 in / 1427 out tokens · 29316 ms · 2026-06-28T17:44:33.999421+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 2 canonical work pages

[1]

Qwen3-vl technical report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[2]

Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, An- drew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

Pith/arXiv arXiv 2025
[3]

Tensorf: Tensorial radiance fields

Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. InEuropean conference on computer vision, pages 333–350. Springer, 2022

2022
[4]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities.arXiv preprint arXiv:2401.12168, 2024

Boyuan Chen, Zhuo Xu, Sean Kirmani, Danny Driess, Pete Florence, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities.arXiv preprint arXiv:2401.12168, 2024

arXiv 2024
[5]

TTT3r: 3d reconstruction as test-time training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. TTT3r: 3d reconstruction as test-time training. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum? id=aMs6FtNaY5

2026
[6]

Think with 3d: Geo- metric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025

Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, and Ruqi Huang. Think with 3d: Geo- metric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025

arXiv 2025
[7]

Flow3r: Fac- tored flow prediction for scalable visual geometry learning

Zhongxiao Cong, Qitao Zhao, Minsik Jeon, and Shubham Tulsiani. Flow3r: Fac- tored flow prediction for scalable visual geometry learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026. URLhttps://openaccess.thecvf.com/content/CVPR2026/html/ Cong_Flow3r_Factored_Flow_Prediction_for_Scalable_Visual_ Geometry_Learni...

2026
[8]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

2017
[9]

ISBN 9781450383912

Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sablé-Meyer, Lucas Morales, Luke Hewitt, Luc Cary, Armando Solar-Lezama, and Joshua B. Tenenbaum. Dream- coder: bootstrapping inductive program synthesis with wake-sleep library learning. InProceedings of the 42nd ACM SIGPLAN International Conference on Program- ming Language Design and Implementation, PLD...

work page doi:10.1145/3453483.3454080 2021
[10]

Vlm-3r: Vision- language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, et al. Vlm-3r: Vision- language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025

Pith/arXiv arXiv 2025
[11]

Video-r1: Reinforcing video reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in MLLMs. InThe Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems, 2026. URLhttps://openreview.net/forum? id=a2JTVVvcEl

2026
[12]

Pearson Education, 2010

Martin Fowler.Domain-Specific Languages, Portable Documents. Pearson Education, 2010

2010
[13]

Pal: Program-aided language models.arXiv preprint arXiv:2211.10435, 2022

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models.arXiv preprint arXiv:2211.10435, 2022

Pith/arXiv arXiv 2022
[14]

Pursuing minimal sufficiency in spatial reasoning

Yejie Guo, Yunzhong Hou, Wufei Ma, Meng Tang, and Ming-Hsuan Yang. Pursuing minimal sufficiency in spatial reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum? id=bZAKJwyn1n

2026
[15]

Visual programming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023

2023
[16]

Mem4nav: Boosting vision-and-language navigation in urban environments with a hierarchical spatial-cognition long-short memory system.arXiv preprint arXiv:2506.19433, 2025

Lixuan He, Haoyu Dong, Zhenxing Chen, Yangcheng Yu, Jie Feng, and Yong Li. Mem4nav: Boosting vision-and-language navigation in urban environments with a hierarchical spatial-cognition long-short memory system.arXiv preprint arXiv:2506.19433, 2025

arXiv 2025
[17]

G 2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688, 2025

Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, and Jiangmiao Pang. G 2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688, 2025

arXiv 2025
[18]

Mllms need 3d-aware represen- tation supervision for scene understanding.CoRR, abs/2506.01946, June 2025

Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. Mllms need 3d-aware represen- tation supervision for scene understanding.CoRR, abs/2506.01946, June 2025. URL https://doi.org/10.48550/arXiv.2506.01946

work page doi:10.48550/arxiv.2506.01946 2025
[19]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4): 139–1, 2023

2023
[20]

Open- vla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Open- vla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 18HE, LI, LIN AND Y ANG: REASMORY

Pith/arXiv arXiv 2024
[21]

Spatialladder: Pro- gressive training for spatial reasoning in vision-language models

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Pro- gressive training for spatial reasoning in vision-language models. InThe Four- teenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=KtrFXlvgrK

2026
[22]

Vmem: Consistent interac- tive video scene generation with surfel-indexed view memory

Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interac- tive video scene generation with surfel-indexed view memory. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25690–25699, 2025

2025
[23]

Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views. InThe Four- teenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=yirunib8l8

2026
[24]

Msnav: Zero-shot vision-and-language navigation with dynamic mem- ory and llm spatial reasoning

Chenghao Liu, Zhimu Zhou, Jiachen Zhang, Minghao Zhang, Songfang Huang, and Huiling Duan. Msnav: Zero-shot vision-and-language navigation with dynamic mem- ory and llm spatial reasoning. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 20112–20116. IEEE, 2026

2026
[25]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023

2023
[26]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

2024
[27]

Dynamem: Online dynamic spatio-semantic memory for open world mobile manipulation

Peiqi Liu, Zhanqiu Guo, Mohit Warke, Soumith Chintala, Chris Paxton, Nur Muham- mad Mahi Shafiullah, and Lerrel Pinto. Dynamem: Online dynamic spatio-semantic memory for open world mobile manipulation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13346–13355. IEEE, 2025

2025
[28]

pyspatial: Generating 3d visual programs for zero-shot spatial reasoning.arXiv preprint arXiv:2603.00905, 2026

Zhanpeng Luo, Ce Zhang, Silong Yong, Cunxi Dai, Qianwei Wang, Haoxi Ran, Guanya Shi, Katia Sycara, and Yaqi Xie. pyspatial: Generating 3d visual programs for zero-shot spatial reasoning.arXiv preprint arXiv:2603.00905, 2026

arXiv 2026
[29]

Nerf in the wild: Neural radiance fields for uncon- strained photo collections

Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for uncon- strained photo collections. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7210–7219, 2021

2021
[30]

When and how to develop domain-specific languages.ACM computing surveys (CSUR), 37(4):316–344, 2005

Marjan Mernik, Jan Heering, and Anthony M Sloane. When and how to develop domain-specific languages.ACM computing surveys (CSUR), 37(4):316–344, 2005

2005
[31]

Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ra- mamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

2021
[32]

Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graph- ics (TOG), 41(4):1–15, 2022

Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graph- ics (TOG), 41(4):1–15, 2022. HE, LI, LIN AND Y ANG: REASMORY19

2022
[33]

Oxford university press, 1978

John O’keefe and Lynn Nadel.The hippocampus as a cognitive map. Oxford university press, 1978

1978
[34]

Single unit activity in the rat hippocampus during a spatial memory task.Experimental brain research, 68(1):1–27, 1987

John O’Keefe and Andrew Speakman. Single unit activity in the rat hippocampus during a spatial memory task.Experimental brain research, 68(1):1–27, 1987

1987
[35]

Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

Pith/arXiv arXiv 2025
[36]

Long-context state-space video world models

Ryan Po, Yotam Nitzan, Richard Zhang, Berlin Chen, Tri Dao, Eli Shechtman, Gordon Wetzstein, and Xun Huang. Long-context state-space video world models. InProceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 8733–8744, 2025

2025
[37]

Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221, 2025

Zhangyang Qi, Zhixiong Zhang, Yizhou Yu, Jiaqi Wang, and Hengshuang Zhao. Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221, 2025

arXiv 2025
[38]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational confer- ence on machine learning, pages 8748–8763. PmLR, 2021

2021
[39]

Vqasynth, 2024

remyxai. Vqasynth, 2024. URLhttps://github.com/remyxai/VQASynth/ tree/main. GitHub repository

2024
[40]

Statespacediffuser: Bringing long context to diffusion world models

Nedko Savov, Naser Kazemi, Deheng Zhang, Danda Pani Paudel, Xi Wang, and Luc Van Gool. Statespacediffuser: Bringing long context to diffusion world models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=g52NwTQj0Q

2026
[41]

Toolformer: Lan- guage models can teach themselves to use tools.Advances in neural information pro- cessing systems, 36:68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Lan- guage models can teach themselves to use tools.Advances in neural information pro- cessing systems, 36:68539–68551, 2023

2023
[42]

The development of spatial representations of large-scale environments.Advances in child development and behavior, 10:9–55, 1975

Alexander W Siegel and Sheldon H White. The development of spatial representations of large-scale environments.Advances in child development and behavior, 10:9–55, 1975

1975
[43]

Vipergpt: Visual inference via python execution for reasoning

Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11888–11898, 2023

2023
[44]

Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

Edward C Tolman. Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

1948
[45]

3d reconstruction with spatial memory

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. In2025 International Conference on 3D Vision (3DV), pages 78–89. IEEE, 2025. 20HE, LI, LIN AND Y ANG: REASMORY

2025
[46]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025
[47]

Continuous 3d perception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

2025
[48]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

2024
[49]

$\pi^3$: Permutation- equivariant visual geometry learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. $\pi^3$: Permutation- equivariant visual geometry learning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum? id=DTQIjngDta

2026
[50]

Spatial-MLLM: Boost- ing MLLM capabilities in visual-based spatial intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-MLLM: Boost- ing MLLM capabilities in visual-based spatial intelligence. InThe Thirty-ninth An- nual Conference on Neural Information Processing Systems, 2026. URLhttps: //openreview.net/forum?id=RnXS7aK4rK

2026
[51]

Point3r: Streaming 3d recon- struction with explicit spatial pointer memory

Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3r: Streaming 3d recon- struction with explicit spatial pointer memory. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview. net/forum?id=yk1iqV9Etr

2026
[52]

Worldmem: Long-term consistent world simulation with memory

Zeqi Xiao, Yushi LAN, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=c6CAVKlKmU

2026
[53]

Maskclustering: View consensus based mask graph clustering for open-vocabulary 3d instance segmentation

Mi Yan, Jiazhao Zhang, Yan Zhu, and He Wang. Maskclustering: View consensus based mask graph clustering for open-vocabulary 3d instance segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[54]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025

2025
[55]

Mindjourney: Test-time scaling with world mod- els for spatial reasoning

Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, and Chuang Gan. Mindjourney: Test-time scaling with world mod- els for spatial reasoning. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2026. URLhttps://openreview.net/forum?id= L2W4wQsNkY

2026
[56]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023. HE, LI, LIN AND Y ANG: REASMORY21

2023
[57]

Spa- tial mental modeling from limited views

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, et al. Spa- tial mental modeling from limited views. InStructural Priors for Vision Workshop at ICCV’25, 2025

2025
[58]

Instainpaint: Instant 3d-scene inpainting with masked large reconstruction model

Junqi You, Chieh Hubert Lin, Weijie Lyu, Zhengbo Zhang, and Ming-Hsuan Yang. Instainpaint: Instant 3d-scene inpainting with masked large reconstruction model. In Adv. Neural Inform. Process. Syst., 2025

2025
[59]

Context as memory: Scene-consistent interactive long video generation with memory retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. InProceedings of the SIGGRAPH Asia 2025 Conference Pa- pers, pages 1–11, 2025

2025
[60]

Boosting mllm spatial reasoning with geometrically referenced 3d scene representations.arXiv preprint arXiv:2603.08592, 2026

Jiangye Yuan, Gowri Kumar, and Baoyuan Wang. Boosting mllm spatial reasoning with geometrically referenced 3d scene representations.arXiv preprint arXiv:2603.08592, 2026

Pith/arXiv arXiv 2026
[61]

3dgraphllm: Combining semantic graphs and large language models for 3d scene understanding

Tatiana Zemskova and Dmitry Yudin. 3dgraphllm: Combining semantic graphs and large language models for 3d scene understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8885–8895, 2025

2025
[62]

Loger: Long-context geometric reconstruction with hybrid memory.arXiv preprint arXiv:2603.03269, 2026

Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, and Deqing Sun. Loger: Long-context geometric reconstruction with hybrid memory.arXiv preprint arXiv:2603.03269, 2026

Pith/arXiv arXiv 2026
[63]

Construct- ing coherent spatial memory in llm agents through graph rectification.arXiv preprint arXiv:2510.04195, 2025

Puzhen Zhang, Xuyang Chen, Yu Feng, Yuhan Jiang, and Liqiu Meng. Construct- ing coherent spatial memory in llm agents through graph rectification.arXiv preprint arXiv:2510.04195, 2025

Pith/arXiv arXiv 2025
[64]

Freeman, and Hao Tan

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Freeman, and Hao Tan. Test-time training done right. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=Tb9qAxT3xv

2026
[65]

Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029, 2026

Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Bing- hao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, et al. Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029, 2026

arXiv 2026
[66]

Spacemind: Camera-guided modality fusion for spatial reasoning in vision-language models.arXiv preprint arXiv:2511.23075, 2025

Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Wei- jian Sun, and Zizhuang Wei. Spacemind: Camera-guided modality fusion for spatial reasoning in vision-language models.arXiv preprint arXiv:2511.23075, 2025

arXiv 2025
[67]

Vlm4d: Towards spatiotemporal awareness in vision language models

Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Na- gachandra, Di Chang, Dongdong Chen, Xin Eric Wang, and Achuta Kadambi. Vlm4d: Towards spatiotemporal awareness in vision language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 8600–8612, 2025

2025
[68]

Learning 3d persistent embodied world models.arXiv preprint arXiv:2505.05495, 2025

Siyuan Zhou, Yilun Du, Yuncong Yang, Lei Han, Peihao Chen, Dit-Yan Yeung, and Chuang Gan. Learning 3d persistent embodied world models.arXiv preprint arXiv:2505.05495, 2025. 22HE, LI, LIN AND Y ANG: REASMORY

arXiv 2025
[69]

Stream- ing visual geometry transformer

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Stream- ing visual geometry transformer. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum? id=5APgTKsnx8

2026
[70]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

2023

[1] [1]

Qwen3-vl technical report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[2] [2]

Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, An- drew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

Pith/arXiv arXiv 2025

[3] [3]

Tensorf: Tensorial radiance fields

Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. InEuropean conference on computer vision, pages 333–350. Springer, 2022

2022

[4] [4]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities.arXiv preprint arXiv:2401.12168, 2024

Boyuan Chen, Zhuo Xu, Sean Kirmani, Danny Driess, Pete Florence, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities.arXiv preprint arXiv:2401.12168, 2024

arXiv 2024

[5] [5]

TTT3r: 3d reconstruction as test-time training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. TTT3r: 3d reconstruction as test-time training. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum? id=aMs6FtNaY5

2026

[6] [6]

Think with 3d: Geo- metric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025

Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, and Ruqi Huang. Think with 3d: Geo- metric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025

arXiv 2025

[7] [7]

Flow3r: Fac- tored flow prediction for scalable visual geometry learning

Zhongxiao Cong, Qitao Zhao, Minsik Jeon, and Shubham Tulsiani. Flow3r: Fac- tored flow prediction for scalable visual geometry learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026. URLhttps://openaccess.thecvf.com/content/CVPR2026/html/ Cong_Flow3r_Factored_Flow_Prediction_for_Scalable_Visual_ Geometry_Learni...

2026

[8] [8]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

2017

[9] [9]

ISBN 9781450383912

Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sablé-Meyer, Lucas Morales, Luke Hewitt, Luc Cary, Armando Solar-Lezama, and Joshua B. Tenenbaum. Dream- coder: bootstrapping inductive program synthesis with wake-sleep library learning. InProceedings of the 42nd ACM SIGPLAN International Conference on Program- ming Language Design and Implementation, PLD...

work page doi:10.1145/3453483.3454080 2021

[10] [10]

Vlm-3r: Vision- language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, et al. Vlm-3r: Vision- language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025

Pith/arXiv arXiv 2025

[11] [11]

Video-r1: Reinforcing video reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in MLLMs. InThe Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems, 2026. URLhttps://openreview.net/forum? id=a2JTVVvcEl

2026

[12] [12]

Pearson Education, 2010

Martin Fowler.Domain-Specific Languages, Portable Documents. Pearson Education, 2010

2010

[13] [13]

Pal: Program-aided language models.arXiv preprint arXiv:2211.10435, 2022

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models.arXiv preprint arXiv:2211.10435, 2022

Pith/arXiv arXiv 2022

[14] [14]

Pursuing minimal sufficiency in spatial reasoning

Yejie Guo, Yunzhong Hou, Wufei Ma, Meng Tang, and Ming-Hsuan Yang. Pursuing minimal sufficiency in spatial reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum? id=bZAKJwyn1n

2026

[15] [15]

Visual programming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023

2023

[16] [16]

Mem4nav: Boosting vision-and-language navigation in urban environments with a hierarchical spatial-cognition long-short memory system.arXiv preprint arXiv:2506.19433, 2025

Lixuan He, Haoyu Dong, Zhenxing Chen, Yangcheng Yu, Jie Feng, and Yong Li. Mem4nav: Boosting vision-and-language navigation in urban environments with a hierarchical spatial-cognition long-short memory system.arXiv preprint arXiv:2506.19433, 2025

arXiv 2025

[17] [17]

G 2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688, 2025

Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, and Jiangmiao Pang. G 2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688, 2025

arXiv 2025

[18] [18]

Mllms need 3d-aware represen- tation supervision for scene understanding.CoRR, abs/2506.01946, June 2025

Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. Mllms need 3d-aware represen- tation supervision for scene understanding.CoRR, abs/2506.01946, June 2025. URL https://doi.org/10.48550/arXiv.2506.01946

work page doi:10.48550/arxiv.2506.01946 2025

[19] [19]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4): 139–1, 2023

2023

[20] [20]

Open- vla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Open- vla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 18HE, LI, LIN AND Y ANG: REASMORY

Pith/arXiv arXiv 2024

[21] [21]

Spatialladder: Pro- gressive training for spatial reasoning in vision-language models

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Pro- gressive training for spatial reasoning in vision-language models. InThe Four- teenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=KtrFXlvgrK

2026

[22] [22]

Vmem: Consistent interac- tive video scene generation with surfel-indexed view memory

Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interac- tive video scene generation with surfel-indexed view memory. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25690–25699, 2025

2025

[23] [23]

Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views. InThe Four- teenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=yirunib8l8

2026

[24] [24]

Msnav: Zero-shot vision-and-language navigation with dynamic mem- ory and llm spatial reasoning

Chenghao Liu, Zhimu Zhou, Jiachen Zhang, Minghao Zhang, Songfang Huang, and Huiling Duan. Msnav: Zero-shot vision-and-language navigation with dynamic mem- ory and llm spatial reasoning. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 20112–20116. IEEE, 2026

2026

[25] [25]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023

2023

[26] [26]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

2024

[27] [27]

Dynamem: Online dynamic spatio-semantic memory for open world mobile manipulation

Peiqi Liu, Zhanqiu Guo, Mohit Warke, Soumith Chintala, Chris Paxton, Nur Muham- mad Mahi Shafiullah, and Lerrel Pinto. Dynamem: Online dynamic spatio-semantic memory for open world mobile manipulation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13346–13355. IEEE, 2025

2025

[28] [28]

pyspatial: Generating 3d visual programs for zero-shot spatial reasoning.arXiv preprint arXiv:2603.00905, 2026

Zhanpeng Luo, Ce Zhang, Silong Yong, Cunxi Dai, Qianwei Wang, Haoxi Ran, Guanya Shi, Katia Sycara, and Yaqi Xie. pyspatial: Generating 3d visual programs for zero-shot spatial reasoning.arXiv preprint arXiv:2603.00905, 2026

arXiv 2026

[29] [29]

Nerf in the wild: Neural radiance fields for uncon- strained photo collections

Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for uncon- strained photo collections. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7210–7219, 2021

2021

[30] [30]

When and how to develop domain-specific languages.ACM computing surveys (CSUR), 37(4):316–344, 2005

Marjan Mernik, Jan Heering, and Anthony M Sloane. When and how to develop domain-specific languages.ACM computing surveys (CSUR), 37(4):316–344, 2005

2005

[31] [31]

Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ra- mamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

2021

[32] [32]

Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graph- ics (TOG), 41(4):1–15, 2022

Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graph- ics (TOG), 41(4):1–15, 2022. HE, LI, LIN AND Y ANG: REASMORY19

2022

[33] [33]

Oxford university press, 1978

John O’keefe and Lynn Nadel.The hippocampus as a cognitive map. Oxford university press, 1978

1978

[34] [34]

Single unit activity in the rat hippocampus during a spatial memory task.Experimental brain research, 68(1):1–27, 1987

John O’Keefe and Andrew Speakman. Single unit activity in the rat hippocampus during a spatial memory task.Experimental brain research, 68(1):1–27, 1987

1987

[35] [35]

Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

Pith/arXiv arXiv 2025

[36] [36]

Long-context state-space video world models

Ryan Po, Yotam Nitzan, Richard Zhang, Berlin Chen, Tri Dao, Eli Shechtman, Gordon Wetzstein, and Xun Huang. Long-context state-space video world models. InProceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 8733–8744, 2025

2025

[37] [37]

Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221, 2025

Zhangyang Qi, Zhixiong Zhang, Yizhou Yu, Jiaqi Wang, and Hengshuang Zhao. Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221, 2025

arXiv 2025

[38] [38]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational confer- ence on machine learning, pages 8748–8763. PmLR, 2021

2021

[39] [39]

Vqasynth, 2024

remyxai. Vqasynth, 2024. URLhttps://github.com/remyxai/VQASynth/ tree/main. GitHub repository

2024

[40] [40]

Statespacediffuser: Bringing long context to diffusion world models

Nedko Savov, Naser Kazemi, Deheng Zhang, Danda Pani Paudel, Xi Wang, and Luc Van Gool. Statespacediffuser: Bringing long context to diffusion world models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=g52NwTQj0Q

2026

[41] [41]

Toolformer: Lan- guage models can teach themselves to use tools.Advances in neural information pro- cessing systems, 36:68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Lan- guage models can teach themselves to use tools.Advances in neural information pro- cessing systems, 36:68539–68551, 2023

2023

[42] [42]

The development of spatial representations of large-scale environments.Advances in child development and behavior, 10:9–55, 1975

Alexander W Siegel and Sheldon H White. The development of spatial representations of large-scale environments.Advances in child development and behavior, 10:9–55, 1975

1975

[43] [43]

Vipergpt: Visual inference via python execution for reasoning

Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11888–11898, 2023

2023

[44] [44]

Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

Edward C Tolman. Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

1948

[45] [45]

3d reconstruction with spatial memory

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. In2025 International Conference on 3D Vision (3DV), pages 78–89. IEEE, 2025. 20HE, LI, LIN AND Y ANG: REASMORY

2025

[46] [46]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025

[47] [47]

Continuous 3d perception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

2025

[48] [48]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

2024

[49] [49]

$\pi^3$: Permutation- equivariant visual geometry learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. $\pi^3$: Permutation- equivariant visual geometry learning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum? id=DTQIjngDta

2026

[50] [50]

Spatial-MLLM: Boost- ing MLLM capabilities in visual-based spatial intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-MLLM: Boost- ing MLLM capabilities in visual-based spatial intelligence. InThe Thirty-ninth An- nual Conference on Neural Information Processing Systems, 2026. URLhttps: //openreview.net/forum?id=RnXS7aK4rK

2026

[51] [51]

Point3r: Streaming 3d recon- struction with explicit spatial pointer memory

Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3r: Streaming 3d recon- struction with explicit spatial pointer memory. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview. net/forum?id=yk1iqV9Etr

2026

[52] [52]

Worldmem: Long-term consistent world simulation with memory

Zeqi Xiao, Yushi LAN, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=c6CAVKlKmU

2026

[53] [53]

Maskclustering: View consensus based mask graph clustering for open-vocabulary 3d instance segmentation

Mi Yan, Jiazhao Zhang, Yan Zhu, and He Wang. Maskclustering: View consensus based mask graph clustering for open-vocabulary 3d instance segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024

[54] [54]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025

2025

[55] [55]

Mindjourney: Test-time scaling with world mod- els for spatial reasoning

Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, and Chuang Gan. Mindjourney: Test-time scaling with world mod- els for spatial reasoning. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2026. URLhttps://openreview.net/forum?id= L2W4wQsNkY

2026

[56] [56]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023. HE, LI, LIN AND Y ANG: REASMORY21

2023

[57] [57]

Spa- tial mental modeling from limited views

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, et al. Spa- tial mental modeling from limited views. InStructural Priors for Vision Workshop at ICCV’25, 2025

2025

[58] [58]

Instainpaint: Instant 3d-scene inpainting with masked large reconstruction model

Junqi You, Chieh Hubert Lin, Weijie Lyu, Zhengbo Zhang, and Ming-Hsuan Yang. Instainpaint: Instant 3d-scene inpainting with masked large reconstruction model. In Adv. Neural Inform. Process. Syst., 2025

2025

[59] [59]

Context as memory: Scene-consistent interactive long video generation with memory retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. InProceedings of the SIGGRAPH Asia 2025 Conference Pa- pers, pages 1–11, 2025

2025

[60] [60]

Boosting mllm spatial reasoning with geometrically referenced 3d scene representations.arXiv preprint arXiv:2603.08592, 2026

Jiangye Yuan, Gowri Kumar, and Baoyuan Wang. Boosting mllm spatial reasoning with geometrically referenced 3d scene representations.arXiv preprint arXiv:2603.08592, 2026

Pith/arXiv arXiv 2026

[61] [61]

3dgraphllm: Combining semantic graphs and large language models for 3d scene understanding

Tatiana Zemskova and Dmitry Yudin. 3dgraphllm: Combining semantic graphs and large language models for 3d scene understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8885–8895, 2025

2025

[62] [62]

Loger: Long-context geometric reconstruction with hybrid memory.arXiv preprint arXiv:2603.03269, 2026

Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, and Deqing Sun. Loger: Long-context geometric reconstruction with hybrid memory.arXiv preprint arXiv:2603.03269, 2026

Pith/arXiv arXiv 2026

[63] [63]

Construct- ing coherent spatial memory in llm agents through graph rectification.arXiv preprint arXiv:2510.04195, 2025

Puzhen Zhang, Xuyang Chen, Yu Feng, Yuhan Jiang, and Liqiu Meng. Construct- ing coherent spatial memory in llm agents through graph rectification.arXiv preprint arXiv:2510.04195, 2025

Pith/arXiv arXiv 2025

[64] [64]

Freeman, and Hao Tan

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Freeman, and Hao Tan. Test-time training done right. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=Tb9qAxT3xv

2026

[65] [65]

Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029, 2026

Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Bing- hao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, et al. Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029, 2026

arXiv 2026

[66] [66]

Spacemind: Camera-guided modality fusion for spatial reasoning in vision-language models.arXiv preprint arXiv:2511.23075, 2025

Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Wei- jian Sun, and Zizhuang Wei. Spacemind: Camera-guided modality fusion for spatial reasoning in vision-language models.arXiv preprint arXiv:2511.23075, 2025

arXiv 2025

[67] [67]

Vlm4d: Towards spatiotemporal awareness in vision language models

Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Na- gachandra, Di Chang, Dongdong Chen, Xin Eric Wang, and Achuta Kadambi. Vlm4d: Towards spatiotemporal awareness in vision language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 8600–8612, 2025

2025

[68] [68]

Learning 3d persistent embodied world models.arXiv preprint arXiv:2505.05495, 2025

Siyuan Zhou, Yilun Du, Yuncong Yang, Lei Han, Peihao Chen, Dit-Yan Yeung, and Chuang Gan. Learning 3d persistent embodied world models.arXiv preprint arXiv:2505.05495, 2025. 22HE, LI, LIN AND Y ANG: REASMORY

arXiv 2025

[69] [69]

Stream- ing visual geometry transformer

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Stream- ing visual geometry transformer. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum? id=5APgTKsnx8

2026

[70] [70]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

2023