pith. machine review for the scientific record. sign in

arxiv: 2603.20530 · v2 · submitted 2026-03-20 · 💻 cs.RO · cs.CV

Recognition: no theorem link

Memory Over Maps: 3D Object Localization Without Reconstruction

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:43 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords object localizationmap-free navigationvisual memoryRGB-D keyframesvision-language modelssparse depth fusionrobot navigation3D object localization
0
0 comments X

The pith

Object localization for robots succeeds by storing only posed RGB-D images and fusing sparse views on demand, without any global 3D map.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that explicit 3D scene reconstruction is unnecessary for localizing objects in embodied tasks. Instead, a lightweight memory of posed RGB-D keyframes suffices: at query time the system retrieves candidate views, re-ranks them with a vision-language model, and builds a sparse 3D estimate of the target through depth back-projection and multi-view fusion. This design cuts scene indexing time by more than two orders of magnitude and slashes storage costs while still delivering competitive results on object-goal navigation benchmarks. The work shows that direct semantic reasoning over image memory can replace dense reconstruction pipelines for robot navigation without requiring task-specific training.

Core claim

Object localization reduces to retrieving and re-ranking a small set of posed RGB-D keyframes from a visual memory, followed by on-demand sparse multi-view depth fusion, and this process produces usable 3D target locations for navigation without ever constructing a global point cloud, voxel grid, or scene graph.

What carries the argument

A visual memory of posed RGB-D keyframes together with vision-language model re-ranking and sparse multi-view depth back-projection that produces an on-demand 3D estimate of the queried target.

If this is right

  • Scene indexing completes over two orders of magnitude faster than reconstruction pipelines.
  • Storage requirements drop substantially because only keyframes are kept instead of dense 3D data.
  • Object-goal navigation performance remains strong across multiple benchmarks with no task-specific training.
  • Direct reasoning over 2D image memory can substitute for dense 3D reconstruction in object-centric robot tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robots could operate in much larger or changing environments where full reconstruction quickly becomes impractical.
  • The same memory could support other embodied queries such as finding relations between multiple objects without extra mapping.
  • Incremental addition of new keyframes might allow the system to adapt online without rebuilding any global structure.

Load-bearing premise

That vision-language model re-ranking of candidate views plus sparse multi-view depth fusion will reliably produce accurate 3D target locations without a global scene representation or task-specific training.

What would settle it

Direct comparison on the same navigation benchmarks showing that the map-free method produces significantly higher localization error or lower success rate than a standard reconstruction-based pipeline.

Figures

Figures reproduced from arXiv: 2603.20530 by Allison Lau, Boyang Sun, Jianwen Cao, Marc Pollefeys, Rui Zhou, Xander Yap.

Figure 1
Figure 1. Figure 1: Image-based target localization without dense 3D reconstruction. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Method overview. Given a query and posed RGB-D keyframes: (1) Retrieval: SigLIP2 embeddings indexed with FAISS retrieve top-K candidates. (2) VLM Re-Rank: a VLM filters false positives (red) and promotes true matches (green). (3) 3D Localization: SAM 3 segments the target; masked depth is backprojected, predictions are grouped into object instances, and per-instance multi-view fusion produces a 3D goal est… view at source ↗
Figure 3
Figure 3. Figure 3: shows top-1 retrieved images for fine-grained and context-dependent queries across small indoor (HM3D, MP3D), large indoor (LaMAR-CAB [59]), and outdoor (LaMAR-LIN [59]) scenes, demonstrating retrieval across indoor and outdoor scenes of varying scale [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison with HOV-SG [17]. Our method retrieves the correct view with segmentation mask and produces a fused 3D point cloud, capturing fine-grained details that HOV-SG overlooks. offline, and the robot navigates to queried objects ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Real-world robot navigation. Spot navigates to queried objects in an iPad-scanned indoor scene. (a) Third-person view of the robot at the goal. (b) Top-1 retrieved image with segmentation mask (green). (c) Spot’s onboard camera view upon arrival. WITHOUT VLM RE-RANK iPad on the dining table iPhone on the table Lit lamp WITH VLM RE-RANK iPad on the dining table iPhone on the table Lit lamp [PITH_FULL_IMAGE… view at source ↗
Figure 6
Figure 6. Figure 6: VLM re-ranking ablation. SigLIP2 retrieves semantically similar but incorrect views (red); VLM re-ranking corrects all three queries (green). effectiveness of reasoning directly over image observations rather than constructing dense intermediate representations. While our formulation focuses on object-centric localiza￾tion from individual views, i mage by a suitable amount. REFERENCES [1] M. Savva, A. Kadi… view at source ↗
read the original abstract

Target localization is a prerequisite for embodied tasks such as navigation and manipulation. Conventional approaches rely on constructing explicit 3D scene representations to enable target localization, such as point clouds, voxel grids, or scene graphs. While effective, these pipelines incur substantial mapping time, storage overhead, and scalability limitations. Recent advances in vision-language models suggest that rich semantic reasoning can be performed directly on 2D observations, raising a fundamental question: is a complete 3D scene reconstruction necessary for object localization? In this work, we revisit object localization and propose a map-free pipeline that stores only posed RGB-D keyframes as a lightweight visual memory--without constructing any global 3D representation of the scene. At query time, our method retrieves candidate views, re-ranks them with a vision-language model, and constructs a sparse, on-demand 3D estimate of the queried target through depth backprojection and multi-view fusion. Compared to reconstruction-based pipelines, this design drastically reduces preprocessing cost, enabling scene indexing that is over two orders of magnitude faster to build while using substantially less storage. We further validate the localized targets on downstream object-goal navigation tasks. Despite requiring no task-specific training, our approach achieves strong performance across multiple benchmarks, demonstrating that direct reasoning over image-based scene memory can effectively replace dense 3D reconstruction for object-centric robot navigation. Project page: https://ruizhou-cn.github.io/memory-over-maps/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a map-free pipeline for 3D object localization that stores only posed RGB-D keyframes as lightweight visual memory. At query time it retrieves candidate views, re-ranks them with a pre-trained vision-language model, and produces a sparse 3D target coordinate via depth back-projection and multi-view fusion. The method is claimed to reduce scene-indexing time by over two orders of magnitude and storage cost while achieving strong performance on object-goal navigation benchmarks without task-specific training or any global 3D reconstruction.

Significance. If the localization accuracy holds under realistic conditions, the approach would substantially lower the preprocessing and memory burden of embodied navigation pipelines, allowing robots to operate directly from image-based memory rather than maintaining dense maps or scene graphs. The absence of task-specific training and the use of off-the-shelf VLMs are notable strengths that could improve scalability.

major comments (3)
  1. [Abstract / Experiments] Abstract and Experiments section: the central claim that the method 'achieves strong performance across multiple benchmarks' is unsupported by any quantitative metrics, success rates, localization error distributions, or direct comparisons to reconstruction-based baselines. Aggregate navigation success alone does not verify that the sparse multi-view fusion produces metric-accurate 3D coordinates.
  2. [Method / Experiments] Method and Experiments: no ablation or failure-case analysis is provided for the two load-bearing assumptions—(1) that VLM re-ranking reliably selects views with sufficient parallax and target visibility, and (2) that simple averaging of noisy depth values converges to usable accuracy. The skeptic note correctly identifies that low overlap, partial occlusions, or specular surfaces can break either step, yet only aggregate results are reported.
  3. [Experiments] Experiments: the manuscript reports only downstream navigation success rates rather than per-query 3D localization error (e.g., mean Euclidean distance to ground-truth target position). This makes it impossible to isolate whether navigation failures stem from localization inaccuracy or from other pipeline components.
minor comments (2)
  1. [Abstract] The project page link is given but the manuscript does not indicate whether code, keyframes, or evaluation scripts will be released, which would be valuable for reproducibility.
  2. [Method] Notation for the multi-view fusion step (e.g., how depths are weighted or outliers rejected) should be formalized with an equation rather than left as prose.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the quantitative evidence and analysis in the manuscript. We address each major comment below and will revise the paper accordingly.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the central claim that the method 'achieves strong performance across multiple benchmarks' is unsupported by any quantitative metrics, success rates, localization error distributions, or direct comparisons to reconstruction-based baselines. Aggregate navigation success alone does not verify that the sparse multi-view fusion produces metric-accurate 3D coordinates.

    Authors: We agree that the abstract and experiments would benefit from more explicit quantitative support. The current manuscript reports navigation success rates on object-goal navigation benchmarks with comparisons to reconstruction-based methods, but we acknowledge that aggregate success rates alone do not fully isolate the accuracy of the 3D localization step. In the revised version, we will add specific success rates, localization error distributions (e.g., mean and median Euclidean errors), and direct numerical comparisons to baselines to better substantiate the claims about metric accuracy. revision: yes

  2. Referee: [Method / Experiments] Method and Experiments: no ablation or failure-case analysis is provided for the two load-bearing assumptions—(1) that VLM re-ranking reliably selects views with sufficient parallax and target visibility, and (2) that simple averaging of noisy depth values converges to usable accuracy. The skeptic note correctly identifies that low overlap, partial occlusions, or specular surfaces can break either step, yet only aggregate results are reported.

    Authors: We concur that dedicated ablations and failure-case analysis would strengthen the paper. The manuscript currently emphasizes end-to-end navigation performance, but we will add an ablation study examining the contribution of VLM re-ranking (including metrics on selected view quality such as parallax and visibility) and the multi-view fusion step. We will also include a discussion of failure modes under conditions like low overlap, occlusions, and specular surfaces, with qualitative examples where possible. revision: yes

  3. Referee: [Experiments] Experiments: the manuscript reports only downstream navigation success rates rather than per-query 3D localization error (e.g., mean Euclidean distance to ground-truth target position). This makes it impossible to isolate whether navigation failures stem from localization inaccuracy or from other pipeline components.

    Authors: We accept this point. To allow isolation of localization performance, the revised experiments section will report per-query 3D localization errors, including mean Euclidean distance to ground-truth target positions across queries, along with error distributions. This will complement the existing navigation success rates and clarify the sources of any failures. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper proposes an engineering pipeline that stores posed RGB-D keyframes as visual memory, retrieves candidates, re-ranks them using an external pre-trained VLM, and computes target locations via standard depth backprojection followed by multi-view averaging. No equations are presented that define outputs in terms of themselves, no parameters are fitted to a data subset and then relabeled as predictions, and no load-bearing claims reduce to self-citations or author-imported uniqueness theorems. The central steps rely on independently established geometric operations and off-the-shelf models whose correctness is external to the paper, rendering the approach self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are detailed; the approach assumes standard VLM capabilities and geometric fusion work as described.

pith-pipeline@v0.9.0 · 5561 in / 1094 out tokens · 31533 ms · 2026-05-15T07:43:02.260162+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 8 internal anchors

  1. [1]

    Habitat: A platform for embodied ai research,

    M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik,et al., “Habitat: A platform for embodied ai research,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9339–9347

  2. [2]

    Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,

    S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 171–23 181

  3. [3]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman,et al., “Do as i can, not as i say: Grounding language in robotic affordances,”arXiv preprint arXiv:2204.01691, 2022

  4. [4]

    Openmask3d: Open-vocabulary 3d instance segmen- tation,

    A. Takmaz, E. Fedele, R. W. Sumner, M. Pollefeys, F. Tombari, and F. Engelmann, “Openmask3d: Open-vocabulary 3d instance segmen- tation,”arXiv preprint arXiv:2306.13631, 2023

  5. [5]

    Locate 3d: Real-world object localization via self-supervised learning in 3d,

    S. Arnaud, P. McVay, A. Martin, A. Majumdar, K. M. Jatavallabhula, P. Thomas, R. Partsey, D. Dugas, A. Gejji, A. Sax,et al., “Locate 3d: Real-world object localization via self-supervised learning in 3d,” arXiv preprint arXiv:2504.14151, 2025

  6. [6]

    Object goal navigation using goal-oriented semantic exploration,

    D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,” Advances in Neural Information Processing Systems, vol. 33, pp. 4247–4258, 2020

  7. [7]

    Vlfm: Vision- language frontier maps for zero-shot semantic navigation,

    N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision- language frontier maps for zero-shot semantic navigation,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 42–48

  8. [8]

    V oxgraph: Globally consistent, volumetric mapping using signed distance function submaps,

    V . Reijgwart, A. Millane, H. Oleynikova, R. Siegwart, C. Cadena, and J. Nieto, “V oxgraph: Globally consistent, volumetric mapping using signed distance function submaps,”IEEE Robotics and Automation Letters, vol. 5, no. 1, pp. 227–234, 2019

  9. [9]

    V oxblox: Incremental 3d euclidean signed distance fields for on- board mav planning,

    H. Oleynikova, Z. Taylor, M. Fehr, R. Siegwart, and J. Nieto, “V oxblox: Incremental 3d euclidean signed distance fields for on- board mav planning,” in2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 1366–1373

  10. [10]

    Octomap: An efficient probabilistic 3d mapping framework based on octrees,

    A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Bur- gard, “Octomap: An efficient probabilistic 3d mapping framework based on octrees,”Autonomous robots, vol. 34, no. 3, pp. 189–206, 2013

  11. [11]

    Kimera: an open- source library for real-time metric-semantic localization and mapping,

    A. Rosinol, M. Abate, Y . Chang, and L. Carlone, “Kimera: an open- source library for real-time metric-semantic localization and mapping,” in2020 IEEE international conference on robotics and automation (ICRA). IEEE, 2020, pp. 1689–1696

  12. [12]

    Kimera-multi: a system for distributed multi-robot metric-semantic simultaneous localization and mapping,

    Y . Chang, Y . Tian, J. P. How, and L. Carlone, “Kimera-multi: a system for distributed multi-robot metric-semantic simultaneous localization and mapping,” in2021 IEEE international conference on robotics and automation (ICRA). IEEE, 2021, pp. 11 210–11 218

  13. [13]

    3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans,

    A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans,”arXiv preprint arXiv:2002.06289, 2020

  14. [14]

    Hydra: A real-time spatial perception system for 3d scene graph construction and optimization,

    N. Hughes, Y . Chang, and L. Carlone, “Hydra: A real-time spatial perception system for 3d scene graph construction and optimization,” arXiv preprint arXiv:2201.13360, 2022

  15. [15]

    Clio: Real-time task-driven open-set 3d scene graphs,

    D. Maggio, Y . Chang, N. Hughes, M. Trang, D. Griffith, C. Dougherty, E. Cristofalo, L. Schmid, and L. Carlone, “Clio: Real-time task-driven open-set 3d scene graphs,”IEEE Robotics and Automation Letters, vol. 9, no. 10, pp. 8921–8928, 2024

  16. [16]

    Visual language maps for robot navigation,

    C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” in2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 10 608–10 615

  17. [17]

    Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation,

    A. Werby, C. Huang, M. B ¨uchner, A. Valada, and W. Burgard, “Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation,” inFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024

  18. [18]

    Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,

    Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa,et al., “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 5021–5028

  19. [19]

    Netvlad: Cnn architecture for weakly supervised place recognition,

    R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” inPro- ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5297–5307

  20. [20]

    Su- perglue: Learning feature matching with graph neural networks,

    P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Su- perglue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4938–4947

  21. [21]

    Depth Anything 3: Recovering the Visual Space from Any Views

    H. Lin, S. Chen, J. Liew, D. Y . Chen, Z. Li, G. Shi, J. Feng, and B. Kang, “Depth anything 3: Recovering the visual space from any views,”arXiv preprint arXiv:2511.10647, 2025

  22. [22]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023

  23. [23]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdul- mohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafa,et al., “Siglip 2: Multilingual vision-language encoders with improved se- mantic understanding, localization, and dense features,”arXiv preprint arXiv:2502.14786, 2025

  24. [24]

    Billion-scale similarity search with gpus,

    J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with gpus,”IEEE transactions on big data, vol. 7, no. 3, pp. 535–547, 2019

  25. [25]

    Language-driven semantic segmentation,

    B. Li, K. Q. Weinberger, S. Belongie, V . Koltun, and R. Ranftl, “Language-driven semantic segmentation,”arXiv preprint arXiv:2201.03546, 2022

  26. [26]

    Open-vocabulary functional 3d scene graphs for real- world indoor spaces,

    C. Zhang, A. Delitzas, F. Wang, R. Zhang, X. Ji, M. Pollefeys, and F. Engelmann, “Open-vocabulary functional 3d scene graphs for real- world indoor spaces,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19 401–19 413

  27. [27]

    Keysg: Hierarchical keyframe-based 3d scene graphs,

    A. Werby, D. Rotondi, F. Scaparro, and K. O. Arras, “Keysg: Hierarchical keyframe-based 3d scene graphs,”arXiv preprint arXiv:2510.01049, 2025

  28. [28]

    Dynamem: Online dynamic spatio-semantic memory for open world mobile manipulation,

    P. Liu, Z. Guo, M. Warke, S. Chintala, C. Paxton, N. M. M. Shafiullah, and L. Pinto, “Dynamem: Online dynamic spatio-semantic memory for open world mobile manipulation,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 13 346–13 355

  29. [29]

    Lagmemo: Language 3d gaussian splatting memory for multi-modal open-vocabulary multi-goal visual navigation,

    H. Zhou, X. Wang, H. Li, F. Sun, S. Guo, G. Qi, J. Xu, and H. Zhao, “Lagmemo: Language 3d gaussian splatting memory for multi-modal open-vocabulary multi-goal visual navigation,”arXiv preprint arXiv:2510.24118, 2025

  30. [30]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  31. [31]

    Scaling up visual and vision-language representation learning with noisy text supervision,

    C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 4904–4916

  32. [32]

    Flava: A foundational language and vision alignment model,

    A. Singh, R. Hu, V . Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, “Flava: A foundational language and vision alignment model,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 15 638–15 650

  33. [33]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-VL technical report,”arXiv preprint arXiv:2502.13923, 2025

  34. [34]

    Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames,

    E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra, “Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames,”arXiv preprint arXiv:1911.00357, 2019

  35. [35]

    Habitat- web: Learning embodied object-search strategies from human demon- strations at scale,

    R. Ramrakhya, E. Undersander, D. Batra, and A. Das, “Habitat- web: Learning embodied object-search strategies from human demon- strations at scale,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5173–5183

  36. [36]

    Offline visual representation learning for embodied navigation,

    K. Yadav, R. Ramrakhya, A. Majumdar, V .-P. Berges, S. Kuhar, D. Ba- tra, A. Baevski, and O. Maksymets, “Offline visual representation learning for embodied navigation,” inWorkshop on Reincarnating Reinforcement Learning at ICLR 2023, 2023

  37. [37]

    Pirlnav: Pretraining with imitation and rl finetuning for objectnav,

    R. Ramrakhya, D. Batra, E. Wijmans, and A. Das, “Pirlnav: Pretraining with imitation and rl finetuning for objectnav,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 896–17 906

  38. [38]

    Zson: Zero-shot object-goal navigation using multimodal goal embed- dings,

    A. Majumdar, G. Aggarwal, B. Devnani, J. Hoffman, and D. Batra, “Zson: Zero-shot object-goal navigation using multimodal goal embed- dings,”Advances in Neural Information Processing Systems, vol. 35, pp. 32 340–32 352, 2022

  39. [39]

    Prioritized semantic learning for zero-shot instance navigation,

    X. Sun, L. Liu, H. Zhi, R. Qiu, and J. Liang, “Prioritized semantic learning for zero-shot instance navigation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 161–178

  40. [40]

    Goat: Go to any thing,

    M. Chang, T. Gervet, M. Khanna, S. Yenamandra, D. Shah, S. Y . Min, K. Shah, C. Paxton, S. Gupta, D. Batra,et al., “Goat: Go to any thing,” arXiv preprint arXiv:2311.06430, 2023

  41. [41]

    Uni-navid: A video-based vision-language- action model for unifying embodied navigation tasks,

    J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang, “Uni-navid: A video-based vision-language- action model for unifying embodied navigation tasks,”arXiv preprint arXiv:2412.06224, 2024

  42. [42]

    Esc: Exploration with soft commonsense constraints for zero-shot object navigation,

    K. Zhou, K. Zheng, C. Pryor, Y . Shen, H. Jin, L. Getoor, and X. E. Wang, “Esc: Exploration with soft commonsense constraints for zero-shot object navigation,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 42 829–42 842

  43. [43]

    Openfmnav: Towards open-set zero-shot object navigation via vision-language foundation models,

    Y . Kuang, H. Lin, and M. Jiang, “Openfmnav: Towards open-set zero-shot object navigation via vision-language foundation models,” inFindings of the Association for Computational Linguistics: NAACL 2024, 2024, pp. 338–351

  44. [44]

    Unigoal: Towards universal zero-shot goal-oriented navigation,

    H. Yin, X. Xu, L. Zhao, Z. Wang, J. Zhou, and J. Lu, “Unigoal: Towards universal zero-shot goal-oriented navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2025, pp. 19 057–19 066

  45. [45]

    Tango: training- free embodied ai agents for open-world tasks,

    F. Ziliotto, T. Campari, L. Serafini, and L. Ballan, “Tango: training- free embodied ai agents for open-world tasks,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24 603–24 613

  46. [46]

    Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,

    Y . Long, W. Cai, H. Wang, G. Zhan, and H. Dong, “Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,”arXiv preprint arXiv:2406.04882, 2024

  47. [47]

    Trihelper: Zero-shot object navigation with dynamic assistance,

    L. Zhang, Q. Zhang, H. Wang, E. Xiao, Z. Jiang, H. Chen, and R. Xu, “Trihelper: Zero-shot object navigation with dynamic assistance,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 10 035–10 042

  48. [48]

    Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation,

    H. Yin, X. Xu, Z. Wu, J. Zhou, and J. Lu, “Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation,”Advances in neural information processing systems, vol. 37, pp. 5285–5307, 2024

  49. [49]

    SAM 3: Segment Anything with Concepts

    N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang,et al., “Sam 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025

  50. [50]

    Goat-bench: A benchmark for multi-modal lifelong navigation,

    M. Khanna, R. Ramrakhya, G. Chhablani, S. Yenamandra, T. Gervet, M. Chang, Z. Kira, D. S. Chaplot, D. Batra, and R. Mottaghi, “Goat-bench: A benchmark for multi-modal lifelong navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 16 373–16 383

  51. [51]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang,et al., “Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai,”arXiv preprint arXiv:2109.08238, 2021

  52. [52]

    Matterport3D: Learning from RGB-D Data in Indoor Environments

    A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y . Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,”arXiv preprint arXiv:1709.06158, 2017

  53. [53]

    Hm3d- ovon: A dataset and benchmark for open-vocabulary object goal navigation,

    N. Yokoyama, R. Ramrakhya, A. Das, D. Batra, and S. Ha, “Hm3d- ovon: A dataset and benchmark for open-vocabulary object goal navigation,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 5543–5550

  54. [54]

    L3mvn: Leveraging large language models for visual target navigation,

    B. Yu, H. Kasaei, and M. Cao, “L3mvn: Leveraging large language models for visual target navigation,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 3554–3560

  55. [55]

    Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav,

    K. Yadav, A. Majumdar, R. Ramrakhya, N. Yokoyama, A. Baevski, Z. Kira, O. Maksymets, and D. Batra, “Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav,”arXiv preprint arXiv:2303.07798, 2023

  56. [56]

    Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation,

    Z. Zhu, X. Wang, Y . Li, Z. Zhang, X. Ma, Y . Chen, B. Jia, W. Liang, Q. Yu, Z. Deng,et al., “Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 8120–8132

  57. [57]

    On Evaluation of Embodied Navigation Agents

    P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V . Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva,et al., “On evaluation of embodied navigation agents,”arXiv preprint arXiv:1807.06757, 2018

  58. [58]

    Sun rgb-d: A rgb-d scene un- derstanding benchmark suite,

    S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene un- derstanding benchmark suite,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 567–576

  59. [59]

    Lamar: Benchmarking local- ization and mapping for augmented reality,

    P.-E. Sarlin, M. Dusmanu, J. L. Sch ¨onberger, P. Speciale, L. Gruber, V . Larsson, O. Miksik, and M. Pollefeys, “Lamar: Benchmarking local- ization and mapping for augmented reality,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 686–704