arxiv: 2603.20530 · v2 · submitted 2026-03-20 · 💻 cs.RO · cs.CV

Recognition: no theorem link

Memory Over Maps: 3D Object Localization Without Reconstruction

Rui Zhou , Xander Yap , Jianwen Cao , Allison Lau , Boyang Sun , Marc Pollefeys

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:43 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords object localizationmap-free navigationvisual memoryRGB-D keyframesvision-language modelssparse depth fusionrobot navigation3D object localization

0 comments

The pith

Object localization for robots succeeds by storing only posed RGB-D images and fusing sparse views on demand, without any global 3D map.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that explicit 3D scene reconstruction is unnecessary for localizing objects in embodied tasks. Instead, a lightweight memory of posed RGB-D keyframes suffices: at query time the system retrieves candidate views, re-ranks them with a vision-language model, and builds a sparse 3D estimate of the target through depth back-projection and multi-view fusion. This design cuts scene indexing time by more than two orders of magnitude and slashes storage costs while still delivering competitive results on object-goal navigation benchmarks. The work shows that direct semantic reasoning over image memory can replace dense reconstruction pipelines for robot navigation without requiring task-specific training.

Core claim

Object localization reduces to retrieving and re-ranking a small set of posed RGB-D keyframes from a visual memory, followed by on-demand sparse multi-view depth fusion, and this process produces usable 3D target locations for navigation without ever constructing a global point cloud, voxel grid, or scene graph.

What carries the argument

A visual memory of posed RGB-D keyframes together with vision-language model re-ranking and sparse multi-view depth back-projection that produces an on-demand 3D estimate of the queried target.

If this is right

Scene indexing completes over two orders of magnitude faster than reconstruction pipelines.
Storage requirements drop substantially because only keyframes are kept instead of dense 3D data.
Object-goal navigation performance remains strong across multiple benchmarks with no task-specific training.
Direct reasoning over 2D image memory can substitute for dense 3D reconstruction in object-centric robot tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Robots could operate in much larger or changing environments where full reconstruction quickly becomes impractical.
The same memory could support other embodied queries such as finding relations between multiple objects without extra mapping.
Incremental addition of new keyframes might allow the system to adapt online without rebuilding any global structure.

Load-bearing premise

That vision-language model re-ranking of candidate views plus sparse multi-view depth fusion will reliably produce accurate 3D target locations without a global scene representation or task-specific training.

What would settle it

Direct comparison on the same navigation benchmarks showing that the map-free method produces significantly higher localization error or lower success rate than a standard reconstruction-based pipeline.

Figures

Figures reproduced from arXiv: 2603.20530 by Allison Lau, Boyang Sun, Jianwen Cao, Marc Pollefeys, Rui Zhou, Xander Yap.

**Figure 2.** Figure 2: Method overview. Given a query and posed RGB-D keyframes: (1) Retrieval: SigLIP2 embeddings indexed with FAISS retrieve top-K candidates. (2) VLM Re-Rank: a VLM filters false positives (red) and promotes true matches (green). (3) 3D Localization: SAM 3 segments the target; masked depth is backprojected, predictions are grouped into object instances, and per-instance multi-view fusion produces a 3D goal est… view at source ↗

**Figure 3.** Figure 3: shows top-1 retrieved images for fine-grained and context-dependent queries across small indoor (HM3D, MP3D), large indoor (LaMAR-CAB [59]), and outdoor (LaMAR-LIN [59]) scenes, demonstrating retrieval across indoor and outdoor scenes of varying scale [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison with HOV-SG [17]. Our method retrieves the correct view with segmentation mask and produces a fused 3D point cloud, capturing fine-grained details that HOV-SG overlooks. offline, and the robot navigates to queried objects ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Real-world robot navigation. Spot navigates to queried objects in an iPad-scanned indoor scene. (a) Third-person view of the robot at the goal. (b) Top-1 retrieved image with segmentation mask (green). (c) Spot’s onboard camera view upon arrival. WITHOUT VLM RE-RANK iPad on the dining table iPhone on the table Lit lamp WITH VLM RE-RANK iPad on the dining table iPhone on the table Lit lamp [PITH_FULL_IMAGE… view at source ↗

**Figure 6.** Figure 6: VLM re-ranking ablation. SigLIP2 retrieves semantically similar but incorrect views (red); VLM re-ranking corrects all three queries (green). effectiveness of reasoning directly over image observations rather than constructing dense intermediate representations. While our formulation focuses on object-centric localization from individual views, i mage by a suitable amount. REFERENCES [1] M. Savva, A. Kadi… view at source ↗

read the original abstract

Target localization is a prerequisite for embodied tasks such as navigation and manipulation. Conventional approaches rely on constructing explicit 3D scene representations to enable target localization, such as point clouds, voxel grids, or scene graphs. While effective, these pipelines incur substantial mapping time, storage overhead, and scalability limitations. Recent advances in vision-language models suggest that rich semantic reasoning can be performed directly on 2D observations, raising a fundamental question: is a complete 3D scene reconstruction necessary for object localization? In this work, we revisit object localization and propose a map-free pipeline that stores only posed RGB-D keyframes as a lightweight visual memory--without constructing any global 3D representation of the scene. At query time, our method retrieves candidate views, re-ranks them with a vision-language model, and constructs a sparse, on-demand 3D estimate of the queried target through depth backprojection and multi-view fusion. Compared to reconstruction-based pipelines, this design drastically reduces preprocessing cost, enabling scene indexing that is over two orders of magnitude faster to build while using substantially less storage. We further validate the localized targets on downstream object-goal navigation tasks. Despite requiring no task-specific training, our approach achieves strong performance across multiple benchmarks, demonstrating that direct reasoning over image-based scene memory can effectively replace dense 3D reconstruction for object-centric robot navigation. Project page: https://ruizhou-cn.github.io/memory-over-maps/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows you can localize objects for robots by storing posed RGB-D images and using VLMs for retrieval plus sparse fusion, skipping full 3D maps and cutting build time sharply.

read the letter

The main takeaway is that a lightweight visual memory of posed keyframes can replace dense reconstruction for object localization in navigation tasks. At query time the method pulls candidate views, re-ranks them with a VLM, and fuses back-projected depths on demand to get a 3D target coordinate. This design drops preprocessing cost by more than two orders of magnitude and uses far less storage than point-cloud or voxel pipelines. The downstream object-goal navigation results look competitive on standard benchmarks even without any task-specific training, which is the practical payoff. The pipeline itself is simple and builds on existing VLM and depth tools rather than inventing new geometry, so the contribution sits in the map-free framing and the on-demand fusion step. That combination is new enough to stand apart from the reconstruction-heavy baselines cited in the abstract. The soft spot is the lack of detailed localization error numbers or ablations in the high-level description. The stress-test point about low view overlap or noisy depth is reasonable; simple multi-view averaging can drift when parallax is poor or surfaces are specular, and we need to see the actual per-query 3D error distributions and view-selection quality checks to know how often the method stays accurate. The paper is aimed at embodied-AI researchers who want to scale navigation without heavy mapping overhead. Readers working on efficient perception stacks will find the reduced storage and speed useful even if they later add safeguards. It deserves a serious referee because the central claim is concrete, the experiments are on public benchmarks, and the idea is easy to test or extend. I would send it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper proposes a map-free pipeline for 3D object localization that stores only posed RGB-D keyframes as lightweight visual memory. At query time it retrieves candidate views, re-ranks them with a pre-trained vision-language model, and produces a sparse 3D target coordinate via depth back-projection and multi-view fusion. The method is claimed to reduce scene-indexing time by over two orders of magnitude and storage cost while achieving strong performance on object-goal navigation benchmarks without task-specific training or any global 3D reconstruction.

Significance. If the localization accuracy holds under realistic conditions, the approach would substantially lower the preprocessing and memory burden of embodied navigation pipelines, allowing robots to operate directly from image-based memory rather than maintaining dense maps or scene graphs. The absence of task-specific training and the use of off-the-shelf VLMs are notable strengths that could improve scalability.

major comments (3)

[Abstract / Experiments] Abstract and Experiments section: the central claim that the method 'achieves strong performance across multiple benchmarks' is unsupported by any quantitative metrics, success rates, localization error distributions, or direct comparisons to reconstruction-based baselines. Aggregate navigation success alone does not verify that the sparse multi-view fusion produces metric-accurate 3D coordinates.
[Method / Experiments] Method and Experiments: no ablation or failure-case analysis is provided for the two load-bearing assumptions—(1) that VLM re-ranking reliably selects views with sufficient parallax and target visibility, and (2) that simple averaging of noisy depth values converges to usable accuracy. The skeptic note correctly identifies that low overlap, partial occlusions, or specular surfaces can break either step, yet only aggregate results are reported.
[Experiments] Experiments: the manuscript reports only downstream navigation success rates rather than per-query 3D localization error (e.g., mean Euclidean distance to ground-truth target position). This makes it impossible to isolate whether navigation failures stem from localization inaccuracy or from other pipeline components.

minor comments (2)

[Abstract] The project page link is given but the manuscript does not indicate whether code, keyframes, or evaluation scripts will be released, which would be valuable for reproducibility.
[Method] Notation for the multi-view fusion step (e.g., how depths are weighted or outliers rejected) should be formalized with an equation rather than left as prose.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the quantitative evidence and analysis in the manuscript. We address each major comment below and will revise the paper accordingly.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the central claim that the method 'achieves strong performance across multiple benchmarks' is unsupported by any quantitative metrics, success rates, localization error distributions, or direct comparisons to reconstruction-based baselines. Aggregate navigation success alone does not verify that the sparse multi-view fusion produces metric-accurate 3D coordinates.

Authors: We agree that the abstract and experiments would benefit from more explicit quantitative support. The current manuscript reports navigation success rates on object-goal navigation benchmarks with comparisons to reconstruction-based methods, but we acknowledge that aggregate success rates alone do not fully isolate the accuracy of the 3D localization step. In the revised version, we will add specific success rates, localization error distributions (e.g., mean and median Euclidean errors), and direct numerical comparisons to baselines to better substantiate the claims about metric accuracy. revision: yes
Referee: [Method / Experiments] Method and Experiments: no ablation or failure-case analysis is provided for the two load-bearing assumptions—(1) that VLM re-ranking reliably selects views with sufficient parallax and target visibility, and (2) that simple averaging of noisy depth values converges to usable accuracy. The skeptic note correctly identifies that low overlap, partial occlusions, or specular surfaces can break either step, yet only aggregate results are reported.

Authors: We concur that dedicated ablations and failure-case analysis would strengthen the paper. The manuscript currently emphasizes end-to-end navigation performance, but we will add an ablation study examining the contribution of VLM re-ranking (including metrics on selected view quality such as parallax and visibility) and the multi-view fusion step. We will also include a discussion of failure modes under conditions like low overlap, occlusions, and specular surfaces, with qualitative examples where possible. revision: yes
Referee: [Experiments] Experiments: the manuscript reports only downstream navigation success rates rather than per-query 3D localization error (e.g., mean Euclidean distance to ground-truth target position). This makes it impossible to isolate whether navigation failures stem from localization inaccuracy or from other pipeline components.

Authors: We accept this point. To allow isolation of localization performance, the revised experiments section will report per-query 3D localization errors, including mean Euclidean distance to ground-truth target positions across queries, along with error distributions. This will complement the existing navigation success rates and clarify the sources of any failures. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper proposes an engineering pipeline that stores posed RGB-D keyframes as visual memory, retrieves candidates, re-ranks them using an external pre-trained VLM, and computes target locations via standard depth backprojection followed by multi-view averaging. No equations are presented that define outputs in terms of themselves, no parameters are fitted to a data subset and then relabeled as predictions, and no load-bearing claims reduce to self-citations or author-imported uniqueness theorems. The central steps rely on independently established geometric operations and off-the-shelf models whose correctness is external to the paper, rendering the approach self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are detailed; the approach assumes standard VLM capabilities and geometric fusion work as described.

pith-pipeline@v0.9.0 · 5561 in / 1094 out tokens · 31533 ms · 2026-05-15T07:43:02.260162+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 8 internal anchors

[1]

Habitat: A platform for embodied ai research,

M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik,et al., “Habitat: A platform for embodied ai research,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9339–9347

work page 2019
[2]

Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,

S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 171–23 181

work page 2023
[3]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman,et al., “Do as i can, not as i say: Grounding language in robotic affordances,”arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Openmask3d: Open-vocabulary 3d instance segmen- tation,

A. Takmaz, E. Fedele, R. W. Sumner, M. Pollefeys, F. Tombari, and F. Engelmann, “Openmask3d: Open-vocabulary 3d instance segmen- tation,”arXiv preprint arXiv:2306.13631, 2023

work page arXiv 2023
[5]

Locate 3d: Real-world object localization via self-supervised learning in 3d,

S. Arnaud, P. McVay, A. Martin, A. Majumdar, K. M. Jatavallabhula, P. Thomas, R. Partsey, D. Dugas, A. Gejji, A. Sax,et al., “Locate 3d: Real-world object localization via self-supervised learning in 3d,” arXiv preprint arXiv:2504.14151, 2025

work page arXiv 2025
[6]

Object goal navigation using goal-oriented semantic exploration,

D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,” Advances in Neural Information Processing Systems, vol. 33, pp. 4247–4258, 2020

work page 2020
[7]

Vlfm: Vision- language frontier maps for zero-shot semantic navigation,

N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision- language frontier maps for zero-shot semantic navigation,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 42–48

work page 2024
[8]

V oxgraph: Globally consistent, volumetric mapping using signed distance function submaps,

V . Reijgwart, A. Millane, H. Oleynikova, R. Siegwart, C. Cadena, and J. Nieto, “V oxgraph: Globally consistent, volumetric mapping using signed distance function submaps,”IEEE Robotics and Automation Letters, vol. 5, no. 1, pp. 227–234, 2019

work page 2019
[9]

V oxblox: Incremental 3d euclidean signed distance fields for on- board mav planning,

H. Oleynikova, Z. Taylor, M. Fehr, R. Siegwart, and J. Nieto, “V oxblox: Incremental 3d euclidean signed distance fields for on- board mav planning,” in2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 1366–1373

work page 2017
[10]

Octomap: An efficient probabilistic 3d mapping framework based on octrees,

A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Bur- gard, “Octomap: An efficient probabilistic 3d mapping framework based on octrees,”Autonomous robots, vol. 34, no. 3, pp. 189–206, 2013

work page 2013
[11]

Kimera: an open- source library for real-time metric-semantic localization and mapping,

A. Rosinol, M. Abate, Y . Chang, and L. Carlone, “Kimera: an open- source library for real-time metric-semantic localization and mapping,” in2020 IEEE international conference on robotics and automation (ICRA). IEEE, 2020, pp. 1689–1696

work page 2020
[12]

Kimera-multi: a system for distributed multi-robot metric-semantic simultaneous localization and mapping,

Y . Chang, Y . Tian, J. P. How, and L. Carlone, “Kimera-multi: a system for distributed multi-robot metric-semantic simultaneous localization and mapping,” in2021 IEEE international conference on robotics and automation (ICRA). IEEE, 2021, pp. 11 210–11 218

work page 2021
[13]

3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans,

A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans,”arXiv preprint arXiv:2002.06289, 2020

work page arXiv 2002
[14]

Hydra: A real-time spatial perception system for 3d scene graph construction and optimization,

N. Hughes, Y . Chang, and L. Carlone, “Hydra: A real-time spatial perception system for 3d scene graph construction and optimization,” arXiv preprint arXiv:2201.13360, 2022

work page arXiv 2022
[15]

Clio: Real-time task-driven open-set 3d scene graphs,

D. Maggio, Y . Chang, N. Hughes, M. Trang, D. Griffith, C. Dougherty, E. Cristofalo, L. Schmid, and L. Carlone, “Clio: Real-time task-driven open-set 3d scene graphs,”IEEE Robotics and Automation Letters, vol. 9, no. 10, pp. 8921–8928, 2024

work page 2024
[16]

Visual language maps for robot navigation,

C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” in2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 10 608–10 615

work page 2023
[17]

Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation,

A. Werby, C. Huang, M. B ¨uchner, A. Valada, and W. Burgard, “Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation,” inFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024

work page 2024
[18]

Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,

Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa,et al., “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 5021–5028

work page 2024
[19]

Netvlad: Cnn architecture for weakly supervised place recognition,

R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” inPro- ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5297–5307

work page 2016
[20]

Su- perglue: Learning feature matching with graph neural networks,

P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Su- perglue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4938–4947

work page 2020
[21]

Depth Anything 3: Recovering the Visual Space from Any Views

H. Lin, S. Chen, J. Liew, D. Y . Chen, Z. Li, G. Shi, J. Feng, and B. Kang, “Depth anything 3: Recovering the visual space from any views,”arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023

work page 2023
[23]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdul- mohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafa,et al., “Siglip 2: Multilingual vision-language encoders with improved se- mantic understanding, localization, and dense features,”arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Billion-scale similarity search with gpus,

J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with gpus,”IEEE transactions on big data, vol. 7, no. 3, pp. 535–547, 2019

work page 2019
[25]

Language-driven semantic segmentation,

B. Li, K. Q. Weinberger, S. Belongie, V . Koltun, and R. Ranftl, “Language-driven semantic segmentation,”arXiv preprint arXiv:2201.03546, 2022

work page arXiv 2022
[26]

Open-vocabulary functional 3d scene graphs for real- world indoor spaces,

C. Zhang, A. Delitzas, F. Wang, R. Zhang, X. Ji, M. Pollefeys, and F. Engelmann, “Open-vocabulary functional 3d scene graphs for real- world indoor spaces,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19 401–19 413

work page 2025
[27]

Keysg: Hierarchical keyframe-based 3d scene graphs,

A. Werby, D. Rotondi, F. Scaparro, and K. O. Arras, “Keysg: Hierarchical keyframe-based 3d scene graphs,”arXiv preprint arXiv:2510.01049, 2025

work page arXiv 2025
[28]

Dynamem: Online dynamic spatio-semantic memory for open world mobile manipulation,

P. Liu, Z. Guo, M. Warke, S. Chintala, C. Paxton, N. M. M. Shafiullah, and L. Pinto, “Dynamem: Online dynamic spatio-semantic memory for open world mobile manipulation,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 13 346–13 355

work page 2025
[29]

Lagmemo: Language 3d gaussian splatting memory for multi-modal open-vocabulary multi-goal visual navigation,

H. Zhou, X. Wang, H. Li, F. Sun, S. Guo, G. Qi, J. Xu, and H. Zhao, “Lagmemo: Language 3d gaussian splatting memory for multi-modal open-vocabulary multi-goal visual navigation,”arXiv preprint arXiv:2510.24118, 2025

work page arXiv 2025
[30]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

work page 2021
[31]

Scaling up visual and vision-language representation learning with noisy text supervision,

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 4904–4916

work page 2021
[32]

Flava: A foundational language and vision alignment model,

A. Singh, R. Hu, V . Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, “Flava: A foundational language and vision alignment model,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 15 638–15 650

work page 2022
[33]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-VL technical report,”arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames,

E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra, “Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames,”arXiv preprint arXiv:1911.00357, 2019

work page arXiv 1911
[35]

Habitat- web: Learning embodied object-search strategies from human demon- strations at scale,

R. Ramrakhya, E. Undersander, D. Batra, and A. Das, “Habitat- web: Learning embodied object-search strategies from human demon- strations at scale,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5173–5183

work page 2022
[36]

Offline visual representation learning for embodied navigation,

K. Yadav, R. Ramrakhya, A. Majumdar, V .-P. Berges, S. Kuhar, D. Ba- tra, A. Baevski, and O. Maksymets, “Offline visual representation learning for embodied navigation,” inWorkshop on Reincarnating Reinforcement Learning at ICLR 2023, 2023

work page 2023
[37]

Pirlnav: Pretraining with imitation and rl finetuning for objectnav,

R. Ramrakhya, D. Batra, E. Wijmans, and A. Das, “Pirlnav: Pretraining with imitation and rl finetuning for objectnav,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 896–17 906

work page 2023
[38]

Zson: Zero-shot object-goal navigation using multimodal goal embed- dings,

A. Majumdar, G. Aggarwal, B. Devnani, J. Hoffman, and D. Batra, “Zson: Zero-shot object-goal navigation using multimodal goal embed- dings,”Advances in Neural Information Processing Systems, vol. 35, pp. 32 340–32 352, 2022

work page 2022
[39]

Prioritized semantic learning for zero-shot instance navigation,

X. Sun, L. Liu, H. Zhi, R. Qiu, and J. Liang, “Prioritized semantic learning for zero-shot instance navigation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 161–178

work page 2024
[40]

Goat: Go to any thing,

M. Chang, T. Gervet, M. Khanna, S. Yenamandra, D. Shah, S. Y . Min, K. Shah, C. Paxton, S. Gupta, D. Batra,et al., “Goat: Go to any thing,” arXiv preprint arXiv:2311.06430, 2023

work page arXiv 2023
[41]

Uni-navid: A video-based vision-language- action model for unifying embodied navigation tasks,

J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang, “Uni-navid: A video-based vision-language- action model for unifying embodied navigation tasks,”arXiv preprint arXiv:2412.06224, 2024

work page arXiv 2024
[42]

Esc: Exploration with soft commonsense constraints for zero-shot object navigation,

K. Zhou, K. Zheng, C. Pryor, Y . Shen, H. Jin, L. Getoor, and X. E. Wang, “Esc: Exploration with soft commonsense constraints for zero-shot object navigation,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 42 829–42 842

work page 2023
[43]

Openfmnav: Towards open-set zero-shot object navigation via vision-language foundation models,

Y . Kuang, H. Lin, and M. Jiang, “Openfmnav: Towards open-set zero-shot object navigation via vision-language foundation models,” inFindings of the Association for Computational Linguistics: NAACL 2024, 2024, pp. 338–351

work page 2024
[44]

Unigoal: Towards universal zero-shot goal-oriented navigation,

H. Yin, X. Xu, L. Zhao, Z. Wang, J. Zhou, and J. Lu, “Unigoal: Towards universal zero-shot goal-oriented navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2025, pp. 19 057–19 066

work page 2025
[45]

Tango: training- free embodied ai agents for open-world tasks,

F. Ziliotto, T. Campari, L. Serafini, and L. Ballan, “Tango: training- free embodied ai agents for open-world tasks,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24 603–24 613

work page 2025
[46]

Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,

Y . Long, W. Cai, H. Wang, G. Zhan, and H. Dong, “Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,”arXiv preprint arXiv:2406.04882, 2024

work page arXiv 2024
[47]

Trihelper: Zero-shot object navigation with dynamic assistance,

L. Zhang, Q. Zhang, H. Wang, E. Xiao, Z. Jiang, H. Chen, and R. Xu, “Trihelper: Zero-shot object navigation with dynamic assistance,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 10 035–10 042

work page 2024
[48]

Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation,

H. Yin, X. Xu, Z. Wu, J. Zhou, and J. Lu, “Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation,”Advances in neural information processing systems, vol. 37, pp. 5285–5307, 2024

work page 2024
[49]

SAM 3: Segment Anything with Concepts

N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang,et al., “Sam 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Goat-bench: A benchmark for multi-modal lifelong navigation,

M. Khanna, R. Ramrakhya, G. Chhablani, S. Yenamandra, T. Gervet, M. Chang, Z. Kira, D. S. Chaplot, D. Batra, and R. Mottaghi, “Goat-bench: A benchmark for multi-modal lifelong navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 16 373–16 383

work page 2024
[51]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang,et al., “Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai,”arXiv preprint arXiv:2109.08238, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[52]

Matterport3D: Learning from RGB-D Data in Indoor Environments

A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y . Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,”arXiv preprint arXiv:1709.06158, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[53]

Hm3d- ovon: A dataset and benchmark for open-vocabulary object goal navigation,

N. Yokoyama, R. Ramrakhya, A. Das, D. Batra, and S. Ha, “Hm3d- ovon: A dataset and benchmark for open-vocabulary object goal navigation,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 5543–5550

work page 2024
[54]

L3mvn: Leveraging large language models for visual target navigation,

B. Yu, H. Kasaei, and M. Cao, “L3mvn: Leveraging large language models for visual target navigation,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 3554–3560

work page 2023
[55]

Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav,

K. Yadav, A. Majumdar, R. Ramrakhya, N. Yokoyama, A. Baevski, Z. Kira, O. Maksymets, and D. Batra, “Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav,”arXiv preprint arXiv:2303.07798, 2023

work page arXiv 2023
[56]

Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation,

Z. Zhu, X. Wang, Y . Li, Z. Zhang, X. Ma, Y . Chen, B. Jia, W. Liang, Q. Yu, Z. Deng,et al., “Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 8120–8132

work page 2025
[57]

On Evaluation of Embodied Navigation Agents

P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V . Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva,et al., “On evaluation of embodied navigation agents,”arXiv preprint arXiv:1807.06757, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[58]

Sun rgb-d: A rgb-d scene un- derstanding benchmark suite,

S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene un- derstanding benchmark suite,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 567–576

work page 2015
[59]

Lamar: Benchmarking local- ization and mapping for augmented reality,

P.-E. Sarlin, M. Dusmanu, J. L. Sch ¨onberger, P. Speciale, L. Gruber, V . Larsson, O. Miksik, and M. Pollefeys, “Lamar: Benchmarking local- ization and mapping for augmented reality,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 686–704

work page 2022