Recognition: no theorem link
Memory Over Maps: 3D Object Localization Without Reconstruction
Pith reviewed 2026-05-15 07:43 UTC · model grok-4.3
The pith
Object localization for robots succeeds by storing only posed RGB-D images and fusing sparse views on demand, without any global 3D map.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Object localization reduces to retrieving and re-ranking a small set of posed RGB-D keyframes from a visual memory, followed by on-demand sparse multi-view depth fusion, and this process produces usable 3D target locations for navigation without ever constructing a global point cloud, voxel grid, or scene graph.
What carries the argument
A visual memory of posed RGB-D keyframes together with vision-language model re-ranking and sparse multi-view depth back-projection that produces an on-demand 3D estimate of the queried target.
If this is right
- Scene indexing completes over two orders of magnitude faster than reconstruction pipelines.
- Storage requirements drop substantially because only keyframes are kept instead of dense 3D data.
- Object-goal navigation performance remains strong across multiple benchmarks with no task-specific training.
- Direct reasoning over 2D image memory can substitute for dense 3D reconstruction in object-centric robot tasks.
Where Pith is reading between the lines
- Robots could operate in much larger or changing environments where full reconstruction quickly becomes impractical.
- The same memory could support other embodied queries such as finding relations between multiple objects without extra mapping.
- Incremental addition of new keyframes might allow the system to adapt online without rebuilding any global structure.
Load-bearing premise
That vision-language model re-ranking of candidate views plus sparse multi-view depth fusion will reliably produce accurate 3D target locations without a global scene representation or task-specific training.
What would settle it
Direct comparison on the same navigation benchmarks showing that the map-free method produces significantly higher localization error or lower success rate than a standard reconstruction-based pipeline.
Figures
read the original abstract
Target localization is a prerequisite for embodied tasks such as navigation and manipulation. Conventional approaches rely on constructing explicit 3D scene representations to enable target localization, such as point clouds, voxel grids, or scene graphs. While effective, these pipelines incur substantial mapping time, storage overhead, and scalability limitations. Recent advances in vision-language models suggest that rich semantic reasoning can be performed directly on 2D observations, raising a fundamental question: is a complete 3D scene reconstruction necessary for object localization? In this work, we revisit object localization and propose a map-free pipeline that stores only posed RGB-D keyframes as a lightweight visual memory--without constructing any global 3D representation of the scene. At query time, our method retrieves candidate views, re-ranks them with a vision-language model, and constructs a sparse, on-demand 3D estimate of the queried target through depth backprojection and multi-view fusion. Compared to reconstruction-based pipelines, this design drastically reduces preprocessing cost, enabling scene indexing that is over two orders of magnitude faster to build while using substantially less storage. We further validate the localized targets on downstream object-goal navigation tasks. Despite requiring no task-specific training, our approach achieves strong performance across multiple benchmarks, demonstrating that direct reasoning over image-based scene memory can effectively replace dense 3D reconstruction for object-centric robot navigation. Project page: https://ruizhou-cn.github.io/memory-over-maps/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a map-free pipeline for 3D object localization that stores only posed RGB-D keyframes as lightweight visual memory. At query time it retrieves candidate views, re-ranks them with a pre-trained vision-language model, and produces a sparse 3D target coordinate via depth back-projection and multi-view fusion. The method is claimed to reduce scene-indexing time by over two orders of magnitude and storage cost while achieving strong performance on object-goal navigation benchmarks without task-specific training or any global 3D reconstruction.
Significance. If the localization accuracy holds under realistic conditions, the approach would substantially lower the preprocessing and memory burden of embodied navigation pipelines, allowing robots to operate directly from image-based memory rather than maintaining dense maps or scene graphs. The absence of task-specific training and the use of off-the-shelf VLMs are notable strengths that could improve scalability.
major comments (3)
- [Abstract / Experiments] Abstract and Experiments section: the central claim that the method 'achieves strong performance across multiple benchmarks' is unsupported by any quantitative metrics, success rates, localization error distributions, or direct comparisons to reconstruction-based baselines. Aggregate navigation success alone does not verify that the sparse multi-view fusion produces metric-accurate 3D coordinates.
- [Method / Experiments] Method and Experiments: no ablation or failure-case analysis is provided for the two load-bearing assumptions—(1) that VLM re-ranking reliably selects views with sufficient parallax and target visibility, and (2) that simple averaging of noisy depth values converges to usable accuracy. The skeptic note correctly identifies that low overlap, partial occlusions, or specular surfaces can break either step, yet only aggregate results are reported.
- [Experiments] Experiments: the manuscript reports only downstream navigation success rates rather than per-query 3D localization error (e.g., mean Euclidean distance to ground-truth target position). This makes it impossible to isolate whether navigation failures stem from localization inaccuracy or from other pipeline components.
minor comments (2)
- [Abstract] The project page link is given but the manuscript does not indicate whether code, keyframes, or evaluation scripts will be released, which would be valuable for reproducibility.
- [Method] Notation for the multi-view fusion step (e.g., how depths are weighted or outliers rejected) should be formalized with an equation rather than left as prose.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the quantitative evidence and analysis in the manuscript. We address each major comment below and will revise the paper accordingly.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: the central claim that the method 'achieves strong performance across multiple benchmarks' is unsupported by any quantitative metrics, success rates, localization error distributions, or direct comparisons to reconstruction-based baselines. Aggregate navigation success alone does not verify that the sparse multi-view fusion produces metric-accurate 3D coordinates.
Authors: We agree that the abstract and experiments would benefit from more explicit quantitative support. The current manuscript reports navigation success rates on object-goal navigation benchmarks with comparisons to reconstruction-based methods, but we acknowledge that aggregate success rates alone do not fully isolate the accuracy of the 3D localization step. In the revised version, we will add specific success rates, localization error distributions (e.g., mean and median Euclidean errors), and direct numerical comparisons to baselines to better substantiate the claims about metric accuracy. revision: yes
-
Referee: [Method / Experiments] Method and Experiments: no ablation or failure-case analysis is provided for the two load-bearing assumptions—(1) that VLM re-ranking reliably selects views with sufficient parallax and target visibility, and (2) that simple averaging of noisy depth values converges to usable accuracy. The skeptic note correctly identifies that low overlap, partial occlusions, or specular surfaces can break either step, yet only aggregate results are reported.
Authors: We concur that dedicated ablations and failure-case analysis would strengthen the paper. The manuscript currently emphasizes end-to-end navigation performance, but we will add an ablation study examining the contribution of VLM re-ranking (including metrics on selected view quality such as parallax and visibility) and the multi-view fusion step. We will also include a discussion of failure modes under conditions like low overlap, occlusions, and specular surfaces, with qualitative examples where possible. revision: yes
-
Referee: [Experiments] Experiments: the manuscript reports only downstream navigation success rates rather than per-query 3D localization error (e.g., mean Euclidean distance to ground-truth target position). This makes it impossible to isolate whether navigation failures stem from localization inaccuracy or from other pipeline components.
Authors: We accept this point. To allow isolation of localization performance, the revised experiments section will report per-query 3D localization errors, including mean Euclidean distance to ground-truth target positions across queries, along with error distributions. This will complement the existing navigation success rates and clarify the sources of any failures. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper proposes an engineering pipeline that stores posed RGB-D keyframes as visual memory, retrieves candidates, re-ranks them using an external pre-trained VLM, and computes target locations via standard depth backprojection followed by multi-view averaging. No equations are presented that define outputs in terms of themselves, no parameters are fitted to a data subset and then relabeled as predictions, and no load-bearing claims reduce to self-citations or author-imported uniqueness theorems. The central steps rely on independently established geometric operations and off-the-shelf models whose correctness is external to the paper, rendering the approach self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Habitat: A platform for embodied ai research,
M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik,et al., “Habitat: A platform for embodied ai research,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9339–9347
work page 2019
-
[2]
Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,
S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 171–23 181
work page 2023
-
[3]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman,et al., “Do as i can, not as i say: Grounding language in robotic affordances,”arXiv preprint arXiv:2204.01691, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Openmask3d: Open-vocabulary 3d instance segmen- tation,
A. Takmaz, E. Fedele, R. W. Sumner, M. Pollefeys, F. Tombari, and F. Engelmann, “Openmask3d: Open-vocabulary 3d instance segmen- tation,”arXiv preprint arXiv:2306.13631, 2023
-
[5]
Locate 3d: Real-world object localization via self-supervised learning in 3d,
S. Arnaud, P. McVay, A. Martin, A. Majumdar, K. M. Jatavallabhula, P. Thomas, R. Partsey, D. Dugas, A. Gejji, A. Sax,et al., “Locate 3d: Real-world object localization via self-supervised learning in 3d,” arXiv preprint arXiv:2504.14151, 2025
-
[6]
Object goal navigation using goal-oriented semantic exploration,
D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,” Advances in Neural Information Processing Systems, vol. 33, pp. 4247–4258, 2020
work page 2020
-
[7]
Vlfm: Vision- language frontier maps for zero-shot semantic navigation,
N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision- language frontier maps for zero-shot semantic navigation,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 42–48
work page 2024
-
[8]
V oxgraph: Globally consistent, volumetric mapping using signed distance function submaps,
V . Reijgwart, A. Millane, H. Oleynikova, R. Siegwart, C. Cadena, and J. Nieto, “V oxgraph: Globally consistent, volumetric mapping using signed distance function submaps,”IEEE Robotics and Automation Letters, vol. 5, no. 1, pp. 227–234, 2019
work page 2019
-
[9]
V oxblox: Incremental 3d euclidean signed distance fields for on- board mav planning,
H. Oleynikova, Z. Taylor, M. Fehr, R. Siegwart, and J. Nieto, “V oxblox: Incremental 3d euclidean signed distance fields for on- board mav planning,” in2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 1366–1373
work page 2017
-
[10]
Octomap: An efficient probabilistic 3d mapping framework based on octrees,
A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Bur- gard, “Octomap: An efficient probabilistic 3d mapping framework based on octrees,”Autonomous robots, vol. 34, no. 3, pp. 189–206, 2013
work page 2013
-
[11]
Kimera: an open- source library for real-time metric-semantic localization and mapping,
A. Rosinol, M. Abate, Y . Chang, and L. Carlone, “Kimera: an open- source library for real-time metric-semantic localization and mapping,” in2020 IEEE international conference on robotics and automation (ICRA). IEEE, 2020, pp. 1689–1696
work page 2020
-
[12]
Y . Chang, Y . Tian, J. P. How, and L. Carlone, “Kimera-multi: a system for distributed multi-robot metric-semantic simultaneous localization and mapping,” in2021 IEEE international conference on robotics and automation (ICRA). IEEE, 2021, pp. 11 210–11 218
work page 2021
-
[13]
3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans,
A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans,”arXiv preprint arXiv:2002.06289, 2020
-
[14]
Hydra: A real-time spatial perception system for 3d scene graph construction and optimization,
N. Hughes, Y . Chang, and L. Carlone, “Hydra: A real-time spatial perception system for 3d scene graph construction and optimization,” arXiv preprint arXiv:2201.13360, 2022
-
[15]
Clio: Real-time task-driven open-set 3d scene graphs,
D. Maggio, Y . Chang, N. Hughes, M. Trang, D. Griffith, C. Dougherty, E. Cristofalo, L. Schmid, and L. Carlone, “Clio: Real-time task-driven open-set 3d scene graphs,”IEEE Robotics and Automation Letters, vol. 9, no. 10, pp. 8921–8928, 2024
work page 2024
-
[16]
Visual language maps for robot navigation,
C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” in2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 10 608–10 615
work page 2023
-
[17]
Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation,
A. Werby, C. Huang, M. B ¨uchner, A. Valada, and W. Burgard, “Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation,” inFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024
work page 2024
-
[18]
Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,
Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa,et al., “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 5021–5028
work page 2024
-
[19]
Netvlad: Cnn architecture for weakly supervised place recognition,
R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” inPro- ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5297–5307
work page 2016
-
[20]
Su- perglue: Learning feature matching with graph neural networks,
P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Su- perglue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4938–4947
work page 2020
-
[21]
Depth Anything 3: Recovering the Visual Space from Any Views
H. Lin, S. Chen, J. Liew, D. Y . Chen, Z. Li, G. Shi, J. Feng, and B. Kang, “Depth anything 3: Recovering the visual space from any views,”arXiv preprint arXiv:2511.10647, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023
work page 2023
-
[23]
M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdul- mohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafa,et al., “Siglip 2: Multilingual vision-language encoders with improved se- mantic understanding, localization, and dense features,”arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Billion-scale similarity search with gpus,
J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with gpus,”IEEE transactions on big data, vol. 7, no. 3, pp. 535–547, 2019
work page 2019
-
[25]
Language-driven semantic segmentation,
B. Li, K. Q. Weinberger, S. Belongie, V . Koltun, and R. Ranftl, “Language-driven semantic segmentation,”arXiv preprint arXiv:2201.03546, 2022
-
[26]
Open-vocabulary functional 3d scene graphs for real- world indoor spaces,
C. Zhang, A. Delitzas, F. Wang, R. Zhang, X. Ji, M. Pollefeys, and F. Engelmann, “Open-vocabulary functional 3d scene graphs for real- world indoor spaces,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19 401–19 413
work page 2025
-
[27]
Keysg: Hierarchical keyframe-based 3d scene graphs,
A. Werby, D. Rotondi, F. Scaparro, and K. O. Arras, “Keysg: Hierarchical keyframe-based 3d scene graphs,”arXiv preprint arXiv:2510.01049, 2025
-
[28]
Dynamem: Online dynamic spatio-semantic memory for open world mobile manipulation,
P. Liu, Z. Guo, M. Warke, S. Chintala, C. Paxton, N. M. M. Shafiullah, and L. Pinto, “Dynamem: Online dynamic spatio-semantic memory for open world mobile manipulation,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 13 346–13 355
work page 2025
-
[29]
H. Zhou, X. Wang, H. Li, F. Sun, S. Guo, G. Qi, J. Xu, and H. Zhao, “Lagmemo: Language 3d gaussian splatting memory for multi-modal open-vocabulary multi-goal visual navigation,”arXiv preprint arXiv:2510.24118, 2025
-
[30]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763
work page 2021
-
[31]
Scaling up visual and vision-language representation learning with noisy text supervision,
C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 4904–4916
work page 2021
-
[32]
Flava: A foundational language and vision alignment model,
A. Singh, R. Hu, V . Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, “Flava: A foundational language and vision alignment model,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 15 638–15 650
work page 2022
-
[33]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-VL technical report,”arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames,
E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra, “Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames,”arXiv preprint arXiv:1911.00357, 2019
-
[35]
Habitat- web: Learning embodied object-search strategies from human demon- strations at scale,
R. Ramrakhya, E. Undersander, D. Batra, and A. Das, “Habitat- web: Learning embodied object-search strategies from human demon- strations at scale,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5173–5183
work page 2022
-
[36]
Offline visual representation learning for embodied navigation,
K. Yadav, R. Ramrakhya, A. Majumdar, V .-P. Berges, S. Kuhar, D. Ba- tra, A. Baevski, and O. Maksymets, “Offline visual representation learning for embodied navigation,” inWorkshop on Reincarnating Reinforcement Learning at ICLR 2023, 2023
work page 2023
-
[37]
Pirlnav: Pretraining with imitation and rl finetuning for objectnav,
R. Ramrakhya, D. Batra, E. Wijmans, and A. Das, “Pirlnav: Pretraining with imitation and rl finetuning for objectnav,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 896–17 906
work page 2023
-
[38]
Zson: Zero-shot object-goal navigation using multimodal goal embed- dings,
A. Majumdar, G. Aggarwal, B. Devnani, J. Hoffman, and D. Batra, “Zson: Zero-shot object-goal navigation using multimodal goal embed- dings,”Advances in Neural Information Processing Systems, vol. 35, pp. 32 340–32 352, 2022
work page 2022
-
[39]
Prioritized semantic learning for zero-shot instance navigation,
X. Sun, L. Liu, H. Zhi, R. Qiu, and J. Liang, “Prioritized semantic learning for zero-shot instance navigation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 161–178
work page 2024
-
[40]
M. Chang, T. Gervet, M. Khanna, S. Yenamandra, D. Shah, S. Y . Min, K. Shah, C. Paxton, S. Gupta, D. Batra,et al., “Goat: Go to any thing,” arXiv preprint arXiv:2311.06430, 2023
-
[41]
Uni-navid: A video-based vision-language- action model for unifying embodied navigation tasks,
J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang, “Uni-navid: A video-based vision-language- action model for unifying embodied navigation tasks,”arXiv preprint arXiv:2412.06224, 2024
-
[42]
Esc: Exploration with soft commonsense constraints for zero-shot object navigation,
K. Zhou, K. Zheng, C. Pryor, Y . Shen, H. Jin, L. Getoor, and X. E. Wang, “Esc: Exploration with soft commonsense constraints for zero-shot object navigation,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 42 829–42 842
work page 2023
-
[43]
Openfmnav: Towards open-set zero-shot object navigation via vision-language foundation models,
Y . Kuang, H. Lin, and M. Jiang, “Openfmnav: Towards open-set zero-shot object navigation via vision-language foundation models,” inFindings of the Association for Computational Linguistics: NAACL 2024, 2024, pp. 338–351
work page 2024
-
[44]
Unigoal: Towards universal zero-shot goal-oriented navigation,
H. Yin, X. Xu, L. Zhao, Z. Wang, J. Zhou, and J. Lu, “Unigoal: Towards universal zero-shot goal-oriented navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2025, pp. 19 057–19 066
work page 2025
-
[45]
Tango: training- free embodied ai agents for open-world tasks,
F. Ziliotto, T. Campari, L. Serafini, and L. Ballan, “Tango: training- free embodied ai agents for open-world tasks,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24 603–24 613
work page 2025
-
[46]
Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,
Y . Long, W. Cai, H. Wang, G. Zhan, and H. Dong, “Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,”arXiv preprint arXiv:2406.04882, 2024
-
[47]
Trihelper: Zero-shot object navigation with dynamic assistance,
L. Zhang, Q. Zhang, H. Wang, E. Xiao, Z. Jiang, H. Chen, and R. Xu, “Trihelper: Zero-shot object navigation with dynamic assistance,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 10 035–10 042
work page 2024
-
[48]
Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation,
H. Yin, X. Xu, Z. Wu, J. Zhou, and J. Lu, “Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation,”Advances in neural information processing systems, vol. 37, pp. 5285–5307, 2024
work page 2024
-
[49]
SAM 3: Segment Anything with Concepts
N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang,et al., “Sam 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Goat-bench: A benchmark for multi-modal lifelong navigation,
M. Khanna, R. Ramrakhya, G. Chhablani, S. Yenamandra, T. Gervet, M. Chang, Z. Kira, D. S. Chaplot, D. Batra, and R. Mottaghi, “Goat-bench: A benchmark for multi-modal lifelong navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 16 373–16 383
work page 2024
-
[51]
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang,et al., “Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai,”arXiv preprint arXiv:2109.08238, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[52]
Matterport3D: Learning from RGB-D Data in Indoor Environments
A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y . Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,”arXiv preprint arXiv:1709.06158, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[53]
Hm3d- ovon: A dataset and benchmark for open-vocabulary object goal navigation,
N. Yokoyama, R. Ramrakhya, A. Das, D. Batra, and S. Ha, “Hm3d- ovon: A dataset and benchmark for open-vocabulary object goal navigation,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 5543–5550
work page 2024
-
[54]
L3mvn: Leveraging large language models for visual target navigation,
B. Yu, H. Kasaei, and M. Cao, “L3mvn: Leveraging large language models for visual target navigation,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 3554–3560
work page 2023
-
[55]
Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav,
K. Yadav, A. Majumdar, R. Ramrakhya, N. Yokoyama, A. Baevski, Z. Kira, O. Maksymets, and D. Batra, “Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav,”arXiv preprint arXiv:2303.07798, 2023
-
[56]
Z. Zhu, X. Wang, Y . Li, Z. Zhang, X. Ma, Y . Chen, B. Jia, W. Liang, Q. Yu, Z. Deng,et al., “Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 8120–8132
work page 2025
-
[57]
On Evaluation of Embodied Navigation Agents
P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V . Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva,et al., “On evaluation of embodied navigation agents,”arXiv preprint arXiv:1807.06757, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[58]
Sun rgb-d: A rgb-d scene un- derstanding benchmark suite,
S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene un- derstanding benchmark suite,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 567–576
work page 2015
-
[59]
Lamar: Benchmarking local- ization and mapping for augmented reality,
P.-E. Sarlin, M. Dusmanu, J. L. Sch ¨onberger, P. Speciale, L. Gruber, V . Larsson, O. Miksik, and M. Pollefeys, “Lamar: Benchmarking local- ization and mapping for augmented reality,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 686–704
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.