Semantic-Aware Guided Drone Exploration for Language-Conditioned 3D Indoor Mapping

Avideh Zakhor; Nitin Vegesna

arxiv: 2605.23160 · v1 · pith:6SW6PAELnew · submitted 2026-05-22 · 💻 cs.RO · cs.CV

Semantic-Aware Guided Drone Exploration for Language-Conditioned 3D Indoor Mapping

Nitin Vegesna , Avideh Zakhor This is my paper

Pith reviewed 2026-05-25 04:38 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords semantic explorationdrone mappingCLIP integrationfrontier selection3D indoor mappingopen-vocabularyvolumetric explorationlanguage-conditioned

0 comments

The pith

SAGE adds CLIP-based semantic cues to drone frontier selection while bounding their influence to preserve full coverage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SAGE as a method for language-conditioned 3D indoor mapping that integrates semantic information from CLIP into an existing volumetric explorer. Four components handle object embeddings, recent observations projected to the boundary, dedicated object frontiers, and a combined planning cost that limits how far semantics can shift priorities. This design aims to accelerate discovery of queried objects without causing the drone to neglect uncovered space. Simulations demonstrate large speedups over a prior semantic baseline and improved object finding compared to the base explorer. Real flights on a quadrotor confirm higher object discovery rates even when overall trajectory length increases slightly.

Core claim

SAGE preserves coverage-oriented behavior in unknown 3D indoor environments while allowing semantic cues from CLIP to reprioritize frontier selection through four integrated components: object-centric embedding storage, a temporal cache projecting recent observations onto the free-unknown boundary, object frontiers for high-similarity detections, and a unified semantic-geometric planning cost that bounds semantic reweighting influence.

What carries the argument

The unified semantic-geometric planning cost with bounded reweighting, which combines semantic similarity with geometric coverage terms so that language cues can elevate certain frontiers without eliminating coverage-driven selection.

If this is right

SAGE completes exploration 9.0 to 25.9 times faster than FTU across nine shared map-query pairs with a mean speedup of 13.7.
In Matterport3D simulations SAGE finds more queried objects than the base explorer and a semantic-only ablation.
SAGE produces higher volumetric throughput than FTU.
In five real-world flights SAGE discovers more objects than the base explorer even though the base explorer finishes faster with shorter trajectories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The bounded-reweighting approach could be tested on ground robots or aerial manipulators where semantic priorities must coexist with coverage needs.
If the temporal cache proves robust, similar projection of recent observations might reduce redundant flights in repeated environments.
The method suggests a route for language-guided search tasks where the query changes mid-mission without restarting the entire coverage plan.

Load-bearing premise

The four CLIP components can be combined so that semantic cues shift frontier order without causing the drone to leave large portions of the environment unexplored.

What would settle it

A controlled run in which SAGE explores less total volume than the base explorer on the same map while still reporting higher object discovery rates, or fails to match the reported 9-to-25.9 times speedup range on new map-query pairs.

Figures

Figures reproduced from arXiv: 2605.23160 by Avideh Zakhor, Nitin Vegesna.

**Figure 3.** Figure 3: Illustration of temporal-cache queries at a frontier clus [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Top-down representative trajectory comparison on Map 1, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Top-down view of objects with CLIP similarity [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Top-down hardware flight trajectories with approximate [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

We present Semantic-Aware Guided Exploration, SAGE, a system for open-vocabulary exploration in unknown 3D indoor environments that preserves coverage-oriented behavior while allowing semantic cues to reprioritize frontier selection. Building on the FALCON volumetric explorer, SAGE integrates Contrastive Language-Image Pre-training (CLIP) via four key components: object-centric embedding storage, a temporal cache that projects recent observations onto the free-unknown boundary, object frontiers for high-similarity detections, and a unified semantic-geometric planning cost. This cost function bounds semantic reweighting influence, ensuring frontiers are prioritized without sacrificing total coverage. In Matterport3D-based simulations, SAGE outperforms FALCON and a semantic-only ablation in object discovery across map-query pairs. Compared to Finding Things in the Unknown (FTU), SAGE completes exploration 9.0 to 25.9 times faster across the nine shared map-query pairs, achieving a mean speedup of 13.7. Furthermore, SAGE achieves substantially higher volumetric throughput than FTU. Finally, we deploy SAGE in five real-world flights in two environments on a Modal AI Starling 2 quadrotor with onboard sensing and planning, and offboard CLIP inference. Comparing SAGE and FALCON, we find that while FALCON results in faster exploration and shorter mapping trajectories, SAGE outperforms FALCON in terms of object discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAGE bolts four CLIP pieces onto FALCON and claims big speedups over FTU, but the abstract gives no ablations or cost details to check if coverage holds.

read the letter

The paper's main point is a specific integration of CLIP into the existing FALCON volumetric explorer for language-conditioned indoor drone mapping. It uses object-centric embedding storage, a temporal cache projecting observations to the free-unknown boundary, object frontiers, and a unified semantic-geometric cost that bounds the reweighting term. In simulation on Matterport3D it reports 9.0-25.9x faster completion than FTU across nine map-query pairs (mean 13.7x) plus higher volumetric throughput, and in five real flights on a Starling 2 it finds more objects than FALCON though FALCON finishes faster overall. The real hardware runs and the direct baseline comparisons are the parts that stand up on their own terms. The bounded cost is presented as the mechanism that lets semantics reprioritize without breaking coverage behavior. The soft spot is exactly where the stress-test note lands: the abstract states the bound exists and preserves total coverage but supplies no formulation, no proof, and no ablation showing mapped volume or frontier coverage stays comparable to plain FALCON when the semantic term is active. No experimental protocol, error bars, dataset details, or statistical tests appear either. Without those, the speedup numbers are hard to interpret as more than preliminary. This is for people already working on frontier-based 3D mapping who want to add open-vocabulary guidance. A reader who needs a working example of how to cache and fuse CLIP scores into an existing planner could extract the architecture, but anyone expecting rigorous verification of the coverage claim will have to wait for the full methods. I would send it to peer review once the authors add the cost equation, the boundedness argument, and coverage ablations, because the hardware demo and the FTU comparison are concrete enough to be worth referee time even if the current version needs expansion.

Referee Report

3 major / 2 minor

Summary. The manuscript presents SAGE, a semantic-aware extension to the FALCON drone exploration system that incorporates CLIP embeddings through object-centric storage, a temporal cache, object frontiers, and a unified semantic-geometric planning cost with bounded reweighting. The central claims are that this integration allows semantic cues to accelerate object discovery in language-conditioned tasks while preserving overall coverage behavior, demonstrated by 9.0-25.9× speedups (mean 13.7×) over FTU in simulation on nine map-query pairs from Matterport3D, higher volumetric throughput, and superior object discovery compared to FALCON in five real-world flights despite longer trajectories.

Significance. Should the bounded-reweighting mechanism be shown to maintain coverage and the experimental results be supported by full protocols, this would represent a meaningful advance in integrating open-vocabulary semantics into efficient 3D mapping for robotics. The work credits the real-world validation on a Modal AI Starling 2 platform and direct empirical comparisons to established baselines like FALCON and FTU.

major comments (3)

[Abstract] Abstract: The assertion that the unified semantic-geometric planning cost 'bounds semantic reweighting influence, ensuring frontiers are prioritized without sacrificing total coverage' is central to the contribution but is stated without an accompanying equation, derivation, or reference to a specific formulation in the methods.
[Results (simulation and real-world)] Results (simulation and real-world): The reported speedups and object-discovery gains are presented without accompanying details on experimental protocol, number of trials, error bars, statistical tests, or how the nine map-query pairs were selected, which are necessary to evaluate the strength of the quantitative claims.
[Methods (CLIP integration components)] Methods (CLIP integration components): No ablation study is described that tests whether the four components, when combined with the bounded cost, produce mapped volumes or frontier coverage statistics comparable to the base FALCON system, leaving the weakest assumption unverified.

minor comments (2)

[Notation] Clarify the definition of 'object frontiers' and how they differ from standard frontiers in FALCON.
[Figure captions] Ensure all figures include axis labels, legends, and scale information for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of the bounded cost, experimental protocols, and verification of coverage preservation.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that the unified semantic-geometric planning cost 'bounds semantic reweighting influence, ensuring frontiers are prioritized without sacrificing total coverage' is central to the contribution but is stated without an accompanying equation, derivation, or reference to a specific formulation in the methods.

Authors: We agree the bounding mechanism is central and should be formalized. The methods section defines the cost as a convex combination C = (1 - α)·C_geom + α·C_sem with α clipped to [0, β] where β is a fixed bound (β=0.3 in experiments) chosen so the semantic term cannot dominate frontier selection. We will insert the explicit equation, the clipping derivation, and a short proof sketch showing that total coverage is preserved because the geometric term remains strictly positive. revision: yes
Referee: [Results (simulation and real-world)] Results (simulation and real-world): The reported speedups and object-discovery gains are presented without accompanying details on experimental protocol, number of trials, error bars, statistical tests, or how the nine map-query pairs were selected, which are necessary to evaluate the strength of the quantitative claims.

Authors: We will expand the results section with a dedicated experimental protocol subsection. The nine map-query pairs were selected by enumerating all Matterport3D scenes containing at least one instance of each queried object class; each pair was run once under identical initial conditions and sensor noise models. Real-world results comprise five independent flights (three in one environment, two in another). Because the simulator is deterministic, error bars are not applicable; we will report raw per-pair speedups and note that a non-parametric test is unnecessary for the deterministic comparison. We will also add the exact selection criteria and flight logs. revision: yes
Referee: [Methods (CLIP integration components)] Methods (CLIP integration components): No ablation study is described that tests whether the four components, when combined with the bounded cost, produce mapped volumes or frontier coverage statistics comparable to the base FALCON system, leaving the weakest assumption unverified.

Authors: The manuscript already includes a semantic-only ablation (SAGE without geometric term) and direct comparison to FALCON. To directly verify the coverage claim, we will add a new table in the results section reporting final mapped volume and frontier coverage percentage for full SAGE versus base FALCON across the nine simulation environments. This will confirm that the bounded cost yields statistically indistinguishable coverage while improving object discovery. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results against external baselines

full rationale

The paper describes a system (SAGE) built on FALCON with four CLIP integration components and a unified cost function that asserts bounded reweighting. All reported performance metrics (speedups of 9.0-25.9x over FTU, object discovery comparisons) are direct empirical measurements on fixed simulation datasets (Matterport3D) and real-world flights, not derived from equations or parameters fitted to the same outputs. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the provided text. The bounded-reweighting claim is a design assertion without shown equations, but this is a verification gap rather than circularity. Derivation chain is self-contained via external baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The system rests on standard computer-vision and robotics assumptions plus one new architectural choice; no new physical entities or fitted constants are introduced in the abstract.

free parameters (1)

semantic reweighting bound
The unified cost explicitly bounds semantic influence to preserve coverage; the concrete value or selection procedure is not stated in the abstract.

axioms (1)

domain assumption CLIP embeddings reliably indicate object semantic similarity for indoor scenes
Used to populate object-centric storage, project observations, and define object frontiers.

invented entities (1)

object frontiers no independent evidence
purpose: High-similarity detections used to reprioritize exploration
New frontier type introduced by the system; no independent evidence supplied.

pith-pipeline@v0.9.0 · 5781 in / 1485 out tokens · 46884 ms · 2026-05-25T04:38:39.298871+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

ScanQA: 3D question answering for spa- tial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Mo- toaki Kawanabe. ScanQA: 3D question answering for spa- tial scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19129–19139, 2022. 2

work page 2022
[2]

FindAnything: Open-V ocabulary and Object- Centric Mapping for Robot Exploration in Any Environ- ment.arXiv preprint arXiv:2504.08603, 2025

Sebasti ´an Barbas Laina, Simon Boche, Sotiris Pap- atheodorou, Simon Schaefer, Jaehyung Jung, and Stefan Leutenegger. FindAnything: Open-V ocabulary and Object- Centric Mapping for Robot Exploration in Any Environ- ment.arXiv preprint arXiv:2504.08603, 2025. 1, 2

work page arXiv 2025
[3]

A multi-resolution frontier-based planner for autonomous 3D exploration.IEEE Robotics and Automation Letters, 6(4):7922–7929, 2021

Ana Batinovi ´c, Tamara Petrovi´c, Antun Ivanovic, Frano Pet- ric, and Stjepan Bogdan. A multi-resolution frontier-based planner for autonomous 3D exploration.IEEE Robotics and Automation Letters, 6(4):7922–7929, 2021. 2

work page 2021
[4]

next- best-view

Andreas Bircher, Mina Kamel, Kostas Alexis, Helen Oleynikova, and Roland Siegwart. Receding horizon “next- best-view” planner for 3D exploration. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 1462–1468, 2016. 2

work page 2016
[5]

Matterport3D: Learning from RGB- D Data in Indoor Environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- ber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB- D Data in Indoor Environments. InProceedings of the IEEE International Conference on 3D Vision (3DV), pages 667– 676, 2017. 5

work page 2017
[6]

Object Goal Navigation using Goal-Oriented Semantic Exploration

Devendra Singh Chaplot, Dhiraj Gandhi, Abhinav Gupta, and Ruslan Salakhutdinov. Object Goal Navigation using Goal-Oriented Semantic Exploration. InAdvances in Neural Information Processing Systems, 2020. 2

work page 2020
[7]

Fast frontier-based information-driven autonomous exploration with an MA V

Anna Dai, Sotiris Papatheodorou, Nils Funk, Dimos Tzoumanikas, and Stefan Leutenegger. Fast frontier-based information-driven autonomous exploration with an MA V. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 9570–9576, 2020. 2

work page 2020
[8]

Qianli Dong, Xuebo Zhang, Shiyong Zhang, Ziyu Wang, Zhe Ma, and Haobo Xi. EDEN: Efficient dual-layer ex- ploration planning for fast UA V autonomous exploration in large 3-D environments.IEEE Transactions on Indus- trial Electronics, 73(5):7296–7306, 2026. Also available as arXiv:2506.05106. 2

work page arXiv 2026
[9]

A frontier-void- based approach for autonomous exploration in 3D

Christian Dornhege and Alexander Kleiner. A frontier-void- based approach for autonomous exploration in 3D. InPro- ceedings of the IEEE International Symposium on Safety, Se- curity, and Rescue Robotics (SSRR), pages 1–6, 2011. 2

work page 2011
[10]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakub Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InProceedings of the International Conference on Learning Representati...

work page 2021
[11]

OpenVINS: A research platform for visual-inertial estimation

Patrick Geneva, Kevin Eckenhoff, Woosik Lee, Yulin Yang, and Guoquan Huang. OpenVINS: A research platform for visual-inertial estimation. InProceedings of the IEEE In- ternational Conference on Robotics and Automation (ICRA), Paris, France, 2020. 6

work page 2020
[12]

Fan, Matteo Palieri, Mykel J

Muhammad Fadhil Ginting, Sung-Kyun Kim, David D. Fan, Matteo Palieri, Mykel J. Kochenderfer, and Ali-akbar Agha- mohammadi. SEEK: Semantic reasoning for object goal navigation in real world inspection tasks. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024. 2

work page 2024
[13]

End-to-End Navigation with Vision-Language Models: Transforming Spatial Reasoning into Question- Answering

Dylan Goetting, Himanshu Gaurav Singh, and Antonio Loquercio. End-to-End Navigation with Vision-Language Models: Transforming Spatial Reasoning into Question- Answering. InProceedings of the International Conference on Neuro-symbolic Systems, pages 22–35. PMLR, 2025. 2

work page 2025
[14]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask R-CNN. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 2961–2969, 2017. 2

work page 2017
[15]

An effective implementation of the lin- kernighan traveling salesman heuristic.European Journal of Operational Research, 126(1):106–130, 2000

Keld Helsgaun. An effective implementation of the lin- kernighan traveling salesman heuristic.European Journal of Operational Research, 126(1):106–130, 2000. 5

work page 2000
[16]

Visual language maps for robot navigation

Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. InPro- ceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023. 2

work page 2023
[17]

Tenenbaum, Celso Miguel de Melo, Mad- hava Krishna, Liam Paull, Florian Shkurti, and Antonio Tor- ralba

Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, Qiao Gu, Mohd Omama, Tao Chen, Alaa Maalouf, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Keetha, Ayush Tewari, Joshua B. Tenenbaum, Celso Miguel de Melo, Mad- hava Krishna, Liam Paull, Florian Shkurti, and Antonio Tor- ralba. ConceptFusion: Open-set Multimodal 3D Mapping. InProceedings of Robotics...

work page 2023
[18]

3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Transactions on Graphics, 42(4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Transactions on Graphics, 42(4), 2023. 2

work page 2023
[19]

RA VEN: Resilient Aerial Navigation via Open-Set Semantic Memory and Behavior Adaptation

Seungchan Kim, Omar Alama, Dmytro Kurdydyk, John Keller, Nikhil Keetha, Wenshan Wang, Yonatan Bisk, and Sebastian Scherer. RA VEN: Resilient Aerial Navigation via Open-Set Semantic Memory and Behavior Adaptation. arXiv preprint arXiv:2509.23563, 2025. 1, 2

work page arXiv 2025
[20]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Ma- chine Learning (ICML), pages 12888–12900, 2022. 2

work page 2022
[21]

VISTA: Open-V ocabulary, Task-Relevant Robot Exploration with Online Semantic Gaussian Splatting

Keiko Nagami, Timothy Chen, Javier Yu, Ola Shorinwa, Maximilian Adang, Carlyn Dougherty, Eric Cristofalo, and Mac Schwager. VISTA: Open-V ocabulary, Task-Relevant Robot Exploration with Online Semantic Gaussian Splatting. arXiv preprint arXiv:2507.01125, 2025. 2

work page arXiv 2025
[22]

Find- ing things in the unknown: Semantic object-centric explo- ration with an MA V

Sotiris Papatheodorou, Nils Funk, Dimos Tzoumanikas, Christopher Choi, Binbin Xu, and Stefan Leutenegger. Find- ing things in the unknown: Semantic object-centric explo- ration with an MA V. InProceedings of the IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 3640–3646, 2023. 1, 2, 5

work page 2023
[23]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning (ICML), pages 8748–8763, 2021. 2, 3

work page 2021
[24]

Sonia Raychaudhuri and Angel X. Chang. Semantic map- ping in indoor embodied AI: A survey on advances, chal- lenges, and future directions.Transactions on Machine Learning Research, 2025. arXiv:2501.05750. 2

work page arXiv 2025
[25]

You only look once: Unified, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 779– 788, 2016. 2

work page 2016
[26]

Faster R-CNN: Towards real-time object detection with re- gion proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- gion proposal networks. InAdvances in Neural Information Processing Systems, pages 91–99, 2015. 2

work page 2015
[27]

Habitat: A Platform for Embodied AI Research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9339–9347, 2019. 2

work page 2019
[28]

RT-GuIDE: Real- Time Gaussian splatting for Information-Driven Exploration

Yuezhan Tao, Dexter Ong, Varun Murali, Igor Spasoje- vic, Pratik Chaudhari, and Vijay Kumar. RT-GuIDE: Real- Time Gaussian splatting for Information-Driven Exploration. arXiv preprint arXiv:2409.18122 [cs.RO], 2024. 2

work page arXiv 2024
[29]

MIT Press, Cambridge, MA, USA, 2005

Sebastian Thrun, Wolfram Burgard, and Dieter Fox.Proba- bilistic Robotics. MIT Press, Cambridge, MA, USA, 2005. 2

work page 2005
[30]

Spatiality-guided transformer for 3D dense captioning on point clouds

Heng Wang, Chaoyi Zhang, Jianhui Yu, and Weidong Cai. Spatiality-guided transformer for 3D dense captioning on point clouds. InProceedings of the Thirty-First Inter- national Joint Conference on Artificial Intelligence, pages 1393–1400, 2022. 2

work page 2022
[31]

A frontier-based approach for autonomous exploration

Brian Yamauchi. A frontier-based approach for autonomous exploration. InProceedings of the IEEE International Sym- posium on Computational Intelligence in Robotics and Au- tomation (CIRA), pages 146–151, 1997. 1, 2

work page 1997
[32]

Auxiliary Tasks and Exploration Enable ObjectGoal Naviga- tion

Joel Ye, Dhruv Batra, Abhishek Das, and Erik Wijmans. Auxiliary Tasks and Exploration Enable ObjectGoal Naviga- tion. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 16117–16126, 2021. 2

work page 2021
[33]

3D Question Answering.IEEE Transactions on Visualiza- tion and Computer Graphics, pages 1–16, 2021

Shuquan Ye, Dongdong Chen, Songfang Han, and Jing Liao. 3D Question Answering.IEEE Transactions on Visualiza- tion and Computer Graphics, pages 1–16, 2021. 2

work page 2021
[34]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, 2023. 2

work page 2023
[35]

FALCON: Fast Autonomous Aerial Explo- ration using Coverage Path Guidance.IEEE Transactions on Robotics, 41:1365–1385, 2024

Yichen Zhang, Xinyi Chen, Chen Feng, Boyu Zhou, and Shaojie Shen. FALCON: Fast Autonomous Aerial Explo- ration using Coverage Path Guidance.IEEE Transactions on Robotics, 41:1365–1385, 2024. 1, 2, 5

work page 2024
[36]

FUEL: Fast UA V exploration using incremental frontier structure and hierarchical planning.IEEE Robotics and Au- tomation Letters, 6(2):779–786, 2021

Boyu Zhou, Yichen Zhang, Xinyi Chen, and Shaojie Shen. FUEL: Fast UA V exploration using incremental frontier structure and hierarchical planning.IEEE Robotics and Au- tomation Letters, 6(2):779–786, 2021. 2

work page 2021

[1] [1]

ScanQA: 3D question answering for spa- tial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Mo- toaki Kawanabe. ScanQA: 3D question answering for spa- tial scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19129–19139, 2022. 2

work page 2022

[2] [2]

FindAnything: Open-V ocabulary and Object- Centric Mapping for Robot Exploration in Any Environ- ment.arXiv preprint arXiv:2504.08603, 2025

Sebasti ´an Barbas Laina, Simon Boche, Sotiris Pap- atheodorou, Simon Schaefer, Jaehyung Jung, and Stefan Leutenegger. FindAnything: Open-V ocabulary and Object- Centric Mapping for Robot Exploration in Any Environ- ment.arXiv preprint arXiv:2504.08603, 2025. 1, 2

work page arXiv 2025

[3] [3]

A multi-resolution frontier-based planner for autonomous 3D exploration.IEEE Robotics and Automation Letters, 6(4):7922–7929, 2021

Ana Batinovi ´c, Tamara Petrovi´c, Antun Ivanovic, Frano Pet- ric, and Stjepan Bogdan. A multi-resolution frontier-based planner for autonomous 3D exploration.IEEE Robotics and Automation Letters, 6(4):7922–7929, 2021. 2

work page 2021

[4] [4]

next- best-view

Andreas Bircher, Mina Kamel, Kostas Alexis, Helen Oleynikova, and Roland Siegwart. Receding horizon “next- best-view” planner for 3D exploration. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 1462–1468, 2016. 2

work page 2016

[5] [5]

Matterport3D: Learning from RGB- D Data in Indoor Environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- ber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB- D Data in Indoor Environments. InProceedings of the IEEE International Conference on 3D Vision (3DV), pages 667– 676, 2017. 5

work page 2017

[6] [6]

Object Goal Navigation using Goal-Oriented Semantic Exploration

Devendra Singh Chaplot, Dhiraj Gandhi, Abhinav Gupta, and Ruslan Salakhutdinov. Object Goal Navigation using Goal-Oriented Semantic Exploration. InAdvances in Neural Information Processing Systems, 2020. 2

work page 2020

[7] [7]

Fast frontier-based information-driven autonomous exploration with an MA V

Anna Dai, Sotiris Papatheodorou, Nils Funk, Dimos Tzoumanikas, and Stefan Leutenegger. Fast frontier-based information-driven autonomous exploration with an MA V. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 9570–9576, 2020. 2

work page 2020

[8] [8]

Qianli Dong, Xuebo Zhang, Shiyong Zhang, Ziyu Wang, Zhe Ma, and Haobo Xi. EDEN: Efficient dual-layer ex- ploration planning for fast UA V autonomous exploration in large 3-D environments.IEEE Transactions on Indus- trial Electronics, 73(5):7296–7306, 2026. Also available as arXiv:2506.05106. 2

work page arXiv 2026

[9] [9]

A frontier-void- based approach for autonomous exploration in 3D

Christian Dornhege and Alexander Kleiner. A frontier-void- based approach for autonomous exploration in 3D. InPro- ceedings of the IEEE International Symposium on Safety, Se- curity, and Rescue Robotics (SSRR), pages 1–6, 2011. 2

work page 2011

[10] [10]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakub Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InProceedings of the International Conference on Learning Representati...

work page 2021

[11] [11]

OpenVINS: A research platform for visual-inertial estimation

Patrick Geneva, Kevin Eckenhoff, Woosik Lee, Yulin Yang, and Guoquan Huang. OpenVINS: A research platform for visual-inertial estimation. InProceedings of the IEEE In- ternational Conference on Robotics and Automation (ICRA), Paris, France, 2020. 6

work page 2020

[12] [12]

Fan, Matteo Palieri, Mykel J

Muhammad Fadhil Ginting, Sung-Kyun Kim, David D. Fan, Matteo Palieri, Mykel J. Kochenderfer, and Ali-akbar Agha- mohammadi. SEEK: Semantic reasoning for object goal navigation in real world inspection tasks. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024. 2

work page 2024

[13] [13]

End-to-End Navigation with Vision-Language Models: Transforming Spatial Reasoning into Question- Answering

Dylan Goetting, Himanshu Gaurav Singh, and Antonio Loquercio. End-to-End Navigation with Vision-Language Models: Transforming Spatial Reasoning into Question- Answering. InProceedings of the International Conference on Neuro-symbolic Systems, pages 22–35. PMLR, 2025. 2

work page 2025

[14] [14]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask R-CNN. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 2961–2969, 2017. 2

work page 2017

[15] [15]

An effective implementation of the lin- kernighan traveling salesman heuristic.European Journal of Operational Research, 126(1):106–130, 2000

Keld Helsgaun. An effective implementation of the lin- kernighan traveling salesman heuristic.European Journal of Operational Research, 126(1):106–130, 2000. 5

work page 2000

[16] [16]

Visual language maps for robot navigation

Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. InPro- ceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023. 2

work page 2023

[17] [17]

Tenenbaum, Celso Miguel de Melo, Mad- hava Krishna, Liam Paull, Florian Shkurti, and Antonio Tor- ralba

Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, Qiao Gu, Mohd Omama, Tao Chen, Alaa Maalouf, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Keetha, Ayush Tewari, Joshua B. Tenenbaum, Celso Miguel de Melo, Mad- hava Krishna, Liam Paull, Florian Shkurti, and Antonio Tor- ralba. ConceptFusion: Open-set Multimodal 3D Mapping. InProceedings of Robotics...

work page 2023

[18] [18]

3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Transactions on Graphics, 42(4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Transactions on Graphics, 42(4), 2023. 2

work page 2023

[19] [19]

RA VEN: Resilient Aerial Navigation via Open-Set Semantic Memory and Behavior Adaptation

Seungchan Kim, Omar Alama, Dmytro Kurdydyk, John Keller, Nikhil Keetha, Wenshan Wang, Yonatan Bisk, and Sebastian Scherer. RA VEN: Resilient Aerial Navigation via Open-Set Semantic Memory and Behavior Adaptation. arXiv preprint arXiv:2509.23563, 2025. 1, 2

work page arXiv 2025

[20] [20]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Ma- chine Learning (ICML), pages 12888–12900, 2022. 2

work page 2022

[21] [21]

VISTA: Open-V ocabulary, Task-Relevant Robot Exploration with Online Semantic Gaussian Splatting

Keiko Nagami, Timothy Chen, Javier Yu, Ola Shorinwa, Maximilian Adang, Carlyn Dougherty, Eric Cristofalo, and Mac Schwager. VISTA: Open-V ocabulary, Task-Relevant Robot Exploration with Online Semantic Gaussian Splatting. arXiv preprint arXiv:2507.01125, 2025. 2

work page arXiv 2025

[22] [22]

Find- ing things in the unknown: Semantic object-centric explo- ration with an MA V

Sotiris Papatheodorou, Nils Funk, Dimos Tzoumanikas, Christopher Choi, Binbin Xu, and Stefan Leutenegger. Find- ing things in the unknown: Semantic object-centric explo- ration with an MA V. InProceedings of the IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 3640–3646, 2023. 1, 2, 5

work page 2023

[23] [23]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning (ICML), pages 8748–8763, 2021. 2, 3

work page 2021

[24] [24]

Sonia Raychaudhuri and Angel X. Chang. Semantic map- ping in indoor embodied AI: A survey on advances, chal- lenges, and future directions.Transactions on Machine Learning Research, 2025. arXiv:2501.05750. 2

work page arXiv 2025

[25] [25]

You only look once: Unified, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 779– 788, 2016. 2

work page 2016

[26] [26]

Faster R-CNN: Towards real-time object detection with re- gion proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- gion proposal networks. InAdvances in Neural Information Processing Systems, pages 91–99, 2015. 2

work page 2015

[27] [27]

Habitat: A Platform for Embodied AI Research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9339–9347, 2019. 2

work page 2019

[28] [28]

RT-GuIDE: Real- Time Gaussian splatting for Information-Driven Exploration

Yuezhan Tao, Dexter Ong, Varun Murali, Igor Spasoje- vic, Pratik Chaudhari, and Vijay Kumar. RT-GuIDE: Real- Time Gaussian splatting for Information-Driven Exploration. arXiv preprint arXiv:2409.18122 [cs.RO], 2024. 2

work page arXiv 2024

[29] [29]

MIT Press, Cambridge, MA, USA, 2005

Sebastian Thrun, Wolfram Burgard, and Dieter Fox.Proba- bilistic Robotics. MIT Press, Cambridge, MA, USA, 2005. 2

work page 2005

[30] [30]

Spatiality-guided transformer for 3D dense captioning on point clouds

Heng Wang, Chaoyi Zhang, Jianhui Yu, and Weidong Cai. Spatiality-guided transformer for 3D dense captioning on point clouds. InProceedings of the Thirty-First Inter- national Joint Conference on Artificial Intelligence, pages 1393–1400, 2022. 2

work page 2022

[31] [31]

A frontier-based approach for autonomous exploration

Brian Yamauchi. A frontier-based approach for autonomous exploration. InProceedings of the IEEE International Sym- posium on Computational Intelligence in Robotics and Au- tomation (CIRA), pages 146–151, 1997. 1, 2

work page 1997

[32] [32]

Auxiliary Tasks and Exploration Enable ObjectGoal Naviga- tion

Joel Ye, Dhruv Batra, Abhishek Das, and Erik Wijmans. Auxiliary Tasks and Exploration Enable ObjectGoal Naviga- tion. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 16117–16126, 2021. 2

work page 2021

[33] [33]

3D Question Answering.IEEE Transactions on Visualiza- tion and Computer Graphics, pages 1–16, 2021

Shuquan Ye, Dongdong Chen, Songfang Han, and Jing Liao. 3D Question Answering.IEEE Transactions on Visualiza- tion and Computer Graphics, pages 1–16, 2021. 2

work page 2021

[34] [34]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, 2023. 2

work page 2023

[35] [35]

FALCON: Fast Autonomous Aerial Explo- ration using Coverage Path Guidance.IEEE Transactions on Robotics, 41:1365–1385, 2024

Yichen Zhang, Xinyi Chen, Chen Feng, Boyu Zhou, and Shaojie Shen. FALCON: Fast Autonomous Aerial Explo- ration using Coverage Path Guidance.IEEE Transactions on Robotics, 41:1365–1385, 2024. 1, 2, 5

work page 2024

[36] [36]

FUEL: Fast UA V exploration using incremental frontier structure and hierarchical planning.IEEE Robotics and Au- tomation Letters, 6(2):779–786, 2021

Boyu Zhou, Yichen Zhang, Xinyi Chen, and Shaojie Shen. FUEL: Fast UA V exploration using incremental frontier structure and hierarchical planning.IEEE Robotics and Au- tomation Letters, 6(2):779–786, 2021. 2

work page 2021