pith. sign in

arxiv: 2605.19958 · v1 · pith:7P3WJSNUnew · submitted 2026-05-19 · 💻 cs.RO

TravExplorer: Cross-Floor Embodied Exploration via Traversability-Aware 3-D Planning

Pith reviewed 2026-05-20 04:59 UTC · model grok-4.3

classification 💻 cs.RO
keywords cross-floor navigationtraversability-aware planningzero-shot object navigationvolumetric mappingembodied explorationhierarchical planneropen-vocabulary searchmulti-floor indoor environments
0
0 comments X

The pith

TravExplorer couples zero-shot semantic guidance with traversability-aware 3-D planning to search for open-vocabulary targets across multiple floors in unseen buildings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TravExplorer as a framework for cross-floor embodied exploration that overcomes single-floor and planar assumptions in existing zero-shot object navigation systems. It builds and maintains a unified volumetric map from online sensor data that separates occupied structures from reachable support surfaces such as floors, stairs, and landings. Traversable frontiers are extracted from connected support surfaces, and a hierarchical planner uses semantic and geometric memories to tour object hypotheses while generating executable cross-floor motions. A lightweight guidance module reduces latency by aligning instance maps with spatial value maps. Large-scale simulations and real-world tests on a quadruped robot show advantages over standard baselines in both single-floor and multi-floor indoor settings without prior maps.

Core claim

TravExplorer maintains a unified volumetric map that distinguishes occupied structures from robot-reachable support surfaces and extracts traversable frontiers from connected support surfaces including floors, stairs, and landings. It aligns a probabilistic instance map from online open-vocabulary segmentation with a spatial value map from fast image-to-text matching, then applies a hierarchical planner for target-aware frontier touring over object hypotheses, traversable frontiers, and stair landmarks, generating motions through foothold-guided 3-D search and vertically constrained local trajectory optimization.

What carries the argument

The unified volumetric map that distinguishes occupied structures from robot-reachable support surfaces, together with the hierarchical planner that performs target-aware frontier touring and foothold-guided 3-D search.

If this is right

  • Robots gain the ability to traverse stairs and landings in multi-level indoor spaces using only online perception.
  • Open-vocabulary target search becomes feasible across single-floor and cross-floor environments without prior maps or human intervention.
  • Hierarchical planning over semantic and geometric memories produces consistent gains over representative ObjectNav baselines in thousands of simulated episodes.
  • FOV-aware active perception resolves incomplete observations that arise during vertical movement between floors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same traversability distinction could be tested for navigation on outdoor slopes or ramps with elevation changes.
  • Replacing the current segmentation module with higher-accuracy vision-language models might further lower semantic-reasoning delays.
  • The foothold-guided local search could be evaluated on legged platforms in warehouses containing mezzanines and catwalks.
  • Scaling the volumetric map to taller structures would test whether frontier extraction remains efficient as the number of floors increases.

Load-bearing premise

Online sensor data alone suffices to reliably separate occupied structures from robot-reachable support surfaces such as stairs and landings in real buildings.

What would settle it

A sequence of real-building trials in which the robot repeatedly misclassifies stairs as obstacles or fails to reach connected landings from sensor data would show the volumetric map cannot support the claimed cross-floor capability.

Figures

Figures reproduced from arXiv: 2605.19958 by Han Zheng, Haoran Liu, Jinghao Wang, Ming Yang, Tong Qin, Yudong Huang, Zhe Chen.

Figure 1
Figure 1. Figure 1: Motivating scenario of TravExplorer. Given an open-vocabulary object [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: System overview of TravExplorer. The quadruped robot receives RGB-D and odometry observations and incrementally maintains volumetric occupancy, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: 3-D traversable frontier representation. Blue, gray, and inflated voxels [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Traversability-aware spatial value map on a staircase. Image-text [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cross-floor exploration process. (a) The robot initializes the map and frontier candidates from early RGB-D observations. (b) TravExplorer expands the [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance comparison on HM3D across single-floor, multi-floor, and [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative mapping results in six representative indoor scenes. Each column shows the original scene geometry, the reconstructed traversability map, and [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison of frontier representations. (a) TravExplorer [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Mapping ablation on exploration efficiency. TravExplorer reaches [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Real-world deployment results of TravExplorer in four representative indoor environments. The first row shows deployment in a multi-floor villa, where [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗
read the original abstract

Zero-shot Object Navigation (ZSON) has shown promise for open-vocabulary target search in unseen environments, yet most existing systems remain tied to planar representations and single-floor assumptions. These assumptions become inadequate in real buildings, where navigation involves floors, stairs, landings, and vertically overlapping spaces. This article presents TravExplorer, a cross-floor embodied exploration framework that couples zero-shot semantic guidance with traversability-aware 3-D planning. TravExplorer maintains a unified volumetric map that distinguishes occupied structures from robot-reachable support surfaces and extracts traversable frontiers from connected support surfaces, including floors, stairs, and landings. A FOV-aware active perception strategy further resolves incomplete observations during cross-floor traversal. To reduce semantic-reasoning latency, a lightweight guidance module aligns a probabilistic instance map from online open-vocabulary segmentation with a spatial value map from fast image-to-text matching. Based on these geometric and semantic memories, a hierarchical planner performs target-aware frontier touring over object hypotheses, traversable frontiers, and stair landmarks, and generates executable cross-floor motions through foothold-guided 3-D search and vertically constrained local trajectory optimization. Experiments over 4,195 simulated episodes on HM3D and MP3D demonstrate consistent advantages over representative ObjectNav baselines. Fifty real-world trials on a Unitree Go2 further validate open-vocabulary target search across single-floor and cross-floor indoor environments without prior maps or human intervention. The code will be released at https://github.com/wuyi2121/TravExplorer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes TravExplorer, a cross-floor embodied exploration framework for zero-shot object navigation in unseen multi-floor indoor environments. It couples open-vocabulary semantic guidance with traversability-aware 3-D planning by maintaining a unified volumetric map that distinguishes occupied structures from robot-reachable support surfaces (floors, stairs, landings) and extracts traversable frontiers. Additional components include FOV-aware active perception, a lightweight guidance module aligning probabilistic instance maps with spatial value maps, and a hierarchical planner performing target-aware frontier touring with foothold-guided 3-D search and vertically constrained trajectory optimization. The work reports performance advantages over ObjectNav baselines in 4,195 simulated episodes on HM3D and MP3D, plus validation via 50 real-world trials on a Unitree Go2 robot without prior maps.

Significance. If the traversability mapping and planning components prove reliable, the framework would address a clear limitation in existing ZSON systems by enabling practical navigation across floors and vertical spaces in real buildings. The integration of geometric memory with open-vocabulary semantics and the commitment to code release are strengths that support reproducibility and extension by the community.

major comments (3)
  1. The experimental results section reports consistent advantages over baselines across 4,195 simulated episodes but provides no error bars, ablation studies on the volumetric map or planner components, or details on episode selection and success criteria. This weakens the ability to evaluate the internal validity of the quantitative claims.
  2. The real-world validation with 50 Unitree Go2 trials is presented as evidence for cross-floor open-vocabulary search, yet no per-component metrics are given on traversability classification accuracy (e.g., precision/recall for reachable vs. occupied voxels) under real sensor noise and partial observations. This is load-bearing for the central claim that the unified volumetric map enables reliable frontier extraction without prior maps.
  3. In the method description of the unified volumetric map and frontier extraction, the approach to distinguishing robot-reachable support surfaces (including stairs and landings) from occupied structures relies on online sensor data alone; however, the handling of depth/LiDAR noise, occlusions, and varying geometries is not specified in sufficient detail to assess robustness.
minor comments (2)
  1. The abstract states 'consistent advantages' without naming the specific metrics (e.g., success rate, SPL, or navigation efficiency) used for comparison with ObjectNav baselines.
  2. Figure captions and method diagrams would benefit from clearer labeling of the hierarchical planner stages and how traversable frontiers connect across floors.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions planned for the next version.

read point-by-point responses
  1. Referee: The experimental results section reports consistent advantages over baselines across 4,195 simulated episodes but provides no error bars, ablation studies on the volumetric map or planner components, or details on episode selection and success criteria. This weakens the ability to evaluate the internal validity of the quantitative claims.

    Authors: We agree that error bars, ablation studies, and clearer experimental details would strengthen the evaluation. In the revised manuscript we will add standard error bars to the success rate and SPL results. We will also include ablation studies that remove the traversability mapping and the hierarchical planner in turn. Episode selection details will be expanded to note random sampling from the HM3D and MP3D validation splits using the standard ObjectNav criteria of reaching within 0.1 m of the target within 500 steps. revision: yes

  2. Referee: The real-world validation with 50 Unitree Go2 trials is presented as evidence for cross-floor open-vocabulary search, yet no per-component metrics are given on traversability classification accuracy (e.g., precision/recall for reachable vs. occupied voxels) under real sensor noise and partial observations. This is load-bearing for the central claim that the unified volumetric map enables reliable frontier extraction without prior maps.

    Authors: We acknowledge that voxel-level precision/recall metrics on traversability would provide stronger quantitative support. Our real-world evaluation centered on end-to-end task success across the 50 trials. Obtaining accurate ground-truth voxel labels for reachable versus occupied space in unmapped real environments proved impractical during the study. In revision we will explicitly discuss this limitation and note that the observed successful stair and landing traversals offer qualitative evidence of map utility, while suggesting such granular metrics as future work. revision: partial

  3. Referee: In the method description of the unified volumetric map and frontier extraction, the approach to distinguishing robot-reachable support surfaces (including stairs and landings) from occupied structures relies on online sensor data alone; however, the handling of depth/LiDAR noise, occlusions, and varying geometries is not specified in sufficient detail to assess robustness.

    Authors: We thank the referee for highlighting this gap. The revised method section will add explicit descriptions of noise mitigation via probabilistic occupancy updates and voxel filtering, multi-view temporal integration to address occlusions, and geometric heuristics (surface normal consistency and height-continuity checks) used to classify support surfaces. These mechanisms are intended to remain robust across varying stair geometries and partial observations. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical system validated against external benchmarks

full rationale

The paper describes an engineering framework for cross-floor navigation using a unified volumetric map, traversability-aware planning, and hierarchical target-aware touring. Claims rest on 4,195 simulated episodes across HM3D/MP3D and 50 real-world Unitree Go2 trials, framed as direct empirical comparisons to ObjectNav baselines. No equations, fitted parameters, or derivations appear that reduce outputs to inputs by construction. No self-citation chains or ansatzes are invoked as load-bearing uniqueness theorems. The central mapping and planning components are presented as independent contributions tested on standard datasets, making the reported advantages self-contained against those external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; full manuscript required to populate the ledger.

pith-pipeline@v0.9.0 · 5815 in / 1086 out tokens · 35826 ms · 2026-05-20T04:59:13.062377+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 2 internal anchors

  1. [1]

    CoWs on PASTURE: Baselines and benchmarks for language-driven zero-shot object navigation,

    S. Y. Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “CoWs on PASTURE: Baselines and benchmarks for language-driven zero-shot object navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 23 171– 23 181

  2. [2]

    L3MVN: Leveraging large language models for visual target navigation,

    B. Yu, H. Kasaei, and M. Cao, “L3MVN: Leveraging large language models for visual target navigation,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023, pp. 3554– 3560

  3. [3]

    VLFM: Vision- language frontier maps for zero-shot semantic navigation,

    N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “VLFM: Vision- language frontier maps for zero-shot semantic navigation,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 42–48

  4. [4]

    SG-Nav: Online 3d scene graph prompting for llm-based zero-shot object navigation,

    H. Yin, X. Xu, Z. Wu, J. Zhou, and J. Lu, “SG-Nav: Online 3d scene graph prompting for llm-based zero-shot object navigation,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37, 2024

  5. [5]

    ApexNav: An adaptive exploration strategy for zero-shot object navigation with target- centric semantic fusion,

    M. Zhang, Y. Du, C. Wu, J. Zhou, Z. Qi, J. Ma, and B. Zhou, “ApexNav: An adaptive exploration strategy for zero-shot object navigation with target- centric semantic fusion,”CoRR, vol. abs/2504.14478, 2025

  6. [6]

    STRIVE: Structured representation integrating VLM reasoning for efficient object navigation,

    H. Zhu, Z. Li, Z. Liu, W. Wang, J. Zhang, J. Francis, and J. Oh, “STRIVE: Structured representation integrating VLM reasoning for efficient object navigation,”CoRR, vol. abs/2505.06729, 2025

  7. [7]

    SysNav: Multi-level systematic cooperation enables real-world, cross-embodiment object navigation,

    H. Zhu, Z. Li, Z. Liu, K. Guo, Z. Lin, Y. Cai, G. Chen, C. Lv, W. Wang, J. Oh, and J. Zhang, “SysNav: Multi-level systematic cooperation enables real-world, cross-embodiment object navigation,”CoRR, vol. abs/2603.06914, 2026

  8. [8]

    USS-Nav: Unified spatio-semantic scene graph for lightweight UA V zero-shot object navigation,

    W. Gai, Y. Gao, Y. Zhou, Y. Xie, Z. Liu, Y. Wu, X. Zhou, F. Gao, and Z. Meng, “USS-Nav: Unified spatio-semantic scene graph for lightweight UA V zero-shot object navigation,”CoRR, vol. abs/2602.00708, 2026

  9. [9]

    Multi-floor zero-shot object navigation policy,

    L. Zhang, H. Wang, E. Xiao, X. Zhang, Q. Zhang, Z. Jiang, and R. Xu, “Multi-floor zero-shot object navigation policy,”CoRR, vol. abs/2409.10906, 2024

  10. [10]

    Stairway to success: An online floor-aware zero-shot object-goal navigation framework via LLM-driven coarse-to-fine exploration,

    Z. Gong, R. Li, T. Hu, R. Qiu, L. Kong, L. Zhang, G. Zhao, Y. Ding, and J. Liang, “Stairway to success: An online floor-aware zero-shot object-goal navigation framework via LLM-driven coarse-to-fine exploration,”IEEE Robotics and Automation Letters, 2026

  11. [11]

    Efficient global navigational planning in 3-D structures based on point cloud tomography,

    B. Yang, J. Cheng, B. Xue, J. Jiao, and M. Liu, “Efficient global navigational planning in 3-D structures based on point cloud tomography,” IEEE/ASME Transactions on Mechatronics, vol. 30, no. 1, pp. 321–332, 2025

  12. [12]

    A frontier-based approach for autonomous exploration,

    B. Yamauchi, “A frontier-based approach for autonomous exploration,” inProceedings of the 1997 IEEE International Symposium on Com- putational Intelligence in Robotics and Automation (CIRA), 1997, pp. 146–151

  13. [13]

    Efficient dense frontier detection for 2-D graph SLAM based on occupancy grid submaps,

    J. Orsulic, D. Miklic, and Z. Kovacic, “Efficient dense frontier detection for 2-D graph SLAM based on occupancy grid submaps,”IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 3569–3576, 2019

  14. [14]

    FAEL: Fast autonomous exploration for large-scale environments with a mobile robot,

    J. Huang, B. Zhou, Z. Fan, Y. Zhu, Y. Jie, L. Li, and H. Cheng, “FAEL: Fast autonomous exploration for large-scale environments with a mobile robot,”IEEE Robotics and Automation Letters, vol. 8, no. 3, pp. 1667– 1674, 2023

  15. [15]

    Rapid autonomous exploration of large-scale environments for ground robots based on region partitioning,

    Z. Wen, X. Liu, G. Lu, and J. Liu, “Rapid autonomous exploration of large-scale environments for ground robots based on region partitioning,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2025, pp. 13 067–13 073

  16. [16]

    FUEL: Fast UA V exploration using incremental frontier structure and hierarchical planning,

    B. Zhou, Y. Zhang, X. Chen, and S. Shen, “FUEL: Fast UA V exploration using incremental frontier structure and hierarchical planning,”IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 779–786, 2021

  17. [17]

    FALCON: Fast autonomous aerial exploration using coverage path guidance,

    Y. Zhang, X. Chen, C. M. Feng, B. Zhou, and S. Shen, “FALCON: Fast autonomous aerial exploration using coverage path guidance,”IEEE Transactions on Robotics, vol. 41, pp. 1365–1385, 2025

  18. [18]

    3D active metric-semantic SLAM,

    Y. Tao, X. Liu, I. Spasojevic, S. Agarwal, and V. Kumar, “3D active metric-semantic SLAM,”IEEE Robotics and Automation Letters, vol. 9, no. 3, pp. 2989–2996, 2024

  19. [19]

    EPIC: A lightweight LiDAR- based UA V exploration framework for large-scale scenarios,

    S. Geng, Z. Ning, F. Zhang, and B. Zhou, “EPIC: A lightweight LiDAR- based UA V exploration framework for large-scale scenarios,”IEEE Robotics and Automation Letters, vol. 10, no. 5, pp. 5090–5097, 2025

  20. [20]

    FLARE: Fast large-scale autonomous exploration guided by unknown regions,

    X. Liu, M. Lin, S. Li, G. Xu, Z. Wang, H. Wu, and Y. Liu, “FLARE: Fast large-scale autonomous exploration guided by unknown regions,”IEEE Robotics and Automation Letters, vol. 10, no. 11, pp. 12 197–12 204, 2025

  21. [21]

    SMUG planner: A safe multi-goal planner for mobile robots in challenging environments,

    C. Chen, J. Frey, P. Arm, and M. Hutter, “SMUG planner: A safe multi-goal planner for mobile robots in challenging environments,”IEEE Robotics and Automation Letters, vol. 8, no. 11, pp. 7170–7177, 2023

  22. [22]

    LRAE: Large- region-aware safe and fast autonomous exploration of ground robots for uneven terrains,

    Q. Bi, X. Zhang, S. Zhang, R. Wang, L. Li, and J. Yuan, “LRAE: Large- region-aware safe and fast autonomous exploration of ground robots for uneven terrains,”IEEE Robotics and Automation Letters, vol. 9, no. 12, pp. 11 186–11 193, 2024

  23. [23]

    Portable planner for enhancing ground robots exploration performance in unstructured environments,

    Y. Jia, W. Tang, H. Sun, J. Yang, B. Liu, and C. Wang, “Portable planner for enhancing ground robots exploration performance in unstructured environments,”IEEE Robotics and Automation Letters, vol. 9, no. 11, pp. 9295–9302, 2024

  24. [24]

    AAGE: Air- assisted ground robotic autonomous exploration in large-scale unknown environments,

    L. Zheng, M. Wei, R. Mei, K. Xu, J. Huang, and H. Cheng, “AAGE: Air- assisted ground robotic autonomous exploration in large-scale unknown environments,”IEEE Transactions on Robotics, vol. 41, pp. 1918–1937, 2025

  25. [25]

    Probabilistic terrain mapping for mobile robots with uncertain localization,

    P. Fankhauser, M. Bloesch, and M. Hutter, “Probabilistic terrain mapping for mobile robots with uncertain localization,”IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3019–3026, 2018

  26. [26]

    Elevation mapping for locomotion and navigation using GPU,

    T. Miki, L. Wellhausen, R. Grandia, F. Jenelten, T. Homberger, and M. Hutter, “Elevation mapping for locomotion and navigation using GPU,” in2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022, pp. 2273–2280

  27. [27]

    Perceptive locomotion through nonlinear model-predictive control,

    R. Grandia, F. Jenelten, S. Yang, F. Farshidian, and M. Hutter, “Perceptive locomotion through nonlinear model-predictive control,”IEEE Transac- tions on Robotics, vol. 39, no. 5, pp. 3402–3421, 2023

  28. [28]

    An efficient trajectory planner for car-like robots on uneven terrain,

    L. Xu, K. Chai, Z. Han, H. Liu, C. Xu, Y. Cao, and F. Gao, “An efficient trajectory planner for car-like robots on uneven terrain,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023, pp. 2853–2860

  29. [29]

    Towards efficient trajectory generation for ground robots beyond 2D environment,

    J. Wang, L. Xu, H. Fu, Z. Meng, C. Xu, Y. Cao, X. Lyu, and F. Gao, “Towards efficient trajectory generation for ground robots beyond 2D environment,”CoRR, vol. abs/2302.03323, 2023

  30. [30]

    Efficient trajectory generation based on traversable planes in 3D complex archi- tectural spaces,

    M. Zhang, Z. Tian, Y. Xia, C. Xu, F. Gao, and Y. Cao, “Efficient trajectory generation based on traversable planes in 3D complex archi- tectural spaces,” in2025 IEEE International Conference on Robotics and Automation (ICRA), 2025, pp. 14 513–14 519

  31. [31]

    Real- time multilevel terrain-aware path planning for ground mobile robots in large-scale rough terrains,

    Y. Li, K. Chen, Y. Wang, W. Zhang, J. Wang, H. Chen, and Y. Liu, “Real- time multilevel terrain-aware path planning for ground mobile robots in large-scale rough terrains,”IEEE Transactions on Robotics, vol. 41, pp. 4159–4179, 2025

  32. [32]

    TiFA: A terrain- informed navigation framework for articulated tracked robots in rescue missions,

    Y. Wang, W. Zhang, Y. Li, J. Wang, and H. Chen, “TiFA: A terrain- informed navigation framework for articulated tracked robots in rescue missions,”The International Journal of Robotics Research, 2025, on- lineFirst

  33. [33]

    Object goal navigation using goal-oriented semantic exploration,

    D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 4247–4258

  34. [34]

    PONI: Potential functions for objectgoal navigation with interaction-free learning,

    S. K. Ramakrishnan, D. S. Chaplot, Z. Al-Halah, J. Malik, and K. Grauman, “PONI: Potential functions for objectgoal navigation with interaction-free learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 18 868– 18 878

  35. [35]

    ZSON: Zero-shot object-goal navigation using multimodal goal embed- dings,

    A. Majumdar, G. Aggarwal, B. Devnani, J. Hoffman, and D. Batra, “ZSON: Zero-shot object-goal navigation using multimodal goal embed- dings,” inAdvances in Neural Information Processing Systems, vol. 35, 2022

  36. [36]

    FOM-Nav: Frontier-object maps for object goal navigation,

    T. Chabal, S. Chen, J. Ponce, and C. Schmid, “FOM-Nav: Frontier-object maps for object goal navigation,”CoRR, vol. abs/2512.01009, 2025

  37. [37]

    OpenFrontier: General Navigation with Visual-Language Grounded Frontiers

    E. Padilla-Cerdio, B. Sun, M. Pollefeys, and H. Blum, “OpenFrontier: General navigation with visual-language grounded frontiers,”CoRR, vol. abs/2603.05377, 2026

  38. [38]

    Objectnav revisited: On evaluation of embodied agents navigating to objects.arXiv preprint arXiv:2006.13171, 2020

    D. Batra, A. Gokaslan, A. Kembhavi, O. Maksymets, R. Mottaghi, M. Savva, A. Toshev, and E. Wijmans, “ObjectNav revisited: On evaluation 15 of embodied agents navigating to objects,”CoRR, vol. abs/2006.13171, 2020

  39. [39]

    BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 202. PMLR, 2023, pp. 19 730–19 742

  40. [40]

    A two-stage optimized next-view planning framework for 3-D unknown environment exploration, and structural reconstruction,

    Z. Meng, H. Qin, Z. Chen, X. Chen, H. Sun, F. Lin, and M. H. A. Jr., “A two-stage optimized next-view planning framework for 3-D unknown environment exploration, and structural reconstruction,”IEEE Robotics and Automation Letters, vol. 2, no. 3, pp. 1680–1687, 2017

  41. [41]

    Ego-planner: An esdf-free gradient-based local planner for quadrotors,

    X. Zhou, Z. Wang, H. Ye, C. Xu, and F. Gao, “Ego-planner: An esdf-free gradient-based local planner for quadrotors,”IEEE Robot. Autom. Lett., vol. 6, no. 2, pp. 478–485, 2021

  42. [42]

    Habitat navigation challenge 2023,

    Habitat Team, “Habitat navigation challenge 2023,” https://github.com/ facebookresearch/habitat-challenge, 2023, accessed: 2026-05-17

  43. [43]

    Habitat-Matterport 3D Dataset (HM3D): 1000 large-scale 3d environments for embodied AI,

    S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y. Zhao, and D. Batra, “Habitat-Matterport 3D Dataset (HM3D): 1000 large-scale 3d environments for embodied AI,” inAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021

  44. [44]

    Matterport3D: Learning from RGB-D data in indoor environments,

    A. X. Chang, A. Dai, T. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3D: Learning from RGB-D data in indoor environments,” in2017 International Conference on 3D Vision (3DV), 2017, pp. 667–676

  45. [45]

    ESC: Exploration with soft commonsense constraints for zero-shot object navigation,

    K. Zhou, K. Zheng, C. Pryor, Y. Shen, H. Jin, L. Getoor, and X. E. Wang, “ESC: Exploration with soft commonsense constraints for zero-shot object navigation,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 202. PMLR, 2023, pp. 42 829–42 842

  46. [46]

    OpenFMNav: Towards open-set zero-shot object navigation via vision-language foundation models,

    Y. Kuang, H. Lin, and M. Jiang, “OpenFMNav: Towards open-set zero-shot object navigation via vision-language foundation models,” in Findings of the Association for Computational Linguistics: NAACL 2024. Association for Computational Linguistics, 2024, pp. 338–351

  47. [47]

    TriHelper: Zero-shot object navigation with dynamic assistance,

    L. Zhang, Q. Zhang, H. Wang, E. Xiao, Z. Jiang, H. Chen, and R. Xu, “TriHelper: Zero-shot object navigation with dynamic assistance,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024

  48. [48]

    SAM 3: Segment Anything with Concepts

    N. Carion, L. Gustafson, Y.-T. Hu, S. Debnath, R. Hu, D. S. Coll-Vinent, C. Ryali, K. V. Alwala, H. Khedr, A. Huanget al., “SAM 3: Segment anything with concepts,”CoRR, vol. abs/2511.16719, 2025

  49. [49]

    An extension of the lin-kernighan-helsgaun tsp solver for constrained traveling salesman and vehicle routing problems,

    K. Helsgaun, “An extension of the lin-kernighan-helsgaun tsp solver for constrained traveling salesman and vehicle routing problems,” Roskilde University, Roskilde, Denmark, Technical Report, 2017. [Online]. Available: http://www.akira.ruc.dk/ ∼keld/ research/LKH-3/LKH-3 REPORT.pdf

  50. [50]

    InstructNav: Zero-shot system for generic instruction navigation in unexplored environment,

    Y. Long, W. Cai, H. Wang, G. Zhan, and H. Dong, “InstructNav: Zero-shot system for generic instruction navigation in unexplored environment,” in Proceedings of The 8th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 270. PMLR, 2025, pp. 2049–2060

  51. [51]

    Fast-lio2: Fast direct lidar- inertial odometry,

    W. Xu, Y. Cai, D. He, J. Lin, and F. Zhang, “Fast-lio2: Fast direct lidar- inertial odometry,”IEEE Transactions on Robotics, vol. 38, no. 4, pp. 2053–2073, 2022

  52. [52]

    Intel RealSense Depth Camera D435i,

    Intel Corporation, “Intel RealSense Depth Camera D435i,” https://www.intel.com/content/www/us/en/products/sku/190004/ intel-realsense-depth-camera-d435i.html, 2026, accessed: 2026-05- 17