pith. sign in

arxiv: 2604.21363 · v2 · pith:4VEGZFNTnew · submitted 2026-04-23 · 💻 cs.RO

A Deployable Embodied Vision-Language Navigation System with Hierarchical Cognition and Context-Aware Exploration

Pith reviewed 2026-05-20 23:50 UTC · model grok-4.3

classification 💻 cs.RO
keywords Vision-Language NavigationEmbodied AIHierarchical ArchitectureMemory GraphContext-Aware ExplorationRobot DeploymentReal-Time Systems
0
0 comments X

The pith

Hierarchical fast and deep layers with a compact memory graph let vision-language navigation run efficiently on real robots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a vision-language navigation system built for actual robots that must operate under tight limits on computation, memory, and speed. It separates immediate perception and movement into a fast layer while reserving a slower deep layer for higher-level reasoning that consults a vision-language model. These layers share an incrementally updated compact memory graph that breaks the environment into smaller pieces for the model to process over time. Exploration decisions are shaped by casting the choice of where to go next as a weighted repairman routing problem that balances reasoning output with physical layout. Experiments on both simulated and physical platforms show gains in success rate and path efficiency compared with earlier methods, all while staying within real-time bounds on modest hardware.

Core claim

The system decomposes navigation into an asynchronous fast perception-action layer and a deep reasoning layer that progressively consumes subgraphs from an incrementally constructed compact memory graph, with exploration posed as a Weighted Traveling Repairman Problem that incorporates both spatial distribution and reasoning outcomes, thereby delivering stronger long-horizon performance without sacrificing real-time execution on resource-limited robots.

What carries the argument

The hierarchical cognition architecture that runs a fast perception-action layer and a deep reasoning layer asynchronously, connected by a shared memory layer that incrementally builds and decomposes a compact memory graph for the vision-language model.

If this is right

  • Navigation success and path efficiency increase relative to prior vision-language navigation methods in both simulation and physical tests.
  • Real-time operation continues on hardware with strict constraints on compute, memory, and energy.
  • Long-horizon instructions are handled by feeding only relevant subgraphs to the model rather than requiring the full environment at every step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same split-layer design with an evolving compact graph could be tested on other continuous-control tasks such as mobile manipulation.
  • Formulating exploration as a weighted repairman problem invites direct comparisons with classical routing algorithms in future spatial-planning work.

Load-bearing premise

The fast and deep asynchronous layers plus the compact memory graph can keep enough context for long-horizon navigation without losing key details or creating timing mismatches in changing real-world conditions.

What would settle it

A sequence of real-world trials in which the robot repeatedly loses track of earlier landmarks or collides because the memory graph falls out of sync with the current scene would show the context-maintenance claim does not hold.

Figures

Figures reproduced from arXiv: 2604.21363 by Chen Wang, Denan Liang, Kuan Xu, Lihua Xie, Ruimeng Liu, Shenghai Yuan, Tongxing Jin, Yizhuo Yang.

Figure 1
Figure 1. Figure 1: We develop a deployable vision–language navigation system that [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed system architecture. The framework decouples perception, memory, and reasoning into three layers, enabling real-time [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The memory graph is decomposed into subgraphs and prioritized [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Failure case analysis of our system on the MP3D and HM3D datasets. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: We deploy our system on a quadruped robot, which performs high-level [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Bridging the gap between embodied intelligence and embedded deployment remains a key challenge in intelligent robotic systems, where perception, reasoning, and planning must operate under strict constraints on computation, memory, energy, and real-time execution. In vision-and-language navigation (VLN), existing approaches often face a trade-off between reasoning capability and deployment efficiency on real-world platforms. In this paper, we present a deployable embodied VLN system that achieves both high efficiency and strong high-level reasoning on real-world robots. The system is decomposed into a fast perception-action layer and a deep reasoning layer running asynchronously at different time scales, with a shared memory layer enabling efficient interaction between them. To support long-horizon reasoning, we incrementally construct a compact memory graph and progressively feed decomposed subgraphs into a vision-language model (VLM). Furthermore, we formulate exploration as a Weighted Traveling Repairman Problem (WTRP) by jointly considering reasoning outcomes and the spatial distribution of candidate regions. Extensive experiments in simulation and real-world environments demonstrate improved navigation success and efficiency over existing VLN approaches while maintaining real-time performance on resource-constrained hardware. Code and additional real-world experiments are available at https://github.com/xukuanHIT/HiCo-Nav.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents HiCo-Nav, a deployable embodied VLN system that decomposes perception, reasoning, and planning into a fast perception-action layer and a deep reasoning layer running asynchronously at different timescales with a shared memory layer. Long-horizon reasoning is supported by incrementally constructing a compact memory graph whose decomposed subgraphs are progressively fed to a VLM; exploration is formulated as a Weighted Traveling Repairman Problem (WTRP) that jointly incorporates reasoning outcomes and spatial layout of candidate regions. The central claim is that this architecture yields improved navigation success and efficiency over prior VLN methods in both simulation and real-world settings while sustaining real-time operation on resource-constrained hardware.

Significance. If the experimental results are shown to be robust, the work would meaningfully advance practical embodied VLN by demonstrating that high-level VLM reasoning can be reconciled with strict real-time and hardware constraints through hierarchical asynchronous design and compact memory management.

major comments (3)
  1. [§4] §4 (Experiments) and Table 2: the headline claim of superior real-world success and efficiency rests on quantitative comparisons, yet the reported metrics lack error bars, statistical significance tests, and explicit exclusion criteria for failed trials; without these it is impossible to confirm that the observed gains are attributable to the hierarchical architecture rather than implementation details.
  2. [§3.3] §3.3 (Compact Memory Graph): the incremental compaction and subgraph decomposition process is described as preserving context for long-horizon reasoning, but no ablation quantifies information loss (e.g., spatial-temporal detail retention rate or failure cases on long trajectories); if compaction discards critical layout information, the WTRP-driven exploration and VLM reasoning claims would not hold.
  3. [§3.2] §3.2 (Asynchronous Layers): the fast and deep layers operate at different timescales with shared memory, yet the manuscript provides no timing analysis or measurements of desynchronization under dynamic environmental changes; timing conflicts would directly undermine the claimed real-time deployability and context maintenance.
minor comments (2)
  1. [Abstract] The abstract states performance improvements without citing the concrete success-rate or SPL deltas relative to the strongest baseline; adding these numbers would improve readability.
  2. [§3.4] Notation for the WTRP objective (Eq. 7) introduces weights derived from reasoning outcomes; clarify whether these weights are recomputed online or fixed per episode.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and Table 2: the headline claim of superior real-world success and efficiency rests on quantitative comparisons, yet the reported metrics lack error bars, statistical significance tests, and explicit exclusion criteria for failed trials; without these it is impossible to confirm that the observed gains are attributable to the hierarchical architecture rather than implementation details.

    Authors: We agree that incorporating statistical analysis would enhance the credibility of our experimental results. In the revised manuscript, we will add error bars to the metrics in Table 2 and other figures, conduct statistical significance tests between our method and the baselines, and provide explicit criteria for trial exclusion (e.g., due to sensor failures or exceeding time limits). These additions will help confirm that the performance gains are due to the hierarchical design rather than other factors. revision: yes

  2. Referee: [§3.3] §3.3 (Compact Memory Graph): the incremental compaction and subgraph decomposition process is described as preserving context for long-horizon reasoning, but no ablation quantifies information loss (e.g., spatial-temporal detail retention rate or failure cases on long trajectories); if compaction discards critical layout information, the WTRP-driven exploration and VLM reasoning claims would not hold.

    Authors: We acknowledge the importance of quantifying any information loss in the memory graph construction. We will perform and include an ablation study in the revised version that evaluates the retention of spatial and temporal details after compaction and examines failure modes on extended trajectories. This will provide evidence that the compact memory graph maintains the necessary information for effective WTRP-based exploration and VLM reasoning. revision: yes

  3. Referee: [§3.2] §3.2 (Asynchronous Layers): the fast and deep layers operate at different timescales with shared memory, yet the manuscript provides no timing analysis or measurements of desynchronization under dynamic environmental changes; timing conflicts would directly undermine the claimed real-time deployability and context maintenance.

    Authors: We appreciate this observation regarding the need for timing analysis. In the revision, we will add a detailed timing breakdown of the asynchronous layers, including measurements of their execution frequencies and any observed desynchronization in dynamic scenarios. We will also discuss how the shared memory layer helps in maintaining context and ensuring real-time operation despite the different timescales. revision: yes

Circularity Check

0 steps flagged

No circularity: modeling choice and system architecture are independent of target metrics

full rationale

The paper presents a hierarchical VLN architecture with asynchronous fast/deep layers and an incrementally built compact memory graph, then formulates exploration as a WTRP that incorporates reasoning outcomes and spatial layout. This is explicitly described as a joint modeling decision rather than a derived quantity fitted to or defined by the success/efficiency metrics it is later evaluated on. No equations, fitted parameters, or self-citations are shown reducing the central claims (real-world success, efficiency, real-time deployability) back to themselves by construction. The derivation chain remains self-contained against external benchmarks and experimental results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on standard robotics assumptions about real-time asynchronous execution and graph-based spatial memory; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Asynchronous fast perception-action and deep reasoning layers with shared memory can support long-horizon VLN without major synchronization loss.
    Invoked when describing the two-layer decomposition and memory interaction.

pith-pipeline@v0.9.0 · 5775 in / 1255 out tokens · 54039 ms · 2026-05-20T23:50:06.637296+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 2 internal anchors

  1. [1]

    Intelligent multisource autonomous navigation: Review and perspectives,

    W. Wang, F. Meng, and X. Yu, “Intelligent multisource autonomous navigation: Review and perspectives,”IEEE/ASME Transactions on Mechatronics, vol. 30, no. 6, pp. 4081–4091, 2025. 10

  2. [2]

    Autonomous visual navigation with head stabilization control for a salamander-like robot,

    Z. Liu, Y . Liu, Y . Fang, and X. Guo, “Autonomous visual navigation with head stabilization control for a salamander-like robot,”IEEE/ASME Transactions on Mechatronics, 2025

  3. [3]

    Rpf-search: Field-based search for robot person following in unknown dynamic environments,

    H. Ye, K. Cai, Y . Zhan, B. Xia, A. Ajoudani, and H. Zhang, “Rpf-search: Field-based search for robot person following in unknown dynamic environments,”IEEE/ASME Transactions on Mechatronics, 2025

  4. [4]

    Emobipednav: Emotion-aware social navigation for bipedal robots with deep reinforcement learning,

    W. Zhu, A. Raju, A. Shamsah, A. Wu, S. Hutchinson, and Y . Zhao, “Emobipednav: Emotion-aware social navigation for bipedal robots with deep reinforcement learning,”IEEE/ASME Transactions on Mechatronics, 2026

  5. [5]

    Aligning cyber space with physical world: A comprehensive survey on embodied ai,

    Y . Liu, W. Chen, Y . Bai, X. Liang, G. Li, W. Gao, and L. Lin, “Aligning cyber space with physical world: A comprehensive survey on embodied ai,”IEEE/ASME Transactions on Mechatronics, 2025

  6. [6]

    A comprehensive review of recent advancements in vision-and-language navigation,

    J. Khan, N. Aafaq, Q. Ali, and M. Mohsin, “A comprehensive review of recent advancements in vision-and-language navigation,”Discover Computing, vol. 29, no. 1, p. 167, 2026

  7. [7]

    A survey of optimization-based task and motion planning: From classical to learning approaches,

    Z. Zhao, S. Cheng, Y . Ding, Z. Zhou, S. Zhang, D. Xu, and Y . Zhao, “A survey of optimization-based task and motion planning: From classical to learning approaches,”IEEE/ASME Transactions On Mechatronics, vol. 30, no. 4, pp. 2799–2825, 2024

  8. [8]

    Vlfm: Vision- language frontier maps for zero-shot semantic navigation,

    N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision- language frontier maps for zero-shot semantic navigation,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 42–48

  9. [9]

    Apexnav: An adaptive exploration strategy for zero-shot object navigation with target-centric semantic fusion,

    M. Zhang, Y . Du, C. Wu, J. Zhou, Z. Qi, J. Ma, and B. Zhou, “Apexnav: An adaptive exploration strategy for zero-shot object navigation with target-centric semantic fusion,”IEEE Robotics and Automation Letters, 2025

  10. [10]

    Vl-nav: real- time vision-language navigation with spatial reasoning,

    Y . Du, T. Fu, Z. Chen, B. Li, S. Su, Z. Zhao, and C. Wang, “Vl-nav: real- time vision-language navigation with spatial reasoning,”arXiv preprint arXiv:2502.00931, 2025

  11. [11]

    Global planning for object navigation via a weighted traveling repairman problem formulation,

    R. Liu, X. Xu, S. Yuan, and L. Xie, “Global planning for object navigation via a weighted traveling repairman problem formulation,” in2026 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2026

  12. [12]

    Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation,

    H. Yin, X. Xu, Z. Wu, J. Zhou, and J. Lu, “Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation,”Advances in neural information processing systems, vol. 37, pp. 5285–5307, 2024

  13. [13]

    Unigoal: Towards universal zero-shot goal-oriented navigation,

    H. Yin, X. Xu, L. Zhao, Z. Wang, J. Zhou, and J. Lu, “Unigoal: Towards universal zero-shot goal-oriented navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 19 057–19 066

  14. [14]

    Openfmnav: Towards open-set zero-shot object navigation via vision-language foundation models,

    Y . Kuang, H. Lin, and M. Jiang, “Openfmnav: Towards open-set zero-shot object navigation via vision-language foundation models,” inFindings of the Association for Computational Linguistics: NAACL 2024, 2024, pp. 338–351

  15. [15]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  16. [16]

    Vision-and-language navigation today and tomorrow: A survey in the era of foundation models

    Y . Zhang, Z. Ma, J. Liet al., “Vision-and-language navigation today and tomorrow: A survey in the era of foundation models,”arXiv preprint arXiv:2407.07035, 2024

  17. [17]

    Speaker-follower models for vision-and-language navigation,

    D. Fried, R. Hu, V . Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg- Kirkpatrick, K. Saenko, D. Klein, and T. Darrell, “Speaker-follower models for vision-and-language navigation,” inNeural Information Processing Systems (NeurIPS), 2018

  18. [18]

    A recurrent vision-and-language bert for navigation,

    Y . Hong, Q. Wu, Y . Qi, C. Rodriguez-Opazo, and S. Gould, “A recurrent vision-and-language bert for navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 1643–1653

  19. [19]

    Dreamwalker: Mental planning for continuous vision-language navigation,

    H. Wang, W. Liang, L. Van Gool, and W. Wang, “Dreamwalker: Mental planning for continuous vision-language navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 10 873–10 883

  20. [20]

    V olumetric environment representation for vision-language navigation,

    R. Liu, W. Wang, and Y . Yang, “V olumetric environment representation for vision-language navigation,” inCVPR, 2024, pp. 16 317–16 328

  21. [21]

    Object goal navigation using goal-oriented semantic exploration,

    D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,”Advances in Neural Information Processing Systems, vol. 33, pp. 4247–4258, 2020

  22. [22]

    Clip on wheels: Zero-shot object navigation as object localization and exploration,

    S. Y . Gadre, M. Wortsman, G. Mehrotra, L. Schmidt, and S. S. Gordon, “Clip on wheels: Zero-shot object navigation as object localization and exploration,”arXiv preprint arXiv:2303.08234, 2023

  23. [23]

    Imagine before go: Self-supervised generative map for object goal navigation,

    S. Zhang, X. Yu, X. Song, X. Wang, and S. Jiang, “Imagine before go: Self-supervised generative map for object goal navigation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 16 414–16 425

  24. [24]

    3d-mem: 3d scene memory for embodied exploration and reasoning,

    Y . Yang, H. Yang, J. Zhou, P. Chen, H. Zhang, Y . Du, and C. Gan, “3d-mem: 3d scene memory for embodied exploration and reasoning,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 17 294–17 303

  25. [25]

    Zson: Zero-shot object-goal navigation using multimodal goal embeddings,

    A. Majumdar, G. Aggarwal, B. Devnani, J. Hoffman, and D. Batra, “Zson: Zero-shot object-goal navigation using multimodal goal embeddings,” Advances in Neural Information Processing Systems, vol. 35, pp. 32 340– 32 352, 2022

  26. [26]

    Esc: Exploration with soft commonsense constraints for zero-shot object navigation,

    K. Zhou, K. Zheng, C. Pryor, Y . Shen, H. Jin, L. Getoor, and X. E. Wang, “Esc: Exploration with soft commonsense constraints for zero-shot object navigation,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 42 829–42 842

  27. [27]

    L3mvn: Leveraging large language models for visual target navigation,

    B. Yu, H. Kasaei, and M. Cao, “L3mvn: Leveraging large language models for visual target navigation,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 3554–3560

  28. [28]

    Goat: Go to any thing,

    M. Chang, T. Gervet, M. Khanna, S. Yenamandra, D. Shah, S. Y . Min, K. Shah, C. Paxton, S. Gupta, D. Batraet al., “Goat: Go to any thing,” arXiv preprint arXiv:2311.06430, 2023

  29. [29]

    Wmnav: Integrating vision-language models into world models for object goal navigation,

    D. Nie, X. Guo, Y . Duan, R. Zhang, and L. Chen, “Wmnav: Integrating vision-language models into world models for object goal navigation,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 2392–2399

  30. [30]

    Tango: training-free embodied ai agents for open-world tasks,

    F. Ziliotto, T. Campari, L. Serafini, and L. Ballan, “Tango: training-free embodied ai agents for open-world tasks,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24 603–24 613

  31. [31]

    Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill,

    W. Cai, S. Huang, G. Cheng, Y . Long, P. Gao, C. Sun, and H. Dong, “Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 5228–5234

  32. [32]

    Fast-lio2: Fast direct lidar-inertial odometry,

    W. Xu, Y . Cai, D. He, J. Lin, and F. Zhang, “Fast-lio2: Fast direct lidar-inertial odometry,”IEEE Transactions on Robotics, vol. 38, no. 4, pp. 2053–2073, 2022

  33. [33]

    Yolo- world: Real-time open-vocabulary object detection,

    T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “Yolo- world: Real-time open-vocabulary object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 16 901–16 911

  34. [34]

    Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

    C. Zhang, D. Han, Y . Qiao, J. U. Kim, S. H. Bae, S. Lee, and C. S. Hong, “Faster segment anything: Towards lightweight sam for mobile applications,”arXiv preprint arXiv:2306.14289, 2023

  35. [35]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Changet al., “Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai,”arXiv preprint arXiv:2109.08238, 2021

  36. [36]

    Matterport3d: Learning from rgb-d data in indoor environments,

    A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y . Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,”International Conference on 3D Vision (3DV), 2017

  37. [37]

    Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation,

    N. Yokoyama, R. Ramrakhya, A. Das, D. Batra, and S. Ha, “Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 5543–5550

  38. [38]

    Prioritized semantic learning for zero-shot instance navigation,

    X. Sun, L. Liu, H. Zhi, R. Qiu, and J. Liang, “Prioritized semantic learning for zero-shot instance navigation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 161–178

  39. [39]

    Habitat-web: Learning embodied object-search strategies from human demonstrations at scale,

    R. Ramrakhya, E. Undersander, D. Batra, and A. Das, “Habitat-web: Learning embodied object-search strategies from human demonstrations at scale,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5173–5183

  40. [40]

    Topv-nav: Unlocking the top-view spatial reasoning potential of mllm for zero-shot object navigation,

    L. Zhong, C. Gao, Z. Ding, Y . Liao, H. Ma, S. Zhang, X. Zhou, and S. Liu, “Topv-nav: Unlocking the top-view spatial reasoning potential of mllm for zero-shot object navigation,”arXiv preprint arXiv:2411.16425, 2024

  41. [41]

    Goat- bench: A benchmark for multi-modal lifelong navigation,

    M. Khanna, R. Ramrakhya, G. Chhablani, S. Yenamandra, T. Gervet, M. Chang, Z. Kira, D. S. Chaplot, D. Batra, and R. Mottaghi, “Goat- bench: A benchmark for multi-modal lifelong navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2024, pp. 16 373–16 383

  42. [42]

    Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation,

    Z. Zhu, X. Wang, Y . Li, Z. Zhang, X. Ma, Y . Chen, B. Jia, W. Liang, Q. Yu, Z. Denget al., “Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 8120–8132

  43. [43]

    Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks,

    J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang, “Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks,” inProceedings of Robotics: Science and Systems (RSS), 2025