pith. sign in

arxiv: 2606.10577 · v1 · pith:F5URKEP6new · submitted 2026-06-09 · 💻 cs.RO

AgenticNav: Zero-Shot Vision-and-Language Navigation as a Tool-Calling Harness

Pith reviewed 2026-06-27 13:33 UTC · model grok-4.3

classification 💻 cs.RO
keywords zero-shot VLN-CEtool-calling harnessvision-language modelspixel-based action selectionon-demand depth queriesagentic memoryR2R-CE benchmarkcontinuous navigation
0
0 comments X

The pith

AgenticNav lets vision-language models navigate continuous spaces by calling tools for pixel actions, depth, and selective memory recall.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes zero-shot vision-and-language navigation in continuous environments as an agentic interface problem. Instead of training waypoint predictors or accumulating long histories, it supplies the VLM with three callable tools: direct pixel selection for actions, on-demand depth queries at chosen pixels, and a compact map paired with a recall tool for selective past observations. This design removes the action-space limits of predictors and the irrelevant context of full histories. On the R2R-CE benchmark the approach reaches new state-of-the-art results among zero-shot methods that use the same VLM backbone, with additional real-world tests showing improved generalization. A sympathetic reader would care because the work demonstrates that exposing granular, on-demand tools can unlock better embodied behavior from existing models without further training.

Core claim

AgenticNav is a lightweight navigation harness that exposes action, depth, and memory as callable tools to a vision-language model. The action tool converts a chosen pixel in an RGB image into executable motion, the depth tool returns metric distance for any requested pixel, and the memory tool supplies a compact trajectory map together with selective recall of past visual observations. This interface replaces waypoint predictors and long textual histories. On the R2R-CE benchmark the method sets new state-of-the-art performance among zero-shot approaches using the same VLM backbone and shows stronger real-world generalization than prior methods.

What carries the argument

The agentic tool-calling harness that turns VLM outputs into direct pixel-based actions, on-demand depth values, and selective recall from a compact map image.

If this is right

  • Direct pixel-based action selection removes the restriction that waypoint predictors place on the model's action space.
  • On-demand depth queries supply metric information only where the model requests it, avoiding unnecessary depth input.
  • The compact map plus selective recall tool keeps memory relevant and prevents prompt overload from long histories.
  • The overall harness yields higher success rates on R2R-CE than previous zero-shot methods that share the same VLM.
  • Real-world robot trials confirm better zero-shot transfer than waypoint-based baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tool-exposure pattern could be applied to other embodied tasks such as manipulation or exploration where VLMs currently rely on learned predictors.
  • Selective recall from a compact map offers a general way to manage context length in any long-horizon agentic system that uses VLMs.
  • If pixel-level action selection proves robust, it may reduce the need to train separate low-level controllers for many navigation domains.
  • The design invites experiments that add further tools, such as semantic object queries, to the same harness.

Load-bearing premise

The vision-language model can interpret the new tools and use them to choose effective navigation actions without the crutches of waypoint predictors or full history accumulation.

What would settle it

An ablation on R2R-CE in which the same VLM backbone is run once with the full tool harness and once with the action tool replaced by a standard waypoint predictor; if the tool version no longer outperforms prior zero-shot baselines, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2606.10577 by Changze Li, Hantian Shi, Jiaying Luo, Jiyuan Cai, Ming Yang, Tong Qin, Yijian Li.

Figure 1
Figure 1. Figure 1: Overview of AgenticNav. We reformulate zero-shot VLN-CE as a tool-calling interface [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: AgenticNav method overview. The VLM queries depth, optionally recalls past node [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Real-world tool-use demonstrations. 4.3 Ablations Tool components [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Zero-shot vision-and-language navigation in continuous environments (VLN-CE) has recently become feasible with large vision-language models (VLMs). However, existing methods typically rely on learned waypoint predictors to propose navigable actions. This severely limits the model's action space and fails to leverage depth inputs effectively. Moreover, memory is commonly handled by accumulating long textual or visual histories with substantial irrelevant context, or by retrieving cross-episode experiences, which weakens the zero-shot setting. In this paper, we rethink zero-shot VLN-CE as an agentic interface between the VLM and the environment, and present AgenticNav, a lightweight navigation harness that exposes action, depth, and memory as callable tools. Instead of choosing from predicted waypoints, the action tool allows the VLM to directly select a target pixel in RGB observations, converting it into executable motion. Depth is exposed through an on-demand pixel-depth tool, enabling the VLM to request precise metric distances only where they matter. For memory, AgenticNav provides a compact map image summarizing the historical trajectory, paired with a recall tool that allows the VLM to selectively revisit past visual observations without overwhelming the prompt context. On the R2R-CE benchmark, AgenticNav establishes new state-of-the-art (SOTA) performance among zero-shot methods given the same VLM backbone. Real-world validation further highlights its zero-shot generalization compared to prior methods. Ablations show that our action tool design outperforms traditional waypoint predictors, and that depth tool and agentic memory further contribute to navigation performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces AgenticNav, a lightweight tool-calling harness that reformulates zero-shot VLN-CE as an agentic interface between a VLM and the environment. It replaces learned waypoint predictors with a pixel-selection action tool that converts chosen RGB pixels to motion, adds an on-demand pixel-depth query tool, and implements memory via a compact map image paired with a selective-recall tool. The central empirical claim is that this design yields new state-of-the-art performance on the R2R-CE benchmark among zero-shot methods that use the identical VLM backbone, with supporting real-world validation and ablations that attribute gains to the individual tools.

Significance. If the reported gains are reproducible and statistically supported, the work could meaningfully advance zero-shot navigation by showing that direct, callable tools can outperform auxiliary learned modules while preserving the zero-shot setting. The selective-recall memory mechanism addresses a common context-overload problem in long-horizon VLN and may generalize to other agentic VLM applications.

major comments (3)
  1. [Abstract] Abstract: the claim that AgenticNav 'establishes new state-of-the-art (SOTA) performance' on R2R-CE is presented without any numerical results (success rate, SPL, NE, etc.), error bars, dataset splits, or references to tables/figures that would allow verification of the magnitude or statistical significance of the improvement over prior zero-shot baselines using the same VLM.
  2. [Abstract] Abstract: the action tool is described as allowing the VLM to 'directly select a target pixel in RGB observations, converting it into executable motion,' yet no concrete mechanism, camera model, or motion primitive is specified; without this, it is impossible to evaluate whether the tool actually expands the action space beyond the limitations attributed to waypoint predictors.
  3. [Abstract] Abstract: the ablations are asserted to show that 'our action tool design outperforms traditional waypoint predictors, and that depth tool and agentic memory further contribute,' but no quantitative ablation results, baseline definitions, or section references are supplied, leaving the load-bearing empirical support for the tool-design choices unverified.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it named the specific VLM backbone used for the reported comparisons.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments focused on the abstract. We agree that the abstract should be revised to include key quantitative results, section references, and brief clarifications to make the claims verifiable. We address each point below and will incorporate the changes in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that AgenticNav 'establishes new state-of-the-art (SOTA) performance' on R2R-CE is presented without any numerical results (success rate, SPL, NE, etc.), error bars, dataset splits, or references to tables/figures that would allow verification of the magnitude or statistical significance of the improvement over prior zero-shot baselines using the same VLM.

    Authors: We agree that the abstract would benefit from explicit metrics and references. In the revision we will update the abstract to state the success rate, SPL, and NE achieved on the R2R-CE val-unseen split, note that results are reported with standard deviations across runs, and add a direct reference to Table 1 and Section 4.1, where the full comparison against prior zero-shot methods that use the identical VLM backbone is presented. revision: yes

  2. Referee: [Abstract] Abstract: the action tool is described as allowing the VLM to 'directly select a target pixel in RGB observations, converting it into executable motion,' yet no concrete mechanism, camera model, or motion primitive is specified; without this, it is impossible to evaluate whether the tool actually expands the action space beyond the limitations attributed to waypoint predictors.

    Authors: The abstract intentionally remains high-level. The concrete implementation—projection of the selected pixel through the camera intrinsics, on-demand depth query for metric scaling, and the resulting motion primitive—is fully specified in Section 3.2. We will add a short parenthetical phrase in the abstract (“via camera projection and depth-guided motion primitive”) to indicate the mechanism without lengthening the text excessively, while retaining the pointer to the method section. revision: partial

  3. Referee: [Abstract] Abstract: the ablations are asserted to show that 'our action tool design outperforms traditional waypoint predictors, and that depth tool and agentic memory further contribute,' but no quantitative ablation results, baseline definitions, or section references are supplied, leaving the load-bearing empirical support for the tool-design choices unverified.

    Authors: We acknowledge the abstract’s reference to ablations is too terse. The quantitative results, baseline definitions (learned waypoint predictors from prior work), and per-tool contributions are reported in Section 4.3 and Table 3. In the revised abstract we will insert a concise clause (“Ablations in Section 4.3 show the action tool outperforms waypoint predictors, with further gains from the depth tool and agentic memory”) together with the table reference. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with benchmark evaluation

full rationale

The paper describes an agentic tool-calling harness for zero-shot VLN-CE, exposing pixel-action, depth, and selective-recall map tools to a VLM. Central claims rest on R2R-CE benchmark results and ablations showing tool utility over waypoint predictors. No equations, fitted parameters, derivations, or self-citation chains are present that reduce any result to inputs by construction. Evaluation is externally falsifiable via standard benchmarks and does not invoke uniqueness theorems or prior author work as load-bearing premises. This is a standard empirical systems contribution with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the domain assumption that VLMs can leverage the described tool interface zero-shot, plus three newly introduced tool mechanisms whose effectiveness is evidenced only by the reported benchmark gains.

axioms (1)
  • domain assumption Large vision-language models can effectively utilize tool-calling interfaces for navigation tasks without additional training or fine-tuning.
    This assumption underpins the entire zero-shot design and is invoked when the abstract states that the VLM uses the exposed tools directly.
invented entities (3)
  • Pixel-selection action tool no independent evidence
    purpose: Converts VLM-chosen pixels in RGB observations into executable robot motion, replacing waypoint predictors.
    New tool introduced by the harness; no independent evidence outside the reported performance.
  • On-demand pixel-depth tool no independent evidence
    purpose: Supplies precise metric distances from depth input only for selected pixels.
    New tool introduced by the harness; no independent evidence outside the reported performance.
  • Compact map image with recall tool no independent evidence
    purpose: Summarizes trajectory history and enables selective retrieval of past observations without long context.
    New memory mechanism introduced by the harness; no independent evidence outside the reported performance.

pith-pipeline@v0.9.1-grok · 5831 in / 1501 out tokens · 30272 ms · 2026-06-27T13:33:55.023918+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 2 canonical work pages

  1. [1]

    Anderson, Q

    P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S ¨underhauf, I. Reid, S. Gould, and A. van den Hengel. Vision-and-language navigation: Interpreting visually-grounded naviga- tion instructions in real environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  2. [2]

    Krantz, E

    J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee. Beyond the nav-graph: Vision-and- language navigation in continuous environments. InProceedings of the European Conference on Computer Vision (ECCV), 2020

  3. [3]

    Y . Qiao, W. Lyu, H. Wang, Z. Wang, Z. Li, Y . Zhang, M. Tan, and Q. Wu. Open-nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2025

  4. [4]

    X. Shi, Z. Li, W. Lyu, J. Xia, F. Dayoub, Y . Qiao, and Q. Wu. Smartway: Enhanced waypoint prediction and backtracking for zero-shot vision-and-language navigation. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025

  5. [5]

    G. Dai, S. Wang, Z. Wang, G.-S. Xie, Y . Yang, J. Pan, Q. Sun, and X. Shu. History to fu- ture: Evolving agent with experience and thought for zero-shot vision-and-language naviga- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2026

  6. [6]

    Fried, R

    D. Fried, R. Hu, V . Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell. Speaker-follower models for vision-and-language navi- gation. InAdvances in Neural Information Processing Systems (NeurIPS), 2018

  7. [7]

    C.-Y . Ma, Z. Wu, G. AlRegib, C. Xiong, and Z. Kira. The regretful agent: Heuristic-aided navigation through progress estimation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  8. [8]

    F. Zhu, Y . Zhu, X. Chang, and X. Liang. Vision-language navigation with self-supervised auxiliary reasoning tasks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  9. [9]

    W. Hao, C. Li, X. Li, L. Carin, and J. Gao. Towards learning a generic agent for vision-and- language navigation via pre-training. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  10. [10]

    Guhur, M

    P.-L. Guhur, M. Tapaswi, S. Chen, I. Laptev, and C. Schmid. Airbert: In-domain pretraining for vision-and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

  11. [11]

    Z. Wang, J. Li, Y . Hong, Y . Wang, Q. Wu, M. Bansal, S. Gould, H. Tan, and Y . Qiao. Scaling data generation in vision-and-language navigation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2023

  12. [12]

    Krantz, A

    J. Krantz, A. Gokaslan, D. Batra, S. Lee, and O. Maksymets. Waypoint models for instruction- guided navigation in continuous environments. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

  13. [13]

    Y . Hong, Z. Wang, Q. Wu, and S. Gould. Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  14. [14]

    Chen, P.-L

    S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev. Think global, act local: Dual- scale graph transformer for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 9

  15. [15]

    D. An, Y . Qi, Y . Li, Y . Huang, L. Wang, T. Tan, and J. Shao. Bevbert: Multimodal map pre-training for language-guided navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  16. [16]

    D. An, H. Wang, W. Wang, Z. Wang, Y . Huang, K. He, and L. Wang. Etpnav: Evolving topological planning for vision-language navigation in continuous environments.IEEE Trans- actions on Pattern Analysis and Machine Intelligence (TPAMI), 47(7), 2025

  17. [17]

    Z. Wang, X. Li, J. Yang, Y . Liu, and S. Jiang. Gridmm: Grid memory map for vision-and- language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  18. [18]

    Zheng, S

    D. Zheng, S. Huang, L. Zhao, Y . Zhong, and L. Wang. Towards learning a generalist model for embodied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  19. [19]

    Zhang, K

    J. Zhang, K. Wang, R. Xu, G. Zhou, Y . Hong, X. Fang, Q. Wu, Z. Zhang, and H. Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation. InProceedings of Robotics: Science and Systems (RSS), 2024

  20. [20]

    Cheng, Y

    A.-C. Cheng, Y . Ji, Z. Yang, Z. Gongye, X. Zou, J. Kautz, E. Biyik, H. Yin, S. Liu, and X. Wang. Navila: Legged robot vision-language-action model for navigation. InProceedings of Robotics: Science and Systems (RSS), 2025

  21. [21]

    Ichter, A

    B. Ichter, A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Ir- pan, E. Jang, R. Julian, D. Kalashnikov, S. Levine, Y . Lu, C. Parada, K. Rao, P. Sermanet, A. T. Toshev, V . Vanhoucke, F. Xia, T. Xiao, P. Xu, M. Yan, N. Brown, M. Ahn, O. Cortes, N. Sievers, C. Tan, S. Xu, D. Reyes, J. Rettinghouse, J. Quiambao, P. Pastor, L. Lu...

  22. [22]

    C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y . Su. LLM-Planner: Few- shot grounded planning for embodied agents with large language models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023

  23. [23]

    P. Chen, X. Sun, H. Zhi, R. Zeng, T. H. Li, G. Liu, M. Tan, and C. Gan.A 2Nav: Action-aware zero-shot robot navigation by exploiting vision-and-language ability of foundation models,

  24. [24]

    arXiv preprint arXiv:2308.07997

  25. [25]

    G. Zhou, Y . Hong, and Q. Wu. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 38, 2024

  26. [26]

    Y . Long, X. Li, W. Cai, and H. Dong. Discuss before moving: Visual language navigation via multi-expert discussions. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024

  27. [27]

    J. Chen, B. Lin, R. Xu, Z. Chai, X. Liang, and K.-Y . Wong. MapGPT: Map-guided prompting with adaptive path planning for vision-and-language navigation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Volume 1: Long Pa- pers. Association for Computational Linguistics, 2024

  28. [28]

    Y . Long, W. Cai, H. Wang, G. Zhan, and H. Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment. InProceedings of The 8th Conference on Robot Learning (CoRL), volume 270 ofProceedings of Machine Learning Research. PMLR, 2025. 10

  29. [29]

    Z. Zhan, L. Yu, S. Yu, and G. Tan. MC-GPT: Empowering vision-and-language navigation with memory map and reasoning chains, 2024. arXiv preprint arXiv:2405.10620

  30. [30]

    Huang, F

    W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, P. Sermanet, T. Jackson, N. Brown, L. Luu, S. Levine, K. Hausman, and B. Ichter. Inner monologue: Embodied reasoning through planning with language models. InProceed- ings of The 6th Conference on Robot Learning (CoRL), volume 205 ofProceedings of Ma...

  31. [31]

    Liang, W

    J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2023

  32. [32]

    D. Shah, B. Osinski, B. Ichter, and S. Levine. Lm-nav: Robotic navigation with large pre- trained models of language, vision, and action. InProceedings of The 6th Conference on Robot Learning (CoRL), volume 205 ofProceedings of Machine Learning Research. PMLR, 2023

  33. [33]

    Rajvanshi, K

    A. Rajvanshi, K. Sikka, X. Lin, B. Lee, H.-P. Chiu, and A. Velasquez. Saynav: Grounding large language models for dynamic planning to navigation in new environments. InProceedings of the International Conference on Automated Planning and Scheduling (ICAPS), volume 34, 2024

  34. [34]

    Driess, F

    D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. PaLM-e: An embodied multimodal language model. InProceedings of the 40th International Confere...

  35. [35]

    Chandaka, G

    B. Chandaka, G. X. Wang, H. Chen, H. Che, A. J. Zhai, and S. Wang. Human-like navigation in a world built for humans. InProceedings of the 9th Conference on Robot Learning (CoRL), 2025. 11