Three-Step Nav: A Hierarchical Global-Local Planner for Zero-Shot Vision-and-Language Navigation

Laurent Itti; Wanrong Zheng; Yunhao Ge

arxiv: 2604.26946 · v1 · submitted 2026-04-29 · 💻 cs.CV · cs.RO

Three-Step Nav: A Hierarchical Global-Local Planner for Zero-Shot Vision-and-Language Navigation

Wanrong Zheng , Yunhao Ge , Laurent Itti This is my paper

Pith reviewed 2026-05-07 10:25 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords vision-and-language navigationzero-shot navigationmultimodal large language modelshierarchical planningglobal-local navigationtrajectory auditingR2R-CERxR-CE

0 comments

The pith

A three-step protocol of forward landmarks, current alignment, and backward audit lets multimodal models navigate unknown spaces more reliably without any training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current zero-shot vision-and-language navigation agents driven by multimodal large language models often drift off course, halt early, or fail to finish because each decision uses only the immediate view. This paper introduces Three-Step Nav, a hierarchical planner that first extracts global landmarks to sketch a coarse route, then aligns the present observation to the next sub-goal for local guidance, and finally audits the full past trajectory to correct accumulated errors before stopping. The method drops into existing pipelines with no fine-tuning or gradient updates. If correct, it would raise success rates on the R2R-CE and RxR-CE benchmarks to new levels for purely zero-shot agents.

Core claim

Three-Step Nav counters drift and early stopping in MLLM-based VLN by enforcing a three-view protocol: look forward to extract global landmarks and sketch a coarse plan, look now to align the current view with the next sub-goal for fine guidance, and look backward to audit the entire trajectory and correct drift before deciding to stop. Requiring no gradient updates or task-specific fine-tuning, the planner achieves state-of-the-art zero-shot performance on the R2R-CE and RxR-CE datasets.

What carries the argument

The three-view protocol that separates global landmark extraction, local sub-goal alignment, and trajectory auditing inside repeated multimodal large language model calls.

If this is right

Existing VLN pipelines can adopt the planner with only minimal added overhead.
Zero-shot success rates improve on R2R-CE and RxR-CE without any model updates.
Agents become less prone to premature stopping and cumulative course drift.
The same underlying MLLM can handle both coarse global planning and fine local correction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Explicit multi-perspective checks inside a single model may compensate for single-step reasoning limits that appear in many sequential tasks.
The backward audit step could be adapted to other embodied domains such as object manipulation or long-horizon exploration where error builds over time.
If the three-view pattern works here, similar staged global-local-audit loops might raise reliability in other zero-shot instruction-following settings.

Load-bearing premise

The multimodal large language model can reliably extract global landmarks, align observations to sub-goals, and audit trajectories without any fine-tuning or task-specific adaptation.

What would settle it

Deploying the Three-Step Nav agent on the R2R-CE or RxR-CE test split and measuring a success rate no higher than the strongest prior zero-shot baseline would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2604.26946 by Laurent Itti, Wanrong Zheng, Yunhao Ge.

**Figure 1.** Figure 1: (a) Prior LLM-core planners rely only on view at source ↗

**Figure 2.** Figure 2: Illustration of the overall pipeline of the proposed methodology. We have three modules: view at source ↗

**Figure 3.** Figure 3: One successful example in the R2R-CE dataset. (a) view at source ↗

read the original abstract

Breakthrough progress in vision-based navigation through unknown environments has been achieved by using multimodal large language models (MLLMs). These models can plan a sequence of motions by evaluating the current view at each time step against the task and goal given to the agent. However, current zero-shot Vision-and-Language Navigation (VLN) agents powered by MLLMs still tend to drift off course, halt prematurely, and achieve low overall success rates. We propose Three-Step Nav to counteract these failures with a three-view protocol: First, "look forward" to extract global landmarks and sketch a coarse plan. Then, "look now" to align the current visual observation with the next sub-goal for fine-grained guidance. Finally, "look backward" audits the entire trajectory to correct accumulated drift before stopping. Requiring no gradient updates or task-specific fine-tuning, our planner drops into existing VLN pipelines with minimal overhead. Three-Step Nav achieves state-of-the-art zero-shot performance on the R2R-CE and RxR-CE dataset. Our code is available at https://github.com/ZoeyZheng0/3-step-Nav.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Three-Step Nav gives a clean three-view prompting fix for drift and early stopping in zero-shot MLLM navigation, but the SOTA numbers still need the experiments to show that the MLLM actually handles the three calls reliably.

read the letter

The main point is that this paper adds a backward audit step to MLLM-based VLN planning. The forward view pulls global landmarks for a coarse route, the current view lines up the next sub-goal, and the backward view checks the stored trajectory for drift before the agent stops. That three-step loop is the actual new piece, and it runs with no fine-tuning or extra parameters, just off-the-shelf prompting dropped into existing pipelines.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Three-Step Nav, a hierarchical zero-shot planner for vision-and-language navigation that uses an off-the-shelf multimodal large language model in three sequential views: forward to extract global landmarks and sketch a coarse plan, current to align the observation with the next sub-goal, and backward to audit the stored trajectory for drift correction. The approach requires no gradient updates or task-specific fine-tuning and is claimed to achieve state-of-the-art success rates on the R2R-CE and RxR-CE datasets.

Significance. If the central empirical claim holds, the work would be significant for demonstrating that a lightweight, training-free hierarchical protocol can mitigate common MLLM failure modes (drift, premature stopping) in VLN, offering a drop-in improvement to existing pipelines with low overhead and full code release for reproducibility.

major comments (2)

[§3] §3 (Method), three-step protocol description: the SOTA claim is load-bearing on the assumption that a single off-the-shelf MLLM can reliably (1) extract usable global landmarks, (2) produce accurate sub-goal alignments, and (3) detect/correct drift via trajectory audit. No quantitative metrics, error rates, or ablation results are reported for these three MLLM calls on R2R-CE or RxR-CE, so it is impossible to verify that the hierarchical planner outperforms a plain MLLM baseline rather than collapsing to it.
[§4] §4 (Experiments), results tables: the abstract and experimental claims of state-of-the-art zero-shot SR on R2R-CE and RxR-CE are presented without accompanying details on the exact baselines (e.g., prior MLLM VLN methods), full metric suite (SR, SPL, NE, etc.), number of evaluation episodes, or variance across seeds. This prevents assessment of whether the reported gains are robust or statistically significant.

minor comments (2)

The GitHub link is provided but the repository description does not explicitly list the exact prompt templates used for each of the three MLLM calls; including them would improve reproducibility.
Notation for the stored trajectory and sub-goal list is introduced without a clear diagram or pseudocode listing the data structures passed between the three steps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional empirical details will strengthen the manuscript. We address each major comment below and will revise the paper to incorporate the requested analyses and clarifications while preserving the core contributions.

read point-by-point responses

Referee: [§3] §3 (Method), three-step protocol description: the SOTA claim is load-bearing on the assumption that a single off-the-shelf MLLM can reliably (1) extract usable global landmarks, (2) produce accurate sub-goal alignments, and (3) detect/correct drift via trajectory audit. No quantitative metrics, error rates, or ablation results are reported for these three MLLM calls on R2R-CE or RxR-CE, so it is impossible to verify that the hierarchical planner outperforms a plain MLLM baseline rather than collapsing to it.

Authors: We agree that the absence of per-component metrics makes it harder to isolate the contribution of each view. The manuscript emphasizes end-to-end results because the protocol is designed as an integrated pipeline, but we acknowledge this leaves the individual reliability assumptions unquantified. In the revision we will add a dedicated ablation subsection in §3 reporting: landmark extraction precision on a manually annotated subset of R2R-CE episodes, sub-goal alignment accuracy for the current-view step, drift-correction success rate for the backward audit, and full end-to-end performance when each of the three views is removed. These results will be compared against a plain single-view MLLM baseline to demonstrate that the hierarchical structure yields measurable gains rather than collapsing to the baseline. revision: yes
Referee: [§4] §4 (Experiments), results tables: the abstract and experimental claims of state-of-the-art zero-shot SR on R2R-CE and RxR-CE are presented without accompanying details on the exact baselines (e.g., prior MLLM VLN methods), full metric suite (SR, SPL, NE, etc.), number of evaluation episodes, or variance across seeds. This prevents assessment of whether the reported gains are robust or statistically significant.

Authors: We apologize for the incomplete experimental reporting. The evaluations followed the standard R2R-CE and RxR-CE validation protocols (full validation splits, ~1000+ episodes per setting) and compared against recent zero-shot MLLM VLN baselines. In the revised §4 we will: (i) explicitly enumerate all baselines with citations, (ii) report the complete metric suite (SR, SPL, NE, and any others used), (iii) state the exact episode counts, and (iv) include mean ± standard deviation across seeds where multiple runs were performed or clearly note single-run results. We will also add a brief discussion of statistical significance of the observed improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: prompting strategy is independent of fitted inputs or self-referential derivations

full rationale

The paper describes a three-step prompting protocol (look forward for global landmarks, look now for sub-goal alignment, look backward for drift auditing) applied to an off-the-shelf MLLM inside existing VLN pipelines. No equations, parameter fitting, uniqueness theorems, or self-citations appear in the provided text as load-bearing elements of any derivation. The SOTA zero-shot claim is presented as an empirical outcome on R2R-CE and RxR-CE rather than a logical consequence that reduces to the method's own inputs by construction. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new mathematical parameters or entities; it builds on existing MLLM capabilities with a structured prompting strategy.

axioms (1)

domain assumption Multimodal large language models can accurately interpret visual scenes for navigation planning
Central to the three-step protocol relying on MLLM outputs for landmarks and alignment.

pith-pipeline@v0.9.0 · 5505 in / 1175 out tokens · 61377 ms · 2026-05-07T10:25:27.854112+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

Checklist

Association for the Advancement of Artificial Intelligence (AAAI). Checklist

work page
[2]

[Yes] (b) An analysis of the properties and complexity (time, space, sample size) of any algorithm

For all models and algorithms presented, check if you include: (a) A clear description of the mathematical set- ting, assumptions, algorithm, and/or model. [Yes] (b) An analysis of the properties and complexity (time, space, sample size) of any algorithm. [Yes] Three-Step Nav: A Hierarchical Global–Local Planner for Zero-Shot Vision-and-Language Navigatio...

work page
[3]

[Not Applicable] (b) Complete proofs of all theoretical results

For any theoretical claim, check if you include: (a) Statements of the full set of assumptions of all theoretical results. [Not Applicable] (b) Complete proofs of all theoretical results. [Not Applicable] (c) Clear explanations of any assumptions. [Not Applicable]

work page
[4]

[Yes] (b) All the training details (e.g., data splits, hy- perparameters, how they were chosen)

For all figures and tables that present empirical results, check if you include: (a) The code, data, and instructions needed to reproduce the main experimental results (ei- ther in the supplemental material or as a URL). [Yes] (b) All the training details (e.g., data splits, hy- perparameters, how they were chosen). [Not Applicable] (c) A clear definition...

work page
[5]

[Yes] (b) The license information of the assets, if ap- plicable

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include: (a) Citations of the creator If your work uses ex- isting assets. [Yes] (b) The license information of the assets, if ap- plicable. [Yes] (c) New assets either in the supplemental mate- rial or as a URL, if applicable. [Yes] (d) Information ...

work page
[6]

[Not Applicable] (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable

If you used crowdsourcing or conducted research with human subjects, check if you include: (a) The full text of instructions given to partici- pants and screenshots. [Not Applicable] (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Not Appli- cable] (c) The estimated hourly wage paid...

work page

[1] [1]

Checklist

Association for the Advancement of Artificial Intelligence (AAAI). Checklist

work page

[2] [2]

[Yes] (b) An analysis of the properties and complexity (time, space, sample size) of any algorithm

For all models and algorithms presented, check if you include: (a) A clear description of the mathematical set- ting, assumptions, algorithm, and/or model. [Yes] (b) An analysis of the properties and complexity (time, space, sample size) of any algorithm. [Yes] Three-Step Nav: A Hierarchical Global–Local Planner for Zero-Shot Vision-and-Language Navigatio...

work page

[3] [3]

[Not Applicable] (b) Complete proofs of all theoretical results

For any theoretical claim, check if you include: (a) Statements of the full set of assumptions of all theoretical results. [Not Applicable] (b) Complete proofs of all theoretical results. [Not Applicable] (c) Clear explanations of any assumptions. [Not Applicable]

work page

[4] [4]

[Yes] (b) All the training details (e.g., data splits, hy- perparameters, how they were chosen)

For all figures and tables that present empirical results, check if you include: (a) The code, data, and instructions needed to reproduce the main experimental results (ei- ther in the supplemental material or as a URL). [Yes] (b) All the training details (e.g., data splits, hy- perparameters, how they were chosen). [Not Applicable] (c) A clear definition...

work page

[5] [5]

[Yes] (b) The license information of the assets, if ap- plicable

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include: (a) Citations of the creator If your work uses ex- isting assets. [Yes] (b) The license information of the assets, if ap- plicable. [Yes] (c) New assets either in the supplemental mate- rial or as a URL, if applicable. [Yes] (d) Information ...

work page

[6] [6]

[Not Applicable] (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable

If you used crowdsourcing or conducted research with human subjects, check if you include: (a) The full text of instructions given to partici- pants and screenshots. [Not Applicable] (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Not Appli- cable] (c) The estimated hourly wage paid...

work page