NORM-Nav: Zero-Shot Mobile Robot Navigation with Natural Language Behavioral Constraints

Chao Gao; Dongjie Huo; Dong Zhang; Guyue Zhou; Junhui Wang; Yan Qiao

arxiv: 2605.16979 · v1 · pith:X4FJKXA3new · submitted 2026-05-16 · 💻 cs.RO

NORM-Nav: Zero-Shot Mobile Robot Navigation with Natural Language Behavioral Constraints

Dongjie Huo , Junhui Wang , Chao Gao , Yan Qiao , Dong Zhang , Guyue Zhou This is my paper

Pith reviewed 2026-05-19 20:08 UTC · model grok-4.3

classification 💻 cs.RO

keywords mobile robot navigationnatural language constraintscostmap-based planningzero-shot navigationsocial navigationLLM groundingbehavioral conventionsvision-LiDAR fusion

0 comments

The pith

NORM-Nav converts natural language behavioral constraints into multi-layer costmaps that standard planners can use for more human-like robot paths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to make mobile robots follow not only collision-free routes but also local social conventions described in ordinary language when moving through shared human spaces. Conventional costmap navigation focuses on geometry and often produces paths that feel inappropriate or inefficient to people. NORM-Nav addresses this by letting an LLM parse each spoken instruction into structured rules, then grounding those rules with live vision and LiDAR data to build layered costmaps. The costmaps carry geometric, semantic, directional, and velocity information and plug directly into existing grid planners. Experiments in simulation and on physical robots show higher task completion rates and paths that match human demonstrations more closely than standard baselines.

Core claim

NORM-Nav is a zero-shot framework that parses natural language instructions with an LLM, grounds the resulting constraints in real-time vision-LiDAR perception, and encodes them as multi-layer costmaps representing geometric, semantic, directional, and velocity cues; these costmaps are compatible with ordinary grid-based planners and produce higher task success rates together with trajectories closer to human references than representative baselines.

What carries the argument

Multi-layer costmaps that encode geometric, semantic, directional, and velocity cues derived from LLM-parsed natural language constraints and fed to unmodified grid planners.

If this is right

Existing planners can be reused without code changes when new behavioral rules are introduced.
Robots can adopt fresh social conventions simply by receiving spoken instructions at deployment time.
Navigation becomes more acceptable in human environments because paths respect local customs by construction.
Success rates rise because constraint violations that previously caused task failure are now penalized in the costmap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same parsing-and-grounding pipeline could be applied to language-guided manipulation or multi-robot coordination tasks.
Handling conflicting or time-varying instructions would require extensions to the costmap merging step.
Scaling the approach to larger environments might benefit from more robust grounding that tolerates sensor noise.
Real-world deployment would benefit from user studies measuring how well the generated paths match actual human expectations.

Load-bearing premise

The LLM can correctly turn natural language instructions into structured constraints and the real-time vision-LiDAR system can ground those constraints accurately enough to produce costmaps that make the planner behave as intended.

What would settle it

An experiment in which a directional instruction such as 'stay to the right' is issued yet the robot's paths violate the rule at rates equal to or higher than those produced by a baseline planner without language input.

Figures

Figures reproduced from arXiv: 2605.16979 by Chao Gao, Dongjie Huo, Dong Zhang, Guyue Zhou, Junhui Wang, Yan Qiao.

**Figure 2.** Figure 2: The architecture of the proposed method for zero-shot navigation under natural language behavioral constraints. The system integrates LLMs with [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Example of parsing online and offline behavioral constraints into [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the directional constraint layer. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: The experimental robot platform: (a) front view and (b) side view. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Simulation results on three representative navigation tasks. The proposed method produces stable trajectories that closely follow human-operated [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Real-world demonstrations of behavior-constrained navigation. The proposed method successfully follows natural language instructions without [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Effect of interpolation parameter α on trajectory generation. Larger α values lead to stronger compliance with directional preferences, while smaller values produce more conservative paths. no explicit constraints are imposed. These observations indicate that natural language instructions are reliably translated into consistent low-level velocity control within the proposed method. E. Real-World Task Demo… view at source ↗

read the original abstract

Mobile robots operating in human-centered environments must generate not only collision-free paths but also trajectories that follow local behavioral conventions. Conventional costmap-based navigation emphasizes geometric feasibility and often overlooks such requirements, which can result in socially inappropriate behaviors. This paper presents NORM-Nav, a zero-shot framework that integrates natural language behavioral constraints into costmap-based planning. An LLM parses each instruction into structured constraints and grounds them using real-time vision--LiDAR perception. These constraints are encoded as multi-layer costmaps that represent geometric, semantic, directional, and velocity cues and are directly compatible with standard grid-based planners. Simulation and real-world experiments indicate that NORM-Nav improves task success rates and produces trajectories closer to human references than representative baselines. The project website is available at https://ei-nav.github.io/NORM-Nav.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes NORM-Nav, a zero-shot framework for mobile robot navigation in human-centered environments. It uses an LLM to parse natural language behavioral constraints, grounds them via real-time vision-LiDAR perception, encodes the constraints as multi-layer costmaps (geometric, semantic, directional, and velocity cues), and feeds these directly into standard grid-based planners. Simulation and real-world experiments are reported to show higher task success rates and trajectories closer to human references than representative baselines.

Significance. If the empirical results hold under detailed scrutiny, the work is significant for practical deployment of robots that must respect local behavioral conventions without retraining. The zero-shot framing, direct compatibility with existing planners, and multi-layer costmap representation of language-derived cues address a clear gap between high-level instructions and low-level geometric navigation. The approach is internally consistent and directly testable.

major comments (2)

[§5] §5 (Experiments): The central claim of improved success rates and closer-to-human trajectories rests on quantitative comparisons, yet the manuscript provides limited detail on exact success percentages, trajectory deviation metrics (e.g., DTW or Hausdorff distance), baseline implementations, and statistical significance testing. These specifics are load-bearing for evaluating whether the multi-layer costmap encoding actually delivers measurable gains.
[§3.2] §3.2 (Constraint Encoding): The description of how LLM-structured constraints are mapped to the four costmap layers (geometric, semantic, directional, velocity) lacks explicit equations or pseudocode for the cost functions. Without these, it is difficult to verify that the encoding preserves intended behavior and remains compatible with standard planners without introducing unintended biases.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a concise statement of the exact success-rate improvements and trajectory metrics achieved, rather than qualitative phrasing.
[§3] Notation for the multi-layer costmap fusion (e.g., how layers are combined before planner input) should be defined explicitly in a single equation or table for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. The comments highlight opportunities to improve clarity and completeness, which we address point by point below. We have prepared revisions that directly respond to each concern while preserving the core contributions of the work.

read point-by-point responses

Referee: [§5] §5 (Experiments): The central claim of improved success rates and closer-to-human trajectories rests on quantitative comparisons, yet the manuscript provides limited detail on exact success percentages, trajectory deviation metrics (e.g., DTW or Hausdorff distance), baseline implementations, and statistical significance testing. These specifics are load-bearing for evaluating whether the multi-layer costmap encoding actually delivers measurable gains.

Authors: We agree that greater quantitative transparency will strengthen the evaluation. In the revised manuscript we will expand Section 5 with a table reporting exact success rates for NORM-Nav and all baselines, explicit values for the trajectory deviation metrics (DTW and Hausdorff distance) computed against human reference trajectories, a detailed description of baseline implementations (including any re-implementation choices and parameter settings), and the results of statistical significance tests (paired t-tests with reported p-values). These additions will make the performance gains attributable to the multi-layer costmap encoding fully verifiable. revision: yes
Referee: [§3.2] §3.2 (Constraint Encoding): The description of how LLM-structured constraints are mapped to the four costmap layers (geometric, semantic, directional, velocity) lacks explicit equations or pseudocode for the cost functions. Without these, it is difficult to verify that the encoding preserves intended behavior and remains compatible with standard planners without introducing unintended biases.

Authors: We thank the referee for this observation. To improve reproducibility and allow readers to verify compatibility with standard planners, we will revise Section 3.2 to include explicit equations for each of the four cost functions (geometric, semantic, directional, and velocity) and a concise pseudocode block that shows the mapping from the LLM-structured constraint output to the corresponding costmap layers. These additions will demonstrate that the encodings are designed to avoid unintended biases while remaining directly usable by grid-based planners. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an engineering integration of an LLM for parsing natural language behavioral constraints, real-time vision-LiDAR perception for grounding those constraints into geometric/semantic/directional/velocity layers, and multi-layer costmaps fed to standard grid-based planners. No equations, fitted parameters, or self-referential definitions appear in the described pipeline; the zero-shot framing relies on external LLM capabilities and perception modules whose outputs are independently verifiable. Experimental results in simulation and real-world settings provide direct, falsifiable support for improved success rates and human-likeness without any reduction of predictions to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The approach rests on assumptions about LLM parsing accuracy and perception grounding that are invoked to support the zero-shot claim but receive no independent validation in the provided abstract.

axioms (2)

domain assumption Large language models can accurately parse natural language instructions into structured behavioral constraints suitable for robotics.
This assumption underpins the parsing step described in the abstract.
domain assumption Real-time vision and LiDAR data can reliably ground parsed constraints to the current environment for costmap construction.
Required for the grounding and encoding steps.

invented entities (1)

Multi-layer costmaps encoding geometric, semantic, directional, and velocity cues from language constraints no independent evidence
purpose: To represent behavioral constraints in a format directly usable by standard grid-based planners.
Introduced as the core integration mechanism in the framework.

pith-pipeline@v0.9.0 · 5675 in / 1443 out tokens · 54789 ms · 2026-05-19T20:08:19.481910+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

These constraints are encoded as multi-layer costmaps that represent geometric, semantic, directional, and velocity cues and are directly compatible with standard grid-based planners.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The interpolation function is expressed as C(u) = floor(c1 + (c2−c1)(u−u1)/(u2−u1))^α ...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 4 internal anchors

[1]

Safe-vln: Collision avoidance for vision-and-language navigation of autonomous robots operating in continuous environments,

L. Yue, D. Zhou, L. Xie, F. Zhang, Y . Yan, and E. Yin, “Safe-vln: Collision avoidance for vision-and-language navigation of autonomous robots operating in continuous environments,”IEEE Robot. Autom. Lett., vol. 9, no. 6, pp. 4918–4925, 2024

work page 2024
[2]

MPC-DS: A safe path track- ing method for agvs in dynamic environments with dense obstacles,

D. Zhang, D. Huo, M. Zhou, and Z. Cao, “MPC-DS: A safe path track- ing method for agvs in dynamic environments with dense obstacles,” IEEE Trans. Intell. Transp. Syst., 2025

work page 2025
[3]

Openbench: A new benchmark and baseline for semantic navigation in smart logistics,

J. Wanget al., “Openbench: A new benchmark and baseline for semantic navigation in smart logistics,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA), 2025, pp. 16 202–16 208

work page 2025
[4]

Open: Lightweight map-based semantic navigation for gps-free last-mile delivery,

J. Wang, D. Huo, Y . Shi, C. Gao, Y . Qiao, and G. Zhou, “Open: Lightweight map-based semantic navigation for gps-free last-mile delivery,”IEEE Trans. Autom. Sci. Eng., vol. 22, pp. 20 713–20 725, 2025

work page 2025
[5]

The marathon 2: A navigation system,

S. Macenski, F. Mart ´ın, R. White, and J. G. Clavero, “The marathon 2: A navigation system,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS). IEEE, 2020, pp. 2718–2725

work page 2020
[6]

Advancesinembodiednavigationusinglargelanguage models: A survey,

J. Linet al., “Advances in embodied navigation using large language models: A survey,”arXiv preprint arXiv:2311.00530, 2025

work page arXiv 2025
[7]

Vision-language navigation: A survey and taxonomy,

W. Wu, T. Chang, X. Li, Q. Yin, and Y . Hu, “Vision-language navigation: A survey and taxonomy,”Neural Comput. Appl., vol. 36, no. 7, pp. 3291–3316, 2023

work page 2023
[8]

Vision-and-language navigation today and tomorrow: A survey in the era of foundation models,

Y . Zhanget al., “Vision-and-language navigation today and tomorrow: A survey in the era of foundation models,”Trans. Mach. Learn. Res., 2024

work page 2024
[9]

TERP: Reliable planning in uneven outdoor environments using deep reinforcement learning,

K. Weerakoon, A. J. Sathyamoorthy, U. Patel, and D. Manocha, “TERP: Reliable planning in uneven outdoor environments using deep reinforcement learning,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA), 2022, pp. 9447–9453

work page 2022
[10]

BEVNav: Robot autonomous navigation via spatial-temporal contrastive learning in bird’s-eye view,

J. Jiang, Y . Yang, Y . Deng, C. Ma, and J. Zhang, “BEVNav: Robot autonomous navigation via spatial-temporal contrastive learning in bird’s-eye view,”IEEE Robot. Autom. Lett., vol. 9, no. 12, pp. 10 796– 10 802, 2024

work page 2024
[11]

Large Language Models: A Survey

S. Minaeeet al., “Large language models: A survey,”arXiv preprint arXiv:2402.06196, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

LM-Nav: Robotic navigation with large pre-trained models of language, vision, and action,

D. Shah, B. Osi ´nski, B. Ichter, and S. Levine, “LM-Nav: Robotic navigation with large pre-trained models of language, vision, and action,” inProc. Conf. Robot Learn. (CoRL). PMLR, 2023, pp. 492–504

work page 2023
[13]

Behav: Behavioral rule guided autonomy using vlms for robot navigation in outdoor scenes,

K. Weerakoonet al., “Behav: Behavioral rule guided autonomy using vlms for robot navigation in outdoor scenes,” inProc. IEEE Int. Conf. Robot. Automat. (ICRA), 2025, pp. 7044–7051

work page 2025
[14]

ROS navigation tuning guide,

K. Zheng, “ROS navigation tuning guide,” inRobot Operating System (ROS): The Complete Reference (Vol. 6), A. Koubaa, Ed. Cham: Springer, 2021, pp. 197–226

work page 2021
[15]

Improving the ros 2 navigation stack with real-time local costmap updates for agricultural applica- tions,

E. Sani, A. Sgorbissa, and S. Carpin, “Improving the ros 2 navigation stack with real-time local costmap updates for agricultural applica- tions,” inProc. IEEE Int. Conf. Robot. Automat. (ICRA). IEEE, 2024, pp. 17 701–17 707

work page 2024
[16]

Pacer: Preference- conditioned all-terrain costmap generation,

L. Mao, G. Warnell, P. Stone, and J. Biswas, “Pacer: Preference- conditioned all-terrain costmap generation,”IEEE Robot. Autom. Lett., 2025

work page 2025
[17]

Interactive navigation in environments with traversable obstacles using large language and vision-language models,

Z. Zhang, A. Lin, C. W. Wong, X. Chu, Q. Dou, and K. S. Au, “Interactive navigation in environments with traversable obstacles using large language and vision-language models,” inProc. IEEE Int. Conf. Robot. Automat. (ICRA). IEEE, 2024, pp. 7867–7873

work page 2024
[18]

V olumetric environment repre- sentation for vision-language navigation,

R. Liu, W. Wang, and Y . Yang, “V olumetric environment repre- sentation for vision-language navigation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 16 317–16 328

work page 2024
[19]

Navid: Video-based vlm plans the next step for vision- and-language navigation,

J. Zhanget al., “Navid: Video-based vlm plans the next step for vision- and-language navigation,”Robot. Sci. Syst., 2024

work page 2024
[20]

Scaling data generation in vision-and-language navigation,

Z. Wanget al., “Scaling data generation in vision-and-language navigation,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 12 009–12 020

work page 2023
[21]

Lelan: Learning a language-conditioned navigation policy from in- the-wild videos,

N. Hirose, C. Glossop, A. Sridhar, D. Shah, O. Mees, and S. Levine, “Lelan: Learning a language-conditioned navigation policy from in- the-wild videos,”arXiv preprint arXiv:2410.03603, 2024

work page arXiv 2024
[22]

Cast: Counterfactual labels improve instruction following in vision-language- action models.arXiv preprint arXiv:2508.13446, 2025

C. Glossop, W. Chen, A. Bhorkar, D. Shah, and S. Levine, “CAST: Counterfactual labels improve instruction following in vision- language-action models,”arXiv preprint arXiv:2508.13446, 2025

work page arXiv 2025
[23]

Vlfm: Vision- language frontier maps for zero-shot semantic navigation,

N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision- language frontier maps for zero-shot semantic navigation,” inProc. IEEE Int. Conf. Robot. Automat. (ICRA). IEEE, 2024, pp. 42–48

work page 2024
[24]

Clip-nav: Using clip for zero-shot vision-and-language navigation,

V . S. Dorbala, G. Sigurdsson, R. Piramuthu, J. Thomason, and G. S. Sukhatme, “Clip-nav: Using clip for zero-shot vision-and-language navigation,”arXiv preprint arXiv:2211.16649, 2022

work page arXiv 2022
[25]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

M. Ahnet al., “Do as i can, not as i say: Grounding language in robotic affordances,”arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Navgpt-2: Unleashing navigational reasoning capability for large vision-language models,

G. Zhou, Y . Hong, Z. Wang, X. E. Wang, and Q. Wu, “Navgpt-2: Unleashing navigational reasoning capability for large vision-language models,” inEur. Conf. Comput. Vis.Springer, 2024, pp. 260–278

work page 2024
[27]

OpenFMNav: Towards open-set zero-shot object navigation via vision-language foundation models,

Y . Kuang, H. Lin, and M. Jiang, “Openfmnav: Towards open-set zero- shot object navigation via vision-language foundation models,”arXiv preprint arXiv:2402.10670, 2024

work page arXiv 2024
[28]

Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024

Y . Long, W. Cai, H. Wang, G. Zhan, and H. Dong, “Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,”arXiv preprint arXiv:2406.04882, 2024

work page arXiv 2024
[29]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

T. Renet al., “Grounded SAM: Assembling open-world models for diverse visual tasks,”arXiv preprint arXiv:2401.14159, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

DBSCAN clustering algorithm based on density,

D. Deng, “DBSCAN clustering algorithm based on density,” in2020 7th Int. Forum on Elect. Eng. Automat. (IFEEA). IEEE, 2020, pp. 949–953

work page 2020
[31]

On Evaluation of Embodied Navigation Agents

P. Andersonet al., “On evaluation of embodied navigation agents,” arXiv preprint arXiv:1807.06757, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

Fr ´echet distance for curves, revisited,

B. Aronov, S. Har-Peled, C. Knauer, Y . Wang, and C. Wenk, “Fr ´echet distance for curves, revisited,” inEur. Symp. algorithmss. Springer, 2006, pp. 52–63

work page 2006

[1] [1]

Safe-vln: Collision avoidance for vision-and-language navigation of autonomous robots operating in continuous environments,

L. Yue, D. Zhou, L. Xie, F. Zhang, Y . Yan, and E. Yin, “Safe-vln: Collision avoidance for vision-and-language navigation of autonomous robots operating in continuous environments,”IEEE Robot. Autom. Lett., vol. 9, no. 6, pp. 4918–4925, 2024

work page 2024

[2] [2]

MPC-DS: A safe path track- ing method for agvs in dynamic environments with dense obstacles,

D. Zhang, D. Huo, M. Zhou, and Z. Cao, “MPC-DS: A safe path track- ing method for agvs in dynamic environments with dense obstacles,” IEEE Trans. Intell. Transp. Syst., 2025

work page 2025

[3] [3]

Openbench: A new benchmark and baseline for semantic navigation in smart logistics,

J. Wanget al., “Openbench: A new benchmark and baseline for semantic navigation in smart logistics,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA), 2025, pp. 16 202–16 208

work page 2025

[4] [4]

Open: Lightweight map-based semantic navigation for gps-free last-mile delivery,

J. Wang, D. Huo, Y . Shi, C. Gao, Y . Qiao, and G. Zhou, “Open: Lightweight map-based semantic navigation for gps-free last-mile delivery,”IEEE Trans. Autom. Sci. Eng., vol. 22, pp. 20 713–20 725, 2025

work page 2025

[5] [5]

The marathon 2: A navigation system,

S. Macenski, F. Mart ´ın, R. White, and J. G. Clavero, “The marathon 2: A navigation system,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS). IEEE, 2020, pp. 2718–2725

work page 2020

[6] [6]

Advancesinembodiednavigationusinglargelanguage models: A survey,

J. Linet al., “Advances in embodied navigation using large language models: A survey,”arXiv preprint arXiv:2311.00530, 2025

work page arXiv 2025

[7] [7]

Vision-language navigation: A survey and taxonomy,

W. Wu, T. Chang, X. Li, Q. Yin, and Y . Hu, “Vision-language navigation: A survey and taxonomy,”Neural Comput. Appl., vol. 36, no. 7, pp. 3291–3316, 2023

work page 2023

[8] [8]

Vision-and-language navigation today and tomorrow: A survey in the era of foundation models,

Y . Zhanget al., “Vision-and-language navigation today and tomorrow: A survey in the era of foundation models,”Trans. Mach. Learn. Res., 2024

work page 2024

[9] [9]

TERP: Reliable planning in uneven outdoor environments using deep reinforcement learning,

K. Weerakoon, A. J. Sathyamoorthy, U. Patel, and D. Manocha, “TERP: Reliable planning in uneven outdoor environments using deep reinforcement learning,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA), 2022, pp. 9447–9453

work page 2022

[10] [10]

BEVNav: Robot autonomous navigation via spatial-temporal contrastive learning in bird’s-eye view,

J. Jiang, Y . Yang, Y . Deng, C. Ma, and J. Zhang, “BEVNav: Robot autonomous navigation via spatial-temporal contrastive learning in bird’s-eye view,”IEEE Robot. Autom. Lett., vol. 9, no. 12, pp. 10 796– 10 802, 2024

work page 2024

[11] [11]

Large Language Models: A Survey

S. Minaeeet al., “Large language models: A survey,”arXiv preprint arXiv:2402.06196, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

LM-Nav: Robotic navigation with large pre-trained models of language, vision, and action,

D. Shah, B. Osi ´nski, B. Ichter, and S. Levine, “LM-Nav: Robotic navigation with large pre-trained models of language, vision, and action,” inProc. Conf. Robot Learn. (CoRL). PMLR, 2023, pp. 492–504

work page 2023

[13] [13]

Behav: Behavioral rule guided autonomy using vlms for robot navigation in outdoor scenes,

K. Weerakoonet al., “Behav: Behavioral rule guided autonomy using vlms for robot navigation in outdoor scenes,” inProc. IEEE Int. Conf. Robot. Automat. (ICRA), 2025, pp. 7044–7051

work page 2025

[14] [14]

ROS navigation tuning guide,

K. Zheng, “ROS navigation tuning guide,” inRobot Operating System (ROS): The Complete Reference (Vol. 6), A. Koubaa, Ed. Cham: Springer, 2021, pp. 197–226

work page 2021

[15] [15]

Improving the ros 2 navigation stack with real-time local costmap updates for agricultural applica- tions,

E. Sani, A. Sgorbissa, and S. Carpin, “Improving the ros 2 navigation stack with real-time local costmap updates for agricultural applica- tions,” inProc. IEEE Int. Conf. Robot. Automat. (ICRA). IEEE, 2024, pp. 17 701–17 707

work page 2024

[16] [16]

Pacer: Preference- conditioned all-terrain costmap generation,

L. Mao, G. Warnell, P. Stone, and J. Biswas, “Pacer: Preference- conditioned all-terrain costmap generation,”IEEE Robot. Autom. Lett., 2025

work page 2025

[17] [17]

Interactive navigation in environments with traversable obstacles using large language and vision-language models,

Z. Zhang, A. Lin, C. W. Wong, X. Chu, Q. Dou, and K. S. Au, “Interactive navigation in environments with traversable obstacles using large language and vision-language models,” inProc. IEEE Int. Conf. Robot. Automat. (ICRA). IEEE, 2024, pp. 7867–7873

work page 2024

[18] [18]

V olumetric environment repre- sentation for vision-language navigation,

R. Liu, W. Wang, and Y . Yang, “V olumetric environment repre- sentation for vision-language navigation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 16 317–16 328

work page 2024

[19] [19]

Navid: Video-based vlm plans the next step for vision- and-language navigation,

J. Zhanget al., “Navid: Video-based vlm plans the next step for vision- and-language navigation,”Robot. Sci. Syst., 2024

work page 2024

[20] [20]

Scaling data generation in vision-and-language navigation,

Z. Wanget al., “Scaling data generation in vision-and-language navigation,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 12 009–12 020

work page 2023

[21] [21]

Lelan: Learning a language-conditioned navigation policy from in- the-wild videos,

N. Hirose, C. Glossop, A. Sridhar, D. Shah, O. Mees, and S. Levine, “Lelan: Learning a language-conditioned navigation policy from in- the-wild videos,”arXiv preprint arXiv:2410.03603, 2024

work page arXiv 2024

[22] [22]

Cast: Counterfactual labels improve instruction following in vision-language- action models.arXiv preprint arXiv:2508.13446, 2025

C. Glossop, W. Chen, A. Bhorkar, D. Shah, and S. Levine, “CAST: Counterfactual labels improve instruction following in vision- language-action models,”arXiv preprint arXiv:2508.13446, 2025

work page arXiv 2025

[23] [23]

Vlfm: Vision- language frontier maps for zero-shot semantic navigation,

N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision- language frontier maps for zero-shot semantic navigation,” inProc. IEEE Int. Conf. Robot. Automat. (ICRA). IEEE, 2024, pp. 42–48

work page 2024

[24] [24]

Clip-nav: Using clip for zero-shot vision-and-language navigation,

V . S. Dorbala, G. Sigurdsson, R. Piramuthu, J. Thomason, and G. S. Sukhatme, “Clip-nav: Using clip for zero-shot vision-and-language navigation,”arXiv preprint arXiv:2211.16649, 2022

work page arXiv 2022

[25] [25]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

M. Ahnet al., “Do as i can, not as i say: Grounding language in robotic affordances,”arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

Navgpt-2: Unleashing navigational reasoning capability for large vision-language models,

G. Zhou, Y . Hong, Z. Wang, X. E. Wang, and Q. Wu, “Navgpt-2: Unleashing navigational reasoning capability for large vision-language models,” inEur. Conf. Comput. Vis.Springer, 2024, pp. 260–278

work page 2024

[27] [27]

OpenFMNav: Towards open-set zero-shot object navigation via vision-language foundation models,

Y . Kuang, H. Lin, and M. Jiang, “Openfmnav: Towards open-set zero- shot object navigation via vision-language foundation models,”arXiv preprint arXiv:2402.10670, 2024

work page arXiv 2024

[28] [28]

Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024

Y . Long, W. Cai, H. Wang, G. Zhan, and H. Dong, “Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,”arXiv preprint arXiv:2406.04882, 2024

work page arXiv 2024

[29] [29]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

T. Renet al., “Grounded SAM: Assembling open-world models for diverse visual tasks,”arXiv preprint arXiv:2401.14159, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

DBSCAN clustering algorithm based on density,

D. Deng, “DBSCAN clustering algorithm based on density,” in2020 7th Int. Forum on Elect. Eng. Automat. (IFEEA). IEEE, 2020, pp. 949–953

work page 2020

[31] [31]

On Evaluation of Embodied Navigation Agents

P. Andersonet al., “On evaluation of embodied navigation agents,” arXiv preprint arXiv:1807.06757, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[32] [32]

Fr ´echet distance for curves, revisited,

B. Aronov, S. Har-Peled, C. Knauer, Y . Wang, and C. Wenk, “Fr ´echet distance for curves, revisited,” inEur. Symp. algorithmss. Springer, 2006, pp. 52–63

work page 2006