PlatonicNav: Unveiling Semantic Correspondence in Navigation with Platonic Topological Maps

Junlin Long; Luke Borgnolo; Maxwell Twelftree; Xu Deng; Yang Zhao; Yiran Wang; Yue Yang; Zeyu Zhang

arxiv: 2606.01788 · v1 · pith:HQJE47LZnew · submitted 2026-06-01 · 💻 cs.CV

PlatonicNav: Unveiling Semantic Correspondence in Navigation with Platonic Topological Maps

Junlin Long , Zeyu Zhang , Xu Deng , Yiran Wang , Yue Yang , Luke Borgnolo , Maxwell Twelftree , Yang Zhao This is my paper

Pith reviewed 2026-06-28 15:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords Platonic Topological Mapobject goal navigationvision-language navigationsemantic correspondencetraining-free navigationembodied navigationself-supervised visionblind matching

0 comments

The pith

A training-free topological map from self-supervised vision alone grounds language goals via blind matching, unifying object-goal and vision-language navigation as interfaces to one semantic manifold.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that vision and language encoders share an underlying semantic structure even when trained separately. It builds a Platonic Topological Map that combines geometric distances with semantic distances extracted from a self-supervised visual encoder. Language instructions are matched to map nodes through blind comparison with no paired vision-language data or cross-modal fine-tuning. If this holds, vision-only object navigation, cross-modal object navigation, and vision-language navigation become different access points to the same object-centric manifold. This matters for embodied agents because it removes the requirement for large paired datasets or explicit alignment training when moving between tasks and robot bodies.

Core claim

Extending the Platonic Representation Hypothesis to embodied navigation shows that vision-only ObjNav, cross-modal ObjNav, and VLN are three interfaces to the same object-centric semantic manifold. The Platonic Topological Map fuses geometric and semantic node distances from a self-supervised visual encoder and grounds language goals through blind matching without any paired vision-language data, allowing the same map to support all three tasks and to transfer to real robots without cross-modal training.

What carries the argument

The Platonic Topological Map, which fuses geometric and semantic node distances from a self-supervised visual encoder and performs blind matching to ground language goals.

If this is right

The same map supports vision-only object goal navigation on HM3D-IIN and OVON without language input.
The map also supports cross-modal object goal navigation and vision-language navigation on R2R-CE using the same blind matching step.
The framework transfers directly to physical robots such as the Unitree Go2 without additional cross-modal training.
No explicit supervision or paired data is required to switch between tasks or robot embodiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the alignment is reliable, the same blind-matching approach could be tested on non-navigation tasks that require grounding language in visual scenes.
Topological maps built this way might reduce dependence on large vision-language models for other embodied problems.
The method invites direct comparison of success rates when the visual encoder is swapped for alternatives that lack semantic structure.
Success on multiple benchmarks suggests the manifold view may extend to additional sensory modalities beyond vision and language.

Load-bearing premise

A self-supervised visual encoder already produces features whose semantic structure aligns with language closely enough for blind matching to map nodes to succeed without any paired vision-language data.

What would settle it

Apply the blind matching procedure on R2R-CE or OVON language goals against vision-built maps and measure success rate; if the rate equals or falls below random node selection, the claim that the alignment supports reliable navigation collapses.

Figures

Figures reproduced from arXiv: 2606.01788 by Junlin Long, Luke Borgnolo, Maxwell Twelftree, Xu Deng, Yang Zhao, Yiran Wang, Yue Yang, Zeyu Zhang.

**Figure 1.** Figure 1: Blind matching of vision and language in navigation scene. Text and images are both abstractions of the same underlying world. Vision and language encoders fv and fl learn similar pairwise relations between concepts. We exploit these pairwise relations in a matching solver to recover valid correspondences between vision and language representations without requiring any paired data [57]. Two recent obse… view at source ↗

**Figure 2.** Figure 2: PlatonicNav Pipeline. (a) Mapping: We construct Platonic Topological Map as a semantic scene graph, where image segments are used as object nodes, and edges are weighted by both geometric distance and semantic distance computed from vision embedding space. (b) Goal Selection: Given the natural-language instruction, we pairwise blind match language embeddings of goal object category and visual embedding of … view at source ↗

**Figure 3.** Figure 3: Visual-only ObjNav, VLN, and PlatonicNav trajectory comparison. Top-down trajectory maps of vision-only ObjNav (ObjectReact), VLN (ETPNav), and PlatonicNav with matched scenes and targets, corresponding to Step 1 and Step 2 of our thought experiment (Section 3.1). Trajectory similarity suggests that vision-only navigation implicitly encodes language-level semantic structure, motivating our investigation of… view at source ↗

**Figure 4.** Figure 4: Segment-based topological map. Image segments serve as graph nodes, and navigation is planned as a sequence of segment-level “hops” over a sparse graph. Figure adapted from [22]. A topological map represents an environment as a graph G = (V, E), where each node v ∈ V corresponds to an observation, landmark, object, or spatial region, and each edge e ∈ E encodes connectivity or traversal cost. Compared w… view at source ↗

**Figure 5.** Figure 5: Top-down trajectory map of vision-only ObjNav and PlatonicNav on HM3D-IIN. We visualize the navigation trajectories of vision-only ObjNav (e.g., ObjectReact [21]) and PlatonicNav with pure vision goal grounding. Their trajectories shows relative similarity while PlatonicNav’s trajectories seem more straightforward than ObjectReact’s. Observing both similarity and difference between vision-only ObjNav’s tra… view at source ↗

**Figure 6.** Figure 6: Real-world robot platforms for evaluation. We deploy our method on a quadruped Unitree Go2 robot, providing robust perception and locomotion. These platforms demonstrate the applicability of Platonic Topological Maps in embodied system. Evaluation Protocol. For both platforms, we construct topological maps from onboard sensory inputs and evaluate navigation performance under object-goal and language-condit… view at source ↗

**Figure 8.** Figure 8: ObjectNav Task 1, repeat phase. Qualitative visualization. Steps ObjectNav Task 2 — Teach t1 t2 t3 t4 t5 t6 t7 t8 Ego-view Depth Point Map [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: ObjectNav Task 2, teach phase. Qualitative visualization. Steps ObjectNav Task 2 — Repeat t1 t2 t3 t4 t5 t6 t7 t8 Ego-view Depth Point Map [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: ObjectNav Task 2, repeat phase. Qualitative visualization. Steps ObjectNav Task 3 — Teach t1 t2 t3 t4 t5 t6 t7 t8 Ego-view Depth Point Map [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: ObjectNav Task 3, teach phase. Qualitative visualization. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: ObjectNav Task 3, repeat phase. Qualitative visualization. G.2 VLN Qualitative Results Steps VLN Teach t1 t2 t3 t4 t5 t6 t7 t8 Ego-view Depth Point Map [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: VLN teach phase. Qualitative visualization. Steps VLN Repeat — lamp t1 t2 t3 t4 t5 t6 t7 t8 Ego-view Depth Point Map [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: VLN repeat phase, go to the lamp. Qualitative visualization. Steps VLN Repeat — plant t1 t2 t3 t4 t5 t6 t7 t8 Ego-view Depth Point Map [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: VLN repeat phase, find the plant. Qualitative visualization. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: VLN repeat phase, go to the chair. Qualitative visualization. H Additional Simulation Results H.1 VLN Simulation Results Ego-view t1 t2 t3 t4 t5 t6 t7 t8 Depth BEV VLN from here exit the living room turn left wait at the bottom of the stairs Steps [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: VLN simulation task, bottom of stairs. Qualitative visualization. Ego-view t1 t2 t3 t4 t5 t6 t7 t8 Depth BEV VLN from here head towards the stairs stop on the round rug next to the flowers Steps [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗

**Figure 18.** Figure 18: VLN simulation task, round rug near flowers. Qualitative visualization. Ego-view t1 t2 t3 t4 t5 t6 t7 t8 Depth BEV VLN from here move ahead in between bar and table to the chair Steps [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗

**Figure 19.** Figure 19: VLN simulation task, chair near bar and table. Qualitative visualization. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗

**Figure 20.** Figure 20: VLN simulation task, stairs before outside. Qualitative visualization. Ego-view t1 t2 t3 t4 t5 t6 t7 t8 Depth BEV VLN from here turn left and go straight until you get to a large table Steps [PITH_FULL_IMAGE:figures/full_fig_p022_20.png] view at source ↗

**Figure 21.** Figure 21: VLN simulation task, large table. Qualitative visualization. Ego-view t1 t2 t3 t4 t5 t6 t7 t8 Depth BEV VLN from here turn left and go straight until you get to three tables with chairs turn left and wait near the couc Steps [PITH_FULL_IMAGE:figures/full_fig_p022_21.png] view at source ↗

**Figure 22.** Figure 22: VLN simulation task, tables and chairs. Qualitative visualization. Ego-view t1 t2 t3 t4 t5 t6 t7 t8 Depth BEV VLN from here turn left continue down the hallway until you get to the stairs wait there Steps [PITH_FULL_IMAGE:figures/full_fig_p022_22.png] view at source ↗

**Figure 23.** Figure 23: VLN simulation task, hallway to stairs. Qualitative visualization. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_23.png] view at source ↗

**Figure 24.** Figure 24: VLN simulation task, walk down stairs. Qualitative visualization. Ego-view t1 t2 t3 t4 t5 t6 t7 t8 Depth BEV VLN from here walk down the first set of stairs wait there Steps [PITH_FULL_IMAGE:figures/full_fig_p023_24.png] view at source ↗

**Figure 25.** Figure 25: VLN simulation task, first set of stairs. Qualitative visualization. Ego-view t1 t2 t3 t4 t5 t6 t7 t8 Depth BEV VLN from here walk into the dining room area stop in front of the island Steps [PITH_FULL_IMAGE:figures/full_fig_p023_25.png] view at source ↗

**Figure 26.** Figure 26: VLN simulation task, dining room island. Qualitative visualization. Ego-view t1 t2 t3 t4 t5 t6 t7 t8 Depth BEV VLN from here walk into the kitchen around the dining table to the buffet stop and wait there Steps [PITH_FULL_IMAGE:figures/full_fig_p023_26.png] view at source ↗

**Figure 27.** Figure 27: VLN simulation task, kitchen and buffet. Qualitative visualization. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_27.png] view at source ↗

**Figure 28.** Figure 28: VLN simulation task, fireplace. Qualitative visualization. Ego-view t1 t2 t3 t4 t5 t6 t7 t8 Depth BEV VLN from here walk towards the desk in the office area stop next to the desk Steps [PITH_FULL_IMAGE:figures/full_fig_p024_28.png] view at source ↗

**Figure 29.** Figure 29: VLN simulation task, office desk. Qualitative visualization. H.2 ObjNav Simulation Results Ego-view t1 t2 t3 t4 t5 t6 t7 t8 Depth BEV ObjectNav refrigerator Steps [PITH_FULL_IMAGE:figures/full_fig_p024_29.png] view at source ↗

**Figure 30.** Figure 30: ObjNav simulation task, refrigerator. Qualitative visualization. Ego-view t1 t2 t3 t4 t5 t6 t7 t8 Depth BEV ObjectNav tv stand Steps [PITH_FULL_IMAGE:figures/full_fig_p024_30.png] view at source ↗

**Figure 31.** Figure 31: ObjNav simulation task, TV stand. Qualitative visualization. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_31.png] view at source ↗

**Figure 32.** Figure 32: ObjNav simulation task, dining chair. Qualitative visualization. Ego-view t1 t2 t3 t4 t5 t6 t7 t8 Depth BEV ObjectNav desk Steps [PITH_FULL_IMAGE:figures/full_fig_p025_32.png] view at source ↗

**Figure 33.** Figure 33: ObjNav simulation task, desk. Qualitative visualization. Ego-view t1 t2 t3 t4 t5 t6 t7 t8 Depth BEV ObjectNav chair Steps [PITH_FULL_IMAGE:figures/full_fig_p025_33.png] view at source ↗

**Figure 34.** Figure 34: ObjNav simulation task, chair. Qualitative visualization. Ego-view t1 t2 t3 t4 t5 t6 t7 t8 Depth BEV ObjectNav sofa chair Steps [PITH_FULL_IMAGE:figures/full_fig_p025_34.png] view at source ↗

**Figure 35.** Figure 35: ObjNav simulation task, sofa chair. Qualitative visualization. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_35.png] view at source ↗

**Figure 36.** Figure 36: ObjNav simulation task, photo. Qualitative visualization. I External Assets We list the existing assets used in PlatonicNav, together with their versions or identifiers and license terms. HM3D: v0.2; Matterport End User License Agreement for Academic Use of Model Data. HM3D-IIN: HM3D Instance ImageNav v3; MIT code; HM3D-derived data under Matterport HM3D terms. HM3D-OVON: official episodes; MIT-listed rel… view at source ↗

read the original abstract

Embodied visual navigation, where an agent perceives a complex environment and acts to reach a goal from raw sensory input, underpins a wide range of applications such as household service robotics, assistive robotics, and large-scale autonomous exploration. However, recent attempts to unify vision-and-language navigation (VLN) and object goal navigation (ObjNav) remain at the level of architectural fusion, mixed-task training, and large vision-language pretraining, without examining whether independently trained vision and language encoders may already share a common semantic structure. Moreover, even object-centric topological maps still ground language goals through explicit cross-modal supervision such as CLIP or large vision-language models, leaving open whether such grounding is possible from a purely vision-built map. To address these challenges, we extend the Platonic Representation Hypothesis to embodied navigation and recast vision-only ObjNav, cross-modal ObjNav, and VLN as three different interfaces to the same object-centric semantic manifold. We further introduce PlatonicNav, a training-free framework whose Platonic Topological Map fuses geometric and semantic node distances from a self-supervised visual encoder, and grounds language goals via blind matching without any paired vision-language data. Extensive experiments on simulation benchmarks including HM3D-IIN, OVON, and R2R-CE on MP3D, together with deployment on Unitree Go2, demonstrate that PlatonicNav generalizes across tasks, modalities, and embodiments without explicit cross-modal training. Code: https://github.com/AIGeeksGroup/PlatonicNav. Website: https://aigeeksgroup.github.io/PlatonicNav.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PlatonicNav claims a training-free topological map from self-supervised vision can unify ObjNav and VLN via blind language matching, but the abstract gives no numbers or controls to check if the alignment actually works.

read the letter

The main point is that this paper extends the Platonic Representation Hypothesis to navigation by building a Platonic Topological Map from a self-supervised visual encoder, fusing geometric and semantic distances, and grounding language goals through blind matching without any paired data or cross-modal training. It treats vision-only ObjNav, cross-modal ObjNav, and VLN as different interfaces to the same object-centric manifold.

What stands out as new is the training-free construction and the blind-matching step, which avoids the usual explicit supervision like CLIP. The experiments cover simulation benchmarks (HM3D-IIN, OVON, R2R-CE on MP3D) plus a real-robot deployment on Unitree Go2, and the code is released, which helps reproducibility.

The soft spots are noticeable. The abstract supplies no quantitative results, error bars, ablation studies, or encoder details, so there is no way to verify whether the blind matching succeeds reliably or whether the claimed generalization across tasks and embodiments holds. The stress-test concern lands: the whole approach rests on the self-supervised visual features already being sufficiently isomorphic to language embeddings for nearest-neighbor matching to work without tuning. If that alignment is weak or benchmark-specific, the central framing collapses. The circularity burden is real because success depends on external pre-trained representations whose properties are not examined here.

This is aimed at researchers in embodied navigation who want training-free or zero-shot map-based methods. A reader focused on topological representations or the Platonic hypothesis might find the framing useful even if the results need more scrutiny.

It deserves peer review because the idea is coherent and the multi-benchmark plus real-robot setup is a reasonable starting point, even though the current write-up leaves the key assumption untested.

Referee Report

2 major / 1 minor

Summary. The paper extends the Platonic Representation Hypothesis to embodied navigation by recasting vision-only ObjNav, cross-modal ObjNav, and VLN as interfaces to the same object-centric semantic manifold. It introduces the training-free PlatonicNav framework, whose Platonic Topological Map is built from a self-supervised visual encoder, fuses geometric and semantic node distances, and grounds language goals via blind matching without any paired vision-language data or explicit cross-modal supervision. Experiments on HM3D-IIN, OVON, and R2R-CE (MP3D) plus real-robot deployment on Unitree Go2 are presented as evidence of generalization across tasks, modalities, and embodiments.

Significance. If the blind-matching results hold under the stated conditions, the work would be significant for showing that semantic alignment between independently trained vision and language representations can be exploited directly in navigation without additional cross-modal training or paired data. The public code release and website are strengths that support reproducibility.

major comments (2)

[Abstract] Abstract: the central claim that language grounding occurs via 'blind matching' without paired vision-language data rests on the unverified assumption that a self-supervised visual encoder already produces node features whose geometry is sufficiently isomorphic to language embeddings for reliable nearest-neighbor matching; no description of the encoder, any projection or normalization step, or controls isolating this alignment from benchmark-specific artifacts is supplied.
[Abstract] Abstract / Experiments: the generalization claim across three navigation tasks and a real robot is asserted, yet the abstract supplies no quantitative success rates, error bars, ablation results, or baseline comparisons; without these data it is impossible to determine whether the reported performance supports the 'three different interfaces to the same manifold' framing or is an artifact of the chosen benchmarks.

minor comments (1)

[Abstract] The term 'Platonic Topological Map' is introduced as a new entity but its precise construction (node definition, distance fusion formula) is not formalized in the abstract, which hinders immediate assessment of the 'parameter-free' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. Below we respond point-by-point to the major comments, clarifying the role of the abstract versus the full paper and indicating where revisions will strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that language grounding occurs via 'blind matching' without paired vision-language data rests on the unverified assumption that a self-supervised visual encoder already produces node features whose geometry is sufficiently isomorphic to language embeddings for reliable nearest-neighbor matching; no description of the encoder, any projection or normalization step, or controls isolating this alignment from benchmark-specific artifacts is supplied.

Authors: The abstract is a concise summary and therefore omits implementation specifics that appear in the Methods section, where the self-supervised visual encoder (including its architecture and any normalization) is fully specified and the Platonic Topological Map construction is detailed. The claim of blind matching without paired data or projection layers is not an unverified assumption; it is empirically tested by the consistent navigation performance across three distinct benchmarks and a real-robot embodiment, none of which involve cross-modal fine-tuning. These results across varied environments function as the control isolating semantic alignment from benchmark artifacts. We will revise the abstract to name the encoder family and note the absence of learned projections. revision: partial
Referee: [Abstract] Abstract / Experiments: the generalization claim across three navigation tasks and a real robot is asserted, yet the abstract supplies no quantitative success rates, error bars, ablation results, or baseline comparisons; without these data it is impossible to determine whether the reported performance supports the 'three different interfaces to the same manifold' framing or is an artifact of the chosen benchmarks.

Authors: We agree that the abstract would be strengthened by including representative quantitative results. The full manuscript already reports success rates, standard deviations, ablations on geometric versus semantic distances, and baseline comparisons on HM3D-IIN, OVON, and R2R-CE, plus real-robot metrics on Unitree Go2; these numbers directly support the unified-manifold interpretation because the same map and matching procedure succeed without task-specific training. We will incorporate key quantitative highlights (e.g., success rates on the primary benchmarks) into the revised abstract. revision: yes

Circularity Check

0 steps flagged

No circularity; framework applies external pre-trained encoder without internal reduction

full rationale

The paper presents PlatonicNav as a training-free framework that builds a topological map from a self-supervised visual encoder and performs blind matching for language grounding. No equations, parameters, or steps are shown to reduce by construction to fitted inputs or self-citations within the paper itself. The claimed semantic alignment is imported from the external encoder's properties rather than derived or assumed via self-definition. The extension of the Platonic Representation Hypothesis is framed as an application to navigation tasks, with generalization demonstrated via external benchmarks (HM3D-IIN, OVON, R2R-CE) rather than internal fitting loops. This is the most common honest finding for papers that leverage pre-trained models without re-deriving their representations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Based on the abstract alone, the central claim rests on the Platonic Representation Hypothesis applied to navigation and the assumption that blind matching suffices for language grounding. No explicit free parameters are named. The new map structure is introduced without independent evidence outside the reported experiments.

axioms (2)

domain assumption Independently trained vision and language encoders share a common semantic structure (Platonic Representation Hypothesis)
Invoked as the foundation for treating different navigation tasks as interfaces to one manifold.
ad hoc to paper Blind matching of language goals to vision-derived map nodes works without paired vision-language data
This premise enables the training-free claim and is not derived from prior literature cited in the abstract.

invented entities (1)

Platonic Topological Map no independent evidence
purpose: Fuses geometric and semantic node distances from a self-supervised visual encoder to support blind language grounding
New map representation introduced by the paper; no independent evidence provided in the abstract.

pith-pipeline@v0.9.1-grok · 5845 in / 1505 out tokens · 48113 ms · 2026-06-28T15:33:33.728008+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 4 canonical work pages · 2 internal anchors

[1]

BEVBert: Multimodal map pre-training for language-guided navigation

Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. BEVBert: Multimodal map pre-training for language-guided navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2022

2022
[2]

ETPNav: Evolving topological planning for vision-language navigation in continuous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. ETPNav: Evolving topological planning for vision-language navigation in continuous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

2023
[3]

1st place solutions for RxR-habitat vision-and-language navigation competition (CVPR 2022), 2022

Dong An, Zun Wang, Yangguang Li, Yi Wang, Yicong Hong, Yan Huang, Liang Wang, and Jing Shao. 1st place solutions for RxR-habitat vision-and-language navigation competition (CVPR 2022), 2022

2022
[4]

Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, V

Peter Anderson, Angel X. Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, V . Koltun, J. Kosecka, Jitendra Malik, Roozbeh Mottaghi, M. Savva, et al. On evaluation of embodied navigation agents.arXiv.org, 2018

2018
[5]

Reid, Stephen Gould, and Anton van den Hengel

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, I. Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017

2017
[6]

Qwen3-vl technical report.arXiv.org, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xiong-Hui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Rongyao Fang, Chang Gao, et al. Qwen3-vl technical report.arXiv.org, 2025

2025
[7]

Objectnav revisited: On evaluation of embodied agents navigating to objects, 2020

Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. Objectnav revisited: On evaluation of embodied agents navigating to objects, 2020

2020
[8]

Mairal, Piotr Bojanowski, and Armand Joulin

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv’e J’egou, J. Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InIEEE International Conference on Computer Vision, pages 9630–9640. IEEE, 2021

2021
[9]

Chang, Angela Dai, T

Angel X. Chang, Angela Dai, T. Funkhouser, Maciej Halber, M. Nießner, M. Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.International Conference on 3D Vision, pages 667–676, 2017

2017
[10]

Gandhi, Abhinav Gupta, and Russ Salakhutdinov

Devendra Singh Chaplot, Dhiraj P. Gandhi, Abhinav Gupta, and Russ Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. InNeural Information Processing Systems, 2020

2020
[11]

Neural topological SLAM for visual navigation

Devendra Singh Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, and Saurabh Gupta. Neural topological SLAM for visual navigation. InComputer Vision and Pattern Recognition, pages 12872–12881. IEEE, 2020

2020
[12]

Jiaqi Chen, Bingqian Lin, Xinmin Liu, Xiaodan Liang, and Kwan-Yee K. Wong. Affordances- oriented planning using foundation models for continuous vision-language navigation. InAAAI Conference on Artificial Intelligence, 2024

2024
[13]

Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K. Wong. MapGPT: Map-guided prompting with adaptive path planning for vision-and-language naviga- tion. InAnnual Meeting of the Association for Computational Linguistics, pages 9796–9810. Association for Computational Linguistics, 2024

2024
[14]

V’azquez, and S

Kevin Chen, Junshen Chen, Jo Chuang, M. V’azquez, and S. Savarese. Topological plan- ning with transformers for vision-and-language navigation. InComputer Vision and Pattern Recognition, pages 11276–11286, 2020

2020
[15]

Li, Mingkui Tan, and Chuang Gan

Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H. Li, Mingkui Tan, and Chuang Gan. Weakly-supervised multi-granularity map learning for vision-and-language navigation. In Neural Information Processing Systems, volume 35, pages 38149–38161. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2022. 10

2022
[16]

History aware multimodal transformer for vision-and-language navigation

Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. History aware multimodal transformer for vision-and-language navigation. InNeural Information Processing Systems, 2021

2021
[17]

Think global, act local: Dual-scale graph transformer for vision-and-language navigation

Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act local: Dual-scale graph transformer for vision-and-language navigation. InComputer Vision and Pattern Recognition, pages 16516–16526. IEEE, 2022

2022
[18]

NaVILA: Legged robot vision-language-action model for navigation.Robotics, 2024

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiaolong Wang. NaVILA: Legged robot vision-language-action model for navigation.Robotics, 2024

2024
[19]

CoWs on pasture: Baselines and benchmarks for language-driven zero-shot object navigation

Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song. CoWs on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. InComputer Vision and Pattern Recognition, 2022

2022
[20]

Octonav: Towards generalist embodied navigation.arXiv.org, 2025

Chen Gao, Liankai Jin, Xingyu Peng, Jiazhao Zhang, Yue Deng, Annan Li, He Wang, and Si Liu. Octonav: Towards generalist embodied navigation.arXiv.org, 2025

2025
[21]

Objectreact: Learning object-relative control for visual navigation

Sourav Garg, Dustin Craggs, Vineeth Bhat, Lachlan Mares, Stefan Podgorski, Madhava Krishna, Feras Dayoub, and Ian Reid. Objectreact: Learning object-relative control for visual navigation. arXiv.org, 2025

2025
[22]

Hosseinzadeh, Lachlan Mares, Niko Sunderhauf, Feras Dayoub, and Ian Reid

Sourav Garg, Krishan Rana, M. Hosseinzadeh, Lachlan Mares, Niko Sunderhauf, Feras Dayoub, and Ian Reid. Robohop: Segment-based topological map representation for open-world visual navigation. InIEEE International Conference on Robotics and Automation, pages 4090–4097. IEEE, IEEE, 2024

2024
[23]

Cross-modal map learning for vision and language navigation

Georgios Georgakis, Karl Schmeckpeper, Karan Wanchoo, Soham Dan, Eleni Miltsakaki, Dan Roth, and Kostas Daniilidis. Cross-modal map learning for vision and language navigation. In Computer Vision and Pattern Recognition, pages 15439–15449. IEEE, 2022

2022
[24]

Revisiting the platonic representation hypothesis: An aristotelian view.arXiv.org, 2026

Fabian Gröger, Shuo Wen, and Maria Brbi’c. Revisiting the platonic representation hypothesis: An aristotelian view.arXiv.org, 2026

2026
[25]

Concept- graphs: Open-vocabulary 3D scene graphs for perception and planning

Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Ramalingam Chellappa, et al. Concept- graphs: Open-vocabulary 3D scene graphs for perception and planning. InIEEE International Conference on Robotics and Automation, 2023

2023
[26]

Girshick

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll’ar, and Ross B. Girshick. Masked autoencoders are scalable vision learners. InComputer Vision and Pattern Recognition, 2021

2021
[27]

Learning navigational visual representations with semantic map supervision

Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernoncourt, Trung Bui, Stephen Gould, and Hao Tan. Learning navigational visual representations with semantic map supervision. InIEEE International Conference on Computer Vision, pages 3032–3044. IEEE, 2023

2023
[28]

Visual language maps for robot navigation

Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. InIEEE International Conference on Robotics and Automation, 2022

2022
[29]

Mobilevla-r1: Reinforcing vision-language-action for mobile robots.arXiv preprint arXiv:2511.17889, 2025

Ting Huang, Dongjian Li, Rui Yang, Zeyu Zhang, Zida Yang, and Hao Tang. Mobilevla-r1: Reinforcing vision-language-action for mobile robots.arXiv preprint arXiv:2511.17889, 2025

work page arXiv 2025
[30]

The platonic representation hypothesis.International Conference on Machine Learning, 2024

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.International Conference on Machine Learning, 2024

2024
[31]

Omama, Tao Chen, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Varma Keetha, A

Krishna Murthy Jatavallabhula, Ali Kuwajerwala, Qiao Gu, Mohd. Omama, Tao Chen, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Varma Keetha, A. Tewari, et al. Conceptfusion: Open-set multimodal 3D mapping. InRobotics: Science and Systems. Robotics: Science and Systems Foundation, 2023

2023
[32]

Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational Conference on Machine Learning, 2021. 11

2021
[33]

Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloé Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, A

A. Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloé Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, A. Berg, Wan-Yen Lo, et al. Segment anything. InIEEE International Conference on Computer Vision, pages 3992–4003. IEEE, 2023

2023
[34]

Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

A. Sophia Koepke, Daniil Zverev, Shiry Ginosar, and Alexei A. Efros. Back into Plato’s cave: Examining cross-modal representational convergence at scale.arXiv preprint arXiv:2604.18572, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Sim-2-sim transfer for vision-and-language navigation in contin- uous environments

Jacob Krantz and Stefan Lee. Sim-2-sim transfer for vision-and-language navigation in contin- uous environments. InEuropean Conference on Computer Vision, pages 588–603. Springer Nature Switzerland, 2022

2022
[36]

Instance- specific image goal navigation: Training embodied agents to find object instances, 2022

Jacob Krantz, Stefan Lee, Jitendra Malik, Dhruv Batra, and Devendra Singh Chaplot. Instance- specific image goal navigation: Training embodied agents to find object instances, 2022

2022
[37]

Beyond the nav- graph: Vision and language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav- graph: Vision and language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120. Springer International Publishing, 2020

2020
[38]

Beyond the nav- graph: Vision-and-language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav- graph: Vision-and-language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020

2020
[39]

Room-Across- Room: Multilingual vision-and-language navigation with dense spatiotemporal grounding

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-Across- Room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. InConference on Empirical Methods in Natural Language Processing, pages 4392–4412. Association for Computational Linguistics, 2020

2020
[40]

Nav-r1: Reasoning and navigation in embodied scenes.arXiv preprint arXiv:2509.10884, 2025

Qingxiang Liu, Ting Huang, Zeyu Zhang, and Hao Tang. Nav-r1: Reasoning and navigation in embodied scenes.arXiv preprint arXiv:2509.10884, 2025

work page arXiv 2025
[41]

Instructnav: Zero-shot system for generic instruction navigation in unexplored environment, 2024

Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment, 2024

2024
[42]

ZSON: Zero-shot object-goal navigation using multimodal goal embeddings

Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, and Dhruv Batra. ZSON: Zero-shot object-goal navigation using multimodal goal embeddings. InNeural Information Processing Systems, pages 32340–32352. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2022

2022
[43]

Hall, Ming-Wei Chang, et al

Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernández Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith B. Hall, Ming-Wei Chang, et al. Large dual encoders are generalizable retrievers. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Conference on Empirical Methods in Natural Language Processing, 2021

2021
[44]

Oquab, Timothée Darcet, Théo Moutakanni, Huy V

M. Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.Trans. Mach. Learn. Res., 2023

2023
[45]

Pollefeys, and T

Songyou Peng, Kyle Genova, ChiyuMaxJiang, Andrea Tagliasacchi, M. Pollefeys, and T. Funkhouser. Openscene: 3D scene understanding with open vocabularies. InComputer Vision and Pattern Recognition, 2022

2022
[46]

VLN-R1: Vision-language navigation via reinforcement fine-tuning, 2025

Zhangyang Qi, Zhixiong Zhang, Yizhou Yu, Jiaqi Wang, and Hengshuang Zhao. VLN-R1: Vision-language navigation via reinforcement fine-tuning, 2025

2025
[47]

Ramesh, Gabriel Goh, S

Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, S. Agarwal, G. Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[48]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2019. 12

2019
[49]

Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI

Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexan- der Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. InNeurIPS Datasets and Benchmarks, 2021

2021
[50]

PIRLNav: Pretraining with imitation and RL finetuning for ObjectNav

Ram Ramrakhya, Dhruv Batra, Erik Wijmans, and Abhishek Das. PIRLNav: Pretraining with imitation and RL finetuning for ObjectNav. InComputer Vision and Pattern Recognition, pages 17896–17906. IEEE, 2023

2023
[51]

Habitat-Web: Learning embodied object-search strategies from human demonstrations at scale

Ram Ramrakhya, Eric Undersander, Dhruv Batra, and Abhishek Das. Habitat-Web: Learning embodied object-search strategies from human demonstrations at scale. InComputer Vision and Pattern Recognition, pages 5163–5173. IEEE, 2022

2022
[52]

Language- aligned waypoint (LAW) supervision for vision-and-language navigation in continuous environ- ments

Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel Chang. Language- aligned waypoint (LAW) supervision for vision-and-language navigation in continuous environ- ments. InConference on Empirical Methods in Natural Language Processing, pages 4018–4028. Association for Computational Linguistics, 2021

2021
[53]

Sentence-BERT: Sentence embeddings using Siamese BERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. InConference on Empirical Methods in Natural Language Processing, pages 3980–3990. Association for Computational Linguistics, 2019

2019
[54]

Gordon, and J

Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InInternational Conference on Artificial Intelligence and Statistics, pages 627–635, 2010

2010
[55]

Semi-parametric topological memory for navigation

Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for navigation. InInternational Conference on Learning Representations, 2018

2018
[56]

Habitat: A platform for embodied AI research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied AI research. InIEEE International Conference on Computer Vision, pages 9338–9346. IEEE, 2019

2019
[57]

It’s a (blind) match! towards vision- language correspondence without parallel data

Dominik Schnaus, Nikita Araslanov, and Daniel Cremers. It’s a (blind) match! towards vision- language correspondence without parallel data. InComputer Vision and Pattern Recognition, pages 24983–24992. IEEE, 2025

2025
[58]

Proximal policy optimization algorithms.arXiv.org, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv.org, 2017

2017
[59]

Navigation with large language models: Semantic guesswork as a heuristic for planning

Dhruv Shah, Michael Equi, Blazej Osinski, Fei Xia, Brian Ichter, and Sergey Levine. Navigation with large language models: Semantic guesswork as a heuristic for planning. InConference on Robot Learning, 2023

2023
[60]

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Gool, and Wenguan Wang

Hanqing Wang, Wei Liang, L. Gool, and Wenguan Wang. DREAMW ALKER: Mental planning for continuous vision-language navigation. InIEEE International Conference on Computer Vision, pages 10839–10849. IEEE, 2023

2023
[62]

Lookahead exploration with neural radiance representation for continuous vision-language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Junjie Hu, Ming Jiang, and Shuqiang Jiang. Lookahead exploration with neural radiance representation for continuous vision-language navigation. InComputer Vision and Pattern Recognition, pages 13753–13762. IEEE, 2024

2024
[63]

Gridmm: Grid memory map for vision-and-language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision-and-language navigation. InIEEE International Conference on Computer Vision, pages 15579–15590. IEEE, 2023

2023
[64]

Streamvln: Streaming vision-and-language navigation via slowfast context modeling.arXiv.org, 2025

Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and-language navigation via slowfast context modeling.arXiv.org, 2025. 13

2025
[65]

DD-PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames

Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra. DD-PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames. InInternational Conference on Learning Representations, 2019

2019
[66]

VLFM: Vision-language frontier maps for zero-shot semantic navigation

Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. VLFM: Vision-language frontier maps for zero-shot semantic navigation. InIEEE International Confer- ence on Robotics and Automation, 2023

2023
[67]

HM3D- OVON: A dataset and benchmark for open-vocabulary object goal navigation

Naoki Yokoyama, Ram Ramrakhya, Abhishek Das, Dhruv Batra, and Sehoon Ha. HM3D- OVON: A dataset and benchmark for open-vocabulary object goal navigation. InIEEE/RJS International Conference on Intelligent RObots and Systems, pages 5543–5550. IEEE, 2024

2024
[68]

Escaping Plato’s cave: JAM for aligning independently trained vision and language models.arXiv.org, 2025

Lauren Hyoseo Yoon, Yisong Yue, and Been Kim. Escaping Plato’s cave: JAM for aligning independently trained vision and language models.arXiv.org, 2025

2025
[69]

L3MVN: Leveraging large language models for visual target navigation

Bangguo Yu, Hamidreza Kasaei, and Ming Cao. L3MVN: Leveraging large language models for visual target navigation. InIEEE/RJS International Conference on Intelligent RObots and Systems, pages 3554–3560. IEEE, 2023

2023
[70]

Correctnav: Self-correction flywheel empowers vision-language-action navigation model, 2025

Zhuoyuan Yu, Yuxing Long, Zihan Yang, Chengyan Zeng, Hongwei Fan, Jiyao Zhang, and Hao Dong. Correctnav: Self-correction flywheel empowers vision-language-action navigation model, 2025

2025
[71]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InIEEE International Conference on Computer Vision, pages 11941–11952. IEEE, 2023

2023
[72]

Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks.arXiv.org, 2024

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks.arXiv.org, 2024

2024
[73]

Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks. InarXiv.org, 2024

2024
[74]

NaVid: Video-based VLM plans the next step for vision-and-language navigation

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and Wanggui He. NaVid: Video-based VLM plans the next step for vision-and-language navigation. InRobotics: Science and Systems. Robotics: Science and Systems Foundation, 2024

2024
[75]

Fast segment anything.arXiv.org, 2023

Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment anything.arXiv.org, 2023

2023
[76]

NavGPT: Explicit reasoning in vision-and-language navigation with large language models

Gengze Zhou, Yicong Hong, and Qi Wu. NavGPT: Explicit reasoning in vision-and-language navigation with large language models. InAAAI Conference on Artificial Intelligence, 2023

2023
[77]

ESC: Exploration with soft commonsense constraints for zero-shot object navigation

Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia Jin, Lise Getoor, and Xin Eric Wang. ESC: Exploration with soft commonsense constraints for zero-shot object navigation. In International Conference on Machine Learning, 2023

2023
[78]

Move to understand a 3D scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation

Ziyu Zhu, Xilin Wang, Yixuan Li, Zhuofan Zhang, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Wei Liang, Qian Yu, Zhidong Deng, et al. Move to understand a 3D scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation. InIEEE International Conference on Computer Vision, pages 8120–8132. IEEE, 2025. 14 A Preliminaries A.1 Pla...

2025

[1] [1]

BEVBert: Multimodal map pre-training for language-guided navigation

Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. BEVBert: Multimodal map pre-training for language-guided navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2022

2022

[2] [2]

ETPNav: Evolving topological planning for vision-language navigation in continuous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. ETPNav: Evolving topological planning for vision-language navigation in continuous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

2023

[3] [3]

1st place solutions for RxR-habitat vision-and-language navigation competition (CVPR 2022), 2022

Dong An, Zun Wang, Yangguang Li, Yi Wang, Yicong Hong, Yan Huang, Liang Wang, and Jing Shao. 1st place solutions for RxR-habitat vision-and-language navigation competition (CVPR 2022), 2022

2022

[4] [4]

Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, V

Peter Anderson, Angel X. Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, V . Koltun, J. Kosecka, Jitendra Malik, Roozbeh Mottaghi, M. Savva, et al. On evaluation of embodied navigation agents.arXiv.org, 2018

2018

[5] [5]

Reid, Stephen Gould, and Anton van den Hengel

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, I. Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017

2017

[6] [6]

Qwen3-vl technical report.arXiv.org, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xiong-Hui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Rongyao Fang, Chang Gao, et al. Qwen3-vl technical report.arXiv.org, 2025

2025

[7] [7]

Objectnav revisited: On evaluation of embodied agents navigating to objects, 2020

Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. Objectnav revisited: On evaluation of embodied agents navigating to objects, 2020

2020

[8] [8]

Mairal, Piotr Bojanowski, and Armand Joulin

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv’e J’egou, J. Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InIEEE International Conference on Computer Vision, pages 9630–9640. IEEE, 2021

2021

[9] [9]

Chang, Angela Dai, T

Angel X. Chang, Angela Dai, T. Funkhouser, Maciej Halber, M. Nießner, M. Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.International Conference on 3D Vision, pages 667–676, 2017

2017

[10] [10]

Gandhi, Abhinav Gupta, and Russ Salakhutdinov

Devendra Singh Chaplot, Dhiraj P. Gandhi, Abhinav Gupta, and Russ Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. InNeural Information Processing Systems, 2020

2020

[11] [11]

Neural topological SLAM for visual navigation

Devendra Singh Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, and Saurabh Gupta. Neural topological SLAM for visual navigation. InComputer Vision and Pattern Recognition, pages 12872–12881. IEEE, 2020

2020

[12] [12]

Jiaqi Chen, Bingqian Lin, Xinmin Liu, Xiaodan Liang, and Kwan-Yee K. Wong. Affordances- oriented planning using foundation models for continuous vision-language navigation. InAAAI Conference on Artificial Intelligence, 2024

2024

[13] [13]

Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K. Wong. MapGPT: Map-guided prompting with adaptive path planning for vision-and-language naviga- tion. InAnnual Meeting of the Association for Computational Linguistics, pages 9796–9810. Association for Computational Linguistics, 2024

2024

[14] [14]

V’azquez, and S

Kevin Chen, Junshen Chen, Jo Chuang, M. V’azquez, and S. Savarese. Topological plan- ning with transformers for vision-and-language navigation. InComputer Vision and Pattern Recognition, pages 11276–11286, 2020

2020

[15] [15]

Li, Mingkui Tan, and Chuang Gan

Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H. Li, Mingkui Tan, and Chuang Gan. Weakly-supervised multi-granularity map learning for vision-and-language navigation. In Neural Information Processing Systems, volume 35, pages 38149–38161. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2022. 10

2022

[16] [16]

History aware multimodal transformer for vision-and-language navigation

Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. History aware multimodal transformer for vision-and-language navigation. InNeural Information Processing Systems, 2021

2021

[17] [17]

Think global, act local: Dual-scale graph transformer for vision-and-language navigation

Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act local: Dual-scale graph transformer for vision-and-language navigation. InComputer Vision and Pattern Recognition, pages 16516–16526. IEEE, 2022

2022

[18] [18]

NaVILA: Legged robot vision-language-action model for navigation.Robotics, 2024

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiaolong Wang. NaVILA: Legged robot vision-language-action model for navigation.Robotics, 2024

2024

[19] [19]

CoWs on pasture: Baselines and benchmarks for language-driven zero-shot object navigation

Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song. CoWs on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. InComputer Vision and Pattern Recognition, 2022

2022

[20] [20]

Octonav: Towards generalist embodied navigation.arXiv.org, 2025

Chen Gao, Liankai Jin, Xingyu Peng, Jiazhao Zhang, Yue Deng, Annan Li, He Wang, and Si Liu. Octonav: Towards generalist embodied navigation.arXiv.org, 2025

2025

[21] [21]

Objectreact: Learning object-relative control for visual navigation

Sourav Garg, Dustin Craggs, Vineeth Bhat, Lachlan Mares, Stefan Podgorski, Madhava Krishna, Feras Dayoub, and Ian Reid. Objectreact: Learning object-relative control for visual navigation. arXiv.org, 2025

2025

[22] [22]

Hosseinzadeh, Lachlan Mares, Niko Sunderhauf, Feras Dayoub, and Ian Reid

Sourav Garg, Krishan Rana, M. Hosseinzadeh, Lachlan Mares, Niko Sunderhauf, Feras Dayoub, and Ian Reid. Robohop: Segment-based topological map representation for open-world visual navigation. InIEEE International Conference on Robotics and Automation, pages 4090–4097. IEEE, IEEE, 2024

2024

[23] [23]

Cross-modal map learning for vision and language navigation

Georgios Georgakis, Karl Schmeckpeper, Karan Wanchoo, Soham Dan, Eleni Miltsakaki, Dan Roth, and Kostas Daniilidis. Cross-modal map learning for vision and language navigation. In Computer Vision and Pattern Recognition, pages 15439–15449. IEEE, 2022

2022

[24] [24]

Revisiting the platonic representation hypothesis: An aristotelian view.arXiv.org, 2026

Fabian Gröger, Shuo Wen, and Maria Brbi’c. Revisiting the platonic representation hypothesis: An aristotelian view.arXiv.org, 2026

2026

[25] [25]

Concept- graphs: Open-vocabulary 3D scene graphs for perception and planning

Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Ramalingam Chellappa, et al. Concept- graphs: Open-vocabulary 3D scene graphs for perception and planning. InIEEE International Conference on Robotics and Automation, 2023

2023

[26] [26]

Girshick

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll’ar, and Ross B. Girshick. Masked autoencoders are scalable vision learners. InComputer Vision and Pattern Recognition, 2021

2021

[27] [27]

Learning navigational visual representations with semantic map supervision

Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernoncourt, Trung Bui, Stephen Gould, and Hao Tan. Learning navigational visual representations with semantic map supervision. InIEEE International Conference on Computer Vision, pages 3032–3044. IEEE, 2023

2023

[28] [28]

Visual language maps for robot navigation

Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. InIEEE International Conference on Robotics and Automation, 2022

2022

[29] [29]

Mobilevla-r1: Reinforcing vision-language-action for mobile robots.arXiv preprint arXiv:2511.17889, 2025

Ting Huang, Dongjian Li, Rui Yang, Zeyu Zhang, Zida Yang, and Hao Tang. Mobilevla-r1: Reinforcing vision-language-action for mobile robots.arXiv preprint arXiv:2511.17889, 2025

work page arXiv 2025

[30] [30]

The platonic representation hypothesis.International Conference on Machine Learning, 2024

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.International Conference on Machine Learning, 2024

2024

[31] [31]

Omama, Tao Chen, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Varma Keetha, A

Krishna Murthy Jatavallabhula, Ali Kuwajerwala, Qiao Gu, Mohd. Omama, Tao Chen, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Varma Keetha, A. Tewari, et al. Conceptfusion: Open-set multimodal 3D mapping. InRobotics: Science and Systems. Robotics: Science and Systems Foundation, 2023

2023

[32] [32]

Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational Conference on Machine Learning, 2021. 11

2021

[33] [33]

Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloé Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, A

A. Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloé Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, A. Berg, Wan-Yen Lo, et al. Segment anything. InIEEE International Conference on Computer Vision, pages 3992–4003. IEEE, 2023

2023

[34] [34]

Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

A. Sophia Koepke, Daniil Zverev, Shiry Ginosar, and Alexei A. Efros. Back into Plato’s cave: Examining cross-modal representational convergence at scale.arXiv preprint arXiv:2604.18572, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

Sim-2-sim transfer for vision-and-language navigation in contin- uous environments

Jacob Krantz and Stefan Lee. Sim-2-sim transfer for vision-and-language navigation in contin- uous environments. InEuropean Conference on Computer Vision, pages 588–603. Springer Nature Switzerland, 2022

2022

[36] [36]

Instance- specific image goal navigation: Training embodied agents to find object instances, 2022

Jacob Krantz, Stefan Lee, Jitendra Malik, Dhruv Batra, and Devendra Singh Chaplot. Instance- specific image goal navigation: Training embodied agents to find object instances, 2022

2022

[37] [37]

Beyond the nav- graph: Vision and language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav- graph: Vision and language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120. Springer International Publishing, 2020

2020

[38] [38]

Beyond the nav- graph: Vision-and-language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav- graph: Vision-and-language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020

2020

[39] [39]

Room-Across- Room: Multilingual vision-and-language navigation with dense spatiotemporal grounding

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-Across- Room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. InConference on Empirical Methods in Natural Language Processing, pages 4392–4412. Association for Computational Linguistics, 2020

2020

[40] [40]

Nav-r1: Reasoning and navigation in embodied scenes.arXiv preprint arXiv:2509.10884, 2025

Qingxiang Liu, Ting Huang, Zeyu Zhang, and Hao Tang. Nav-r1: Reasoning and navigation in embodied scenes.arXiv preprint arXiv:2509.10884, 2025

work page arXiv 2025

[41] [41]

Instructnav: Zero-shot system for generic instruction navigation in unexplored environment, 2024

Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment, 2024

2024

[42] [42]

ZSON: Zero-shot object-goal navigation using multimodal goal embeddings

Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, and Dhruv Batra. ZSON: Zero-shot object-goal navigation using multimodal goal embeddings. InNeural Information Processing Systems, pages 32340–32352. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2022

2022

[43] [43]

Hall, Ming-Wei Chang, et al

Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernández Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith B. Hall, Ming-Wei Chang, et al. Large dual encoders are generalizable retrievers. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Conference on Empirical Methods in Natural Language Processing, 2021

2021

[44] [44]

Oquab, Timothée Darcet, Théo Moutakanni, Huy V

M. Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.Trans. Mach. Learn. Res., 2023

2023

[45] [45]

Pollefeys, and T

Songyou Peng, Kyle Genova, ChiyuMaxJiang, Andrea Tagliasacchi, M. Pollefeys, and T. Funkhouser. Openscene: 3D scene understanding with open vocabularies. InComputer Vision and Pattern Recognition, 2022

2022

[46] [46]

VLN-R1: Vision-language navigation via reinforcement fine-tuning, 2025

Zhangyang Qi, Zhixiong Zhang, Yizhou Yu, Jiaqi Wang, and Hengshuang Zhao. VLN-R1: Vision-language navigation via reinforcement fine-tuning, 2025

2025

[47] [47]

Ramesh, Gabriel Goh, S

Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, S. Agarwal, G. Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[48] [48]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2019. 12

2019

[49] [49]

Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI

Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexan- der Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. InNeurIPS Datasets and Benchmarks, 2021

2021

[50] [50]

PIRLNav: Pretraining with imitation and RL finetuning for ObjectNav

Ram Ramrakhya, Dhruv Batra, Erik Wijmans, and Abhishek Das. PIRLNav: Pretraining with imitation and RL finetuning for ObjectNav. InComputer Vision and Pattern Recognition, pages 17896–17906. IEEE, 2023

2023

[51] [51]

Habitat-Web: Learning embodied object-search strategies from human demonstrations at scale

Ram Ramrakhya, Eric Undersander, Dhruv Batra, and Abhishek Das. Habitat-Web: Learning embodied object-search strategies from human demonstrations at scale. InComputer Vision and Pattern Recognition, pages 5163–5173. IEEE, 2022

2022

[52] [52]

Language- aligned waypoint (LAW) supervision for vision-and-language navigation in continuous environ- ments

Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel Chang. Language- aligned waypoint (LAW) supervision for vision-and-language navigation in continuous environ- ments. InConference on Empirical Methods in Natural Language Processing, pages 4018–4028. Association for Computational Linguistics, 2021

2021

[53] [53]

Sentence-BERT: Sentence embeddings using Siamese BERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. InConference on Empirical Methods in Natural Language Processing, pages 3980–3990. Association for Computational Linguistics, 2019

2019

[54] [54]

Gordon, and J

Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InInternational Conference on Artificial Intelligence and Statistics, pages 627–635, 2010

2010

[55] [55]

Semi-parametric topological memory for navigation

Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for navigation. InInternational Conference on Learning Representations, 2018

2018

[56] [56]

Habitat: A platform for embodied AI research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied AI research. InIEEE International Conference on Computer Vision, pages 9338–9346. IEEE, 2019

2019

[57] [57]

It’s a (blind) match! towards vision- language correspondence without parallel data

Dominik Schnaus, Nikita Araslanov, and Daniel Cremers. It’s a (blind) match! towards vision- language correspondence without parallel data. InComputer Vision and Pattern Recognition, pages 24983–24992. IEEE, 2025

2025

[58] [58]

Proximal policy optimization algorithms.arXiv.org, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv.org, 2017

2017

[59] [59]

Navigation with large language models: Semantic guesswork as a heuristic for planning

Dhruv Shah, Michael Equi, Blazej Osinski, Fei Xia, Brian Ichter, and Sergey Levine. Navigation with large language models: Semantic guesswork as a heuristic for planning. InConference on Robot Learning, 2023

2023

[60] [60]

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[61] [61]

Gool, and Wenguan Wang

Hanqing Wang, Wei Liang, L. Gool, and Wenguan Wang. DREAMW ALKER: Mental planning for continuous vision-language navigation. InIEEE International Conference on Computer Vision, pages 10839–10849. IEEE, 2023

2023

[62] [62]

Lookahead exploration with neural radiance representation for continuous vision-language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Junjie Hu, Ming Jiang, and Shuqiang Jiang. Lookahead exploration with neural radiance representation for continuous vision-language navigation. InComputer Vision and Pattern Recognition, pages 13753–13762. IEEE, 2024

2024

[63] [63]

Gridmm: Grid memory map for vision-and-language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision-and-language navigation. InIEEE International Conference on Computer Vision, pages 15579–15590. IEEE, 2023

2023

[64] [64]

Streamvln: Streaming vision-and-language navigation via slowfast context modeling.arXiv.org, 2025

Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and-language navigation via slowfast context modeling.arXiv.org, 2025. 13

2025

[65] [65]

DD-PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames

Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra. DD-PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames. InInternational Conference on Learning Representations, 2019

2019

[66] [66]

VLFM: Vision-language frontier maps for zero-shot semantic navigation

Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. VLFM: Vision-language frontier maps for zero-shot semantic navigation. InIEEE International Confer- ence on Robotics and Automation, 2023

2023

[67] [67]

HM3D- OVON: A dataset and benchmark for open-vocabulary object goal navigation

Naoki Yokoyama, Ram Ramrakhya, Abhishek Das, Dhruv Batra, and Sehoon Ha. HM3D- OVON: A dataset and benchmark for open-vocabulary object goal navigation. InIEEE/RJS International Conference on Intelligent RObots and Systems, pages 5543–5550. IEEE, 2024

2024

[68] [68]

Escaping Plato’s cave: JAM for aligning independently trained vision and language models.arXiv.org, 2025

Lauren Hyoseo Yoon, Yisong Yue, and Been Kim. Escaping Plato’s cave: JAM for aligning independently trained vision and language models.arXiv.org, 2025

2025

[69] [69]

L3MVN: Leveraging large language models for visual target navigation

Bangguo Yu, Hamidreza Kasaei, and Ming Cao. L3MVN: Leveraging large language models for visual target navigation. InIEEE/RJS International Conference on Intelligent RObots and Systems, pages 3554–3560. IEEE, 2023

2023

[70] [70]

Correctnav: Self-correction flywheel empowers vision-language-action navigation model, 2025

Zhuoyuan Yu, Yuxing Long, Zihan Yang, Chengyan Zeng, Hongwei Fan, Jiyao Zhang, and Hao Dong. Correctnav: Self-correction flywheel empowers vision-language-action navigation model, 2025

2025

[71] [71]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InIEEE International Conference on Computer Vision, pages 11941–11952. IEEE, 2023

2023

[72] [72]

Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks.arXiv.org, 2024

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks.arXiv.org, 2024

2024

[73] [73]

Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks. InarXiv.org, 2024

2024

[74] [74]

NaVid: Video-based VLM plans the next step for vision-and-language navigation

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and Wanggui He. NaVid: Video-based VLM plans the next step for vision-and-language navigation. InRobotics: Science and Systems. Robotics: Science and Systems Foundation, 2024

2024

[75] [75]

Fast segment anything.arXiv.org, 2023

Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment anything.arXiv.org, 2023

2023

[76] [76]

NavGPT: Explicit reasoning in vision-and-language navigation with large language models

Gengze Zhou, Yicong Hong, and Qi Wu. NavGPT: Explicit reasoning in vision-and-language navigation with large language models. InAAAI Conference on Artificial Intelligence, 2023

2023

[77] [77]

ESC: Exploration with soft commonsense constraints for zero-shot object navigation

Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia Jin, Lise Getoor, and Xin Eric Wang. ESC: Exploration with soft commonsense constraints for zero-shot object navigation. In International Conference on Machine Learning, 2023

2023

[78] [78]

Move to understand a 3D scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation

Ziyu Zhu, Xilin Wang, Yixuan Li, Zhuofan Zhang, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Wei Liang, Qian Yu, Zhidong Deng, et al. Move to understand a 3D scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation. InIEEE International Conference on Computer Vision, pages 8120–8132. IEEE, 2025. 14 A Preliminaries A.1 Pla...

2025