POINav: Benchmarking and Enhancing Final-Meters Arrival in Real-World Vision-Language Navigation

Kangning Niu; Meisheng Zhang; Mingchao Sun; Mu Xu; Qiming Li; Ruiyan Gong; Tianlun Li; Wei Guo; Xiaolong Cheng; Xiaolong Wu

arxiv: 2605.28237 · v1 · pith:USKRT5B2new · submitted 2026-05-27 · 💻 cs.RO · cs.CV

POINav: Benchmarking and Enhancing Final-Meters Arrival in Real-World Vision-Language Navigation

Ruiyan Gong , Meisheng Zhang , Yuxiang Zhao , Mingchao Sun , Yanfen Shen , Zedong Chu , Zhining Gu , Wei Guo

show 7 more authors

Xiaolong Cheng Qiming Li Kangning Niu Yanqing Zhu Xiaolong Wu Tianlun Li Mu Xu

This is my paper

Pith reviewed 2026-06-29 11:46 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords vision language navigationpoint of interestrobot navigation3d gaussian splattingbenchmarkfinal metersreal world navigationwaypoints

0 comments

The pith

POINav-Bench and the Brain-Action Framework enable precise POI-goal navigation in reconstructed real-world environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents POINav-Bench, a benchmark with 11 commercial areas reconstructed via 3D Gaussian Splatting spanning over 126,000 square meters and 163 points of interest. It includes traversability annotations and reference paths for closed-loop testing of navigation agents. The authors also introduce the POINav Brain-Action Framework that uses a reasoning module to guide waypoint prediction based on POIs. This matters because current vision-language navigation often fails to reach exact destinations in the final meters despite succeeding at higher-level tasks. The work aims to close the gap between simulation and real-world execution for practical robot navigation in stores and similar spaces.

Core claim

POINav-Bench provides the first closed-loop evaluation platform for real-world POI-goal navigation using high-fidelity 3DGS reconstructions of 11 commercial areas with traversability-aware annotations and reference trajectories. The POINav Brain-Action Framework employs a Brain module for POI-grounded reasoning to direct an Action module in generating continuous waypoints. Together with the POINav-Dataset of 70K real-world signage-entrance pairs, these tools demonstrate a viable approach to refining final-meters arrival in POI-rich environments.

What carries the argument

The POINav Brain-Action Framework, with its Brain module performing POI-grounded reasoning and Action module predicting continuous waypoints.

Load-bearing premise

The 3D Gaussian Splatting models of the commercial areas accurately reflect real-world traversability, lighting, and dynamic conditions so that results transfer to physical robots.

What would settle it

A physical robot executing the framework in one of the 11 commercial areas showing substantially different success rates or paths compared to its performance on the corresponding POINav-Bench reconstruction.

Figures

Figures reproduced from arXiv: 2605.28237 by Kangning Niu, Meisheng Zhang, Mingchao Sun, Mu Xu, Qiming Li, Ruiyan Gong, Tianlun Li, Wei Guo, Xiaolong Cheng, Xiaolong Wu, Yanfen Shen, Yanqing Zhu, Yuxiang Zhao, Zedong Chu, Zhining Gu.

**Figure 1.** Figure 1: Overview of the POINav Ecosystem. To overcome the limitations of semantic misalignments and coarse evaluation scales in final-meters scenarios, POINavBench establishes a high-fidelity, interactive 3DGS platform reconstructed from physical commercial streets. Furthermore, by coupling a newly curated grounding dataset with a decoupled brain-action architecture, this ecosystem empowers embodied agents to a… view at source ↗

**Figure 2.** Figure 2: POINav-Bench construction pipeline. High-fidelity scenes are first reconstructed using 3DGS captures with privacy-masked mechanism, with sites selected for perceptual consistency (Top Left). Diverse POIs are then identified across multiple categories (Bottom left) and annotated with rich semantic metadata and spatial anchors (Right). Specifically, entrances are annotated by horizontal bounding boxes exten… view at source ↗

**Figure 3.** Figure 3: Overview of the POINav Framework. (Top) The Semantic Brain Module grounds POI targets into explicit visual references. Conditioned on these visual cues, the Geometric Action Module predicts continuous trajectory waypoints. (Bottom) The automated Dataset Pipeline curates high-quality signage-entrance pairs for grounding supervision. grounding level: by improving the spatial reasoning of the Brain Module, … view at source ↗

**Figure 4.** Figure 4: Subsets of scenes in POINav-Bench. Each 3DGS scene faithfully reproduces the storefront textures, legible signage, and spatial layouts of real-world commercial streets. The high density of visually similar, co-located POIs poses substantial challenges for precise entrance-level localization during navigation. 5.3 POI-Grounded Reasoning Analysis Since POINav follows the brain-action paradigm, a VLM-based … view at source ↗

**Figure 5.** Figure 5: Visualization of egocentric images during embodied navigation in [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: POINav’s failure and success patterns. POINav fails in large-headingoffset scenarios but succeeds on forward-view POIs, whereas vanilla OmniNav exhibits the opposite trend. Metric / Category Backbone POINav (Ours) POI-Grounded Reasoning Metrics (↑) Referential Correctness (RC) 91.6% 98.8% Grounding Quality (GQ) 88.0% 94.8% Failure Analysis (↓) Referential Error (RC & GQ) 8.4% 1.2% Ambiguous Predictions (G… view at source ↗

**Figure 7.** Figure 7 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Extracted mesh geometry from 3DGS reconstructions of representative scenes in POINav-Bench. The meshes exhibit high geometric fidelity, capturing fine architectural details essential for physically grounded navigation evaluation. B Additional Qualitative Analysis Algorithm 1: Reconstruction Pipeline Input: LiDAR P, RGB I, Aux. A, IMU U Output: 3DGS Scene S, Collision Mesh M // Phase 1: LiDAR-Inertial Mapp… view at source ↗

**Figure 9.** Figure 9: Representative navigation episodes illustrating photorealistic benchmark fidelity. Each row shows four sequential egocentric frames (left to right). (a) The agent correctly approaches the Starbucks storefront but terminates slightly offset from the entrance due to glass-facade visual ambiguity. (b) The agent reaches the correct bank building but converges on the 24h self-service entrance rather than the a… view at source ↗

read the original abstract

Real-world navigation is fundamentally driven by Points of Interest (POIs), yet reaching a precise POI remains a critical "final-meters" challenge. Existing Vision-Language Navigation (VLN) benchmarks of POI-goal navigation often suffer from coarse granularity or significant sim-to-real gaps due to generated scene. To bridge this gap, we present POINav-Bench, the first benchmark designed for closed-loop evaluation of real-world POI-goal navigation. It comprises 11 commercial areas reconstructed from real-world captures using 3D Gaussian Splatting (3DGS), covering 126,398 $m^{2}$ in total and spanning 163 distinct POIs. With traversability-aware annotations and reference trajectories, POINav-Bench enables high-fidelity evaluation of navigation agents in realistic, POI-rich real-world environments. Building on this, we propose the POINav Brain-Action Framework where a Brain module performs POI-grounded reasoning to guide an Action module in predicting continuous waypoints for real-world execution. We further curate the POINav-Dataset, containing 70K real-world signage-entrance pairs. Experiments show that our framework provides a viable path toward refining real-world POI-goal navigation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New 3DGS-based benchmark and 70k dataset for real POI navigation in commercial spaces, but experiments stay inside simulation with no physical robot tests or sim-to-real checks.

read the letter

The paper's main contribution is POINav-Bench: 11 real commercial areas reconstructed via 3D Gaussian Splatting, totaling over 126k square meters and 163 POIs, with traversability annotations and reference trajectories for closed-loop evaluation. They also release a 70k real-world signage-entrance dataset and a Brain-Action framework that splits POI reasoning from waypoint prediction.

This is genuinely new. Most existing VLN work either uses generated scenes or stops at coarse goals, so a high-fidelity real-capture benchmark focused on the final meters fills a clear gap. The scale and the explicit POI emphasis are useful additions that prior datasets do not directly provide.

The framework itself is straightforward but reasonable: the Brain module grounds language in the scene, the Action module outputs continuous controls. The dataset curation from real signage is a practical step.

The soft spot is the evaluation. All reported results run inside the benchmark; the abstract gives no physical robot deployments, no quantitative sim-to-real transfer numbers on success rate or path error, and no tests on lighting changes, moving obstacles, or surface variations that 3DGS often struggles with. The claim that the framework offers a viable path to real-world use therefore rests on an untested transfer assumption. If the reconstructions miss curbs, wet floors, or dynamic elements, the internal gains may not carry over.

This paper is for groups working on real-world VLN or sim-to-real transfer who need better testbeds. The benchmark and dataset artifacts are the parts most likely to see reuse. It deserves a serious referee because new, grounded evaluation resources can organize a subfield even when the accompanying method is still early. I would send it to review with the expectation that the authors add at least one physical deployment or a clear sim-to-real ablation.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces POINav-Bench, the first closed-loop benchmark for real-world POI-goal navigation, consisting of 11 commercial areas (126,398 m² total) reconstructed via 3D Gaussian Splatting with traversability-aware annotations and reference trajectories spanning 163 POIs. It proposes the POINav Brain-Action Framework, in which a Brain module performs POI-grounded reasoning to guide an Action module that outputs continuous waypoints, and curates POINav-Dataset (70K real-world signage-entrance pairs). The abstract states that experiments demonstrate the framework provides a viable path toward refining real-world POI-goal navigation.

Significance. If the sim-to-real transfer holds, the benchmark and framework could meaningfully advance VLN research by supplying high-fidelity, POI-rich environments that address coarse granularity and sim-to-real gaps in existing benchmarks, with potential downstream impact on practical robotic navigation in commercial spaces. The scale of the 3DGS reconstructions and the introduction of traversability annotations represent concrete contributions to evaluation infrastructure.

major comments (1)

[Abstract] Abstract: The central claim that the framework 'provides a viable path toward refining real-world POI-goal navigation' is load-bearing yet unsupported by any described physical robot deployment, quantitative sim-to-real comparison (success rate, trajectory error, etc.), or ablation on unmodeled factors such as dynamic obstacles, specular reflections, or fine-grained traversability (curbs, wet floors). All experiments appear confined to the benchmark, leaving the transfer assumption untested.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the major comment below and will make corresponding revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the framework 'provides a viable path toward refining real-world POI-goal navigation' is load-bearing yet unsupported by any described physical robot deployment, quantitative sim-to-real comparison (success rate, trajectory error, etc.), or ablation on unmodeled factors such as dynamic obstacles, specular reflections, or fine-grained traversability (curbs, wet floors). All experiments appear confined to the benchmark, leaving the transfer assumption untested.

Authors: We agree that the current experiments are performed within the POINav-Bench rather than on physical robots. The benchmark itself is constructed from real-world captures of 11 commercial areas (126,398 m²) using 3D Gaussian Splatting, with traversability annotations and reference trajectories derived directly from those captures; the 70K signage-entrance pairs are likewise real-world data. The Brain-Action framework is evaluated in closed-loop on this high-fidelity benchmark to demonstrate improved POI-grounded reasoning and waypoint prediction. We acknowledge that no physical robot deployment, explicit sim-to-real quantitative metrics, or ablations on dynamic obstacles, specular reflections, or fine-grained surface conditions are reported. To address this, we will revise the abstract (and relevant sections) to state that the framework shows promise on the high-fidelity real-world-derived benchmark as a concrete step toward real-world POI-goal navigation, rather than claiming direct refinement of physical systems without further validation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark and framework are constructive contributions

full rationale

The paper introduces POINav-Bench (11 commercial areas via 3DGS, traversability annotations, reference trajectories) and the POINav Brain-Action Framework (Brain module for POI-grounded reasoning, Action module for waypoints) plus POINav-Dataset (70K signage-entrance pairs). No equations, fitted parameters, or derivation chains are present in the abstract or described structure. Claims rest on benchmark construction and internal experiments rather than any self-referential reduction of predictions to inputs. This is self-contained against external benchmarks with no load-bearing self-citations or ansatzes invoked.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities can be extracted from the provided text.

pith-pipeline@v0.9.1-grok · 5794 in / 1200 out tokens · 30165 ms · 2026-06-29T11:46:03.299268+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 26 canonical work pages · 7 internal anchors

[1]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., Van Den Hengel, A.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3674–3683 (2018)

2018
[2]

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

ObjectNav revisited: On evaluation of embodied agents navigating to objects,

Batra, D., Gokaslan, A., Kembhavi, A., Maksymets, O., Mottaghi, R., Savva, M., Toshev, A., Wijmans, E.: Objectnav revisited: On evaluation of embodied agents navigating to objects. arXiv preprint arXiv:2006.13171 (2020)

work page arXiv 2006
[5]

arXiv preprint arXiv:2309.16634 (2023)

Bono, G., Antsfeld, L., Chidlovskii, B., Weinzaepfel, P., Wolf, C.: End-to-end (instance)-image goal navigation through correspondence as an emergent phe- nomenon. arXiv preprint arXiv:2309.16634 (2023)

work page arXiv 2023
[6]

Advances in Neural Information Processing Systems33, 4247–4258 (2020)

Chaplot, D.S., Gandhi, D.P., Gupta, A., Salakhutdinov, R.R.: Object goal navi- gation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems33, 4247–4258 (2020)

2020
[7]

arXiv preprint arXiv:2511.21135 (2025)

Chen, Z., Guo, Y., Chu, Z., Luo, M., Shen, Y., Sun, M., Hu, J., Xie, S., Yang, K., Shi, P., et al.: Socialnav: Training human-inspired foundation model for socially- aware embodied navigation. arXiv preprint arXiv:2511.21135 (2025)

work page arXiv 2025
[8]

NaVILA: Legged robot vision-language-action model for navigation,

Cheng,A.C.,Ji,Y.,Yang,Z.,Gongye,Z.,Zou,X.,Kautz,J.,Bıyık,E.,Yin,H.,Liu, S., Wang, X.: Navila: Legged robot vision-language-action model for navigation. arXiv preprint arXiv:2412.04453 (2024)

work page arXiv 2024
[9]

ABot-N0: Technical report on the VLA foundation model for versatile embodied navigation,

Chu, Z., Xie, S., Wu, X., Shen, Y., Luo, M., Wang, Z., Liu, F., Leng, X., Hu, J., Yin, M., et al.: Abot-n0: Technical report on the vla foundation model for versatile embodied navigation. arXiv preprint arXiv:2602.11598 (2026)

work page arXiv 2026
[10]

arXiv preprint arXiv:2211.16649 (2022)

Dorbala, V.S., Sigurdsson, G., Piramuthu, R., Thomason, J., Sukhatme, G.S.: Clip-nav: Using clip for zero-shot vision-and-language navigation. arXiv preprint arXiv:2211.16649 (2022)

work page arXiv 2022
[11]

Guo, D., Wu, F., Zhu, F., Leng, F., Shi, G., Chen, H., Fan, H., Wang, J., Jiang, J., Wang, J., Chen, J., Huang, J., Lei, K., Yuan, L., Luo, L., Liu, P., Ye, Q., Qian, R., Yan, S., Zhao, S., Peng, S., Li, S., Yuan, S., Wu, S., Cheng, T., Liu, W., Wang, W., Zeng, X., Liu, X., Qin, X., Ding, X., Xiao, X., Zhang, X., Zhang, X., Xiong, X., Peng, Y., Chen, Y., ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Huang,Z.,Zhang,Y.,Liu,J.,Song,R.,Tang,C.,Ma,J.:Tic-vla:Athink-in-control vision-language-actionmodelforrobotnavigationindynamicenvironments(2026), https://arxiv.org/abs/2602.02459

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

arXiv preprint arXiv:2504.15643 (2025)

Ieong,I.T.,Tang,H.:Multimodalperceptionforgoal-orientednavigation:Asurvey. arXiv preprint arXiv:2504.15643 (2025)

work page arXiv 2025
[14]

ACM Trans

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G., et al.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023)

2023
[15]

arXiv preprint arXiv:2211.15876 (2022)

Krantz, J., Lee, S., Malik, J., Batra, D., Chaplot, D.S.: Instance-specific image goal navigation: Training embodied agents to find object instances. arXiv preprint arXiv:2211.15876 (2022)

work page arXiv 2022
[16]

In: European Confer- ence on Computer Vision

Krantz, J., Wijmans, E., Majumdar, A., Batra, D., Lee, S.: Beyond the nav-graph: Vision-and-language navigation in continuous environments. In: European Confer- ence on Computer Vision. pp. 104–120. Springer (2020)

2020
[17]

In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Ku, A., Anderson, P., Patel, R., Ie, E., Baldridge, J.: Room-across-room: Multi- lingual vision-and-language navigation with dense spatiotemporal grounding. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 4392–4412 (2020)

2020
[18]

Lin, S., Li, Z., Zhao, X., Zhou, G., Wang, L., Wei, R., Tang, R., Li, J., Wang, H., Pang, J., van den Hengel, A., Liu, J., Wu, Q.: Vlnverse: A benchmark for vision- language navigation with versatile, embodied, realistic simulation and evaluation (2025),https://arxiv.org/abs/2512.19021

work page arXiv 2025
[19]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liu, X., Li, J., Jiang, Y., Sujay, N., Yang, Z., Zhang, J., Abanes, J., Zhang, J., Feng, C.: Citywalker: Learning embodied urban navigation from web-scale videos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 6875–6885 (2025)

2025
[20]

arXiv preprint arXiv:2406.04882 (2024)

Long, Y., Cai, W., Wang, H., Zhan, G., Dong, H.: Instructnav: Zero-shot sys- tem for generic instruction navigation in unexplored environment. arXiv preprint arXiv:2406.04882 (2024)

work page arXiv 2024
[21]

Miao, B., Wei, R., Ge, Z., sun, X., Gao, S., Zhu, J., Wang, R., Tang, S., Xiao, J., Tang,R.,Li,J.:Towardsphysicallyexecutable3dgaussianforembodiednavigation (2025),https://arxiv.org/abs/2510.21307

work page arXiv 2025
[22]

NVIDIA: Isaac Sim,https://github.com/isaac-sim/IsaacSim
[23]

Shah, D., Sridhar, A., Dashora, N., Stachowicz, K., Black, K., Hirose, N., Levine, S.:Vint:Afoundationmodelforvisualnavigation.arXivpreprintarXiv:2306.14846 (2023) 23

work page arXiv 2023
[24]

Team, V., Hong, W., Yu, W., Gu, X., Wang, G., Gan, G., Tang, H., Cheng, J., Qi, J., Ji, J., Pan, L., Duan, S., Wang, W., Wang, Y., Cheng, Y., He, Z., Su, Z., Yang, Z., Pan, Z., Zeng, A., Wang, B., Chen, B., Shi, B., Pang, C., Zhang, C., Yin, D., Yang, F., Chen, G., Li, H., Zhu, J., Chen, J., Xu, J., Xu, J., Chen, J., Lin, J., Chen, J., Wang, J., Chen, J.,...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, R., Xu, S., Dai, C., Xiang, J., Deng, Y., Tong, X., Yang, J.: Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5261–5271 (2025)

2025
[26]

Wang, S., Liang, C., Gao, Y., Yu, E., Li, S., Li, Y., Li, J., Wang, H.: Cityseeker: How do vlms explore embodied urban navigation with implicit human needs? arXiv preprint arXiv:2512.16755 (2025)

work page arXiv 2025
[27]

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., Wang, Z., Chen, Z., Zhang, H., Yang, G., Wang, H., Wei, Q., Yin, J., Li, W., Cui, E., Chen, G., Ding, Z., Tian, C., Wu, Z., Xie, J., Li, Z., Yang, B., Duan, Y., Wang, X., Hou, Z., Hao, H., Zhang, T., Li, S., Zhao, X., Duan, H., Deng, N., Fu, B., He, Y., Wang, Y., He,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, X., Liu, Y., Song, X., Liu, Y., Zhang, S., Jiang, S.: An interactive navi- gation method with effect-oriented affordance. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16446–16456 (2024)

2024
[29]

Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation,

Wei, M., Wan, C., Peng, J., Yu, X., Yang, Y., Feng, D., Cai, W., Zhu, C., Wang, T., Pang, J., et al.: Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation. arXiv preprint arXiv:2512.08186 (2025)

work page arXiv 2025
[30]

StreamVLN: Streaming vision-and-language navigation via SlowFast context model- ing,

Wei,M.,Wan,C.,Yu,X.,Wang,T.,Yang,Y.,Mao,X.,Zhu,C.,Cai,W.,Wang,H., Chen, Y., et al.: Streamvln: Streaming vision-and-language navigation via slowfast context modeling. arXiv preprint arXiv:2507.05240 (2025)

work page arXiv 2025
[31]

arXiv preprint arXiv:1911.00357 (2019)

Wijmans, E., Kadian, A., Morcos, A., Lee, S., Essa, I., Parikh, D., Savva, M., Batra, D.: Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. arXiv preprint arXiv:1911.00357 (2019)

work page arXiv 1911
[32]

OmniNav: A unified framework for prospective exploration and visual-language navigation,

Xue, X., Hu, J., Luo, M., Xie, S., Chen, J., Xie, Z., Quan, K., Guo, W., Xu, M., Chu, Z.: Omninav: A unified framework for prospective exploration and visual- language navigation. arXiv preprint arXiv:2509.25687 (2025)

work page arXiv 2025
[33]

In: 2024 IEEE International Con- ference on Robotics and Automation (ICRA)

Yokoyama, N., Ha, S., Batra, D., Wang, J., Bucher, B.: Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In: 2024 IEEE International Con- ference on Robotics and Automation (ICRA). pp. 42–48. IEEE (2024) 24

2024
[34]

In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Yokoyama, N., Ramrakhya, R., Das, A., Batra, D., Ha, S.: Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 5543–

2024
[35]

NavFoM: Towards a navigation foundation model for unified embodied navigation,

Zhang, J., Li, A., Qi, Y., Li, M., Liu, J., Wang, S., Liu, H., Zhou, G., Wu, Y., Li, X., et al.: Embodied navigation foundation model. arXiv preprint arXiv:2509.12129 (2025)

work page arXiv 2025
[36]

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

Zhang, J., Wang, K., Wang, S., Li, M., Liu, H., Wei, S., Wang, Z., Zhang, Z., Wang, H.: Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks. arXiv preprint arXiv:2412.06224 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhao, X., Agrawal, H., Batra, D., Schwing, A.G.: The surprising effectiveness of visual odometry techniques for embodied pointgoal navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 16127–16136 (2021)

2021
[38]

Zhao, Y., Yang, Y., Zhu, Y., Shen, Y., Wang, C., Gu, Z., Shi, P., Guo, W., Xu, M.: Bridging the indoor-outdoor gap: Vision-centric instruction-guided embodied navigation for the last meters (2026),https://arxiv.org/abs/2602.06427

work page arXiv 2026
[39]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zheng, D., Huang, S., Zhao, L., Zhong, Y., Wang, L.: Towards learning a generalist model for embodied navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13624–13634 (2024) 25

2024

[1] [1]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., Van Den Hengel, A.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3674–3683 (2018)

2018

[2] [2]

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

ObjectNav revisited: On evaluation of embodied agents navigating to objects,

Batra, D., Gokaslan, A., Kembhavi, A., Maksymets, O., Mottaghi, R., Savva, M., Toshev, A., Wijmans, E.: Objectnav revisited: On evaluation of embodied agents navigating to objects. arXiv preprint arXiv:2006.13171 (2020)

work page arXiv 2006

[5] [5]

arXiv preprint arXiv:2309.16634 (2023)

Bono, G., Antsfeld, L., Chidlovskii, B., Weinzaepfel, P., Wolf, C.: End-to-end (instance)-image goal navigation through correspondence as an emergent phe- nomenon. arXiv preprint arXiv:2309.16634 (2023)

work page arXiv 2023

[6] [6]

Advances in Neural Information Processing Systems33, 4247–4258 (2020)

Chaplot, D.S., Gandhi, D.P., Gupta, A., Salakhutdinov, R.R.: Object goal navi- gation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems33, 4247–4258 (2020)

2020

[7] [7]

arXiv preprint arXiv:2511.21135 (2025)

Chen, Z., Guo, Y., Chu, Z., Luo, M., Shen, Y., Sun, M., Hu, J., Xie, S., Yang, K., Shi, P., et al.: Socialnav: Training human-inspired foundation model for socially- aware embodied navigation. arXiv preprint arXiv:2511.21135 (2025)

work page arXiv 2025

[8] [8]

NaVILA: Legged robot vision-language-action model for navigation,

Cheng,A.C.,Ji,Y.,Yang,Z.,Gongye,Z.,Zou,X.,Kautz,J.,Bıyık,E.,Yin,H.,Liu, S., Wang, X.: Navila: Legged robot vision-language-action model for navigation. arXiv preprint arXiv:2412.04453 (2024)

work page arXiv 2024

[9] [9]

ABot-N0: Technical report on the VLA foundation model for versatile embodied navigation,

Chu, Z., Xie, S., Wu, X., Shen, Y., Luo, M., Wang, Z., Liu, F., Leng, X., Hu, J., Yin, M., et al.: Abot-n0: Technical report on the vla foundation model for versatile embodied navigation. arXiv preprint arXiv:2602.11598 (2026)

work page arXiv 2026

[10] [10]

arXiv preprint arXiv:2211.16649 (2022)

Dorbala, V.S., Sigurdsson, G., Piramuthu, R., Thomason, J., Sukhatme, G.S.: Clip-nav: Using clip for zero-shot vision-and-language navigation. arXiv preprint arXiv:2211.16649 (2022)

work page arXiv 2022

[11] [11]

Guo, D., Wu, F., Zhu, F., Leng, F., Shi, G., Chen, H., Fan, H., Wang, J., Jiang, J., Wang, J., Chen, J., Huang, J., Lei, K., Yuan, L., Luo, L., Liu, P., Ye, Q., Qian, R., Yan, S., Zhao, S., Peng, S., Li, S., Yuan, S., Wu, S., Cheng, T., Liu, W., Wang, W., Zeng, X., Liu, X., Qin, X., Ding, X., Xiao, X., Zhang, X., Zhang, X., Xiong, X., Peng, Y., Chen, Y., ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Huang,Z.,Zhang,Y.,Liu,J.,Song,R.,Tang,C.,Ma,J.:Tic-vla:Athink-in-control vision-language-actionmodelforrobotnavigationindynamicenvironments(2026), https://arxiv.org/abs/2602.02459

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

arXiv preprint arXiv:2504.15643 (2025)

Ieong,I.T.,Tang,H.:Multimodalperceptionforgoal-orientednavigation:Asurvey. arXiv preprint arXiv:2504.15643 (2025)

work page arXiv 2025

[14] [14]

ACM Trans

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G., et al.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023)

2023

[15] [15]

arXiv preprint arXiv:2211.15876 (2022)

Krantz, J., Lee, S., Malik, J., Batra, D., Chaplot, D.S.: Instance-specific image goal navigation: Training embodied agents to find object instances. arXiv preprint arXiv:2211.15876 (2022)

work page arXiv 2022

[16] [16]

In: European Confer- ence on Computer Vision

Krantz, J., Wijmans, E., Majumdar, A., Batra, D., Lee, S.: Beyond the nav-graph: Vision-and-language navigation in continuous environments. In: European Confer- ence on Computer Vision. pp. 104–120. Springer (2020)

2020

[17] [17]

In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Ku, A., Anderson, P., Patel, R., Ie, E., Baldridge, J.: Room-across-room: Multi- lingual vision-and-language navigation with dense spatiotemporal grounding. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 4392–4412 (2020)

2020

[18] [18]

Lin, S., Li, Z., Zhao, X., Zhou, G., Wang, L., Wei, R., Tang, R., Li, J., Wang, H., Pang, J., van den Hengel, A., Liu, J., Wu, Q.: Vlnverse: A benchmark for vision- language navigation with versatile, embodied, realistic simulation and evaluation (2025),https://arxiv.org/abs/2512.19021

work page arXiv 2025

[19] [19]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liu, X., Li, J., Jiang, Y., Sujay, N., Yang, Z., Zhang, J., Abanes, J., Zhang, J., Feng, C.: Citywalker: Learning embodied urban navigation from web-scale videos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 6875–6885 (2025)

2025

[20] [20]

arXiv preprint arXiv:2406.04882 (2024)

Long, Y., Cai, W., Wang, H., Zhan, G., Dong, H.: Instructnav: Zero-shot sys- tem for generic instruction navigation in unexplored environment. arXiv preprint arXiv:2406.04882 (2024)

work page arXiv 2024

[21] [21]

Miao, B., Wei, R., Ge, Z., sun, X., Gao, S., Zhu, J., Wang, R., Tang, S., Xiao, J., Tang,R.,Li,J.:Towardsphysicallyexecutable3dgaussianforembodiednavigation (2025),https://arxiv.org/abs/2510.21307

work page arXiv 2025

[22] [22]

NVIDIA: Isaac Sim,https://github.com/isaac-sim/IsaacSim

[23] [23]

Shah, D., Sridhar, A., Dashora, N., Stachowicz, K., Black, K., Hirose, N., Levine, S.:Vint:Afoundationmodelforvisualnavigation.arXivpreprintarXiv:2306.14846 (2023) 23

work page arXiv 2023

[24] [24]

Team, V., Hong, W., Yu, W., Gu, X., Wang, G., Gan, G., Tang, H., Cheng, J., Qi, J., Ji, J., Pan, L., Duan, S., Wang, W., Wang, Y., Cheng, Y., He, Z., Su, Z., Yang, Z., Pan, Z., Zeng, A., Wang, B., Chen, B., Shi, B., Pang, C., Zhang, C., Yin, D., Yang, F., Chen, G., Li, H., Zhu, J., Chen, J., Xu, J., Xu, J., Chen, J., Lin, J., Chen, J., Wang, J., Chen, J.,...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, R., Xu, S., Dai, C., Xiang, J., Deng, Y., Tong, X., Yang, J.: Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5261–5271 (2025)

2025

[26] [26]

Wang, S., Liang, C., Gao, Y., Yu, E., Li, S., Li, Y., Li, J., Wang, H.: Cityseeker: How do vlms explore embodied urban navigation with implicit human needs? arXiv preprint arXiv:2512.16755 (2025)

work page arXiv 2025

[27] [27]

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., Wang, Z., Chen, Z., Zhang, H., Yang, G., Wang, H., Wei, Q., Yin, J., Li, W., Cui, E., Chen, G., Ding, Z., Tian, C., Wu, Z., Xie, J., Li, Z., Yang, B., Duan, Y., Wang, X., Hou, Z., Hao, H., Zhang, T., Li, S., Zhao, X., Duan, H., Deng, N., Fu, B., He, Y., Wang, Y., He,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, X., Liu, Y., Song, X., Liu, Y., Zhang, S., Jiang, S.: An interactive navi- gation method with effect-oriented affordance. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16446–16456 (2024)

2024

[29] [29]

Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation,

Wei, M., Wan, C., Peng, J., Yu, X., Yang, Y., Feng, D., Cai, W., Zhu, C., Wang, T., Pang, J., et al.: Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation. arXiv preprint arXiv:2512.08186 (2025)

work page arXiv 2025

[30] [30]

StreamVLN: Streaming vision-and-language navigation via SlowFast context model- ing,

Wei,M.,Wan,C.,Yu,X.,Wang,T.,Yang,Y.,Mao,X.,Zhu,C.,Cai,W.,Wang,H., Chen, Y., et al.: Streamvln: Streaming vision-and-language navigation via slowfast context modeling. arXiv preprint arXiv:2507.05240 (2025)

work page arXiv 2025

[31] [31]

arXiv preprint arXiv:1911.00357 (2019)

Wijmans, E., Kadian, A., Morcos, A., Lee, S., Essa, I., Parikh, D., Savva, M., Batra, D.: Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. arXiv preprint arXiv:1911.00357 (2019)

work page arXiv 1911

[32] [32]

OmniNav: A unified framework for prospective exploration and visual-language navigation,

Xue, X., Hu, J., Luo, M., Xie, S., Chen, J., Xie, Z., Quan, K., Guo, W., Xu, M., Chu, Z.: Omninav: A unified framework for prospective exploration and visual- language navigation. arXiv preprint arXiv:2509.25687 (2025)

work page arXiv 2025

[33] [33]

In: 2024 IEEE International Con- ference on Robotics and Automation (ICRA)

Yokoyama, N., Ha, S., Batra, D., Wang, J., Bucher, B.: Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In: 2024 IEEE International Con- ference on Robotics and Automation (ICRA). pp. 42–48. IEEE (2024) 24

2024

[34] [34]

In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Yokoyama, N., Ramrakhya, R., Das, A., Batra, D., Ha, S.: Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 5543–

2024

[35] [35]

NavFoM: Towards a navigation foundation model for unified embodied navigation,

Zhang, J., Li, A., Qi, Y., Li, M., Liu, J., Wang, S., Liu, H., Zhou, G., Wu, Y., Li, X., et al.: Embodied navigation foundation model. arXiv preprint arXiv:2509.12129 (2025)

work page arXiv 2025

[36] [36]

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

Zhang, J., Wang, K., Wang, S., Li, M., Liu, H., Wei, S., Wang, Z., Zhang, Z., Wang, H.: Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks. arXiv preprint arXiv:2412.06224 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhao, X., Agrawal, H., Batra, D., Schwing, A.G.: The surprising effectiveness of visual odometry techniques for embodied pointgoal navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 16127–16136 (2021)

2021

[38] [38]

Zhao, Y., Yang, Y., Zhu, Y., Shen, Y., Wang, C., Gu, Z., Shi, P., Guo, W., Xu, M.: Bridging the indoor-outdoor gap: Vision-centric instruction-guided embodied navigation for the last meters (2026),https://arxiv.org/abs/2602.06427

work page arXiv 2026

[39] [39]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zheng, D., Huang, S., Zhao, L., Zhong, Y., Wang, L.: Towards learning a generalist model for embodied navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13624–13634 (2024) 25

2024