pith. sign in

arxiv: 2606.10927 · v1 · pith:GEAH7EMYnew · submitted 2026-06-09 · 💻 cs.RO

AllDayNav: Lifelong Navigation via Real-World Reinforcement Learning

Pith reviewed 2026-06-27 13:24 UTC · model grok-4.3

classification 💻 cs.RO
keywords lifelong navigationreinforcement learningmultimodal memoryembodied roboticsimplicit mappingreal-world navigationmap-free navigation
0
0 comments X

The pith

AllDayNav encodes lifelong scene dynamics into a large model's parameters via reinforcement learning driven by a self-evolving multimodal memory, reaching near-100% success without explicit maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AllDayNav as a framework for lifelong robot navigation in changing environments. It trains a large model end-to-end with reinforcement learning so that scene understanding forms inside the model's weights rather than in a separate map. A memory system automatically stores visual keyframes, semantic labels, and time context, then creates its own training instructions and rewards from those stored elements. Tests across rooms, repeated episodes, and different tasks in both simulation and real settings show the method matches or exceeds map-based and vision-language baselines in success rate and path quality. The central demonstration is that implicit storage inside model parameters can replace explicit mapping for persistent navigation.

Core claim

AllDayNav shows that lifelong embodied navigation succeeds when scene dynamics are implicitly stored inside the billion-scale parameters of a large model through reinforcement learning; the storage is maintained by a self-evolving multimodal memory that keeps visual keyframes, semantic descriptions, and temporal context while generating open-vocabulary instructions, image goals, and structured rewards, producing success rates near 100 percent and better path efficiency than map-based, VLM, or standard RL baselines in cross-room, cross-episode, and cross-task conditions.

What carries the argument

self-evolving multimodal memory that maintains visual keyframes, semantic descriptions, and temporal context while autonomously generating open-vocabulary instructions, image goals, and structured rewards for reinforcement learning

If this is right

  • Success rates approach 100 percent across cross-room, cross-episode, and cross-task scenarios in both simulation and reality.
  • Path efficiency and robustness exceed those of map-based, vision-language, and standard reinforcement-learning methods.
  • Scene understanding forms inside model parameters instead of separate maps or graphs.
  • The same memory-driven reinforcement-learning loop supports open-vocabulary instructions without hand-crafted rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar implicit encoding could apply to other long-horizon embodied tasks such as sequential manipulation if the memory component is reused.
  • Further scaling of the underlying model size might increase the duration over which memory remains reliable without added structures.
  • The approach invites direct comparison of memory-update frequency against navigation failure rate in environments with higher rates of object movement.

Load-bearing premise

The multimodal memory can keep updating its visual, semantic, and temporal records from partial observations without external help or explicit maps.

What would settle it

Repeated trials in rapidly changing real spaces where the memory fails to retain consistent scene records across episodes and navigation success drops below map-based baselines.

Figures

Figures reproduced from arXiv: 2606.10927 by Hang Yin, He Wang, Jiahang Liu, Jiazhao Zhang, Minghan Li, Yinan Liang, Zhizheng Zhang.

Figure 1
Figure 1. Figure 1: Overview of AllDayNav, a lifelong self-learning navigation framework. (Left) Deployment in unseen dynamic environments (simulated and real-world). (Middle) The core Memory–Policy Co-evolution: AllDayNav autonomously builds a self-evolving multimodal memory database to generate self-instructions and retrieve visual goals, training a Vision-Language-Action (VLA) policy without supervision. (Right) As experie… view at source ↗
Figure 2
Figure 2. Figure 2: AllDayNav system architecture. (A) The VLA Navigation Backbone processes dual-encoded observations (SigLIP + DINOv2) and compressed history to predict waypoints. (B) The Self-Evolving Memory Database continuously accumulates keyframes with semantic descriptions generated by VLM. (C) The Self-Instruction & Retrieval Module generates diverse tasks from memory and retrieves visual goals (with Internet fallbac… view at source ↗
Figure 3
Figure 3. Figure 3: Memory update and retrieval mechanism. (Left) The Self-Evolving Construction Process: Novel frames are captioned and inserted; redundant entries trigger an alignment-based update check. (Right) VLM-Based Semantic Retrieval: Given a complex instruction (possibly with temporal constraints), the VLM selects the most relevant goal from a context of candidate descriptions and timestamps. and construct a prompt:… view at source ↗
Figure 4
Figure 4. Figure 4: VLA backbone architecture and Sim-to-Real strategy. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Lifelong learning performance in simulation across five HM3D/MP3D test scenes. Each subplot reports the fixed-test performance of the pretrained [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Online exploration performance during autonomous learning. This figure shows the dynamic performance on self-instruction tasks during exploration [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation study results comparing the full AllDayNav model with ablation variants across three key metrics: Success Rate (SR), Success weighted by [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Online learning performance on challenging scenarios. The figure shows the Exponential Moving Average (EMA) of Success Rate, SPL, and Episode [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: Real-world experimental setup. The Unitree Go2 platform is [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 10
Figure 10. Figure 10: Retrieval accuracy analysis during online exploration in Scene S1. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Real-world lifelong learning performance across three real-world environments. Rows correspond to Success Rate and Episode Length; columns [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Real-world online exploration performance using EMA of Success Rate and Episode Length during autonomous deployment. These metrics capture [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visualization of navigation episodes. The figure displays a navigation trajectory in the simulation environment (left) and a real-world deployment [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
read the original abstract

Lifelong embodied navigation in dynamic environments requires robots to form persistent scene understanding from fragmentary observations, which remains difficult for existing methods that rely on explicit maps or scene graphs and struggle to generalize beyond structured settings. We propose AllDayNav, a lifelong self-learning navigation framework that implicitly encodes scene dynamics into the billion-scale parameters of a large model via reinforcement learning, powered by a self-evolving multimodal memory that maintains and updates visual keyframes, semantic descriptions, and temporal context while autonomously generating open-vocabulary instructions, image goals, and structured rewards. Experiments in both synthetic and real-world environments across cross-room, cross-episode, and cross-task scenarios show that AllDayNav achieves success rates approaching $100\%$ and consistently surpasses strong map-based, VLM, and RL baselines in path efficiency and robustness, demonstrating implicit, memory-driven reinforcement learning as a scalable alternative to explicit mapping for reliable lifelong navigation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes AllDayNav, a lifelong self-learning navigation framework for dynamic real-world environments. It implicitly encodes scene dynamics into the parameters of a large model via reinforcement learning, powered by a self-evolving multimodal memory that maintains visual keyframes, semantic descriptions, and temporal context while generating open-vocabulary instructions, image goals, and structured rewards. Experiments across synthetic and real-world settings in cross-room, cross-episode, and cross-task scenarios are reported to yield success rates approaching 100% while outperforming map-based, VLM, and RL baselines in path efficiency and robustness.

Significance. If the empirical claims hold under rigorous evaluation, the work would be significant for embodied AI and robotics by demonstrating that implicit memory-driven RL can serve as a scalable alternative to explicit mapping or scene graphs for persistent scene understanding and reliable lifelong navigation.

major comments (1)
  1. [Abstract] Abstract: The central claims of near-100% success rates and consistent outperformance over baselines are load-bearing for the contribution, yet the text provides no information on experimental setup, environments, number of trials, specific baselines, metrics, or statistical significance testing. This prevents assessment of whether the data supports the reported superiority.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We agree that additional context on the experimental setup would strengthen the presentation of our central claims and will revise the abstract accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of near-100% success rates and consistent outperformance over baselines are load-bearing for the contribution, yet the text provides no information on experimental setup, environments, number of trials, specific baselines, metrics, or statistical significance testing. This prevents assessment of whether the data supports the reported superiority.

    Authors: We agree with the observation. While the abstract is kept concise by design, the lack of even high-level experimental descriptors makes the claims harder to evaluate at a glance. In the revised manuscript we will expand the abstract to include brief references to the environments (synthetic and real-world), scenario types (cross-room, cross-episode, cross-task), the three classes of baselines (map-based, VLM, RL), the primary metrics (success rate and path efficiency), and the number of trials. Full details on statistical testing and exact trial counts remain in the Experiments section, but the abstract will now be more self-contained. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract and available text present an empirical framework description and performance claims without any equations, algorithms, derivations, or first-principles predictions. No load-bearing steps exist that could reduce by construction to inputs, self-citations, or fitted parameters renamed as predictions. The work reports experimental outcomes across environments and is self-contained against external benchmarks as an applied RL system.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only abstract available, so ledger is incomplete. The approach relies on the assumption that the large model can implicitly encode dynamics via RL.

invented entities (1)
  • self-evolving multimodal memory no independent evidence
    purpose: To maintain and update visual keyframes, semantic descriptions, and temporal context for generating instructions and rewards
    Introduced in the abstract as a core component without external validation mentioned.

pith-pipeline@v0.9.1-grok · 5696 in / 1223 out tokens · 24067 ms · 2026-06-27T13:24:34.506858+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 25 canonical work pages · 11 internal anchors

  1. [1]

    Embodied navigation with multi-modal information: A survey from tasks to methodology,

    Y . Wu, P. Zhang, M. Gu, J. Zheng, and X. Bai, “Embodied navigation with multi-modal information: A survey from tasks to methodology,” Information Fusion, vol. 112, p. 102532, 2024

  2. [2]

    On Evaluation of Embodied Navigation Agents

    P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V . Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savvaet al., “On evalu- ation of embodied navigation agents,”arXiv preprint arXiv:1807.06757, 2018. 18

  3. [3]

    Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,

    P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S ¨underhauf, I. Reid, S. Gould, and A. Van Den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3674–3683

  4. [4]

    Llm as copilot for coarse- grained vision-and-language navigation,

    Y . Qiao, Q. Liu, J. Liu, J. Liu, and Q. Wu, “Llm as copilot for coarse- grained vision-and-language navigation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 459–476

  5. [5]

    NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

    J. Zhang, K. Wang, R. Xu, G. Zhou, Y . Hong, X. Fang, Q. Wu, Z. Zhang, and H. Wang, “Navid: Video-based vlm plans the next step for vision- and-language navigation,”arXiv preprint arXiv:2402.15852, 2024

  6. [6]

    Understanding object descriptions in robotics by open-vocabulary object retrieval and detection,

    S. Guadarrama, E. Rodner, K. Saenko, and T. Darrell, “Understanding object descriptions in robotics by open-vocabulary object retrieval and detection,”The International Journal of Robotics Research, vol. 35, no. 1-3, pp. 265–280, 2016

  7. [7]

    Open-vocabulary object retrieval

    S. Guadarrama, E. Rodner, K. Saenko, N. Zhang, R. Farrell, J. Donahue, and T. Darrell, “Open-vocabulary object retrieval.” inRobotics: science and systems, vol. 2, no. 5, 2014, p. 6

  8. [8]

    Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation,

    N. Yokoyama, R. Ramrakhya, A. Das, D. Batra, and S. Ha, “Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 5543–5550

  9. [9]

    Vision-based holistic scene understanding towards proactive human–robot collaboration,

    J. Fan, P. Zheng, and S. Li, “Vision-based holistic scene understanding towards proactive human–robot collaboration,”Robotics and Computer- Integrated Manufacturing, vol. 75, p. 102304, 2022

  10. [10]

    Outdoor scene understanding of mobile robot via multi-sensor information fusion,

    F.-s. Zhang, D.-y. Ge, J. Song, and W.-j. Xiang, “Outdoor scene understanding of mobile robot via multi-sensor information fusion,” Journal of Industrial Information Integration, vol. 30, p. 100392, 2022

  11. [11]

    Navigation-oriented scene understanding for robotic autonomy: Learn- ing to segment driveability in egocentric images,

    G. Humblot-Renaux, L. Marchegiani, T. B. Moeslund, and R. Gade, “Navigation-oriented scene understanding for robotic autonomy: Learn- ing to segment driveability in egocentric images,”IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 2913–2920, 2022

  12. [12]

    CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

    W. Li, R. Zhang, R. Shao, J. He, and L. Nie, “Cogvla: Cognition- aligned vision-language-action model via instruction-driven routing & sparsification,”arXiv preprint arXiv:2508.21046, 2025

  13. [13]

    Multi-Objective Instruction-Aware Representation Learning in Procedural Content Generation RL

    S.-H. Kim, I.-C. Baek, S.-Y . Lee, G.-H. Hwang, and K.-J. Kim, “Multi- objective instruction-aware representation learning in procedural content generation rl,”arXiv preprint arXiv:2508.09193, 2025

  14. [14]

    Dynscene: Scalable generation of dynamic robotic manipulation scenes for embodied ai,

    S. Lee, S. Park, and H. Kim, “Dynscene: Scalable generation of dynamic robotic manipulation scenes for embodied ai,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 12 166– 12 175

  15. [15]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Changet al., “Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai,”arXiv preprint arXiv:2109.08238, 2021

  16. [16]

    Matterport3D: Learning from RGB-D Data in Indoor Environments

    A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y . Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,”arXiv preprint arXiv:1709.06158, 2017

  17. [17]

    Gibson env: Real-world perception for embodied agents,

    F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese, “Gibson env: Real-world perception for embodied agents,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9068–9079

  18. [18]

    Object goal navigation using goal-oriented semantic exploration,

    D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,”Advances in Neural Information Processing Systems, vol. 33, pp. 4247–4258, 2020

  19. [19]

    Learning hierarchical relationships for object-goal navigation,

    A. Pal, Y . Qiu, and H. Christensen, “Learning hierarchical relationships for object-goal navigation,” inConference on Robot Learning. PMLR, 2021, pp. 517–528

  20. [20]

    Nvila: Efficient frontier visual language models,

    Z. Liu, L. Zhu, B. Shi, Z. Zhang, Y . Lou, S. Yang, H. Xi, S. Cao, Y . Gu, D. Liet al., “Nvila: Efficient frontier visual language models,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 4122–4134

  21. [21]

    Open-nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms,

    Y . Qiao, W. Lyu, H. Wang, Z. Wang, Z. Li, Y . Zhang, M. Tan, and Q. Wu, “Open-nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 6710–6717

  22. [22]

    General scene adaptation for vision-and-language navigation,

    H. Hong, Y . Qiao, S. Wang, J. Liu, and Q. Wu, “General scene adaptation for vision-and-language navigation,”arXiv preprint arXiv:2501.17403, 2025

  23. [23]

    Openfmnav: Towards open-set zero- shot object navigation via vision-language foundation models,

    Y . Kuang, H. Lin, and M. Jiang, “Openfmnav: Towards open-set zero- shot object navigation via vision-language foundation models,”arXiv preprint arXiv:2402.10670, 2024

  24. [24]

    Unigoal: Towards universal zero-shot goal-oriented navigation,

    H. Yin, X. Xu, L. Zhao, Z. Wang, J. Zhou, and J. Lu, “Unigoal: Towards universal zero-shot goal-oriented navigation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19 057–19 066

  25. [25]

    Scene graph contrastive learning for embodied navigation,

    K. P. Singh, J. Salvador, L. Weihs, and A. Kembhavi, “Scene graph contrastive learning for embodied navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 884–10 894

  26. [26]

    VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

    G. Lu, W. Guo, C. Zhang, Y . Zhou, H. Jiang, Z. Gao, Y . Tang, and Z. Wang, “Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning,”arXiv preprint arXiv:2505.18719, 2025

  27. [27]

    Vla-rft: Vision-language-actionreinforcement fine-tuning with verified rewards in world simulators.arXiv preprint arXiv:2510.00406, 2025

    H. Li, P. Ding, R. Suo, Y . Wang, Z. Ge, D. Zang, K. Yu, M. Sun, H. Zhang, D. Wanget al., “Vla-rft: Vision-language-action reinforce- ment fine-tuning with verified rewards in world simulators,”arXiv preprint arXiv:2510.00406, 2025

  28. [28]

    SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

    H. Li, Y . Zuo, J. Yu, Y . Zhang, Z. Yang, K. Zhang, X. Zhu, Y . Zhang, T. Chen, G. Cuiet al., “Simplevla-rl: Scaling vla training via reinforce- ment learning,”arXiv preprint arXiv:2509.09674, 2025

  29. [29]

    Improving vision-language-action model with online reinforcement learning,

    Y . Guo, J. Zhang, X. Chen, X. Ji, Y .-J. Wang, Y . Hu, and J. Chen, “Improving vision-language-action model with online reinforcement learning,”arXiv preprint arXiv:2501.16664, 2025

  30. [30]

    Mental imagery: against the nihilistic hypothesis,

    S. M. Kosslyn, G. Ganis, and W. L. Thompson, “Mental imagery: against the nihilistic hypothesis,”Trends in Cognitive Sciences, vol. 7, no. 3, pp. 109–110, 2003

  31. [31]

    Visual images preserve metric spatial information: evidence from studies of image scanning

    S. M. Kosslyn, T. M. Ball, and B. J. Reiser, “Visual images preserve metric spatial information: evidence from studies of image scanning.” Journal of experimental psychology: Human perception and perfor- mance, vol. 4, no. 1, p. 47, 1978

  32. [32]

    Individual differences in spatial mental imagery,

    G. Borst and S. M. Kosslyn, “Individual differences in spatial mental imagery,”Quarterly Journal of Experimental Psychology, vol. 63, no. 10, pp. 2031–2050, 2010

  33. [33]

    Visual and spatial mental imagery: Dissociable systems of representation,

    M. J. Farah, K. M. Hammond, D. N. Levine, and R. Calvanio, “Visual and spatial mental imagery: Dissociable systems of representation,” Cognitive psychology, vol. 20, no. 4, pp. 439–462, 1988

  34. [34]

    Mental imagery and visual working memory,

    R. Keogh and J. Pearson, “Mental imagery and visual working memory,” PloS one, vol. 6, no. 12, p. e29221, 2011

  35. [35]

    Bird’s-eye-view scene graph for vision-language navigation,

    R. Liu, X. Wang, W. Wang, and Y . Yang, “Bird’s-eye-view scene graph for vision-language navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 968–10 980

  36. [36]

    Hierarchical representations and explicit memory: Learning effective navigation policies on 3d scene graphs using graph neural networks,

    Z. Ravichandran, L. Peng, N. Hughes, J. D. Griffith, and L. Carlone, “Hierarchical representations and explicit memory: Learning effective navigation policies on 3d scene graphs using graph neural networks,” in2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 9272–9279

  37. [37]

    Navigating with spatial intelligence: A survey of scene graph-based object goal navigation,

    G. Chi, L. Aolin, and M. Yiyue, “Navigating with spatial intelligence: A survey of scene graph-based object goal navigation,”Wuhan University Journal of Natural Sciences, vol. 30, no. 5, pp. 405–426, 2025

  38. [38]

    Indoor and outdoor 3d scene graph generation via language-enabled spatial ontologies,

    J. Strader, N. Hughes, W. Chen, A. Speranzon, and L. Carlone, “Indoor and outdoor 3d scene graph generation via language-enabled spatial ontologies,”IEEE Robotics and Automation Letters, vol. 9, no. 6, pp. 4886–4893, 2024

  39. [39]

    Open scene graphs for open-world object- goal navigation,

    J. Loo, Z. Wu, and D. Hsu, “Open scene graphs for open-world object- goal navigation,”The International Journal of Robotics Research, p. 02783649251369549, 2025

  40. [40]

    Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation,

    H. Yin, X. Xu, Z. Wu, J. Zhou, and J. Lu, “Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation,”Advances in neural information processing systems, vol. 37, pp. 5285–5307, 2024

  41. [41]

    Zson: Zero-shot object-goal navigation using multimodal goal embed- dings,

    A. Majumdar, G. Aggarwal, B. Devnani, J. Hoffman, and D. Batra, “Zson: Zero-shot object-goal navigation using multimodal goal embed- dings,”Advances in Neural Information Processing Systems, vol. 35, pp. 32 340–32 352, 2022

  42. [42]

    Entl: Embodied navigation trajectory learner,

    K. Kotar, A. Walsman, and R. Mottaghi, “Entl: Embodied navigation trajectory learner,” inProceedings of the IEEE/CVF International Con- ference on Computer Vision, 2023, pp. 10 863–10 872

  43. [43]

    Rana: Retrieval-augmented navigation,

    G. Monaci, R. S. Rezende, R. Deffayet, G. Csurka, G. Bono, H. D ´ejean, S. Clinchant, and C. Wolf, “Rana: Retrieval-augmented navigation,” arXiv preprint arXiv:2504.03524, 2025

  44. [44]

    Llm- empowered embodied agent for memory-augmented task planning in household robotics,

    M. Glocker, P. H ¨onig, M. Hirschmanner, and M. Vincze, “Llm- empowered embodied agent for memory-augmented task planning in household robotics,”arXiv preprint arXiv:2504.21716, 2025

  45. [45]

    Remembr: Building and reasoning over long-horizon spatio-temporal memory for robot navigation,

    A. Anwar, J. Welsh, J. Biswas, S. Pouya, and Y . Chang, “Remembr: Building and reasoning over long-horizon spatio-temporal memory for robot navigation,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 2838–2845

  46. [46]

    Meta-memory: Retrieving and integrating semantic-spatial memories for robot spatial reasoning,

    Y . Mao, H. Ye, W. Dong, C. Zhang, and H. Zhang, “Meta-memory: Retrieving and integrating semantic-spatial memories for robot spatial reasoning,”arXiv preprint arXiv:2509.20754, 2025. 19

  47. [47]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,

    W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalezet al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,”See https://vicuna. lmsys. org (accessed 14 April 2023), vol. 2, no. 3, p. 6, 2023

  48. [48]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  49. [49]

    Chatnav: Leveraging llm to zero-shot semantic reasoning in object navigation,

    Y . Zhu, Z. Wen, X. Li, X. Shi, X. Wu, H. Dong, and J. Chen, “Chatnav: Leveraging llm to zero-shot semantic reasoning in object navigation,” IEEE Transactions on Circuits and Systems for Video Technology, 2024

  50. [50]

    Mapgpt: Map-guided prompting with adaptive path planning for vision-and- language navigation,

    J. Chen, B. Lin, R. Xu, Z. Chai, X. Liang, and K.-Y . Wong, “Mapgpt: Map-guided prompting with adaptive path planning for vision-and- language navigation,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 9796–9810

  51. [51]

    Instructnav: Zero- shot system for generic instruction navigation in unexplored environ- ment,

    Y . Long, W. Cai, H. Wang, G. Zhan, and H. Dong, “Instructnav: Zero- shot system for generic instruction navigation in unexplored environ- ment,”arXiv preprint arXiv:2406.04882, 2024

  52. [52]

    Navgpt: Explicit reasoning in vision- and-language navigation with large language models,

    G. Zhou, Y . Hong, and Q. Wu, “Navgpt: Explicit reasoning in vision- and-language navigation with large language models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7641–7649

  53. [53]

    Navcot: Boosting llm-based vision-and-language navigation via learning disentangled reasoning,

    B. Lin, Y . Nie, Z. Wei, J. Chen, S. Ma, J. Han, H. Xu, X. Chang, and X. Liang, “Navcot: Boosting llm-based vision-and-language navigation via learning disentangled reasoning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  54. [54]

    Mobility vla: Multimodal instruction navigation with long-context vlms and topological graphs,

    Z. Xu, H.-T. L. Chiang, Z. Fu, M. G. Jacob, T. Zhang, T.-W. E. Lee, W. Yu, C. Schenck, D. Rendleman, D. Shahet al., “Mobility vla: Multimodal instruction navigation with long-context vlms and topological graphs,” in8th Annual Conference on Robot Learning, 2024

  55. [55]

    FiLM-Nav: Efficient and Generalizable Navigation via VLM Fine-tuning

    N. Yokoyama and S. Ha, “Film-nav: Efficient and generalizable naviga- tion via vlm fine-tuning,”arXiv preprint arXiv:2509.16445, 2025

  56. [56]

    Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

    J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang, “Uni-navid: A video-based vision-language- action model for unifying embodied navigation tasks,”arXiv preprint arXiv:2412.06224, 2024

  57. [57]

    Navigation world models,

    A. Bar, G. Zhou, D. Tran, T. Darrell, and Y . LeCun, “Navigation world models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 15 791–15 801

  58. [58]

    Conrft: A reinforcedfine-tuningmethodforvlamodelsviaconsistencypolicy.arXiv preprint arXiv:2502.05450, 2025

    Y . Chen, S. Tian, S. Liu, Y . Zhou, H. Li, and D. Zhao, “Conrft: A reinforced fine-tuning method for vla models via consistency policy,” arXiv preprint arXiv:2502.05450, 2025

  59. [59]

    Serl: A software suite for sample-efficient robotic reinforcement learning,

    J. Luo, Z. Hu, C. Xu, Y . L. Tan, J. Berg, A. Sharma, S. Schaal, C. Finn, A. Gupta, and S. Levine, “Serl: A software suite for sample-efficient robotic reinforcement learning,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 16 961–16 969

  60. [60]

    The ingredients of real-world robotic reinforcement learning.arXiv preprint arXiv:2004.12570, 2020

    H. Zhu, J. Yu, A. Gupta, D. Shah, K. Hartikainen, A. Singh, V . Kumar, and S. Levine, “The ingredients of real-world robotic reinforcement learning,” 2020. [Online]. Available: https://arxiv.org/abs/2004.12570

  61. [61]

    Rlif: Inter- active imitation learning as reinforcement learning,

    J. Luo, P. Dong, Y . Zhai, Y . Ma, and S. Levine, “Rlif: Inter- active imitation learning as reinforcement learning,”arXiv preprint arXiv:2311.12996, 2023

  62. [62]

    Rldg: Robotic generalist policy dis- tillation via reinforcement learning,

    C. Xu, Q. Li, J. Luo, and S. Levine, “Rldg: Robotic generalist policy dis- tillation via reinforcement learning,”arXiv preprint arXiv:2412.09858, 2024

  63. [63]

    Poliformer: Scaling on-policy rl with transformers results in masterful navigators,

    K.-H. Zeng, Z. Zhang, K. Ehsani, R. Hendrix, J. Salvador, A. Herrasti, R. Girshick, A. Kembhavi, and L. Weihs, “Poliformer: Scaling on-policy rl with transformers results in masterful navigators,”arXiv preprint arXiv:2406.20083, 2024

  64. [64]

    Sigmoid loss for language image pre-training,

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” 2023

  65. [65]

    Dinov2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y . Huang, H. Xu, V . Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features withou...

  66. [66]

    Llave: Large language and vision embedding models with hardness-weighted contrastive learning,

    Z. Lan, L. Niu, F. Meng, J. Zhou, and J. Su, “Llave: Large language and vision embedding models with hardness-weighted contrastive learning,” arXiv preprint arXiv:2503.04812, 2025

  67. [67]

    Habitat: A platform for embodied ai research,

    M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Maliket al., “Habitat: A platform for embodied ai research,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9339–9347