pith. sign in

arxiv: 2509.16445 · v2 · submitted 2025-09-19 · 💻 cs.RO

FiLM-Nav: Efficient and Generalizable Navigation via VLM Fine-tuning

Pith reviewed 2026-05-18 15:06 UTC · model grok-4.3

classification 💻 cs.RO
keywords navigationvision-language modelsfine-tuningobject navigationembodied AIgeneralizationHM3Dexploration frontier
0
0 comments X

The pith

Directly fine-tuning a pre-trained vision-language model on simulated navigation data produces a policy that sets new performance records on HM3D object navigation benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FiLM-Nav, which adapts a vision-language model into an end-to-end navigation policy by fine-tuning it to select the next exploration frontier from visual trajectory history and a language-specified goal. Training occurs on a mixture of ObjectNav, OVON, ImageNav, and auxiliary spatial reasoning tasks in simulation, grounding the model's broad knowledge in the specific patterns of goal-driven movement. This yields higher success rates and SPL scores than prior open-vocabulary approaches, along with improved ability to handle object categories not encountered during training. The work shows that targeted embodied fine-tuning can turn web-scale pretraining into practical robotic navigation without relying on separate mapping or zero-shot prompting modules.

Core claim

FiLM-Nav shows that conditioning a fine-tuned VLM directly on raw visual history and the navigation goal, using a diverse mixture of simulated ObjectNav, OVON, ImageNav, and spatial reasoning data, produces a policy that reaches new state-of-the-art SPL and success rates on HM3D ObjectNav while also leading in SPL on the HM3D-OVON benchmark with strong generalization to unseen categories.

What carries the argument

The fine-tuned VLM that outputs the next exploration frontier by processing visual trajectory history together with the language goal.

If this is right

  • The policy achieves new state-of-the-art SPL and success rate among open-vocabulary methods on HM3D ObjectNav.
  • It also records the highest SPL on the HM3D-OVON benchmark while generalizing to object categories never seen in training.
  • Diverse task mixture during fine-tuning proves necessary for robustness across different navigation settings.
  • Direct fine-tuning on embodied simulation data offers an effective route to semantic navigation without intermediate map construction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Navigation stacks could simplify by replacing separate perception and planning modules with a single fine-tuned VLM.
  • The same fine-tuning recipe might extend to other sequential robotics tasks such as manipulation or multi-robot coordination.
  • Real-world data collection on physical platforms could further close the remaining sim-to-real gap in visual grounding.

Load-bearing premise

That exposure to targeted simulated navigation trajectories is sufficient to ground the VLM's pre-trained representations in the dynamics and visual patterns needed for reliable goal-directed movement.

What would settle it

A drop in success rate or SPL when the same model is tested on physical robots in rooms containing object instances and spatial layouts absent from the simulation training set.

Figures

Figures reproduced from arXiv: 2509.16445 by Naoki Yokoyama, Sehoon Ha.

Figure 1
Figure 1. Figure 1: FiLM-Nav takes the agent’s egocentric view history, a language goal ( [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FiLM-Nav architecture. Left: Input images are processed into vision tokens using a frozen SigLIP ViT and a trainable projector. Right: The LLM processes a sequence containing trajectory video tokens, language instructions, and image choices, each with vision tokens and a unique language token ci. The LLM predicts the language token corresponding to the selected choice. SPL serves as the primary ranking met… view at source ↗
Figure 3
Figure 3. Figure 3: Training data is generated via greedy frontier-based exploration. Each frontier is represented by a past RGB observation, selected [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Enabling robotic assistants to navigate complex environments and locate objects described in free-form language is a critical capability for real-world deployment. While foundation models, particularly Vision-Language Models (VLMs), offer powerful semantic understanding, effectively adapting their web-scale knowledge for embodied decision-making remains a key challenge. We present FiLM-Nav (Fine-tuned Language Model for Navigation), an approach that directly fine-tunes pre-trained VLM as the navigation policy. In contrast to methods that use foundation models primarily in a zero-shot manner or for map annotation, FiLM-Nav learns to select the next best exploration frontier by conditioning directly on raw visual trajectory history and the navigation goal. Leveraging targeted simulated embodied experience allows the VLM to ground its powerful pre-trained representations in the specific dynamics and visual patterns relevant to goal-driven navigation. Critically, fine-tuning on a diverse data mixture combining ObjectNav, OVON, ImageNav, and an auxiliary spatial reasoning task proves essential for achieving robustness and broad generalization. FiLM-Nav sets a new state-of-the-art in both SPL and success rate on HM3D ObjectNav among open-vocabulary methods, and sets a state-of-the-art SPL on the challenging HM3D-OVON benchmark, demonstrating strong generalization to unseen object categories. Our work validates that directly fine-tuning VLMs on diverse simulated embodied data is a highly effective pathway towards generalizable and efficient semantic navigation capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces FiLM-Nav, which directly fine-tunes a pre-trained VLM as the navigation policy for open-vocabulary object navigation. The method conditions on raw visual trajectory history and the goal to select exploration frontiers, using targeted simulated embodied experience on a diverse mixture of ObjectNav, OVON, ImageNav, and auxiliary spatial reasoning tasks. The authors claim this mixture is essential for robustness and generalization, reporting new state-of-the-art SPL and success rate on HM3D ObjectNav among open-vocabulary methods as well as SOTA SPL on the HM3D-OVON benchmark with strong generalization to unseen categories.

Significance. If the reported performance gains hold under rigorous controls, the work provides evidence that direct VLM fine-tuning on diverse embodied simulation data can ground pre-trained representations for efficient, generalizable semantic navigation, offering a scalable alternative to zero-shot or map-annotation pipelines. This has potential implications for real-world robotic deployment where language-specified goals must be handled without task-specific engineering.

major comments (1)
  1. [§4] §4 (Experiments) and abstract: The central claim that fine-tuning on the specific diverse data mixture 'proves essential' for robustness and broad generalization to unseen categories on HM3D-OVON is not supported by ablation evidence. No experiments are reported that hold model architecture, optimizer, and evaluation fixed while removing individual components (e.g., training on ObjectNav alone versus the full mixture) to isolate their contribution to the reported SPL gains.
minor comments (1)
  1. [Table 1] Table 1 and §4.2: Ensure all baselines include the exact same VLM backbone and training compute budget for fair comparison; current presentation leaves open whether differences in pre-training data affect the SOTA attribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for highlighting an important point about the strength of evidence supporting our claims. We address the major comment in detail below and outline our planned revisions.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and abstract: The central claim that fine-tuning on the specific diverse data mixture 'proves essential' for robustness and broad generalization to unseen categories on HM3D-OVON is not supported by ablation evidence. No experiments are reported that hold model architecture, optimizer, and evaluation fixed while removing individual components (e.g., training on ObjectNav alone versus the full mixture) to isolate their contribution to the reported SPL gains.

    Authors: We agree that the manuscript would be strengthened by explicit ablations that isolate the contribution of each data component while holding architecture, optimizer, and evaluation protocol fixed. Our current results demonstrate strong performance with the full mixture and weaker results in preliminary single-task pilots, but we did not report the full controlled ablations requested. In the revision we will add these experiments: we will train identical VLM instances on (i) ObjectNav only, (ii) ObjectNav+OVON, (iii) ObjectNav+ImageNav, and (iv) the complete mixture, reporting SPL, success rate, and generalization to unseen categories on HM3D-OVON under the same training budget and hyperparameters. These results will be presented in a new table in §4 and will directly support (or qualify) the claim that the diverse mixture is essential. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on benchmark evaluation, not derivations or self-referential fits

full rationale

The paper describes a fine-tuning procedure for VLMs on simulated navigation data and reports SOTA SPL/success metrics on HM3D ObjectNav and HM3D-OVON. No equations, parameters fitted to subsets then re-predicted, or self-citation chains appear in the provided text. The central claim attributes performance to the data mixture and direct conditioning on visual history, but this is justified by external benchmark outcomes rather than any quantity defined in terms of itself. The work is self-contained against standard embodied AI evaluation protocols.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine-learning paper for robotics. The abstract contains no explicit free parameters, mathematical axioms, or newly postulated entities.

pith-pipeline@v0.9.0 · 5783 in / 1286 out tokens · 46316 ms · 2026-05-18T15:06:47.081946+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation

    cs.CV 2026-02 unverdicted novelty 6.0

    MerNav's Memory-Execute-Review framework improves success rates in zero-shot object goal navigation by 5-8% over baselines on four datasets while outperforming both training-free and supervised methods on key benchmarks.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Objectnav revisited: On evaluation of embodied agents navigating to objects,

    D. Batra, A. Gokaslan, A. Kembhavi, O. Maksymets, R. Mottaghi, M. Savva, A. Toshev, and E. Wijmans, “Objectnav revisited: On evaluation of embodied agents navigating to objects,” 2020

  2. [2]

    Hm3d- ovon: A dataset and benchmark for open-vocabulary object goal navigation,

    N. Yokoyama, R. Ramrakhya, A. Das, D. Batra, and S. Ha, “Hm3d- ovon: A dataset and benchmark for open-vocabulary object goal navigation,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024

  3. [3]

    On the Opportunities and Risks of Foundation Models

    R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill,et al., “On the opportunities and risks of foundation models,”arXiv preprint arXiv:2108.07258, 2021

  4. [4]

    ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation,

    K. Zhou, K. Zheng, C. Pryor, Y . Shen, H. Jin, L. Getoor, and X. E. Wang, “ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation,”arXiv preprint arXiv:2301.13166, 2023

  5. [5]

    L3mvn: Leveraging large language models for visual target navigation,

    B. Yu, H. Kasaei, and M. Cao, “L3mvn: Leveraging large language models for visual target navigation,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023

  6. [6]

    arXiv preprint arXiv:2503.02247 , year=

    D. Nie, X. Guo, Y . Duan, R. Zhang, and L. Chen, “Wmnav: Integrating vision-language models into world models for object goal navigation,” arXiv preprint arXiv:2503.02247, 2025

  7. [7]

    Vlfm: Vision-language frontier maps for zero-shot semantic navigation,

    N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision-language frontier maps for zero-shot semantic navigation,” in International Conference on Robotics and Automation (ICRA), 2024

  8. [8]

    Gamap: Zero-shot object goal navigation with multi-scale geometric-affordance guidance,

    S. Yuan, H. Huang, Y . Hao, C. Wen, A. Tzes, and Y . Fang, “Gamap: Zero-shot object goal navigation with multi-scale geometric-affordance guidance,” inAdvances in Neural Information Processing Systems, ser. NeurIPS 2024, 2024

  9. [9]

    Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,

    Y . Long, W. Cai, H. Wang, G. Zhan, and H. Dong, “Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,” inConference on Robot Learning, ser. CoRL 2024, 2024

  10. [10]

    TANGO: Training- free embodied AI agents for open-world tasks,

    F. Ziliotto, T. Campari, L. Serafini, and L. Ballan, “TANGO: Training- free embodied AI agents for open-world tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. [Online]. Available: https://arxiv.org/abs/2412.10402

  11. [11]

    Vln-game: Vision-language equilibrium search for zero-shot semantic navigation,

    B. Yu, Y . Liu, L. Han, H. Kasaei, T. Li, and M. Cao, “Vln-game: Vision-language equilibrium search for zero-shot semantic navigation,” arXiv preprint arXiv:2411.11609, 2024

  12. [12]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,

    B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Sadigh, L. Guibas, and F. Xia, “Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 14 455–14 465

  13. [13]

    Aligning cyber space with physical world: A comprehensive survey on embodied ai.arXiv preprint arXiv:2407.06886, 2024

    Y . Liu, W. Chen, Y . Bai, X. Liang, G. Li, W. Gao, and L. Lin, “Aligning cyber space with physical world: A comprehensive survey on embodied ai,”arXiv preprint arXiv:2407.06886, 2024

  14. [14]

    Cobra: Extending mamba to multi-modal large language model for efficient inference,

    H. Zhao, M. Zhang, W. Zhao, P. Ding, S. Huang, and D. Wang, “Cobra: Extending mamba to multi-modal large language model for efficient inference,” inProceedings of the 39th AAAI Conference on Artificial Intelligence. AAAI Press, 2025

  15. [15]

    Target-driven visual navigation in indoor scenes using deep reinforcement learning,

    Y . Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” inIEEE International Conference on Robotics and Automation (ICRA), 2017

  16. [16]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra, “Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track,...

  17. [17]

    Imaginenav: Prompting vision- language models as embodied navigator through scene imagination,

    X. Zhao, W. Cai, L. Tang, and T. Wang, “Imaginenav: Prompting vision- language models as embodied navigator through scene imagination,” inInternational Conference on Learning Representations, ser. ICLR 2025, 2025

  18. [18]

    Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks,

    J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang, “Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks,” inRobotics: Science and Systems, 2025

  19. [19]

    On Evaluation of Embodied Navigation Agents

    P. Anderson, A. X. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V . Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, and A. R. Zamir, “On Evaluation of Embodied Navigation Agents,”arXiv preprint arXiv:1807.06757, 2018

  20. [20]

    Habitat challenge 2023,

    K. Yadav, J. Krantz, R. Ramrakhya, S. K. Ramakrishnan, J. Yang, A. Wang, J. Turner, A. Gokaslan, V .-P. Berges, R. Mootaghi, O. Maksymets, A. X. Chang, M. Savva, A. Clegg, D. S. Chaplot, and D. Batra, “Habitat challenge 2023,” https://aihabitat.org/challenge/2023/, 2023

  21. [21]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023

  22. [22]

    SlimPajama: A 627B token cleaned and deduplicated version of RedPajama,

    D. Soboleva, F. Al-Khateeb, R. Myers, J. R. Steeves, J. Hestness, and N. Dey, “SlimPajama: A 627B token cleaned and deduplicated version of RedPajama,” https://cerebras.ai/blog/slimpajama-a-627b- token-cleaned-and-deduplicated-version-of-redpajama, 2023. [Online]. Available: https://huggingface.co/datasets/cerebras/SlimPajama-627B

  23. [23]

    Sigmoid loss for language image pre-training,

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 975–11 986

  24. [24]

    Habitat: A Platform for Embodied AI Research,

    M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik, D. Parikh, and D. Batra, “Habitat: A Platform for Embodied AI Research,” inICCV, 2019

  25. [25]

    A Frontier-Based Approach for Autonomous Explo- ration,

    B. Yamauchi, “A Frontier-Based Approach for Autonomous Explo- ration,” inProceedings 1997 IEEE International Symposium on Com- putational Intelligence in Robotics and Automation CIRA’97. ’Towards New Computational Principles for Robotics and Automation’. IEEE, 1997, pp. 146–151

  26. [26]

    PIRLNav: Pretraining with Imitation and RL Finetuning for ObjectNav,

    R. Ramrakhya, D. Batra, E. Wijmans, and A. Das, “PIRLNav: Pretraining with Imitation and RL Finetuning for ObjectNav,” inCVPR, 2023

  27. [27]

    Decentralized distributed PPO: solving pointgoal navigation

    E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra, “Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames,”arXiv preprint arXiv:1911.00357, 2019

  28. [28]

    Scaling open-vocabulary object detection,

    N. H. Matthias Minderer, Alexey Gritsenko, “Scaling open-vocabulary object detection,”NeurIPS, 2023

  29. [29]

    Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

    C. Zhang, D. Han, Y . Qiao, J. U. Kim, S.-H. Bae, S. Lee, and C. S. Hong, “Faster Segment Anything: Towards Lightweight SAM for Mobile Applications,”arXiv preprint arXiv:2306.14289, 2023

  30. [30]

    Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav.arXiv preprint arXiv:2303.07798, 2023

    K. Yadav, A. Majumdar, R. Ramrakhya, N. Yokoyama, A. Baevski, Z. Kira, O. Maksymets, and D. Batra, “Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav,”arXiv preprint arXiv:2303.07798, 2023

  31. [31]

    Learn to navigate in dynamic environments with normalized lidar scans,

    T. Silwal, A. Guo, K. Narayanan, and S. Karaman, “Learn to navigate in dynamic environments with normalized lidar scans,” inIEEE International Conference on Robotics and Automation (ICRA), 2024

  32. [32]

    The realestate10k dataset for video prediction and beyond,

    T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “The realestate10k dataset for video prediction and beyond,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018