FiLM-Nav: Efficient and Generalizable Navigation via VLM Fine-tuning
Pith reviewed 2026-05-18 15:06 UTC · model grok-4.3
The pith
Directly fine-tuning a pre-trained vision-language model on simulated navigation data produces a policy that sets new performance records on HM3D object navigation benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FiLM-Nav shows that conditioning a fine-tuned VLM directly on raw visual history and the navigation goal, using a diverse mixture of simulated ObjectNav, OVON, ImageNav, and spatial reasoning data, produces a policy that reaches new state-of-the-art SPL and success rates on HM3D ObjectNav while also leading in SPL on the HM3D-OVON benchmark with strong generalization to unseen categories.
What carries the argument
The fine-tuned VLM that outputs the next exploration frontier by processing visual trajectory history together with the language goal.
If this is right
- The policy achieves new state-of-the-art SPL and success rate among open-vocabulary methods on HM3D ObjectNav.
- It also records the highest SPL on the HM3D-OVON benchmark while generalizing to object categories never seen in training.
- Diverse task mixture during fine-tuning proves necessary for robustness across different navigation settings.
- Direct fine-tuning on embodied simulation data offers an effective route to semantic navigation without intermediate map construction.
Where Pith is reading between the lines
- Navigation stacks could simplify by replacing separate perception and planning modules with a single fine-tuned VLM.
- The same fine-tuning recipe might extend to other sequential robotics tasks such as manipulation or multi-robot coordination.
- Real-world data collection on physical platforms could further close the remaining sim-to-real gap in visual grounding.
Load-bearing premise
That exposure to targeted simulated navigation trajectories is sufficient to ground the VLM's pre-trained representations in the dynamics and visual patterns needed for reliable goal-directed movement.
What would settle it
A drop in success rate or SPL when the same model is tested on physical robots in rooms containing object instances and spatial layouts absent from the simulation training set.
Figures
read the original abstract
Enabling robotic assistants to navigate complex environments and locate objects described in free-form language is a critical capability for real-world deployment. While foundation models, particularly Vision-Language Models (VLMs), offer powerful semantic understanding, effectively adapting their web-scale knowledge for embodied decision-making remains a key challenge. We present FiLM-Nav (Fine-tuned Language Model for Navigation), an approach that directly fine-tunes pre-trained VLM as the navigation policy. In contrast to methods that use foundation models primarily in a zero-shot manner or for map annotation, FiLM-Nav learns to select the next best exploration frontier by conditioning directly on raw visual trajectory history and the navigation goal. Leveraging targeted simulated embodied experience allows the VLM to ground its powerful pre-trained representations in the specific dynamics and visual patterns relevant to goal-driven navigation. Critically, fine-tuning on a diverse data mixture combining ObjectNav, OVON, ImageNav, and an auxiliary spatial reasoning task proves essential for achieving robustness and broad generalization. FiLM-Nav sets a new state-of-the-art in both SPL and success rate on HM3D ObjectNav among open-vocabulary methods, and sets a state-of-the-art SPL on the challenging HM3D-OVON benchmark, demonstrating strong generalization to unseen object categories. Our work validates that directly fine-tuning VLMs on diverse simulated embodied data is a highly effective pathway towards generalizable and efficient semantic navigation capabilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FiLM-Nav, which directly fine-tunes a pre-trained VLM as the navigation policy for open-vocabulary object navigation. The method conditions on raw visual trajectory history and the goal to select exploration frontiers, using targeted simulated embodied experience on a diverse mixture of ObjectNav, OVON, ImageNav, and auxiliary spatial reasoning tasks. The authors claim this mixture is essential for robustness and generalization, reporting new state-of-the-art SPL and success rate on HM3D ObjectNav among open-vocabulary methods as well as SOTA SPL on the HM3D-OVON benchmark with strong generalization to unseen categories.
Significance. If the reported performance gains hold under rigorous controls, the work provides evidence that direct VLM fine-tuning on diverse embodied simulation data can ground pre-trained representations for efficient, generalizable semantic navigation, offering a scalable alternative to zero-shot or map-annotation pipelines. This has potential implications for real-world robotic deployment where language-specified goals must be handled without task-specific engineering.
major comments (1)
- [§4] §4 (Experiments) and abstract: The central claim that fine-tuning on the specific diverse data mixture 'proves essential' for robustness and broad generalization to unseen categories on HM3D-OVON is not supported by ablation evidence. No experiments are reported that hold model architecture, optimizer, and evaluation fixed while removing individual components (e.g., training on ObjectNav alone versus the full mixture) to isolate their contribution to the reported SPL gains.
minor comments (1)
- [Table 1] Table 1 and §4.2: Ensure all baselines include the exact same VLM backbone and training compute budget for fair comparison; current presentation leaves open whether differences in pre-training data affect the SOTA attribution.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for highlighting an important point about the strength of evidence supporting our claims. We address the major comment in detail below and outline our planned revisions.
read point-by-point responses
-
Referee: [§4] §4 (Experiments) and abstract: The central claim that fine-tuning on the specific diverse data mixture 'proves essential' for robustness and broad generalization to unseen categories on HM3D-OVON is not supported by ablation evidence. No experiments are reported that hold model architecture, optimizer, and evaluation fixed while removing individual components (e.g., training on ObjectNav alone versus the full mixture) to isolate their contribution to the reported SPL gains.
Authors: We agree that the manuscript would be strengthened by explicit ablations that isolate the contribution of each data component while holding architecture, optimizer, and evaluation protocol fixed. Our current results demonstrate strong performance with the full mixture and weaker results in preliminary single-task pilots, but we did not report the full controlled ablations requested. In the revision we will add these experiments: we will train identical VLM instances on (i) ObjectNav only, (ii) ObjectNav+OVON, (iii) ObjectNav+ImageNav, and (iv) the complete mixture, reporting SPL, success rate, and generalization to unseen categories on HM3D-OVON under the same training budget and hyperparameters. These results will be presented in a new table in §4 and will directly support (or qualify) the claim that the diverse mixture is essential. revision: yes
Circularity Check
No circularity: empirical claims rest on benchmark evaluation, not derivations or self-referential fits
full rationale
The paper describes a fine-tuning procedure for VLMs on simulated navigation data and reports SOTA SPL/success metrics on HM3D ObjectNav and HM3D-OVON. No equations, parameters fitted to subsets then re-predicted, or self-citation chains appear in the provided text. The central claim attributes performance to the data mixture and direct conditioning on visual history, but this is justified by external benchmark outcomes rather than any quantity defined in terms of itself. The work is self-contained against standard embodied AI evaluation protocols.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation
MerNav's Memory-Execute-Review framework improves success rates in zero-shot object goal navigation by 5-8% over baselines on four datasets while outperforming both training-free and supervised methods on key benchmarks.
Reference graph
Works this paper leans on
-
[1]
Objectnav revisited: On evaluation of embodied agents navigating to objects,
D. Batra, A. Gokaslan, A. Kembhavi, O. Maksymets, R. Mottaghi, M. Savva, A. Toshev, and E. Wijmans, “Objectnav revisited: On evaluation of embodied agents navigating to objects,” 2020
work page 2020
-
[2]
Hm3d- ovon: A dataset and benchmark for open-vocabulary object goal navigation,
N. Yokoyama, R. Ramrakhya, A. Das, D. Batra, and S. Ha, “Hm3d- ovon: A dataset and benchmark for open-vocabulary object goal navigation,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024
work page 2024
-
[3]
On the Opportunities and Risks of Foundation Models
R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill,et al., “On the opportunities and risks of foundation models,”arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation,
K. Zhou, K. Zheng, C. Pryor, Y . Shen, H. Jin, L. Getoor, and X. E. Wang, “ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation,”arXiv preprint arXiv:2301.13166, 2023
-
[5]
L3mvn: Leveraging large language models for visual target navigation,
B. Yu, H. Kasaei, and M. Cao, “L3mvn: Leveraging large language models for visual target navigation,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023
work page 2023
-
[6]
arXiv preprint arXiv:2503.02247 , year=
D. Nie, X. Guo, Y . Duan, R. Zhang, and L. Chen, “Wmnav: Integrating vision-language models into world models for object goal navigation,” arXiv preprint arXiv:2503.02247, 2025
-
[7]
Vlfm: Vision-language frontier maps for zero-shot semantic navigation,
N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision-language frontier maps for zero-shot semantic navigation,” in International Conference on Robotics and Automation (ICRA), 2024
work page 2024
-
[8]
Gamap: Zero-shot object goal navigation with multi-scale geometric-affordance guidance,
S. Yuan, H. Huang, Y . Hao, C. Wen, A. Tzes, and Y . Fang, “Gamap: Zero-shot object goal navigation with multi-scale geometric-affordance guidance,” inAdvances in Neural Information Processing Systems, ser. NeurIPS 2024, 2024
work page 2024
-
[9]
Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,
Y . Long, W. Cai, H. Wang, G. Zhan, and H. Dong, “Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,” inConference on Robot Learning, ser. CoRL 2024, 2024
work page 2024
-
[10]
TANGO: Training- free embodied AI agents for open-world tasks,
F. Ziliotto, T. Campari, L. Serafini, and L. Ballan, “TANGO: Training- free embodied AI agents for open-world tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. [Online]. Available: https://arxiv.org/abs/2412.10402
-
[11]
Vln-game: Vision-language equilibrium search for zero-shot semantic navigation,
B. Yu, Y . Liu, L. Han, H. Kasaei, T. Li, and M. Cao, “Vln-game: Vision-language equilibrium search for zero-shot semantic navigation,” arXiv preprint arXiv:2411.11609, 2024
-
[12]
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,
B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Sadigh, L. Guibas, and F. Xia, “Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 14 455–14 465
work page 2024
-
[13]
Y . Liu, W. Chen, Y . Bai, X. Liang, G. Li, W. Gao, and L. Lin, “Aligning cyber space with physical world: A comprehensive survey on embodied ai,”arXiv preprint arXiv:2407.06886, 2024
-
[14]
Cobra: Extending mamba to multi-modal large language model for efficient inference,
H. Zhao, M. Zhang, W. Zhao, P. Ding, S. Huang, and D. Wang, “Cobra: Extending mamba to multi-modal large language model for efficient inference,” inProceedings of the 39th AAAI Conference on Artificial Intelligence. AAAI Press, 2025
work page 2025
-
[15]
Target-driven visual navigation in indoor scenes using deep reinforcement learning,
Y . Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” inIEEE International Conference on Robotics and Automation (ICRA), 2017
work page 2017
-
[16]
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra, “Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track,...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
Imaginenav: Prompting vision- language models as embodied navigator through scene imagination,
X. Zhao, W. Cai, L. Tang, and T. Wang, “Imaginenav: Prompting vision- language models as embodied navigator through scene imagination,” inInternational Conference on Learning Representations, ser. ICLR 2025, 2025
work page 2025
-
[18]
Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks,
J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang, “Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks,” inRobotics: Science and Systems, 2025
work page 2025
-
[19]
On Evaluation of Embodied Navigation Agents
P. Anderson, A. X. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V . Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, and A. R. Zamir, “On Evaluation of Embodied Navigation Agents,”arXiv preprint arXiv:1807.06757, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[20]
K. Yadav, J. Krantz, R. Ramrakhya, S. K. Ramakrishnan, J. Yang, A. Wang, J. Turner, A. Gokaslan, V .-P. Berges, R. Mootaghi, O. Maksymets, A. X. Chang, M. Savva, A. Clegg, D. S. Chaplot, and D. Batra, “Habitat challenge 2023,” https://aihabitat.org/challenge/2023/, 2023
work page 2023
-
[21]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
SlimPajama: A 627B token cleaned and deduplicated version of RedPajama,
D. Soboleva, F. Al-Khateeb, R. Myers, J. R. Steeves, J. Hestness, and N. Dey, “SlimPajama: A 627B token cleaned and deduplicated version of RedPajama,” https://cerebras.ai/blog/slimpajama-a-627b- token-cleaned-and-deduplicated-version-of-redpajama, 2023. [Online]. Available: https://huggingface.co/datasets/cerebras/SlimPajama-627B
work page 2023
-
[23]
Sigmoid loss for language image pre-training,
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 975–11 986
work page 2023
-
[24]
Habitat: A Platform for Embodied AI Research,
M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik, D. Parikh, and D. Batra, “Habitat: A Platform for Embodied AI Research,” inICCV, 2019
work page 2019
-
[25]
A Frontier-Based Approach for Autonomous Explo- ration,
B. Yamauchi, “A Frontier-Based Approach for Autonomous Explo- ration,” inProceedings 1997 IEEE International Symposium on Com- putational Intelligence in Robotics and Automation CIRA’97. ’Towards New Computational Principles for Robotics and Automation’. IEEE, 1997, pp. 146–151
work page 1997
-
[26]
PIRLNav: Pretraining with Imitation and RL Finetuning for ObjectNav,
R. Ramrakhya, D. Batra, E. Wijmans, and A. Das, “PIRLNav: Pretraining with Imitation and RL Finetuning for ObjectNav,” inCVPR, 2023
work page 2023
-
[27]
Decentralized distributed PPO: solving pointgoal navigation
E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra, “Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames,”arXiv preprint arXiv:1911.00357, 2019
-
[28]
Scaling open-vocabulary object detection,
N. H. Matthias Minderer, Alexey Gritsenko, “Scaling open-vocabulary object detection,”NeurIPS, 2023
work page 2023
-
[29]
Faster Segment Anything: Towards Lightweight SAM for Mobile Applications
C. Zhang, D. Han, Y . Qiao, J. U. Kim, S.-H. Bae, S. Lee, and C. S. Hong, “Faster Segment Anything: Towards Lightweight SAM for Mobile Applications,”arXiv preprint arXiv:2306.14289, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
K. Yadav, A. Majumdar, R. Ramrakhya, N. Yokoyama, A. Baevski, Z. Kira, O. Maksymets, and D. Batra, “Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav,”arXiv preprint arXiv:2303.07798, 2023
-
[31]
Learn to navigate in dynamic environments with normalized lidar scans,
T. Silwal, A. Guo, K. Narayanan, and S. Karaman, “Learn to navigate in dynamic environments with normalized lidar scans,” inIEEE International Conference on Robotics and Automation (ICRA), 2024
work page 2024
-
[32]
The realestate10k dataset for video prediction and beyond,
T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “The realestate10k dataset for video prediction and beyond,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.