FiLM-Nav: Efficient and Generalizable Navigation via VLM Fine-tuning

Naoki Yokoyama; Sehoon Ha

arxiv: 2509.16445 · v2 · submitted 2025-09-19 · 💻 cs.RO

FiLM-Nav: Efficient and Generalizable Navigation via VLM Fine-tuning

Naoki Yokoyama , Sehoon Ha This is my paper

Pith reviewed 2026-05-18 15:06 UTC · model grok-4.3

classification 💻 cs.RO

keywords navigationvision-language modelsfine-tuningobject navigationembodied AIgeneralizationHM3Dexploration frontier

0 comments

The pith

Directly fine-tuning a pre-trained vision-language model on simulated navigation data produces a policy that sets new performance records on HM3D object navigation benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FiLM-Nav, which adapts a vision-language model into an end-to-end navigation policy by fine-tuning it to select the next exploration frontier from visual trajectory history and a language-specified goal. Training occurs on a mixture of ObjectNav, OVON, ImageNav, and auxiliary spatial reasoning tasks in simulation, grounding the model's broad knowledge in the specific patterns of goal-driven movement. This yields higher success rates and SPL scores than prior open-vocabulary approaches, along with improved ability to handle object categories not encountered during training. The work shows that targeted embodied fine-tuning can turn web-scale pretraining into practical robotic navigation without relying on separate mapping or zero-shot prompting modules.

Core claim

FiLM-Nav shows that conditioning a fine-tuned VLM directly on raw visual history and the navigation goal, using a diverse mixture of simulated ObjectNav, OVON, ImageNav, and spatial reasoning data, produces a policy that reaches new state-of-the-art SPL and success rates on HM3D ObjectNav while also leading in SPL on the HM3D-OVON benchmark with strong generalization to unseen categories.

What carries the argument

The fine-tuned VLM that outputs the next exploration frontier by processing visual trajectory history together with the language goal.

If this is right

The policy achieves new state-of-the-art SPL and success rate among open-vocabulary methods on HM3D ObjectNav.
It also records the highest SPL on the HM3D-OVON benchmark while generalizing to object categories never seen in training.
Diverse task mixture during fine-tuning proves necessary for robustness across different navigation settings.
Direct fine-tuning on embodied simulation data offers an effective route to semantic navigation without intermediate map construction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Navigation stacks could simplify by replacing separate perception and planning modules with a single fine-tuned VLM.
The same fine-tuning recipe might extend to other sequential robotics tasks such as manipulation or multi-robot coordination.
Real-world data collection on physical platforms could further close the remaining sim-to-real gap in visual grounding.

Load-bearing premise

That exposure to targeted simulated navigation trajectories is sufficient to ground the VLM's pre-trained representations in the dynamics and visual patterns needed for reliable goal-directed movement.

What would settle it

A drop in success rate or SPL when the same model is tested on physical robots in rooms containing object instances and spatial layouts absent from the simulation training set.

Figures

Figures reproduced from arXiv: 2509.16445 by Naoki Yokoyama, Sehoon Ha.

**Figure 2.** Figure 2: FiLM-Nav architecture. Left: Input images are processed into vision tokens using a frozen SigLIP ViT and a trainable projector. Right: The LLM processes a sequence containing trajectory video tokens, language instructions, and image choices, each with vision tokens and a unique language token ci. The LLM predicts the language token corresponding to the selected choice. SPL serves as the primary ranking met… view at source ↗

**Figure 3.** Figure 3: Training data is generated via greedy frontier-based exploration. Each frontier is represented by a past RGB observation, selected [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Enabling robotic assistants to navigate complex environments and locate objects described in free-form language is a critical capability for real-world deployment. While foundation models, particularly Vision-Language Models (VLMs), offer powerful semantic understanding, effectively adapting their web-scale knowledge for embodied decision-making remains a key challenge. We present FiLM-Nav (Fine-tuned Language Model for Navigation), an approach that directly fine-tunes pre-trained VLM as the navigation policy. In contrast to methods that use foundation models primarily in a zero-shot manner or for map annotation, FiLM-Nav learns to select the next best exploration frontier by conditioning directly on raw visual trajectory history and the navigation goal. Leveraging targeted simulated embodied experience allows the VLM to ground its powerful pre-trained representations in the specific dynamics and visual patterns relevant to goal-driven navigation. Critically, fine-tuning on a diverse data mixture combining ObjectNav, OVON, ImageNav, and an auxiliary spatial reasoning task proves essential for achieving robustness and broad generalization. FiLM-Nav sets a new state-of-the-art in both SPL and success rate on HM3D ObjectNav among open-vocabulary methods, and sets a state-of-the-art SPL on the challenging HM3D-OVON benchmark, demonstrating strong generalization to unseen object categories. Our work validates that directly fine-tuning VLMs on diverse simulated embodied data is a highly effective pathway towards generalizable and efficient semantic navigation capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FiLM-Nav shows direct VLM fine-tuning on a navigation data mix can hit SOTA SPL on HM3D ObjectNav and OVON, but the evidence does not yet isolate why that specific mix is required.

read the letter

The main point to take away is that this work gets solid SOTA numbers on HM3D by fine-tuning a VLM to directly output navigation actions from image history and language goals, using a particular blend of training tasks. The new angle is treating the VLM itself as the policy head. Instead of zero-shot inference or using it to build semantic maps, they condition it on the trajectory of past observations and the target to pick the best frontier for exploration. Training happens in simulation on a mix that includes standard ObjectNav, open-vocabulary ObjectNav, ImageNav, and an auxiliary spatial task. They report that this combination drives better generalization to objects not seen during training. What they do well is show measurable gains on the challenging HM3D-OVON setup, where the model has to handle novel categories. The success in transferring web-scale knowledge to embodied control through targeted fine-tuning is a practical step forward for semantic navigation. The main soft spot is around the data mixture. The paper states that the diverse mix is essential for robustness, yet the reported results do not include ablations that drop individual components like the auxiliary task or ImageNav while keeping the model and optimizer the same. Without those, it's difficult to confirm that the full mixture is what produces the generalization, rather than just fine-tuning on the core navigation data. The experimental details in the abstract are also light on baselines and variance, which makes it harder to judge how much the numbers move the needle. This kind of paper is for the embodied robotics community, particularly folks working on integrating large models into robot control loops. A reader focused on benchmark results for navigation would get concrete numbers to compare against, though they might need to dig into the full methods for implementation details. I would recommend sending it for peer review. The empirical claims are specific enough to benefit from referee scrutiny on the training setup and ablations.

Referee Report

1 major / 1 minor

Summary. The paper introduces FiLM-Nav, which directly fine-tunes a pre-trained VLM as the navigation policy for open-vocabulary object navigation. The method conditions on raw visual trajectory history and the goal to select exploration frontiers, using targeted simulated embodied experience on a diverse mixture of ObjectNav, OVON, ImageNav, and auxiliary spatial reasoning tasks. The authors claim this mixture is essential for robustness and generalization, reporting new state-of-the-art SPL and success rate on HM3D ObjectNav among open-vocabulary methods as well as SOTA SPL on the HM3D-OVON benchmark with strong generalization to unseen categories.

Significance. If the reported performance gains hold under rigorous controls, the work provides evidence that direct VLM fine-tuning on diverse embodied simulation data can ground pre-trained representations for efficient, generalizable semantic navigation, offering a scalable alternative to zero-shot or map-annotation pipelines. This has potential implications for real-world robotic deployment where language-specified goals must be handled without task-specific engineering.

major comments (1)

[§4] §4 (Experiments) and abstract: The central claim that fine-tuning on the specific diverse data mixture 'proves essential' for robustness and broad generalization to unseen categories on HM3D-OVON is not supported by ablation evidence. No experiments are reported that hold model architecture, optimizer, and evaluation fixed while removing individual components (e.g., training on ObjectNav alone versus the full mixture) to isolate their contribution to the reported SPL gains.

minor comments (1)

[Table 1] Table 1 and §4.2: Ensure all baselines include the exact same VLM backbone and training compute budget for fair comparison; current presentation leaves open whether differences in pre-training data affect the SOTA attribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for highlighting an important point about the strength of evidence supporting our claims. We address the major comment in detail below and outline our planned revisions.

read point-by-point responses

Referee: [§4] §4 (Experiments) and abstract: The central claim that fine-tuning on the specific diverse data mixture 'proves essential' for robustness and broad generalization to unseen categories on HM3D-OVON is not supported by ablation evidence. No experiments are reported that hold model architecture, optimizer, and evaluation fixed while removing individual components (e.g., training on ObjectNav alone versus the full mixture) to isolate their contribution to the reported SPL gains.

Authors: We agree that the manuscript would be strengthened by explicit ablations that isolate the contribution of each data component while holding architecture, optimizer, and evaluation protocol fixed. Our current results demonstrate strong performance with the full mixture and weaker results in preliminary single-task pilots, but we did not report the full controlled ablations requested. In the revision we will add these experiments: we will train identical VLM instances on (i) ObjectNav only, (ii) ObjectNav+OVON, (iii) ObjectNav+ImageNav, and (iv) the complete mixture, reporting SPL, success rate, and generalization to unseen categories on HM3D-OVON under the same training budget and hyperparameters. These results will be presented in a new table in §4 and will directly support (or qualify) the claim that the diverse mixture is essential. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on benchmark evaluation, not derivations or self-referential fits

full rationale

The paper describes a fine-tuning procedure for VLMs on simulated navigation data and reports SOTA SPL/success metrics on HM3D ObjectNav and HM3D-OVON. No equations, parameters fitted to subsets then re-predicted, or self-citation chains appear in the provided text. The central claim attributes performance to the data mixture and direct conditioning on visual history, but this is justified by external benchmark outcomes rather than any quantity defined in terms of itself. The work is self-contained against standard embodied AI evaluation protocols.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine-learning paper for robotics. The abstract contains no explicit free parameters, mathematical axioms, or newly postulated entities.

pith-pipeline@v0.9.0 · 5783 in / 1286 out tokens · 46316 ms · 2026-05-18T15:06:47.081946+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation
cs.CV 2026-02 unverdicted novelty 6.0

MerNav's Memory-Execute-Review framework improves success rates in zero-shot object goal navigation by 5-8% over baselines on four datasets while outperforming both training-free and supervised methods on key benchmarks.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Objectnav revisited: On evaluation of embodied agents navigating to objects,

D. Batra, A. Gokaslan, A. Kembhavi, O. Maksymets, R. Mottaghi, M. Savva, A. Toshev, and E. Wijmans, “Objectnav revisited: On evaluation of embodied agents navigating to objects,” 2020

work page 2020
[2]

Hm3d- ovon: A dataset and benchmark for open-vocabulary object goal navigation,

N. Yokoyama, R. Ramrakhya, A. Das, D. Batra, and S. Ha, “Hm3d- ovon: A dataset and benchmark for open-vocabulary object goal navigation,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024

work page 2024
[3]

On the Opportunities and Risks of Foundation Models

R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill,et al., “On the opportunities and risks of foundation models,”arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation,

K. Zhou, K. Zheng, C. Pryor, Y . Shen, H. Jin, L. Getoor, and X. E. Wang, “ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation,”arXiv preprint arXiv:2301.13166, 2023

work page arXiv 2023
[5]

L3mvn: Leveraging large language models for visual target navigation,

B. Yu, H. Kasaei, and M. Cao, “L3mvn: Leveraging large language models for visual target navigation,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023

work page 2023
[6]

arXiv preprint arXiv:2503.02247 , year=

D. Nie, X. Guo, Y . Duan, R. Zhang, and L. Chen, “Wmnav: Integrating vision-language models into world models for object goal navigation,” arXiv preprint arXiv:2503.02247, 2025

work page arXiv 2025
[7]

Vlfm: Vision-language frontier maps for zero-shot semantic navigation,

N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision-language frontier maps for zero-shot semantic navigation,” in International Conference on Robotics and Automation (ICRA), 2024

work page 2024
[8]

Gamap: Zero-shot object goal navigation with multi-scale geometric-affordance guidance,

S. Yuan, H. Huang, Y . Hao, C. Wen, A. Tzes, and Y . Fang, “Gamap: Zero-shot object goal navigation with multi-scale geometric-affordance guidance,” inAdvances in Neural Information Processing Systems, ser. NeurIPS 2024, 2024

work page 2024
[9]

Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,

Y . Long, W. Cai, H. Wang, G. Zhan, and H. Dong, “Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,” inConference on Robot Learning, ser. CoRL 2024, 2024

work page 2024
[10]

TANGO: Training- free embodied AI agents for open-world tasks,

F. Ziliotto, T. Campari, L. Serafini, and L. Ballan, “TANGO: Training- free embodied AI agents for open-world tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. [Online]. Available: https://arxiv.org/abs/2412.10402

work page arXiv 2025
[11]

Vln-game: Vision-language equilibrium search for zero-shot semantic navigation,

B. Yu, Y . Liu, L. Han, H. Kasaei, T. Li, and M. Cao, “Vln-game: Vision-language equilibrium search for zero-shot semantic navigation,” arXiv preprint arXiv:2411.11609, 2024

work page arXiv 2024
[12]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,

B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Sadigh, L. Guibas, and F. Xia, “Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 14 455–14 465

work page 2024
[13]

Aligning cyber space with physical world: A comprehensive survey on embodied ai.arXiv preprint arXiv:2407.06886, 2024

Y . Liu, W. Chen, Y . Bai, X. Liang, G. Li, W. Gao, and L. Lin, “Aligning cyber space with physical world: A comprehensive survey on embodied ai,”arXiv preprint arXiv:2407.06886, 2024

work page arXiv 2024
[14]

Cobra: Extending mamba to multi-modal large language model for efficient inference,

H. Zhao, M. Zhang, W. Zhao, P. Ding, S. Huang, and D. Wang, “Cobra: Extending mamba to multi-modal large language model for efficient inference,” inProceedings of the 39th AAAI Conference on Artificial Intelligence. AAAI Press, 2025

work page 2025
[15]

Target-driven visual navigation in indoor scenes using deep reinforcement learning,

Y . Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” inIEEE International Conference on Robotics and Automation (ICRA), 2017

work page 2017
[16]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra, “Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track,...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

Imaginenav: Prompting vision- language models as embodied navigator through scene imagination,

X. Zhao, W. Cai, L. Tang, and T. Wang, “Imaginenav: Prompting vision- language models as embodied navigator through scene imagination,” inInternational Conference on Learning Representations, ser. ICLR 2025, 2025

work page 2025
[18]

Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks,

J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang, “Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks,” inRobotics: Science and Systems, 2025

work page 2025
[19]

On Evaluation of Embodied Navigation Agents

P. Anderson, A. X. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V . Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, and A. R. Zamir, “On Evaluation of Embodied Navigation Agents,”arXiv preprint arXiv:1807.06757, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Habitat challenge 2023,

K. Yadav, J. Krantz, R. Ramrakhya, S. K. Ramakrishnan, J. Yang, A. Wang, J. Turner, A. Gokaslan, V .-P. Berges, R. Mootaghi, O. Maksymets, A. X. Chang, M. Savva, A. Clegg, D. S. Chaplot, and D. Batra, “Habitat challenge 2023,” https://aihabitat.org/challenge/2023/, 2023

work page 2023
[21]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

SlimPajama: A 627B token cleaned and deduplicated version of RedPajama,

D. Soboleva, F. Al-Khateeb, R. Myers, J. R. Steeves, J. Hestness, and N. Dey, “SlimPajama: A 627B token cleaned and deduplicated version of RedPajama,” https://cerebras.ai/blog/slimpajama-a-627b- token-cleaned-and-deduplicated-version-of-redpajama, 2023. [Online]. Available: https://huggingface.co/datasets/cerebras/SlimPajama-627B

work page 2023
[23]

Sigmoid loss for language image pre-training,

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 975–11 986

work page 2023
[24]

Habitat: A Platform for Embodied AI Research,

M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik, D. Parikh, and D. Batra, “Habitat: A Platform for Embodied AI Research,” inICCV, 2019

work page 2019
[25]

A Frontier-Based Approach for Autonomous Explo- ration,

B. Yamauchi, “A Frontier-Based Approach for Autonomous Explo- ration,” inProceedings 1997 IEEE International Symposium on Com- putational Intelligence in Robotics and Automation CIRA’97. ’Towards New Computational Principles for Robotics and Automation’. IEEE, 1997, pp. 146–151

work page 1997
[26]

PIRLNav: Pretraining with Imitation and RL Finetuning for ObjectNav,

R. Ramrakhya, D. Batra, E. Wijmans, and A. Das, “PIRLNav: Pretraining with Imitation and RL Finetuning for ObjectNav,” inCVPR, 2023

work page 2023
[27]

Decentralized distributed PPO: solving pointgoal navigation

E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra, “Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames,”arXiv preprint arXiv:1911.00357, 2019

work page arXiv 1911
[28]

Scaling open-vocabulary object detection,

N. H. Matthias Minderer, Alexey Gritsenko, “Scaling open-vocabulary object detection,”NeurIPS, 2023

work page 2023
[29]

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

C. Zhang, D. Han, Y . Qiao, J. U. Kim, S.-H. Bae, S. Lee, and C. S. Hong, “Faster Segment Anything: Towards Lightweight SAM for Mobile Applications,”arXiv preprint arXiv:2306.14289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav.arXiv preprint arXiv:2303.07798, 2023

K. Yadav, A. Majumdar, R. Ramrakhya, N. Yokoyama, A. Baevski, Z. Kira, O. Maksymets, and D. Batra, “Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav,”arXiv preprint arXiv:2303.07798, 2023

work page arXiv 2023
[31]

Learn to navigate in dynamic environments with normalized lidar scans,

T. Silwal, A. Guo, K. Narayanan, and S. Karaman, “Learn to navigate in dynamic environments with normalized lidar scans,” inIEEE International Conference on Robotics and Automation (ICRA), 2024

work page 2024
[32]

The realestate10k dataset for video prediction and beyond,

T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “The realestate10k dataset for video prediction and beyond,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018

work page 2018

[1] [1]

Objectnav revisited: On evaluation of embodied agents navigating to objects,

D. Batra, A. Gokaslan, A. Kembhavi, O. Maksymets, R. Mottaghi, M. Savva, A. Toshev, and E. Wijmans, “Objectnav revisited: On evaluation of embodied agents navigating to objects,” 2020

work page 2020

[2] [2]

Hm3d- ovon: A dataset and benchmark for open-vocabulary object goal navigation,

N. Yokoyama, R. Ramrakhya, A. Das, D. Batra, and S. Ha, “Hm3d- ovon: A dataset and benchmark for open-vocabulary object goal navigation,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024

work page 2024

[3] [3]

On the Opportunities and Risks of Foundation Models

R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill,et al., “On the opportunities and risks of foundation models,”arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation,

K. Zhou, K. Zheng, C. Pryor, Y . Shen, H. Jin, L. Getoor, and X. E. Wang, “ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation,”arXiv preprint arXiv:2301.13166, 2023

work page arXiv 2023

[5] [5]

L3mvn: Leveraging large language models for visual target navigation,

B. Yu, H. Kasaei, and M. Cao, “L3mvn: Leveraging large language models for visual target navigation,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023

work page 2023

[6] [6]

arXiv preprint arXiv:2503.02247 , year=

D. Nie, X. Guo, Y . Duan, R. Zhang, and L. Chen, “Wmnav: Integrating vision-language models into world models for object goal navigation,” arXiv preprint arXiv:2503.02247, 2025

work page arXiv 2025

[7] [7]

Vlfm: Vision-language frontier maps for zero-shot semantic navigation,

N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision-language frontier maps for zero-shot semantic navigation,” in International Conference on Robotics and Automation (ICRA), 2024

work page 2024

[8] [8]

Gamap: Zero-shot object goal navigation with multi-scale geometric-affordance guidance,

S. Yuan, H. Huang, Y . Hao, C. Wen, A. Tzes, and Y . Fang, “Gamap: Zero-shot object goal navigation with multi-scale geometric-affordance guidance,” inAdvances in Neural Information Processing Systems, ser. NeurIPS 2024, 2024

work page 2024

[9] [9]

Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,

Y . Long, W. Cai, H. Wang, G. Zhan, and H. Dong, “Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,” inConference on Robot Learning, ser. CoRL 2024, 2024

work page 2024

[10] [10]

TANGO: Training- free embodied AI agents for open-world tasks,

F. Ziliotto, T. Campari, L. Serafini, and L. Ballan, “TANGO: Training- free embodied AI agents for open-world tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. [Online]. Available: https://arxiv.org/abs/2412.10402

work page arXiv 2025

[11] [11]

Vln-game: Vision-language equilibrium search for zero-shot semantic navigation,

B. Yu, Y . Liu, L. Han, H. Kasaei, T. Li, and M. Cao, “Vln-game: Vision-language equilibrium search for zero-shot semantic navigation,” arXiv preprint arXiv:2411.11609, 2024

work page arXiv 2024

[12] [12]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,

B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Sadigh, L. Guibas, and F. Xia, “Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 14 455–14 465

work page 2024

[13] [13]

Aligning cyber space with physical world: A comprehensive survey on embodied ai.arXiv preprint arXiv:2407.06886, 2024

Y . Liu, W. Chen, Y . Bai, X. Liang, G. Li, W. Gao, and L. Lin, “Aligning cyber space with physical world: A comprehensive survey on embodied ai,”arXiv preprint arXiv:2407.06886, 2024

work page arXiv 2024

[14] [14]

Cobra: Extending mamba to multi-modal large language model for efficient inference,

H. Zhao, M. Zhang, W. Zhao, P. Ding, S. Huang, and D. Wang, “Cobra: Extending mamba to multi-modal large language model for efficient inference,” inProceedings of the 39th AAAI Conference on Artificial Intelligence. AAAI Press, 2025

work page 2025

[15] [15]

Target-driven visual navigation in indoor scenes using deep reinforcement learning,

Y . Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” inIEEE International Conference on Robotics and Automation (ICRA), 2017

work page 2017

[16] [16]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra, “Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track,...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[17] [17]

Imaginenav: Prompting vision- language models as embodied navigator through scene imagination,

X. Zhao, W. Cai, L. Tang, and T. Wang, “Imaginenav: Prompting vision- language models as embodied navigator through scene imagination,” inInternational Conference on Learning Representations, ser. ICLR 2025, 2025

work page 2025

[18] [18]

Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks,

J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang, “Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks,” inRobotics: Science and Systems, 2025

work page 2025

[19] [19]

On Evaluation of Embodied Navigation Agents

P. Anderson, A. X. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V . Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, and A. R. Zamir, “On Evaluation of Embodied Navigation Agents,”arXiv preprint arXiv:1807.06757, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Habitat challenge 2023,

K. Yadav, J. Krantz, R. Ramrakhya, S. K. Ramakrishnan, J. Yang, A. Wang, J. Turner, A. Gokaslan, V .-P. Berges, R. Mootaghi, O. Maksymets, A. X. Chang, M. Savva, A. Clegg, D. S. Chaplot, and D. Batra, “Habitat challenge 2023,” https://aihabitat.org/challenge/2023/, 2023

work page 2023

[21] [21]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

SlimPajama: A 627B token cleaned and deduplicated version of RedPajama,

D. Soboleva, F. Al-Khateeb, R. Myers, J. R. Steeves, J. Hestness, and N. Dey, “SlimPajama: A 627B token cleaned and deduplicated version of RedPajama,” https://cerebras.ai/blog/slimpajama-a-627b- token-cleaned-and-deduplicated-version-of-redpajama, 2023. [Online]. Available: https://huggingface.co/datasets/cerebras/SlimPajama-627B

work page 2023

[23] [23]

Sigmoid loss for language image pre-training,

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 975–11 986

work page 2023

[24] [24]

Habitat: A Platform for Embodied AI Research,

M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik, D. Parikh, and D. Batra, “Habitat: A Platform for Embodied AI Research,” inICCV, 2019

work page 2019

[25] [25]

A Frontier-Based Approach for Autonomous Explo- ration,

B. Yamauchi, “A Frontier-Based Approach for Autonomous Explo- ration,” inProceedings 1997 IEEE International Symposium on Com- putational Intelligence in Robotics and Automation CIRA’97. ’Towards New Computational Principles for Robotics and Automation’. IEEE, 1997, pp. 146–151

work page 1997

[26] [26]

PIRLNav: Pretraining with Imitation and RL Finetuning for ObjectNav,

R. Ramrakhya, D. Batra, E. Wijmans, and A. Das, “PIRLNav: Pretraining with Imitation and RL Finetuning for ObjectNav,” inCVPR, 2023

work page 2023

[27] [27]

Decentralized distributed PPO: solving pointgoal navigation

E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra, “Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames,”arXiv preprint arXiv:1911.00357, 2019

work page arXiv 1911

[28] [28]

Scaling open-vocabulary object detection,

N. H. Matthias Minderer, Alexey Gritsenko, “Scaling open-vocabulary object detection,”NeurIPS, 2023

work page 2023

[29] [29]

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

C. Zhang, D. Han, Y . Qiao, J. U. Kim, S.-H. Bae, S. Lee, and C. S. Hong, “Faster Segment Anything: Towards Lightweight SAM for Mobile Applications,”arXiv preprint arXiv:2306.14289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav.arXiv preprint arXiv:2303.07798, 2023

K. Yadav, A. Majumdar, R. Ramrakhya, N. Yokoyama, A. Baevski, Z. Kira, O. Maksymets, and D. Batra, “Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav,”arXiv preprint arXiv:2303.07798, 2023

work page arXiv 2023

[31] [31]

Learn to navigate in dynamic environments with normalized lidar scans,

T. Silwal, A. Guo, K. Narayanan, and S. Karaman, “Learn to navigate in dynamic environments with normalized lidar scans,” inIEEE International Conference on Robotics and Automation (ICRA), 2024

work page 2024

[32] [32]

The realestate10k dataset for video prediction and beyond,

T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “The realestate10k dataset for video prediction and beyond,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018

work page 2018