From seeing to experiencing: Scaling navigation foundation models with reinforcement learning

Honglin He, Yukai Ma, Wayne Wu, Bolei Zhou · 2025 · cs.CV · arXiv 2507.22028

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Navigation foundation models trained on massive web-scale data enable agents to generalize across diverse environments and embodiments. However, these models, which are trained solely on offline data, often lack the capacity to reason about the consequences of their actions or adapt through counterfactual understanding. They thus face significant limitations in real-world urban navigation, where interactive and safe behaviors, such as avoiding obstacles and moving pedestrians, are critical. To tackle these challenges, we introduce the Seeing-to-Experiencing (S2E) learning framework to scale the capability of navigation foundation models with reinforcement learning. S2E combines the strengths of pretraining on offline videos and post-training through reinforcement learning. It maintains the model's generalizability acquired from large-scale real-world videos while enhancing its interactivity through reinforcement learning in simulation environments. Specifically, we introduce two innovations: (1) an Anchor-Guided Distribution Matching strategy for offline pretraining, which stabilizes learning and models diverse motion patterns through anchor-based supervision; and (2) a Residual-Attention Module for reinforcement learning, which obtains reactive behaviors from simulation environments without erasing the model's pretrained knowledge. Moreover, we establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3D Gaussian Splatting reconstructions of real-world scenes that incorporate physical interactions. It can systematically assess the generalizability and safety of navigation foundation models.

representative citing papers

Foresight: Iterative Reasoning About Clues that Matter for Navigation

cs.RO · 2026-06-10 · unverdicted · novelty 7.0

Foresight uses iterative VLM plan proposal and critique with RL from human feedback to raise navigation success 37% and cut interventions 52% in real-world tests.

EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild

cs.RO · 2025-05-27 · conditional · novelty 7.0

EgoWalk supplies 50 hours of real-world multimodal human navigation data in varied indoor/outdoor settings together with open pipelines that auto-generate language goal annotations and traversability masks.

From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation

cs.RO · 2026-06-10 · unverdicted · novelty 6.0

FlowPilot combines anchored flow matching for multimodal action pre-training with human-in-the-loop preference learning to improve long-horizon monocular sidewalk navigation, reporting 42% success in simulation and reduced interruptions in real-world tests.

citing papers explorer

Showing 1 of 1 citing paper after filters.

EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild cs.RO · 2025-05-27 · conditional · none · ref 17 · internal anchor
EgoWalk supplies 50 hours of real-world multimodal human navigation data in varied indoor/outdoor settings together with open pipelines that auto-generate language goal annotations and traversability masks.

From seeing to experiencing: Scaling navigation foundation models with reinforcement learning

fields

years

verdicts

representative citing papers

citing papers explorer