The paper introduces a Trajectory Waypoint paradigm with a TSDF-guided diffusion policy and trajectory-enhanced navigator that achieves better performance on VLN-CE benchmarks by ensuring waypoint reachability and planning-execution consistency.
arXiv preprint arXiv:2512.10310 , year=
8 Pith papers cite this work. Polarity classification is still indexing.
abstract
While Multimodal Large Language Models (MLLMs) have demonstrated significant promise in Vision-Language Navigation (VLN), existing agents remain heavily constrained by systemic bottlenecks across inference, training, and data collection. Specifically, they suffer from prohibitive latency due to visual history reprocessing, action leakage during sequence-packed training, and suboptimal exploration in self-correction data collection. To overcome these intertwined challenges, we present Efficient-VLN, a highly efficient and robust baseline that systematically resolves these issues through three simple-yet-effective mechanisms. (1) Inference: We introduce KV-cache reuse with contiguous RoPE, enabling the model to process only the newly observed frame at each step for real-time inference. (2) Training: We propose packed training with an action-isolating mask to accelerate throughput while effectively bridging the training-inference gap by preventing action leakage. (3) Data Collection: We employ an Adaptive DAgger to dynamically balance autonomous exploration and oracle guidance, enhancing error-recovery capability without escalating computational costs. Extensive evaluations show that Efficient-VLN significantly advances the state-of-the-art across the R2R-CE (73.2% SR) and RxR-CE (75.6% SR) benchmarks. Meanwhile, it yields a 28% latency reduction compared to the previous state-of-the-art StreamVLN, establishing a new paradigm for streaming MLLM-based navigation.
citation-role summary
citation-polarity summary
years
2026 8roles
baseline 1polarities
baseline 1representative citing papers
VLN-Cache delivers up to 1.52x faster inference in VLN models by using view-aligned remapping for geometric consistency and a task-relevance saliency filter to manage semantic changes during navigation.
GroundControl estimates navigation uncertainty via statistical deviation from nominal goal-directed distance dynamics, achieving low E-AURC (0.0024 weighted for GPT-4o) on EB-Navigation splits and outperforming entropy and conformal baselines under SRCN evaluation.
Goal2Pixel grounds VLN-CE goals to image pixels via VLM prediction plus keyframe memory, reaching 54.1% SR on R2R-CE Val-Unseen with 7.75 calls per episode versus 46.62 for action prediction.
IDEA is a TTA framework for VLN that builds a dynamic asset library from Fisher-weighted soft prompts and domain coordinates, then uses convex-hull projection for cross-domain bridging and training-free adaptation.
PanoWorld adds spherical spatial cross-attention and pano-native training data to MLLMs for improved spatial reasoning on ERP panoramas, outperforming baselines on new and existing benchmarks.
FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.
StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.
citing papers explorer
-
Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation
The paper introduces a Trajectory Waypoint paradigm with a TSDF-guided diffusion policy and trajectory-enhanced navigator that achieves better performance on VLN-CE benchmarks by ensuring waypoint reachability and planning-execution consistency.
-
VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness
VLN-Cache delivers up to 1.52x faster inference in VLN models by using view-aligned remapping for geometric consistency and a task-relevance saliency filter to manage semantic changes during navigation.
-
GroundControl: Anticipating Navigation Failures in Vision-Language Agents via Trajectory-Consistent Uncertainty Estimates
GroundControl estimates navigation uncertainty via statistical deviation from nominal goal-directed distance dynamics, achieving low E-AURC (0.0024 weighted for GPT-4o) on EB-Navigation splits and outperforming entropy and conformal baselines under SRCN evaluation.
-
Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation
Goal2Pixel grounds VLN-CE goals to image pixels via VLM prediction plus keyframe memory, reaching 54.1% SR on R2R-CE Val-Unseen with 7.75 calls per episode versus 46.62 for action prediction.
-
Turning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation
IDEA is a TTA framework for VLN that builds a dynamic asset library from Fisher-weighted soft prompts and domain coordinates, then uses convex-hull projection for cross-domain bridging and training-free adaptation.
-
PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World
PanoWorld adds spherical spatial cross-attention and pano-native training data to MLLMs for improved spatial reasoning on ERP panoramas, outperforming baselines on new and existing benchmarks.
-
FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching
FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.
-
What Limits Vision-and-Language Navigation ?
StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.