arXiv preprint arXiv:2512.10310 , year=

Efficient-VLN: A Training-Efficient Vision-Language Navigation Model , author= · 2025 · cs.CV · arXiv 2512.10310

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open full Pith review browse 8 citing papers arXiv PDF

abstract

While Multimodal Large Language Models (MLLMs) have demonstrated significant promise in Vision-Language Navigation (VLN), existing agents remain heavily constrained by systemic bottlenecks across inference, training, and data collection. Specifically, they suffer from prohibitive latency due to visual history reprocessing, action leakage during sequence-packed training, and suboptimal exploration in self-correction data collection. To overcome these intertwined challenges, we present Efficient-VLN, a highly efficient and robust baseline that systematically resolves these issues through three simple-yet-effective mechanisms. (1) Inference: We introduce KV-cache reuse with contiguous RoPE, enabling the model to process only the newly observed frame at each step for real-time inference. (2) Training: We propose packed training with an action-isolating mask to accelerate throughput while effectively bridging the training-inference gap by preventing action leakage. (3) Data Collection: We employ an Adaptive DAgger to dynamically balance autonomous exploration and oracle guidance, enhancing error-recovery capability without escalating computational costs. Extensive evaluations show that Efficient-VLN significantly advances the state-of-the-art across the R2R-CE (73.2% SR) and RxR-CE (75.6% SR) benchmarks. Meanwhile, it yields a 28% latency reduction compared to the previous state-of-the-art StreamVLN, establishing a new paradigm for streaming MLLM-based navigation.

citation-role summary

baseline 1

citation-polarity summary

baseline 1

representative citing papers

Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation

cs.RO · 2026-06-05 · unverdicted · novelty 7.0

The paper introduces a Trajectory Waypoint paradigm with a TSDF-guided diffusion policy and trajectory-enhanced navigator that achieves better performance on VLN-CE benchmarks by ensuring waypoint reachability and planning-execution consistency.

VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness

cs.RO · 2026-03-07 · conditional · novelty 7.0

VLN-Cache delivers up to 1.52x faster inference in VLN models by using view-aligned remapping for geometric consistency and a task-relevance saliency filter to manage semantic changes during navigation.

GroundControl: Anticipating Navigation Failures in Vision-Language Agents via Trajectory-Consistent Uncertainty Estimates

cs.RO · 2026-06-18 · unverdicted · novelty 6.0

GroundControl estimates navigation uncertainty via statistical deviation from nominal goal-directed distance dynamics, achieving low E-AURC (0.0024 weighted for GPT-4o) on EB-Navigation splits and outperforming entropy and conformal baselines under SRCN evaluation.

Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

Goal2Pixel grounds VLN-CE goals to image pixels via VLM prediction plus keyframe memory, reaching 54.1% SR on R2R-CE Val-Unseen with 7.75 calls per episode versus 46.62 for action prediction.

Turning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation

cs.RO · 2026-05-22 · unverdicted · novelty 6.0

IDEA is a TTA framework for VLN that builds a dynamic asset library from Fisher-weighted soft prompts and domain coordinates, then uses convex-hull projection for cross-domain bridging and training-free adaptation.

PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World

cs.CV · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

PanoWorld adds spherical spatial cross-attention and pano-native training data to MLLMs for improved spatial reasoning on ERP panoramas, outperforming baselines on new and existing benchmarks.

FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching

cs.RO · 2026-04-27 · unverdicted · novelty 6.0

FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.

What Limits Vision-and-Language Navigation ?

cs.RO · 2026-05-13 · unverdicted · novelty 5.0

StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.

citing papers explorer

Showing 8 of 8 citing papers.

Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation cs.RO · 2026-06-05 · unverdicted · none · ref 54 · internal anchor
The paper introduces a Trajectory Waypoint paradigm with a TSDF-guided diffusion policy and trajectory-enhanced navigator that achieves better performance on VLN-CE benchmarks by ensuring waypoint reachability and planning-execution consistency.
VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness cs.RO · 2026-03-07 · conditional · none · ref 9 · internal anchor
VLN-Cache delivers up to 1.52x faster inference in VLN models by using view-aligned remapping for geometric consistency and a task-relevance saliency filter to manage semantic changes during navigation.
GroundControl: Anticipating Navigation Failures in Vision-Language Agents via Trajectory-Consistent Uncertainty Estimates cs.RO · 2026-06-18 · unverdicted · none · ref 11 · internal anchor
GroundControl estimates navigation uncertainty via statistical deviation from nominal goal-directed distance dynamics, achieving low E-AURC (0.0024 weighted for GPT-4o) on EB-Navigation splits and outperforming entropy and conformal baselines under SRCN evaluation.
Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation cs.CV · 2026-06-01 · unverdicted · none · ref 11 · internal anchor
Goal2Pixel grounds VLN-CE goals to image pixels via VLM prediction plus keyframe memory, reaching 54.1% SR on R2R-CE Val-Unseen with 7.75 calls per episode versus 46.62 for action prediction.
Turning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation cs.RO · 2026-05-22 · unverdicted · none · ref 35 · internal anchor
IDEA is a TTA framework for VLN that builds a dynamic asset library from Fisher-weighted soft prompts and domain coordinates, then uses convex-hull projection for cross-domain bridging and training-free adaptation.
PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World cs.CV · 2026-05-13 · unverdicted · none · ref 59 · 2 links · internal anchor
PanoWorld adds spherical spatial cross-attention and pano-native training data to MLLMs for improved spatial reasoning on ERP panoramas, outperforming baselines on new and existing benchmarks.
FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching cs.RO · 2026-04-27 · unverdicted · none · ref 38 · internal anchor
FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.
What Limits Vision-and-Language Navigation ? cs.RO · 2026-05-13 · unverdicted · none · ref 48 · internal anchor
StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.

arXiv preprint arXiv:2512.10310 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer