hub Canonical reference

NaVILA: Legged Robot Vision-Language-Action Model for Naviga- tion

· 2025 · arXiv 2412.04453

Canonical reference. 78% of citing Pith papers cite this work as background.

37 Pith papers citing it

Background 78% of classified citations

read on arXiv browse 37 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 baseline 1 method 1

citation-polarity summary

background 7 baseline 1 use method 1

representative citing papers

Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation

cs.RO · 2026-06-05 · unverdicted · novelty 7.0

The paper introduces a Trajectory Waypoint paradigm with a TSDF-guided diffusion policy and trajectory-enhanced navigator that achieves better performance on VLN-CE benchmarks by ensuring waypoint reachability and planning-execution consistency.

POINav: Benchmarking and Enhancing Final-Meters Arrival in Real-World Vision-Language Navigation

cs.RO · 2026-05-27 · unverdicted · novelty 7.0

POINav-Bench provides the first high-fidelity real-world benchmark for POI-goal VLN using 3DGS reconstructions of 126k m² with 163 POIs, supported by a Brain-Action framework and 70K real signage-entrance dataset.

AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

cs.RO · 2026-05-21 · unverdicted · novelty 7.0

AwareVLN introduces a structural reasoning module and automatic data engine with progress division to equip VLN agents with self-awareness of agent state and task progress, outperforming prior methods on Habitat datasets.

Beyond Isolation: A Unified Benchmark for General-Purpose Navigation

cs.RO · 2026-05-10 · unverdicted · novelty 7.0

OmniNavBench is a unified benchmark for general-purpose navigation featuring composite multi-skill instructions, support for humanoid, quadrupedal and wheeled robots, and 1779 human teleoperated trajectories across 170 environments.

VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness

cs.RO · 2026-03-07 · conditional · novelty 7.0

VLN-Cache delivers up to 1.52x faster inference in VLN models by using view-aligned remapping for geometric consistency and a task-relevance saliency filter to manage semantic changes during navigation.

Path-level Hindsight Instructions for Semantic Exploration in Vision-Language Navigation

cs.AI · 2026-07-02 · unverdicted · novelty 6.0

Phi-Nav generates path-level hindsight instructions from on-policy exploration trajectories to supply additional semantic supervision for vision-language navigation agents.

SpikeVLA: Vision-Language-Action Models with Spiking Neural Networks

cs.RO · 2026-06-26 · unverdicted · novelty 6.0

SpikeVLA replaces transformer components in VLA models with spiking vision encoder, multi-modal LLM, and action policy network to reduce energy consumption while maintaining competitive performance on navigation tasks.

VEGA: Learning Navigation VLAs from In-the-Wild Egocentric Video with Geometric Trajectory Supervision

cs.RO · 2026-06-16 · unverdicted · novelty 6.0

VEGA reconstructs local geometry from monocular egocentric video to create supervised trajectories that train a flow-matching VLA policy, yielding lower collision rates on a new benchmark and in real-world tests.

From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation

cs.RO · 2026-06-10 · unverdicted · novelty 6.0

FlowPilot combines anchored flow matching for multimodal action pre-training with human-in-the-loop preference learning to improve long-horizon monocular sidewalk navigation, reporting 42% success in simulation and reduced interruptions in real-world tests.

Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

Goal2Pixel grounds VLN-CE goals to image pixels via VLM prediction plus keyframe memory, reaching 54.1% SR on R2R-CE Val-Unseen with 7.75 calls per episode versus 46.62 for action prediction.

Uni-LaViRA: Language-Vision-Robot Actions Translation for Unified Embodied Navigation

cs.RO · 2026-05-26 · unverdicted · novelty 6.0

A zero-shot unified agent for VLN-CE, ObjectNav, EQA and Aerial-VLN on wheeled, quadruped, humanoid and UAV platforms that translates language and vision inputs into actions via MLLMs plus TDM and SCB mechanisms, matching trained foundation models on multiple benchmarks.

GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

GA-VLN builds a geometry-aware BEV representation from RGB-D inputs plus 3D foundation model features to deliver state-of-the-art vision-language navigation using only navigation data.

SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

cs.CV · 2026-04-30 · unverdicted · novelty 6.0

SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.

Visually-grounded Humanoid Agents

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.

HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation

cs.AI · 2026-04-09 · unverdicted · novelty 6.0

HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.

Learning Task-Invariant Properties via Dreamer: Enabling Efficient Policy Transfer for Quadruped Robots

cs.RO · 2026-04-03 · unverdicted · novelty 6.0

DreamTIP adds LLM-identified task-invariant properties as auxiliary targets in Dreamer's world model plus a mixed-replay adaptation step, delivering 28.1% average simulated transfer gains and 100% real-world climb success versus 10% for baselines.

AstraNav-World: World Model for Foresight Control and Consistency

cs.CV · 2025-12-25 · unverdicted · novelty 6.0

AstraNav-World unifies diffusion video generation and vision-language action planning in a single bidirectional model that improves trajectory accuracy, success rates, and zero-shot real-world adaptation in embodied navigation.

Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation

cs.RO · 2025-11-21 · unverdicted · novelty 6.0

Semantic progress reasoning predicts instruction-style advancement from visual history to guide policies, yielding state-of-the-art success and efficiency on R2R-CE and RxR-CE.

R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation

cs.RO · 2025-10-09 · unverdicted · novelty 6.0

R2RGen introduces a simulator-free three-stage pipeline that parses, augments, and post-processes real pointcloud observation-action pairs to improve spatial generalization in robotic manipulation policies.

Real-Time Execution of Action Chunking Flow Policies

cs.RO · 2025-06-09 · unverdicted · novelty 6.0

Real-time chunking (RTC) allows diffusion- and flow-based action chunking policies to execute smoothly and asynchronously, maintaining high success rates on dynamic tasks even with significant inference latency.

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

cs.LG · 2025-04-22 · unverdicted · novelty 6.0

π_{0.5} is a VLA model that achieves long-horizon dexterous manipulation in entirely new homes through co-training on heterogeneous tasks and multi-source data including web and semantic predictions.

FAST: Efficient Action Tokenization for Vision-Language-Action Models

cs.RO · 2025-01-16 · unverdicted · novelty 6.0

FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diffusion VLA performance with up to 5x faster training.

FutureNav: Unified World-Action Modeling for Vision-and-Language Navigation

cs.RO · 2026-06-29 · unverdicted · novelty 5.0

FutureNav proposes a 4B-scale VLM that jointly optimizes action prediction, inverse/forward dynamics, and future state generation for VLN and reports SOTA results on multiple benchmarks.

Act on What You See: Unlocking Safe Social Navigation in Vision-Language-Action Models

cs.RO · 2026-06-09 · unverdicted · novelty 5.0

SALSA aligns social features and adds future-risk signals in VLA models to cut near-collisions by 86.4% and raise social accuracy from 53% to 93% on SCAND and real robots.

citing papers explorer

Showing 34 of 34 citing papers after filters.

Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation cs.RO · 2026-06-05 · unverdicted · none · ref 12
The paper introduces a Trajectory Waypoint paradigm with a TSDF-guided diffusion policy and trajectory-enhanced navigator that achieves better performance on VLN-CE benchmarks by ensuring waypoint reachability and planning-execution consistency.
POINav: Benchmarking and Enhancing Final-Meters Arrival in Real-World Vision-Language Navigation cs.RO · 2026-05-27 · unverdicted · none · ref 8
POINav-Bench provides the first high-fidelity real-world benchmark for POI-goal VLN using 3DGS reconstructions of 126k m² with 163 POIs, supported by a Brain-Action framework and 70K real signage-entrance dataset.
AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation cs.RO · 2026-05-21 · unverdicted · none · ref 13
AwareVLN introduces a structural reasoning module and automatic data engine with progress division to equip VLN agents with self-awareness of agent state and task progress, outperforming prior methods on Habitat datasets.
Beyond Isolation: A Unified Benchmark for General-Purpose Navigation cs.RO · 2026-05-10 · unverdicted · none · ref 7
OmniNavBench is a unified benchmark for general-purpose navigation featuring composite multi-skill instructions, support for humanoid, quadrupedal and wheeled robots, and 1779 human teleoperated trajectories across 170 environments.
Path-level Hindsight Instructions for Semantic Exploration in Vision-Language Navigation cs.AI · 2026-07-02 · unverdicted · none · ref 13
Phi-Nav generates path-level hindsight instructions from on-policy exploration trajectories to supply additional semantic supervision for vision-language navigation agents.
SpikeVLA: Vision-Language-Action Models with Spiking Neural Networks cs.RO · 2026-06-26 · unverdicted · none · ref 2
SpikeVLA replaces transformer components in VLA models with spiking vision encoder, multi-modal LLM, and action policy network to reduce energy consumption while maintaining competitive performance on navigation tasks.
VEGA: Learning Navigation VLAs from In-the-Wild Egocentric Video with Geometric Trajectory Supervision cs.RO · 2026-06-16 · unverdicted · none · ref 23
VEGA reconstructs local geometry from monocular egocentric video to create supervised trajectories that train a flow-matching VLA policy, yielding lower collision rates on a new benchmark and in real-world tests.
From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation cs.RO · 2026-06-10 · unverdicted · none · ref 29
FlowPilot combines anchored flow matching for multimodal action pre-training with human-in-the-loop preference learning to improve long-horizon monocular sidewalk navigation, reporting 42% success in simulation and reduced interruptions in real-world tests.
Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation cs.CV · 2026-06-01 · unverdicted · none · ref 19
Goal2Pixel grounds VLN-CE goals to image pixels via VLM prediction plus keyframe memory, reaching 54.1% SR on R2R-CE Val-Unseen with 7.75 calls per episode versus 46.62 for action prediction.
Uni-LaViRA: Language-Vision-Robot Actions Translation for Unified Embodied Navigation cs.RO · 2026-05-26 · unverdicted · none · ref 12
A zero-shot unified agent for VLN-CE, ObjectNav, EQA and Aerial-VLN on wheeled, quadruped, humanoid and UAV platforms that translates language and vision inputs into actions via MLLMs plus TDM and SCB mechanisms, matching trained foundation models on multiple benchmarks.
GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation cs.CV · 2026-05-21 · unverdicted · none · ref 10
GA-VLN builds a geometry-aware BEV representation from RGB-D inputs plus 3D foundation model features to deliver state-of-the-art vision-language navigation using only navigation data.
SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation cs.CV · 2026-04-30 · unverdicted · none · ref 15
SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.
Visually-grounded Humanoid Agents cs.CV · 2026-04-09 · unverdicted · none · ref 11
A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.
HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation cs.AI · 2026-04-09 · unverdicted · none · ref 7
HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.
Learning Task-Invariant Properties via Dreamer: Enabling Efficient Policy Transfer for Quadruped Robots cs.RO · 2026-04-03 · unverdicted · none · ref 28
DreamTIP adds LLM-identified task-invariant properties as auxiliary targets in Dreamer's world model plus a mixed-replay adaptation step, delivering 28.1% average simulated transfer gains and 100% real-world climb success versus 10% for baselines.
AstraNav-World: World Model for Foresight Control and Consistency cs.CV · 2025-12-25 · unverdicted · none · ref 7
AstraNav-World unifies diffusion video generation and vision-language action planning in a single bidirectional model that improves trajectory accuracy, success rates, and zero-shot real-world adaptation in embodied navigation.
Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation cs.RO · 2025-11-21 · unverdicted · none · ref 8
Semantic progress reasoning predicts instruction-style advancement from visual history to guide policies, yielding state-of-the-art success and efficiency on R2R-CE and RxR-CE.
R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation cs.RO · 2025-10-09 · unverdicted · none · ref 4
R2RGen introduces a simulator-free three-stage pipeline that parses, augments, and post-processes real pointcloud observation-action pairs to improve spatial generalization in robotic manipulation policies.
Real-Time Execution of Action Chunking Flow Policies cs.RO · 2025-06-09 · unverdicted · none · ref 10
Real-time chunking (RTC) allows diffusion- and flow-based action chunking policies to execute smoothly and asynchronously, maintaining high success rates on dynamic tasks even with significant inference latency.
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization cs.LG · 2025-04-22 · unverdicted · none · ref 13
π_{0.5} is a VLA model that achieves long-horizon dexterous manipulation in entirely new homes through co-training on heterogeneous tasks and multi-source data including web and semantic predictions.
FAST: Efficient Action Tokenization for Vision-Language-Action Models cs.RO · 2025-01-16 · unverdicted · none · ref 13
FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diffusion VLA performance with up to 5x faster training.
FutureNav: Unified World-Action Modeling for Vision-and-Language Navigation cs.RO · 2026-06-29 · unverdicted · none · ref 25
FutureNav proposes a 4B-scale VLM that jointly optimizes action prediction, inverse/forward dynamics, and future state generation for VLN and reports SOTA results on multiple benchmarks.
Act on What You See: Unlocking Safe Social Navigation in Vision-Language-Action Models cs.RO · 2026-06-09 · unverdicted · none · ref 7
SALSA aligns social features and adds future-risk signals in VLA models to cut near-collisions by 86.4% and raise social accuracy from 53% to 93% on SCAND and real robots.
G-DRAGON: Geospatial Reasoning and Dynamic Planning for Retrieval-Augmented Outdoor Navigation cs.RO · 2026-05-25 · unverdicted · none · ref 4
G-DRAGON framework maps language commands to OSM coordinates via lightweight LLM for global planning and uses frontier exploration for local targets, outperforming baselines in simulation and completing real UGV person-search missions up to 500m.
SEDualVLN: A Spatially-Enhanced Dual-System for Vision-Language Navigation cs.RO · 2026-05-17 · unverdicted · none · ref 7 · 2 links
SEDualVLN introduces a spatially-enhanced dual-system VLN architecture that achieves state-of-the-art results on VLN-CE benchmarks through coordinated VLM action generation and MLLM waypoint planning.
Terrain Consistent Reference-Guided RL for Humanoid Navigation Autonomy cs.RO · 2026-05-15 · unverdicted · none · ref 19
Terrain-consistent reference modulation during RL training yields SE(2)-controllable humanoid locomotion policies that improve tracking in simulation and enable over 70 m closed-loop autonomous navigation on rough terrain and stairs on the Unitree G1 with onboard computation.
Think before Go: Hierarchical Reasoning for Image-goal Navigation cs.RO · 2026-04-19 · unverdicted · none · ref 102
HRNav decomposes image-goal navigation into VLM-based short-horizon planning and RL-based execution with a wandering suppression penalty to improve performance in complex unseen settings.
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective cs.RO · 2025-07-02 · unverdicted · none · ref 153
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
NVILA: Efficient Frontier Visual Language Models cs.CV · 2024-12-05 · unverdicted · none · ref 8
NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5.1x and latencies 1.2-2.8x.
Vision-Language Models for Deployable Social Robot Navigation: Bridging Semantic Reasoning and Low-Level Control cs.RO · 2026-06-27 · unverdicted · none · ref 29
Survey organizing VLM-based social robot navigation into reasoning, planning, and bridging components with a proposed roadmap for hybrid deployable systems.
Ask When It Pays: Cost-Aware Open-Ended Interaction for Instance Goal Navigation cs.CV · 2026-06-02 · unverdicted · none · ref 14
Proposes cost-aware question selection for ambiguous object navigation via information-gain analysis on corpora, a cost-penalizing benchmark, and a zero-shot MLLM agent.
PEACE: A Planner-Executor Agent with Constraint Enforcement for UAVs cs.RO · 2026-05-26 · unverdicted · none · ref 6
PEACE decouples single-pass LLM planning from PX4 execution via ROS 2 and a constraint layer, with modular 3D perception, and shows feasibility in Gazebo SITL with improved explainability and fewer LLM calls.
Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap cs.RO · 2026-04-15 · unverdicted · none · ref 87
A survey of UAV vision-and-language navigation that establishes a methodological taxonomy, reviews resources and challenges, and proposes a forward-looking research roadmap.
Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation cs.RO · 2026-06-01 · unverdicted · none · ref 9
HSAN integrates hierarchical semantic graphs, optimal transport-based goal selection, and graph-aware RL to claim SOTA results on VLN-CE tasks.

NaVILA: Legged Robot Vision-Language-Action Model for Naviga- tion

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer