pith. machine review for the scientific record. sign in

arxiv: 2412.06224 · v2 · submitted 2024-12-09 · 💻 cs.RO · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:48 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords embodied navigationvision-language-actionunificationvideo-based modelrobot navigationlong-horizon tasksreal-world generalizationmulti-task learning
0
0 comments X

The pith

A single video-based model unifies multiple robot navigation tasks by standardizing their data formats.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Uni-NaVid as a video-based vision-language-action model that combines several common embodied navigation tasks into one system. It achieves this unification by aligning the input and output data setups across tasks such as instruction following, object searching, question answering, and people tracking. Training on 3.6 million samples collected from four sub-tasks creates shared learning benefits. The approach supports mixed long-horizon navigation in new real-world settings without relying on pre-built maps or separate models for each task. Experiments show improved benchmark results and practical effectiveness in physical environments.

Core claim

Uni-NaVid is the first video-based vision-language-action model that unifies diverse embodied navigation tasks by harmonizing input and output data configurations for all commonly used tasks, thereby integrating four essential sub-tasks into a single model trained on 3.6 million navigation samples to foster learning synergy and enable seamless navigation for mixed long-horizon tasks in unseen real-world environments.

What carries the argument

Harmonization of input and output data configurations across tasks, which integrates them into one model without separate handling for each.

If this is right

  • Supports seamless switching between navigation tasks in long sequences without switching models.
  • Achieves state-of-the-art results on standard navigation benchmarks through shared training.
  • Demonstrates strong generalization to real-world settings with unseen environments.
  • Reduces reliance on pre-defined maps or discretized waypoints for practical use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The unification method may extend to combining navigation with manipulation tasks in broader robotics systems.
  • Larger-scale data collection under the same harmonized format could further boost performance on rare task combinations.
  • Deployment might simplify robot software stacks by replacing multiple specialized navigation modules with one model.

Load-bearing premise

Harmonizing input and output data configurations across tasks allows effective integration and positive synergy in learning without loss of performance on individual tasks or negative interference.

What would settle it

Experiments showing clear performance drops on any single sub-task when trained jointly compared to isolated training, or inability to complete mixed long-horizon sequences in real-world tests without additional tuning.

read the original abstract

A practical navigation agent must be capable of handling a wide range of interaction demands, such as following instructions, searching objects, answering questions, tracking people, and more. Existing models for embodied navigation fall short of serving as practical generalists in the real world, as they are often constrained by specific task configurations or pre-defined maps with discretized waypoints. In this work, we present Uni-NaVid, the first video-based vision-language-action (VLA) model designed to unify diverse embodied navigation tasks and enable seamless navigation for mixed long-horizon tasks in unseen real-world environments. Uni-NaVid achieves this by harmonizing the input and output data configurations for all commonly used embodied navigation tasks and thereby integrating all tasks in one model. For training Uni-NaVid, we collect 3.6 million navigation data samples in total from four essential navigation sub-tasks and foster synergy in learning across them. Extensive experiments on comprehensive navigation benchmarks clearly demonstrate the advantages of unification modeling in Uni-NaVid and show it achieves state-of-the-art performance. Additionally, real-world experiments confirm the model's effectiveness and efficiency, shedding light on its strong generalizability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Uni-NaVid, the first video-based vision-language-action model for unifying multiple embodied navigation tasks (instruction following, object search, question answering, person tracking) by harmonizing input/output data configurations across them. It trains a single model on 3.6 million joint samples from four sub-tasks to foster cross-task synergy, reports state-of-the-art results on standard benchmarks, and validates real-world performance on mixed long-horizon tasks in unseen environments.

Significance. If the unification produces genuine positive transfer without negative interference or performance loss on individual tasks, the result would be a meaningful step toward practical generalist navigation agents that handle diverse, long-horizon demands without task-specific retraining or maps.

major comments (2)
  1. [Experiments] Experiments section: the central claim that harmonizing configurations yields 'advantages of unification modeling' and positive synergy rests on SOTA benchmark numbers after joint training on 3.6M samples, yet no ablation is reported that compares the unified model against single-task models trained on identical data splits, architecture, and total sample count. Without this comparison, reported gains cannot be distinguished from simple data-volume scaling.
  2. [Experiments] §4 (or equivalent results section): the abstract and text assert no loss of performance on individual tasks and seamless handling of mixed tasks, but the provided results do not include per-task breakdowns or interference metrics for the unified model versus the single-task baselines; this directly bears on the weakest assumption identified in the stress test.
minor comments (2)
  1. [Abstract] Abstract and §1: the repeated claim of being the 'first' video-based VLA for unification should be supported by a concise related-work comparison table or explicit citation of the closest prior VLA navigation models.
  2. [Method] Notation and data description: the harmonized input/output configurations are described at a high level; a single table enumerating the exact tokenization, action space, and observation format for each of the four sub-tasks would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental validation for the claimed benefits of unification. We address each major comment below and will revise the manuscript to incorporate the suggested comparisons and metrics.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central claim that harmonizing configurations yields 'advantages of unification modeling' and positive synergy rests on SOTA benchmark numbers after joint training on 3.6M samples, yet no ablation is reported that compares the unified model against single-task models trained on identical data splits, architecture, and total sample count. Without this comparison, reported gains cannot be distinguished from simple data-volume scaling.

    Authors: We agree that a controlled ablation isolating unification effects from data scaling would strengthen the evidence for positive cross-task synergy. The current experiments focus on comparisons to existing task-specific SOTA methods, which use varying data volumes and architectures. To directly address this concern, we will add an ablation study in the revised manuscript that trains single-task models using the same architecture and total sample count of 3.6 million (by replicating or appropriately allocating the combined data), allowing clear distinction between scaling and unification benefits. revision: yes

  2. Referee: [Experiments] §4 (or equivalent results section): the abstract and text assert no loss of performance on individual tasks and seamless handling of mixed tasks, but the provided results do not include per-task breakdowns or interference metrics for the unified model versus the single-task baselines; this directly bears on the weakest assumption identified in the stress test.

    Authors: The results section reports performance on each individual benchmark, demonstrating that the unified model maintains or exceeds the performance of prior specialized approaches without explicit degradation. However, we acknowledge that explicit per-task breakdowns and quantitative interference metrics versus single-task baselines are not tabulated. We will revise the manuscript to include these, adding tables with per-sub-task metrics for the unified model alongside single-task equivalents and any measured interference or synergy indicators. revision: yes

Circularity Check

0 steps flagged

No circularity; unification via explicit data harmonization and empirical benchmarks

full rationale

The paper defines Uni-NaVid through concrete architectural choices: harmonizing input/output data configurations across four navigation sub-tasks, collecting 3.6M joint samples, and training a single video-based VLA model. Performance claims rest on reported benchmark results and real-world experiments rather than any self-referential reduction. No equations or derivations equate a claimed prediction to its own fitted inputs by construction, no load-bearing self-citations appear in the provided text, and no uniqueness theorems or ansatzes are imported from prior author work. The derivation chain is self-contained and externally falsifiable via the stated experiments.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The claim rests on standard deep learning assumptions for vision-language models plus empirical training choices; no new physical entities or ungrounded axioms are introduced beyond typical ML practice.

free parameters (2)
  • model architecture hyperparameters
    Standard transformer and training hyperparameters fitted during optimization on the 3.6M samples.
  • task data balancing weights
    Choices in how samples from the four sub-tasks are mixed during unified training.
axioms (1)
  • domain assumption Transformer-based VLA architectures can effectively learn joint representations from video, language, and action data across tasks.
    Implicit foundation for the unification approach in § on model design.

pith-pipeline@v0.9.0 · 5532 in / 1219 out tokens · 34698 ms · 2026-05-16T19:48:37.748335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Uni-NaVid achieves this by harmonizing the input and output data configurations... extensive experiments on comprehensive navigation benchmarks clearly demonstrate the advantages of unification modeling

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Isolation: A Unified Benchmark for General-Purpose Navigation

    cs.RO 2026-05 unverdicted novelty 7.0

    OmniNavBench is a unified benchmark for general-purpose navigation featuring composite multi-skill instructions, support for humanoid, quadrupedal and wheeled robots, and 1779 human teleoperated trajectories across 17...

  2. How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

    cs.AI 2026-04 unverdicted novelty 7.0

    Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.

  3. VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness

    cs.RO 2026-03 conditional novelty 7.0

    VLN-Cache delivers up to 1.52x faster inference in VLN models by using view-aligned remapping for geometric consistency and a task-relevance saliency filter to manage semantic changes during navigation.

  4. PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World

    cs.CV 2026-05 unverdicted novelty 6.0

    PanoWorld adds spherical geometry to MLLMs via cross-attention and pano-specific instruction data, yielding better performance on panoramic spatial reasoning benchmarks than standard perspective-based pipelines.

  5. SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

    cs.CV 2026-04 unverdicted novelty 6.0

    SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.

  6. GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot Learning

    cs.RO 2026-04 unverdicted novelty 6.0

    GS-Playground delivers a high-throughput photorealistic simulator for vision-informed robot learning via parallel physics integrated with batch 3D Gaussian Splatting at 10^4 FPS and an automated Real2Sim workflow for ...

  7. FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching

    cs.RO 2026-04 unverdicted novelty 6.0

    FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.

  8. AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation

    cs.RO 2026-04 unverdicted novelty 6.0

    AsyncShield restores VLA geometric intent from latency via kinematic pose mapping and uses PPO-Lagrangian to balance tracking with LiDAR safety constraints in a plug-and-play module.

  9. FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation

    cs.CV 2026-04 unverdicted novelty 6.0

    FineCog-Nav uses fine-grained cognitive modules driven by foundation models to outperform zero-shot baselines in UAV navigation and introduces the AerialVLN-Fine benchmark with refined instructions.

  10. {\Psi}-Map: Panoptic Surface Integrated Mapping Enables Real2Sim Transfer

    cs.RO 2026-04 unverdicted novelty 6.0

    Ψ-Map combines plane-constrained Gaussian surfels from LiDAR with end-to-end panoptic lifting to deliver high-precision geometric and semantic reconstruction in large-scale environments at real-time speeds.

  11. Visually-grounded Humanoid Agents

    cs.CV 2026-04 unverdicted novelty 6.0

    A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.

  12. HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation

    cs.AI 2026-04 unverdicted novelty 6.0

    HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.

  13. Memory Over Maps: 3D Object Localization Without Reconstruction

    cs.RO 2026-03 unverdicted novelty 6.0

    A map-free localization method stores posed RGB-D keyframes, retrieves and re-ranks them with a VLM, then fuses sparse depth for on-demand 3D target estimates, matching reconstruction-based performance on navigation b...

  14. MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation

    cs.CV 2026-02 unverdicted novelty 6.0

    MerNav's Memory-Execute-Review framework improves success rates in zero-shot object goal navigation by 5-8% over baselines on four datasets while outperforming both training-free and supervised methods on key benchmarks.

  15. AstraNav-World: World Model for Foresight Control and Consistency

    cs.CV 2025-12 unverdicted novelty 6.0

    AstraNav-World unifies diffusion video generation and vision-language action planning in a single bidirectional model that improves trajectory accuracy, success rates, and zero-shot real-world adaptation in embodied n...

  16. Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

    cs.CV 2025-12 unverdicted novelty 6.0

    A monocular RGB-only aerial VLN framework outperforms baselines via prompt-guided multi-task learning, keyframe selection, and label reweighting on AerialVLN and OpenFly benchmarks.

  17. DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

    cs.CV 2025-07 unverdicted novelty 6.0

    DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 avera...

  18. LCGNav: Local Candidate-Aware Geometric Enhancement for General Topological Planning in Vision-Language Navigation

    cs.CV 2026-05 conditional novelty 5.0

    LCGNav improves online topological VLN-CE by converting local depth views to physically truncated 3D point clouds and applying selective dimension-preserving fusion, yielding consistent gains on R2R-CE and RxR-CE benc...

  19. Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

    cs.CV 2026-04 unverdicted novelty 5.0

    Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.

Reference graph

Works this paper leans on

126 extracted references · 126 canonical work pages · cited by 19 Pith papers · 13 internal anchors

  1. [1]

    Etpnav: Evolving topological planning for vision-language nav- igation in continuous environments

    Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topological planning for vision-language nav- igation in continuous environments. arXiv preprint arXiv:2304.03047, 2023

  2. [3]

    On Evaluation of Embodied Navigation Agents

    Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Mano- lis Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 , 2018

  3. [4]

    Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018

  4. [5]

    Sim-to-real transfer for vision-and-language navigation

    Peter Anderson, Ayush Shrivastava, Joanne Truong, Arjun Majumdar, Devi Parikh, Dhruv Batra, and Ste- fan Lee. Sim-to-real transfer for vision-and-language navigation. In Conference on Robot Learning , pages 671–681. PMLR, 2021

  5. [6]

    Human memory: A proposed system and its control processes (vol

    RC Atkinson and RM Shiffrin. Human memory: A proposed system and its control processes (vol. 2). The Psychology of Learning and Motivation: Advances in Research and Theory , pages 89–195, 1968

  6. [7]

    Scanqa: 3d question answering for spatial scene understanding

    Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129–19139, 2022

  7. [8]

    Token Merging: Your ViT But Faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster. ArXiv, abs/2210.09461, 2022

  8. [9]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818 , 2023

  9. [10]

    Activitynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition , pages 961–970, 2015

  10. [11]

    Matterport3d: Learning from rgb-d data in indoor environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In 2017 International Conference on 3D Vision (3DV) , pages 667–676. IEEE, 2017

  11. [12]

    Object goal navigation using goal-oriented semantic exploration

    Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Russ R Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems , 33: 4247–4258, 2020

  12. [13]

    Collecting highly parallel data for paraphrase evaluation

    David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages 190–200, 2011

  13. [14]

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

    Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elho- seiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023

  14. [15]

    Topological planning with transformers for vision-and-language navigation

    Kevin Chen, Junshen K Chen, Jo Chuang, Marynel V´azquez, and Silvio Savarese. Topological planning with transformers for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 11276–11286, 2021

  15. [16]

    Weakly- supervised multi-granularity map learning for vision-and- language navigation

    Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H Li, Mingkui Tan, and Chuang Gan. Weakly- supervised multi-granularity map learning for vision-and- language navigation. arXiv preprint arXiv:2210.07506 , 2022

  16. [17]

    Action-aware zero-shot robot navigation by exploiting vision-and-language ability of foundation models

    Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H Li, Gaowen Liu, Mingkui Tan, and Chuang Gan. Action-aware zero-shot robot navigation by exploiting vision-and-language ability of foundation models. arXiv preprint arXiv:2308.07997 , 2023

  17. [18]

    VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

    Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. arXiv preprint arXiv:2402.13243, 2024

  18. [19]

    Panda-70m: Captioning 70m videos with multiple cross-modality teachers

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13320–13331, 2024

  19. [20]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) , 2023

  20. [21]

    Embodied question answering

    Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 1–10, 2018

  21. [22]

    Clip-nav: Using clip for zero-shot vision-and-language navigation

    Vishnu Sashank Dorbala, Gunnar Sigurdsson, Robinson Piramuthu, Jesse Thomason, and Gaurav S Sukhatme. Clip-nav: Using clip for zero-shot vision-and-language navigation. arXiv preprint arXiv:2211.16649 , 2022

  22. [23]

    A survey of embodied ai: From simulators to research tasks

    Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence , 6(2):230–244, 2022

  23. [24]

    The one ring: a robotic indoor navigation generalist

    Ainaz Eftekhar, Luca Weihs, Rose Hendrix, Ege Caglar, Jordi Salvador, Alvaro Herrasti, Winson Han, Eli Van- derBil, Aniruddha Kembhavi, Ali Farhadi, et al. The one ring: a robotic indoor navigation generalist. arXiv preprint arXiv:2412.14401, 2024

  24. [25]

    Principles and guidelines for evaluating social robot navigation algorithms

    Anthony Francis, Claudia P ´erez-d’Arpino, Chengshu Li, Fei Xia, Alexandre Alahi, Rachid Alami, Aniket Bera, Abhijat Biswas, Joydeep Biswas, Rohan Chandra, et al. Principles and guidelines for evaluating social robot navigation algorithms. arXiv preprint arXiv:2306.16740, 2023

  25. [26]

    Cross-modal map learning for vi- sion and language navigation

    Georgios Georgakis, Karl Schmeckpeper, Karan Wan- choo, Soham Dan, Eleni Miltsakaki, Dan Roth, and Kostas Daniilidis. Cross-modal map learning for vi- sion and language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15460–15470, 2022

  26. [27]

    Navigating to objects in the real world

    Theophile Gervet, Soumith Chintala, Dhruv Batra, Ji- tendra Malik, and Devendra Singh Chaplot. Navigating to objects in the real world. Science Robotics , 8(79): eadf6991, 2023

  27. [28]

    A novel vision-based tracking algorithm for a human-following mobile robot

    Meenakshi Gupta, Swagat Kumar, Laxmidhar Behera, and Venkatesh K Subramanian. A novel vision-based tracking algorithm for a human-following mobile robot. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(7):1415–1427, 2016

  28. [29]

    Exaug: Robot-conditioned navigation policies via geometric experience augmentation

    Noriaki Hirose, Dhruv Shah, Ajay Sridhar, and Sergey Levine. Exaug: Robot-conditioned navigation policies via geometric experience augmentation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 4077–4084. IEEE, 2023

  29. [30]

    CogVLM2: Visual Language Models for Image and Video Understanding

    Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng Zhang, De-Feng Liu, Bin Xu, Juanzi Li, Yu-Chen Dong, and Jie Tang. Cogvlm2: Visual language models for image and video understanding...

  30. [31]

    Bridging the gap between learning in discrete and continuous environments for vision-and-language nav- igation

    Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridging the gap between learning in discrete and continuous environments for vision-and-language nav- igation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15439–15449, 2022

  31. [32]

    3d- llm: Injecting the 3d world into large language models

    Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d- llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems , 36: 20482–20494, 2023

  32. [33]

    Toward socially aware person-following robots

    Shanee S Honig, Tal Oron-Gilad, Hanan Zaichyk, Vardit Sarne-Fleischmann, Samuel Olatunji, and Yael Edan. Toward socially aware person-following robots. IEEE Transactions on Cognitive and Developmental Systems , 10(4):936–954, 2018

  33. [34]

    Visual language maps for robot navigation

    Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. arXiv preprint arXiv:2210.05714 , 2022

  34. [35]

    Visual language maps for robot navigation

    Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 10608–10615. IEEE, 2023

  35. [36]

    Robust kalman filters based on gaussian scale mixture distributions with application to target tracking

    Yulong Huang, Yonggang Zhang, Peng Shi, Zhemin Wu, Junhui Qian, and Jonathon A Chambers. Robust kalman filters based on gaussian scale mixture distributions with application to target tracking. IEEE Transactions on Systems, Man, and Cybernetics: Systems , 49(10):2082– 2096, 2017

  36. [37]

    Person-following by autonomous robots: A categorical overview

    Md Jahidul Islam, Jungseok Hong, and Junaed Sattar. Person-following by autonomous robots: A categorical overview. The International Journal of Robotics Re- search, 38(14):1581–1618, 2019

  37. [38]

    Eqa-mx: Embodied question answering us- ing multimodal expression

    Md Mofijul Islam, Alexi Gladstone, Riashat Islam, and Tariq Iqbal. Eqa-mx: Embodied question answering us- ing multimodal expression. In The Twelfth International Conference on Learning Representations , 2023

  38. [39]

    Chat-univi: Unified visual represen- tation empowers large language models with image and video understanding

    Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual represen- tation empowers large language models with image and video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700–13710, 2024

  39. [40]

    Sim2real predictivity: Does evaluation in simulation predict real- world performance? IEEE Robotics and Automation Letters, 5(4):6670–6677, 2020

    Abhishek Kadian, Joanne Truong, Aaron Gokaslan, Alexander Clegg, Erik Wijmans, Stefan Lee, Manolis Savva, Sonia Chernova, and Dhruv Batra. Sim2real predictivity: Does evaluation in simulation predict real- world performance? IEEE Robotics and Automation Letters, 5(4):6670–6677, 2020

  40. [41]

    Human-following and-guiding in crowded environments using semantic deep-reinforcement-learning for mobile service robots

    Linh K ¨astner, Bassel Fatloun, Zhengcheng Shen, Daniel Gawrisch, and Jens Lambrecht. Human-following and-guiding in crowded environments using semantic deep-reinforcement-learning for mobile service robots. In 2022 International Conference on Robotics and Automation (ICRA), pages 833–839, 2022

  41. [42]

    Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation

    Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X Chang, and Manolis Savva. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Re...

  42. [43]

    Sim-2-sim transfer for vision-and-language navigation in continuous environ- ments

    Jacob Krantz and Stefan Lee. Sim-2-sim transfer for vision-and-language navigation in continuous environ- ments. In European Conference on Computer Vision , pages 588–603. Springer, 2022

  43. [44]

    Beyond the nav-graph: Vision- and-language navigation in continuous environments

    Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision- and-language navigation in continuous environments. In European Conference on Computer Vision , 2020. URL https://api.semanticscholar.org/CorpusID:214802389

  44. [45]

    Waypoint models for instruction-guided navigation in continuous environ- ments

    Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint models for instruction-guided navigation in continuous environ- ments. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 15162–15171, 2021

  45. [47]

    Room-across-room: Multilingual vision-and-language navigation with dense spatiotempo- ral grounding

    Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotempo- ral grounding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4392–4412, 2020

  46. [48]

    Openfm- nav: Towards open-set zero-shot object navigation via vision-language foundation models

    Yuxuan Kuang, Hai Lin, and Meng Jiang. Openfm- nav: Towards open-set zero-shot object navigation via vision-language foundation models. arXiv preprint arXiv:2402.10670, 2024

  47. [49]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023

  48. [50]

    Mvbench: A comprehensive multi-modal video under- standing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

  49. [51]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043 , 2023

  50. [52]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

    Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. arXiv preprint arXiv:2411.15139 , 2024

  51. [53]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023

  52. [54]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

  53. [55]

    Ok-robot: What really matters in integrating open-knowledge models for robotics

    Peiqi Liu, Yaswanth Orru, Chris Paxton, Nur Muham- mad Mahi Shafiullah, and Lerrel Pinto. Ok-robot: What really matters in integrating open-knowledge models for robotics. arXiv preprint arXiv:2401.12202 , 2024

  54. [56]

    Bt-adapter: Video conversation is feasible without video instruction tuning

    Ruyang Liu, Chen Li, Yixiao Ge, Thomas H Li, Ying Shan, and Ge Li. Bt-adapter: Video conversation is feasible without video instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13658–13667, 2024

  55. [57]

    St-llm: Large language models are effective temporal learners

    Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective temporal learners. In European Conference on Computer Vision, pages 1–18. Springer, 2025

  56. [58]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 , 2023

  57. [59]

    Aligning cyber space with physical world: A comprehensive survey on embodied ai

    Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, and Liang Lin. Aligning cyber space with physical world: A comprehensive survey on embodied ai. arXiv preprint arXiv:2407.06886 , 2024

  58. [60]

    Discuss before moving: Visual language nav- igation via multi-expert discussions

    Yuxing Long, Xiaoqi Li, Wenzhe Cai, and Hao Dong. Discuss before moving: Visual language nav- igation via multi-expert discussions. arXiv preprint arXiv:2309.11382, 2023

  59. [61]

    Instructnav: Zero-shot system for generic instruction navigation in unexplored environment

    Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment. arXiv preprint arXiv:2406.04882 , 2024

  60. [62]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 , 2023

  61. [63]

    Openeqa: Embodied question answering in the era of foundation models

    Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foundation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16488–16498, 2024

  62. [64]

    Core challenges of social robot naviga- tion: A survey

    Christoforos Mavrogiannis, Francesca Baldini, Allan Wang, Dapeng Zhao, Pete Trautman, Aaron Steinfeld, and Jean Oh. Core challenges of social robot naviga- tion: A survey. ACM Transactions on Human-Robot Interaction, 12(3):1–39, 2023

  63. [65]

    Bridging the gap between 2d and 3d visual question answering: A fusion approach for 3d vqa

    Wentao Mo and Yang Liu. Bridging the gap between 2d and 3d visual question answering: A fusion approach for 3d vqa. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 38, pages 4261–4268, 2024

  64. [66]

    Vision-based navigation with language-based assistance via imitation learning with indirect interven- tion

    Khanh Nguyen, Debadeepta Dey, Chris Brockett, and Bill Dolan. Vision-based navigation with language-based assistance via imitation learning with indirect interven- tion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12527– 12537, 2019

  65. [67]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  66. [68]

    Habitat 3.0: A co-habitat for humans, avatars and robots,

    Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander William Clegg, Michal Hlavac, So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, avatars and robots. arXiv preprint arXiv:2310.13724 , 2023

  67. [69]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wij- mans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. arXiv preprint arXiv:2109.08238 , 2021

  68. [70]

    Habitat- matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai

    Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat- matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and B...

  69. [71]

    Habitat-web: Learning embodied object- search strategies from human demonstrations at scale

    Ram Ramrakhya, Eric Undersander, Dhruv Batra, and Abhishek Das. Habitat-web: Learning embodied object- search strategies from human demonstrations at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5173–5183, 2022

  70. [72]

    Pirlnav: Pretraining with imitation and rl finetuning for objectnav

    Ram Ramrakhya, Dhruv Batra, Erik Wijmans, and Abhishek Das. Pirlnav: Pretraining with imitation and rl finetuning for objectnav. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17896–17906, 2023

  71. [73]

    Language-aligned waypoint (law) supervision for vision-and-language navigation in continuous environments

    Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Un- nat Jain, and Angel X Chang. Language-aligned waypoint (law) supervision for vision-and-language navigation in continuous environments. arXiv preprint arXiv:2109.15207, 2021

  72. [74]

    A reduction of imitation learning and structured prediction to no-regret online learning

    St´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the four- teenth international conference on artificial intelligence and statistics , pages 627–635. JMLR Workshop and Conference Proceedings, 2011

  73. [75]

    Habitat: A Platform for Embodied AI Research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research. ICCV, 2019

  74. [76]

    Habitat: A platform for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9339–9347, 2019

  75. [77]

    Fast marching methods

    James A Sethian. Fast marching methods. SIAM review, 41(2):199–235, 1999

  76. [78]

    Lm- nav: Robotic navigation with large pre-trained models of language, vision, and action

    Dhruv Shah, Bła˙zej Osi ´nski, Sergey Levine, et al. Lm- nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on robot learning, pages 492–504. PMLR, 2023

  77. [79]

    Gnm: A general navigation model to drive any robot

    Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hirose, and Sergey Levine. Gnm: A general navigation model to drive any robot. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 7226–7233. IEEE, 2023

  78. [80]

    Moviechat: From dense token to sparse memory for long video under- standing

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video under- standing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 18221–18232, 2024

  79. [81]

    Nomad: Goal masked diffusion policies for navigation and exploration

    Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. Nomad: Goal masked diffusion policies for navigation and exploration. In 2024 IEEE Interna- tional Conference on Robotics and Automation (ICRA) , pages 63–70. IEEE, 2024

  80. [82]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 , 2023

Showing first 80 references.