arxiv: 2412.06224 · v2 · submitted 2024-12-09 · 💻 cs.RO · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

Jiazhao Zhang , Kunyu Wang , Shaoan Wang , Minghan Li , Haoran Liu , Songlin Wei , Zhongyuan Wang , Zhizheng Zhang

show 1 more author

He Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:48 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords embodied navigationvision-language-actionunificationvideo-based modelrobot navigationlong-horizon tasksreal-world generalizationmulti-task learning

0 comments

The pith

A single video-based model unifies multiple robot navigation tasks by standardizing their data formats.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Uni-NaVid as a video-based vision-language-action model that combines several common embodied navigation tasks into one system. It achieves this unification by aligning the input and output data setups across tasks such as instruction following, object searching, question answering, and people tracking. Training on 3.6 million samples collected from four sub-tasks creates shared learning benefits. The approach supports mixed long-horizon navigation in new real-world settings without relying on pre-built maps or separate models for each task. Experiments show improved benchmark results and practical effectiveness in physical environments.

Core claim

Uni-NaVid is the first video-based vision-language-action model that unifies diverse embodied navigation tasks by harmonizing input and output data configurations for all commonly used tasks, thereby integrating four essential sub-tasks into a single model trained on 3.6 million navigation samples to foster learning synergy and enable seamless navigation for mixed long-horizon tasks in unseen real-world environments.

What carries the argument

Harmonization of input and output data configurations across tasks, which integrates them into one model without separate handling for each.

If this is right

Supports seamless switching between navigation tasks in long sequences without switching models.
Achieves state-of-the-art results on standard navigation benchmarks through shared training.
Demonstrates strong generalization to real-world settings with unseen environments.
Reduces reliance on pre-defined maps or discretized waypoints for practical use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The unification method may extend to combining navigation with manipulation tasks in broader robotics systems.
Larger-scale data collection under the same harmonized format could further boost performance on rare task combinations.
Deployment might simplify robot software stacks by replacing multiple specialized navigation modules with one model.

Load-bearing premise

Harmonizing input and output data configurations across tasks allows effective integration and positive synergy in learning without loss of performance on individual tasks or negative interference.

What would settle it

Experiments showing clear performance drops on any single sub-task when trained jointly compared to isolated training, or inability to complete mixed long-horizon sequences in real-world tests without additional tuning.

read the original abstract

A practical navigation agent must be capable of handling a wide range of interaction demands, such as following instructions, searching objects, answering questions, tracking people, and more. Existing models for embodied navigation fall short of serving as practical generalists in the real world, as they are often constrained by specific task configurations or pre-defined maps with discretized waypoints. In this work, we present Uni-NaVid, the first video-based vision-language-action (VLA) model designed to unify diverse embodied navigation tasks and enable seamless navigation for mixed long-horizon tasks in unseen real-world environments. Uni-NaVid achieves this by harmonizing the input and output data configurations for all commonly used embodied navigation tasks and thereby integrating all tasks in one model. For training Uni-NaVid, we collect 3.6 million navigation data samples in total from four essential navigation sub-tasks and foster synergy in learning across them. Extensive experiments on comprehensive navigation benchmarks clearly demonstrate the advantages of unification modeling in Uni-NaVid and show it achieves state-of-the-art performance. Additionally, real-world experiments confirm the model's effectiveness and efficiency, shedding light on its strong generalizability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Uni-NaVid unifies several navigation tasks in one video VLA via data harmonization and reports SOTA numbers, but the claimed synergy from joint training lacks the ablations needed to separate it from just using more data.

read the letter

The core contribution is a single video-based VLA that handles instruction following, object search, question answering, and person tracking by forcing all of them into the same input-output format. They collected 3.6 million samples across these tasks, trained one model, and show it beats prior specialized approaches on standard benchmarks while also running in real homes. That harmonization step is concrete and lets the model handle mixed long-horizon sequences without switching architectures, which is useful for anyone trying to move past task-specific agents. The real-world trials add some credibility that the sim results are not entirely brittle. The main weakness is exactly the one the stress-test note flags: the paper asserts positive cross-task synergy and no negative interference, yet does not show an ablation of the unified model against single-task models trained on equivalent data volume and the same backbone. Without that comparison, the SOTA numbers could simply reflect the larger combined dataset rather than any genuine transfer. The architecture details and training recipe look standard for current VLAs, so nothing obviously broken there, but the evidence for the unification claim itself stays indirect. This is the kind of paper that belongs in a reading group for people working on generalist embodied agents; the data-harmonization trick is worth stealing even if the synergy story needs tightening. It is solid enough to send to peer review, though the referees will almost certainly ask for those missing single-task controls.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Uni-NaVid, the first video-based vision-language-action model for unifying multiple embodied navigation tasks (instruction following, object search, question answering, person tracking) by harmonizing input/output data configurations across them. It trains a single model on 3.6 million joint samples from four sub-tasks to foster cross-task synergy, reports state-of-the-art results on standard benchmarks, and validates real-world performance on mixed long-horizon tasks in unseen environments.

Significance. If the unification produces genuine positive transfer without negative interference or performance loss on individual tasks, the result would be a meaningful step toward practical generalist navigation agents that handle diverse, long-horizon demands without task-specific retraining or maps.

major comments (2)

[Experiments] Experiments section: the central claim that harmonizing configurations yields 'advantages of unification modeling' and positive synergy rests on SOTA benchmark numbers after joint training on 3.6M samples, yet no ablation is reported that compares the unified model against single-task models trained on identical data splits, architecture, and total sample count. Without this comparison, reported gains cannot be distinguished from simple data-volume scaling.
[Experiments] §4 (or equivalent results section): the abstract and text assert no loss of performance on individual tasks and seamless handling of mixed tasks, but the provided results do not include per-task breakdowns or interference metrics for the unified model versus the single-task baselines; this directly bears on the weakest assumption identified in the stress test.

minor comments (2)

[Abstract] Abstract and §1: the repeated claim of being the 'first' video-based VLA for unification should be supported by a concise related-work comparison table or explicit citation of the closest prior VLA navigation models.
[Method] Notation and data description: the harmonized input/output configurations are described at a high level; a single table enumerating the exact tokenization, action space, and observation format for each of the four sub-tasks would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental validation for the claimed benefits of unification. We address each major comment below and will revise the manuscript to incorporate the suggested comparisons and metrics.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim that harmonizing configurations yields 'advantages of unification modeling' and positive synergy rests on SOTA benchmark numbers after joint training on 3.6M samples, yet no ablation is reported that compares the unified model against single-task models trained on identical data splits, architecture, and total sample count. Without this comparison, reported gains cannot be distinguished from simple data-volume scaling.

Authors: We agree that a controlled ablation isolating unification effects from data scaling would strengthen the evidence for positive cross-task synergy. The current experiments focus on comparisons to existing task-specific SOTA methods, which use varying data volumes and architectures. To directly address this concern, we will add an ablation study in the revised manuscript that trains single-task models using the same architecture and total sample count of 3.6 million (by replicating or appropriately allocating the combined data), allowing clear distinction between scaling and unification benefits. revision: yes
Referee: [Experiments] §4 (or equivalent results section): the abstract and text assert no loss of performance on individual tasks and seamless handling of mixed tasks, but the provided results do not include per-task breakdowns or interference metrics for the unified model versus the single-task baselines; this directly bears on the weakest assumption identified in the stress test.

Authors: The results section reports performance on each individual benchmark, demonstrating that the unified model maintains or exceeds the performance of prior specialized approaches without explicit degradation. However, we acknowledge that explicit per-task breakdowns and quantitative interference metrics versus single-task baselines are not tabulated. We will revise the manuscript to include these, adding tables with per-sub-task metrics for the unified model alongside single-task equivalents and any measured interference or synergy indicators. revision: yes

Circularity Check

0 steps flagged

No circularity; unification via explicit data harmonization and empirical benchmarks

full rationale

The paper defines Uni-NaVid through concrete architectural choices: harmonizing input/output data configurations across four navigation sub-tasks, collecting 3.6M joint samples, and training a single video-based VLA model. Performance claims rest on reported benchmark results and real-world experiments rather than any self-referential reduction. No equations or derivations equate a claimed prediction to its own fitted inputs by construction, no load-bearing self-citations appear in the provided text, and no uniqueness theorems or ansatzes are imported from prior author work. The derivation chain is self-contained and externally falsifiable via the stated experiments.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The claim rests on standard deep learning assumptions for vision-language models plus empirical training choices; no new physical entities or ungrounded axioms are introduced beyond typical ML practice.

free parameters (2)

model architecture hyperparameters
Standard transformer and training hyperparameters fitted during optimization on the 3.6M samples.
task data balancing weights
Choices in how samples from the four sub-tasks are mixed during unified training.

axioms (1)

domain assumption Transformer-based VLA architectures can effectively learn joint representations from video, language, and action data across tasks.
Implicit foundation for the unification approach in § on model design.

pith-pipeline@v0.9.0 · 5532 in / 1219 out tokens · 34698 ms · 2026-05-16T19:48:37.748335+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Uni-NaVid achieves this by harmonizing the input and output data configurations... extensive experiments on comprehensive navigation benchmarks clearly demonstrate the advantages of unification modeling

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Isolation: A Unified Benchmark for General-Purpose Navigation
cs.RO 2026-05 unverdicted novelty 7.0

OmniNavBench is a unified benchmark for general-purpose navigation featuring composite multi-skill instructions, support for humanoid, quadrupedal and wheeled robots, and 1779 human teleoperated trajectories across 17...
How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace
cs.AI 2026-04 unverdicted novelty 7.0

Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.
VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness
cs.RO 2026-03 conditional novelty 7.0

VLN-Cache delivers up to 1.52x faster inference in VLN models by using view-aligned remapping for geometric consistency and a task-relevance saliency filter to manage semantic changes during navigation.
PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World
cs.CV 2026-05 unverdicted novelty 6.0

PanoWorld adds spherical geometry to MLLMs via cross-attention and pano-specific instruction data, yielding better performance on panoramic spatial reasoning benchmarks than standard perspective-based pipelines.
SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation
cs.CV 2026-04 unverdicted novelty 6.0

SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.
GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot Learning
cs.RO 2026-04 unverdicted novelty 6.0

GS-Playground delivers a high-throughput photorealistic simulator for vision-informed robot learning via parallel physics integrated with batch 3D Gaussian Splatting at 10^4 FPS and an automated Real2Sim workflow for ...
FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching
cs.RO 2026-04 unverdicted novelty 6.0

FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.
AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation
cs.RO 2026-04 unverdicted novelty 6.0

AsyncShield restores VLA geometric intent from latency via kinematic pose mapping and uses PPO-Lagrangian to balance tracking with LiDAR safety constraints in a plug-and-play module.
FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation
cs.CV 2026-04 unverdicted novelty 6.0

FineCog-Nav uses fine-grained cognitive modules driven by foundation models to outperform zero-shot baselines in UAV navigation and introduces the AerialVLN-Fine benchmark with refined instructions.
{\Psi}-Map: Panoptic Surface Integrated Mapping Enables Real2Sim Transfer
cs.RO 2026-04 unverdicted novelty 6.0

Ψ-Map combines plane-constrained Gaussian surfels from LiDAR with end-to-end panoptic lifting to deliver high-precision geometric and semantic reconstruction in large-scale environments at real-time speeds.
Visually-grounded Humanoid Agents
cs.CV 2026-04 unverdicted novelty 6.0

A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.
HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation
cs.AI 2026-04 unverdicted novelty 6.0

HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.
Memory Over Maps: 3D Object Localization Without Reconstruction
cs.RO 2026-03 unverdicted novelty 6.0

A map-free localization method stores posed RGB-D keyframes, retrieves and re-ranks them with a VLM, then fuses sparse depth for on-demand 3D target estimates, matching reconstruction-based performance on navigation b...
MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation
cs.CV 2026-02 unverdicted novelty 6.0

MerNav's Memory-Execute-Review framework improves success rates in zero-shot object goal navigation by 5-8% over baselines on four datasets while outperforming both training-free and supervised methods on key benchmarks.
AstraNav-World: World Model for Foresight Control and Consistency
cs.CV 2025-12 unverdicted novelty 6.0

AstraNav-World unifies diffusion video generation and vision-language action planning in a single bidirectional model that improves trajectory accuracy, success rates, and zero-shot real-world adaptation in embodied n...
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
cs.CV 2025-12 unverdicted novelty 6.0

A monocular RGB-only aerial VLN framework outperforms baselines via prompt-guided multi-task learning, keyframe selection, and label reweighting on AerialVLN and OpenFly benchmarks.
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
cs.CV 2025-07 unverdicted novelty 6.0

DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 avera...
LCGNav: Local Candidate-Aware Geometric Enhancement for General Topological Planning in Vision-Language Navigation
cs.CV 2026-05 conditional novelty 5.0

LCGNav improves online topological VLN-CE by converting local depth views to physically truncated 3D point clouds and applying selective dimension-preserving fusion, yielding consistent gains on R2R-CE and RxR-CE benc...
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
cs.CV 2026-04 unverdicted novelty 5.0

Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.

Reference graph

Works this paper leans on

126 extracted references · 126 canonical work pages · cited by 19 Pith papers · 13 internal anchors

[1]

Etpnav: Evolving topological planning for vision-language nav- igation in continuous environments

Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topological planning for vision-language nav- igation in continuous environments. arXiv preprint arXiv:2304.03047, 2023

work page arXiv 2023
[3]

On Evaluation of Embodied Navigation Agents

Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Mano- lis Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018

work page 2018
[5]

Sim-to-real transfer for vision-and-language navigation

Peter Anderson, Ayush Shrivastava, Joanne Truong, Arjun Majumdar, Devi Parikh, Dhruv Batra, and Ste- fan Lee. Sim-to-real transfer for vision-and-language navigation. In Conference on Robot Learning , pages 671–681. PMLR, 2021

work page 2021
[6]

Human memory: A proposed system and its control processes (vol

RC Atkinson and RM Shiffrin. Human memory: A proposed system and its control processes (vol. 2). The Psychology of Learning and Motivation: Advances in Research and Theory , pages 89–195, 1968

work page 1968
[7]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129–19139, 2022

work page 2022
[8]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster. ArXiv, abs/2210.09461, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Activitynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition , pages 961–970, 2015

work page 2015
[11]

Matterport3d: Learning from rgb-d data in indoor environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In 2017 International Conference on 3D Vision (3DV) , pages 667–676. IEEE, 2017

work page 2017
[12]

Object goal navigation using goal-oriented semantic exploration

Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Russ R Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems , 33: 4247–4258, 2020

work page 2020
[13]

Collecting highly parallel data for paraphrase evaluation

David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages 190–200, 2011

work page 2011
[14]

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elho- seiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Topological planning with transformers for vision-and-language navigation

Kevin Chen, Junshen K Chen, Jo Chuang, Marynel V´azquez, and Silvio Savarese. Topological planning with transformers for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 11276–11286, 2021

work page 2021
[16]

Weakly- supervised multi-granularity map learning for vision-and- language navigation

Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H Li, Mingkui Tan, and Chuang Gan. Weakly- supervised multi-granularity map learning for vision-and- language navigation. arXiv preprint arXiv:2210.07506 , 2022

work page arXiv 2022
[17]

Action-aware zero-shot robot navigation by exploiting vision-and-language ability of foundation models

Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H Li, Gaowen Liu, Mingkui Tan, and Chuang Gan. Action-aware zero-shot robot navigation by exploiting vision-and-language ability of foundation models. arXiv preprint arXiv:2308.07997 , 2023

work page arXiv 2023
[18]

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. arXiv preprint arXiv:2402.13243, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13320–13331, 2024

work page 2024
[20]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) , 2023

work page 2023
[21]

Embodied question answering

Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 1–10, 2018

work page 2018
[22]

Clip-nav: Using clip for zero-shot vision-and-language navigation

Vishnu Sashank Dorbala, Gunnar Sigurdsson, Robinson Piramuthu, Jesse Thomason, and Gaurav S Sukhatme. Clip-nav: Using clip for zero-shot vision-and-language navigation. arXiv preprint arXiv:2211.16649 , 2022

work page arXiv 2022
[23]

A survey of embodied ai: From simulators to research tasks

Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence , 6(2):230–244, 2022

work page 2022
[24]

The one ring: a robotic indoor navigation generalist

Ainaz Eftekhar, Luca Weihs, Rose Hendrix, Ege Caglar, Jordi Salvador, Alvaro Herrasti, Winson Han, Eli Van- derBil, Aniruddha Kembhavi, Ali Farhadi, et al. The one ring: a robotic indoor navigation generalist. arXiv preprint arXiv:2412.14401, 2024

work page arXiv 2024
[25]

Principles and guidelines for evaluating social robot navigation algorithms

Anthony Francis, Claudia P ´erez-d’Arpino, Chengshu Li, Fei Xia, Alexandre Alahi, Rachid Alami, Aniket Bera, Abhijat Biswas, Joydeep Biswas, Rohan Chandra, et al. Principles and guidelines for evaluating social robot navigation algorithms. arXiv preprint arXiv:2306.16740, 2023

work page arXiv 2023
[26]

Cross-modal map learning for vi- sion and language navigation

Georgios Georgakis, Karl Schmeckpeper, Karan Wan- choo, Soham Dan, Eleni Miltsakaki, Dan Roth, and Kostas Daniilidis. Cross-modal map learning for vi- sion and language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15460–15470, 2022

work page 2022
[27]

Navigating to objects in the real world

Theophile Gervet, Soumith Chintala, Dhruv Batra, Ji- tendra Malik, and Devendra Singh Chaplot. Navigating to objects in the real world. Science Robotics , 8(79): eadf6991, 2023

work page 2023
[28]

A novel vision-based tracking algorithm for a human-following mobile robot

Meenakshi Gupta, Swagat Kumar, Laxmidhar Behera, and Venkatesh K Subramanian. A novel vision-based tracking algorithm for a human-following mobile robot. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(7):1415–1427, 2016

work page 2016
[29]

Exaug: Robot-conditioned navigation policies via geometric experience augmentation

Noriaki Hirose, Dhruv Shah, Ajay Sridhar, and Sergey Levine. Exaug: Robot-conditioned navigation policies via geometric experience augmentation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 4077–4084. IEEE, 2023

work page 2023
[30]

CogVLM2: Visual Language Models for Image and Video Understanding

Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng Zhang, De-Feng Liu, Bin Xu, Juanzi Li, Yu-Chen Dong, and Jie Tang. Cogvlm2: Visual language models for image and video understanding...

work page internal anchor Pith review arXiv 2024
[31]

Bridging the gap between learning in discrete and continuous environments for vision-and-language nav- igation

Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridging the gap between learning in discrete and continuous environments for vision-and-language nav- igation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15439–15449, 2022

work page 2022
[32]

3d- llm: Injecting the 3d world into large language models

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d- llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems , 36: 20482–20494, 2023

work page 2023
[33]

Toward socially aware person-following robots

Shanee S Honig, Tal Oron-Gilad, Hanan Zaichyk, Vardit Sarne-Fleischmann, Samuel Olatunji, and Yael Edan. Toward socially aware person-following robots. IEEE Transactions on Cognitive and Developmental Systems , 10(4):936–954, 2018

work page 2018
[34]

Visual language maps for robot navigation

Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. arXiv preprint arXiv:2210.05714 , 2022

work page arXiv 2022
[35]

Visual language maps for robot navigation

Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 10608–10615. IEEE, 2023

work page 2023
[36]

Robust kalman filters based on gaussian scale mixture distributions with application to target tracking

Yulong Huang, Yonggang Zhang, Peng Shi, Zhemin Wu, Junhui Qian, and Jonathon A Chambers. Robust kalman filters based on gaussian scale mixture distributions with application to target tracking. IEEE Transactions on Systems, Man, and Cybernetics: Systems , 49(10):2082– 2096, 2017

work page 2082
[37]

Person-following by autonomous robots: A categorical overview

Md Jahidul Islam, Jungseok Hong, and Junaed Sattar. Person-following by autonomous robots: A categorical overview. The International Journal of Robotics Re- search, 38(14):1581–1618, 2019

work page 2019
[38]

Eqa-mx: Embodied question answering us- ing multimodal expression

Md Mofijul Islam, Alexi Gladstone, Riashat Islam, and Tariq Iqbal. Eqa-mx: Embodied question answering us- ing multimodal expression. In The Twelfth International Conference on Learning Representations , 2023

work page 2023
[39]

Chat-univi: Unified visual represen- tation empowers large language models with image and video understanding

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual represen- tation empowers large language models with image and video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700–13710, 2024

work page 2024
[40]

Sim2real predictivity: Does evaluation in simulation predict real- world performance? IEEE Robotics and Automation Letters, 5(4):6670–6677, 2020

Abhishek Kadian, Joanne Truong, Aaron Gokaslan, Alexander Clegg, Erik Wijmans, Stefan Lee, Manolis Savva, Sonia Chernova, and Dhruv Batra. Sim2real predictivity: Does evaluation in simulation predict real- world performance? IEEE Robotics and Automation Letters, 5(4):6670–6677, 2020

work page 2020
[41]

Human-following and-guiding in crowded environments using semantic deep-reinforcement-learning for mobile service robots

Linh K ¨astner, Bassel Fatloun, Zhengcheng Shen, Daniel Gawrisch, and Jens Lambrecht. Human-following and-guiding in crowded environments using semantic deep-reinforcement-learning for mobile service robots. In 2022 International Conference on Robotics and Automation (ICRA), pages 833–839, 2022

work page 2022
[42]

Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation

Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X Chang, and Manolis Savva. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Re...

work page 2024
[43]

Sim-2-sim transfer for vision-and-language navigation in continuous environ- ments

Jacob Krantz and Stefan Lee. Sim-2-sim transfer for vision-and-language navigation in continuous environ- ments. In European Conference on Computer Vision , pages 588–603. Springer, 2022

work page 2022
[44]

Beyond the nav-graph: Vision- and-language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision- and-language navigation in continuous environments. In European Conference on Computer Vision , 2020. URL https://api.semanticscholar.org/CorpusID:214802389

work page 2020
[45]

Waypoint models for instruction-guided navigation in continuous environ- ments

Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint models for instruction-guided navigation in continuous environ- ments. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 15162–15171, 2021

work page 2021
[47]

Room-across-room: Multilingual vision-and-language navigation with dense spatiotempo- ral grounding

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotempo- ral grounding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4392–4412, 2020

work page 2020
[48]

Openfm- nav: Towards open-set zero-shot object navigation via vision-language foundation models

Yuxuan Kuang, Hai Lin, and Meng Jiang. Openfm- nav: Towards open-set zero-shot object navigation via vision-language foundation models. arXiv preprint arXiv:2402.10670, 2024

work page arXiv 2024
[49]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Mvbench: A comprehensive multi-modal video under- standing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

work page 2024
[51]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043 , 2023

work page arXiv 2023
[52]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. arXiv preprint arXiv:2411.15139 , 2024

work page arXiv 2024
[53]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

work page 2023
[55]

Ok-robot: What really matters in integrating open-knowledge models for robotics

Peiqi Liu, Yaswanth Orru, Chris Paxton, Nur Muham- mad Mahi Shafiullah, and Lerrel Pinto. Ok-robot: What really matters in integrating open-knowledge models for robotics. arXiv preprint arXiv:2401.12202 , 2024

work page arXiv 2024
[56]

Bt-adapter: Video conversation is feasible without video instruction tuning

Ruyang Liu, Chen Li, Yixiao Ge, Thomas H Li, Ying Shan, and Ge Li. Bt-adapter: Video conversation is feasible without video instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13658–13667, 2024

work page 2024
[57]

St-llm: Large language models are effective temporal learners

Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective temporal learners. In European Conference on Computer Vision, pages 1–18. Springer, 2025

work page 2025
[58]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Aligning cyber space with physical world: A comprehensive survey on embodied ai

Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, and Liang Lin. Aligning cyber space with physical world: A comprehensive survey on embodied ai. arXiv preprint arXiv:2407.06886 , 2024

work page arXiv 2024
[60]

Discuss before moving: Visual language nav- igation via multi-expert discussions

Yuxing Long, Xiaoqi Li, Wenzhe Cai, and Hao Dong. Discuss before moving: Visual language nav- igation via multi-expert discussions. arXiv preprint arXiv:2309.11382, 2023

work page arXiv 2023
[61]

Instructnav: Zero-shot system for generic instruction navigation in unexplored environment

Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment. arXiv preprint arXiv:2406.04882 , 2024

work page arXiv 2024
[62]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Openeqa: Embodied question answering in the era of foundation models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foundation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16488–16498, 2024

work page 2024
[64]

Core challenges of social robot naviga- tion: A survey

Christoforos Mavrogiannis, Francesca Baldini, Allan Wang, Dapeng Zhao, Pete Trautman, Aaron Steinfeld, and Jean Oh. Core challenges of social robot naviga- tion: A survey. ACM Transactions on Human-Robot Interaction, 12(3):1–39, 2023

work page 2023
[65]

Bridging the gap between 2d and 3d visual question answering: A fusion approach for 3d vqa

Wentao Mo and Yang Liu. Bridging the gap between 2d and 3d visual question answering: A fusion approach for 3d vqa. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 38, pages 4261–4268, 2024

work page 2024
[66]

Vision-based navigation with language-based assistance via imitation learning with indirect interven- tion

Khanh Nguyen, Debadeepta Dey, Chris Brockett, and Bill Dolan. Vision-based navigation with language-based assistance via imitation learning with indirect interven- tion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12527– 12537, 2019

work page 2019
[67]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[68]

Habitat 3.0: A co-habitat for humans, avatars and robots,

Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander William Clegg, Michal Hlavac, So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, avatars and robots. arXiv preprint arXiv:2310.13724 , 2023

work page arXiv 2023
[69]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wij- mans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. arXiv preprint arXiv:2109.08238 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[70]

Habitat- matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai

Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat- matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and B...

work page 2021
[71]

Habitat-web: Learning embodied object- search strategies from human demonstrations at scale

Ram Ramrakhya, Eric Undersander, Dhruv Batra, and Abhishek Das. Habitat-web: Learning embodied object- search strategies from human demonstrations at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5173–5183, 2022

work page 2022
[72]

Pirlnav: Pretraining with imitation and rl finetuning for objectnav

Ram Ramrakhya, Dhruv Batra, Erik Wijmans, and Abhishek Das. Pirlnav: Pretraining with imitation and rl finetuning for objectnav. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17896–17906, 2023

work page 2023
[73]

Language-aligned waypoint (law) supervision for vision-and-language navigation in continuous environments

Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Un- nat Jain, and Angel X Chang. Language-aligned waypoint (law) supervision for vision-and-language navigation in continuous environments. arXiv preprint arXiv:2109.15207, 2021

work page arXiv 2021
[74]

A reduction of imitation learning and structured prediction to no-regret online learning

St´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the four- teenth international conference on artificial intelligence and statistics , pages 627–635. JMLR Workshop and Conference Proceedings, 2011

work page 2011
[75]

Habitat: A Platform for Embodied AI Research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research. ICCV, 2019

work page 2019
[76]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9339–9347, 2019

work page 2019
[77]

Fast marching methods

James A Sethian. Fast marching methods. SIAM review, 41(2):199–235, 1999

work page 1999
[78]

Lm- nav: Robotic navigation with large pre-trained models of language, vision, and action

Dhruv Shah, Bła˙zej Osi ´nski, Sergey Levine, et al. Lm- nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on robot learning, pages 492–504. PMLR, 2023

work page 2023
[79]

Gnm: A general navigation model to drive any robot

Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hirose, and Sergey Levine. Gnm: A general navigation model to drive any robot. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 7226–7233. IEEE, 2023

work page 2023
[80]

Moviechat: From dense token to sparse memory for long video under- standing

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video under- standing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 18221–18232, 2024

work page 2024
[81]

Nomad: Goal masked diffusion policies for navigation and exploration

Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. Nomad: Goal masked diffusion policies for navigation and exploration. In 2024 IEEE Interna- tional Conference on Robotics and Automation (ICRA) , pages 63–70. IEEE, 2024

work page 2024
[82]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

Showing first 80 references.