Recognition: 2 theorem links
· Lean TheoremUni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks
Pith reviewed 2026-05-16 19:48 UTC · model grok-4.3
The pith
A single video-based model unifies multiple robot navigation tasks by standardizing their data formats.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Uni-NaVid is the first video-based vision-language-action model that unifies diverse embodied navigation tasks by harmonizing input and output data configurations for all commonly used tasks, thereby integrating four essential sub-tasks into a single model trained on 3.6 million navigation samples to foster learning synergy and enable seamless navigation for mixed long-horizon tasks in unseen real-world environments.
What carries the argument
Harmonization of input and output data configurations across tasks, which integrates them into one model without separate handling for each.
If this is right
- Supports seamless switching between navigation tasks in long sequences without switching models.
- Achieves state-of-the-art results on standard navigation benchmarks through shared training.
- Demonstrates strong generalization to real-world settings with unseen environments.
- Reduces reliance on pre-defined maps or discretized waypoints for practical use.
Where Pith is reading between the lines
- The unification method may extend to combining navigation with manipulation tasks in broader robotics systems.
- Larger-scale data collection under the same harmonized format could further boost performance on rare task combinations.
- Deployment might simplify robot software stacks by replacing multiple specialized navigation modules with one model.
Load-bearing premise
Harmonizing input and output data configurations across tasks allows effective integration and positive synergy in learning without loss of performance on individual tasks or negative interference.
What would settle it
Experiments showing clear performance drops on any single sub-task when trained jointly compared to isolated training, or inability to complete mixed long-horizon sequences in real-world tests without additional tuning.
read the original abstract
A practical navigation agent must be capable of handling a wide range of interaction demands, such as following instructions, searching objects, answering questions, tracking people, and more. Existing models for embodied navigation fall short of serving as practical generalists in the real world, as they are often constrained by specific task configurations or pre-defined maps with discretized waypoints. In this work, we present Uni-NaVid, the first video-based vision-language-action (VLA) model designed to unify diverse embodied navigation tasks and enable seamless navigation for mixed long-horizon tasks in unseen real-world environments. Uni-NaVid achieves this by harmonizing the input and output data configurations for all commonly used embodied navigation tasks and thereby integrating all tasks in one model. For training Uni-NaVid, we collect 3.6 million navigation data samples in total from four essential navigation sub-tasks and foster synergy in learning across them. Extensive experiments on comprehensive navigation benchmarks clearly demonstrate the advantages of unification modeling in Uni-NaVid and show it achieves state-of-the-art performance. Additionally, real-world experiments confirm the model's effectiveness and efficiency, shedding light on its strong generalizability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Uni-NaVid, the first video-based vision-language-action model for unifying multiple embodied navigation tasks (instruction following, object search, question answering, person tracking) by harmonizing input/output data configurations across them. It trains a single model on 3.6 million joint samples from four sub-tasks to foster cross-task synergy, reports state-of-the-art results on standard benchmarks, and validates real-world performance on mixed long-horizon tasks in unseen environments.
Significance. If the unification produces genuine positive transfer without negative interference or performance loss on individual tasks, the result would be a meaningful step toward practical generalist navigation agents that handle diverse, long-horizon demands without task-specific retraining or maps.
major comments (2)
- [Experiments] Experiments section: the central claim that harmonizing configurations yields 'advantages of unification modeling' and positive synergy rests on SOTA benchmark numbers after joint training on 3.6M samples, yet no ablation is reported that compares the unified model against single-task models trained on identical data splits, architecture, and total sample count. Without this comparison, reported gains cannot be distinguished from simple data-volume scaling.
- [Experiments] §4 (or equivalent results section): the abstract and text assert no loss of performance on individual tasks and seamless handling of mixed tasks, but the provided results do not include per-task breakdowns or interference metrics for the unified model versus the single-task baselines; this directly bears on the weakest assumption identified in the stress test.
minor comments (2)
- [Abstract] Abstract and §1: the repeated claim of being the 'first' video-based VLA for unification should be supported by a concise related-work comparison table or explicit citation of the closest prior VLA navigation models.
- [Method] Notation and data description: the harmonized input/output configurations are described at a high level; a single table enumerating the exact tokenization, action space, and observation format for each of the four sub-tasks would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental validation for the claimed benefits of unification. We address each major comment below and will revise the manuscript to incorporate the suggested comparisons and metrics.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the central claim that harmonizing configurations yields 'advantages of unification modeling' and positive synergy rests on SOTA benchmark numbers after joint training on 3.6M samples, yet no ablation is reported that compares the unified model against single-task models trained on identical data splits, architecture, and total sample count. Without this comparison, reported gains cannot be distinguished from simple data-volume scaling.
Authors: We agree that a controlled ablation isolating unification effects from data scaling would strengthen the evidence for positive cross-task synergy. The current experiments focus on comparisons to existing task-specific SOTA methods, which use varying data volumes and architectures. To directly address this concern, we will add an ablation study in the revised manuscript that trains single-task models using the same architecture and total sample count of 3.6 million (by replicating or appropriately allocating the combined data), allowing clear distinction between scaling and unification benefits. revision: yes
-
Referee: [Experiments] §4 (or equivalent results section): the abstract and text assert no loss of performance on individual tasks and seamless handling of mixed tasks, but the provided results do not include per-task breakdowns or interference metrics for the unified model versus the single-task baselines; this directly bears on the weakest assumption identified in the stress test.
Authors: The results section reports performance on each individual benchmark, demonstrating that the unified model maintains or exceeds the performance of prior specialized approaches without explicit degradation. However, we acknowledge that explicit per-task breakdowns and quantitative interference metrics versus single-task baselines are not tabulated. We will revise the manuscript to include these, adding tables with per-sub-task metrics for the unified model alongside single-task equivalents and any measured interference or synergy indicators. revision: yes
Circularity Check
No circularity; unification via explicit data harmonization and empirical benchmarks
full rationale
The paper defines Uni-NaVid through concrete architectural choices: harmonizing input/output data configurations across four navigation sub-tasks, collecting 3.6M joint samples, and training a single video-based VLA model. Performance claims rest on reported benchmark results and real-world experiments rather than any self-referential reduction. No equations or derivations equate a claimed prediction to its own fitted inputs by construction, no load-bearing self-citations appear in the provided text, and no uniqueness theorems or ansatzes are imported from prior author work. The derivation chain is self-contained and externally falsifiable via the stated experiments.
Axiom & Free-Parameter Ledger
free parameters (2)
- model architecture hyperparameters
- task data balancing weights
axioms (1)
- domain assumption Transformer-based VLA architectures can effectively learn joint representations from video, language, and action data across tasks.
Lean theorems connected to this paper
-
Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Uni-NaVid achieves this by harmonizing the input and output data configurations... extensive experiments on comprehensive navigation benchmarks clearly demonstrate the advantages of unification modeling
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
Beyond Isolation: A Unified Benchmark for General-Purpose Navigation
OmniNavBench is a unified benchmark for general-purpose navigation featuring composite multi-skill instructions, support for humanoid, quadrupedal and wheeled robots, and 1779 human teleoperated trajectories across 17...
-
How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace
Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.
-
VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness
VLN-Cache delivers up to 1.52x faster inference in VLN models by using view-aligned remapping for geometric consistency and a task-relevance saliency filter to manage semantic changes during navigation.
-
PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World
PanoWorld adds spherical geometry to MLLMs via cross-attention and pano-specific instruction data, yielding better performance on panoramic spatial reasoning benchmarks than standard perspective-based pipelines.
-
SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation
SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.
-
GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot Learning
GS-Playground delivers a high-throughput photorealistic simulator for vision-informed robot learning via parallel physics integrated with batch 3D Gaussian Splatting at 10^4 FPS and an automated Real2Sim workflow for ...
-
FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching
FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.
-
AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation
AsyncShield restores VLA geometric intent from latency via kinematic pose mapping and uses PPO-Lagrangian to balance tracking with LiDAR safety constraints in a plug-and-play module.
-
FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation
FineCog-Nav uses fine-grained cognitive modules driven by foundation models to outperform zero-shot baselines in UAV navigation and introduces the AerialVLN-Fine benchmark with refined instructions.
-
{\Psi}-Map: Panoptic Surface Integrated Mapping Enables Real2Sim Transfer
Ψ-Map combines plane-constrained Gaussian surfels from LiDAR with end-to-end panoptic lifting to deliver high-precision geometric and semantic reconstruction in large-scale environments at real-time speeds.
-
Visually-grounded Humanoid Agents
A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.
-
HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation
HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.
-
Memory Over Maps: 3D Object Localization Without Reconstruction
A map-free localization method stores posed RGB-D keyframes, retrieves and re-ranks them with a VLM, then fuses sparse depth for on-demand 3D target estimates, matching reconstruction-based performance on navigation b...
-
MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation
MerNav's Memory-Execute-Review framework improves success rates in zero-shot object goal navigation by 5-8% over baselines on four datasets while outperforming both training-free and supervised methods on key benchmarks.
-
AstraNav-World: World Model for Foresight Control and Consistency
AstraNav-World unifies diffusion video generation and vision-language action planning in a single bidirectional model that improves trajectory accuracy, success rates, and zero-shot real-world adaptation in embodied n...
-
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
A monocular RGB-only aerial VLN framework outperforms baselines via prompt-guided multi-task learning, keyframe selection, and label reweighting on AerialVLN and OpenFly benchmarks.
-
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 avera...
-
LCGNav: Local Candidate-Aware Geometric Enhancement for General Topological Planning in Vision-Language Navigation
LCGNav improves online topological VLN-CE by converting local depth views to physically truncated 3D point clouds and applying selective dimension-preserving fusion, yielding consistent gains on R2R-CE and RxR-CE benc...
-
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.
Reference graph
Works this paper leans on
-
[1]
Etpnav: Evolving topological planning for vision-language nav- igation in continuous environments
Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topological planning for vision-language nav- igation in continuous environments. arXiv preprint arXiv:2304.03047, 2023
-
[3]
On Evaluation of Embodied Navigation Agents
Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Mano- lis Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 , 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018
work page 2018
-
[5]
Sim-to-real transfer for vision-and-language navigation
Peter Anderson, Ayush Shrivastava, Joanne Truong, Arjun Majumdar, Devi Parikh, Dhruv Batra, and Ste- fan Lee. Sim-to-real transfer for vision-and-language navigation. In Conference on Robot Learning , pages 671–681. PMLR, 2021
work page 2021
-
[6]
Human memory: A proposed system and its control processes (vol
RC Atkinson and RM Shiffrin. Human memory: A proposed system and its control processes (vol. 2). The Psychology of Learning and Motivation: Advances in Research and Theory , pages 89–195, 1968
work page 1968
-
[7]
Scanqa: 3d question answering for spatial scene understanding
Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129–19139, 2022
work page 2022
-
[8]
Token Merging: Your ViT But Faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster. ArXiv, abs/2210.09461, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Activitynet: A large-scale video benchmark for human activity understanding
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition , pages 961–970, 2015
work page 2015
-
[11]
Matterport3d: Learning from rgb-d data in indoor environments
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In 2017 International Conference on 3D Vision (3DV) , pages 667–676. IEEE, 2017
work page 2017
-
[12]
Object goal navigation using goal-oriented semantic exploration
Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Russ R Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems , 33: 4247–4258, 2020
work page 2020
-
[13]
Collecting highly parallel data for paraphrase evaluation
David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages 190–200, 2011
work page 2011
-
[14]
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elho- seiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Topological planning with transformers for vision-and-language navigation
Kevin Chen, Junshen K Chen, Jo Chuang, Marynel V´azquez, and Silvio Savarese. Topological planning with transformers for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 11276–11286, 2021
work page 2021
-
[16]
Weakly- supervised multi-granularity map learning for vision-and- language navigation
Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H Li, Mingkui Tan, and Chuang Gan. Weakly- supervised multi-granularity map learning for vision-and- language navigation. arXiv preprint arXiv:2210.07506 , 2022
-
[17]
Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H Li, Gaowen Liu, Mingkui Tan, and Chuang Gan. Action-aware zero-shot robot navigation by exploiting vision-and-language ability of foundation models. arXiv preprint arXiv:2308.07997 , 2023
-
[18]
VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning
Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. arXiv preprint arXiv:2402.13243, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Panda-70m: Captioning 70m videos with multiple cross-modality teachers
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13320–13331, 2024
work page 2024
-
[20]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) , 2023
work page 2023
-
[21]
Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 1–10, 2018
work page 2018
-
[22]
Clip-nav: Using clip for zero-shot vision-and-language navigation
Vishnu Sashank Dorbala, Gunnar Sigurdsson, Robinson Piramuthu, Jesse Thomason, and Gaurav S Sukhatme. Clip-nav: Using clip for zero-shot vision-and-language navigation. arXiv preprint arXiv:2211.16649 , 2022
-
[23]
A survey of embodied ai: From simulators to research tasks
Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence , 6(2):230–244, 2022
work page 2022
-
[24]
The one ring: a robotic indoor navigation generalist
Ainaz Eftekhar, Luca Weihs, Rose Hendrix, Ege Caglar, Jordi Salvador, Alvaro Herrasti, Winson Han, Eli Van- derBil, Aniruddha Kembhavi, Ali Farhadi, et al. The one ring: a robotic indoor navigation generalist. arXiv preprint arXiv:2412.14401, 2024
-
[25]
Principles and guidelines for evaluating social robot navigation algorithms
Anthony Francis, Claudia P ´erez-d’Arpino, Chengshu Li, Fei Xia, Alexandre Alahi, Rachid Alami, Aniket Bera, Abhijat Biswas, Joydeep Biswas, Rohan Chandra, et al. Principles and guidelines for evaluating social robot navigation algorithms. arXiv preprint arXiv:2306.16740, 2023
-
[26]
Cross-modal map learning for vi- sion and language navigation
Georgios Georgakis, Karl Schmeckpeper, Karan Wan- choo, Soham Dan, Eleni Miltsakaki, Dan Roth, and Kostas Daniilidis. Cross-modal map learning for vi- sion and language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15460–15470, 2022
work page 2022
-
[27]
Navigating to objects in the real world
Theophile Gervet, Soumith Chintala, Dhruv Batra, Ji- tendra Malik, and Devendra Singh Chaplot. Navigating to objects in the real world. Science Robotics , 8(79): eadf6991, 2023
work page 2023
-
[28]
A novel vision-based tracking algorithm for a human-following mobile robot
Meenakshi Gupta, Swagat Kumar, Laxmidhar Behera, and Venkatesh K Subramanian. A novel vision-based tracking algorithm for a human-following mobile robot. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(7):1415–1427, 2016
work page 2016
-
[29]
Exaug: Robot-conditioned navigation policies via geometric experience augmentation
Noriaki Hirose, Dhruv Shah, Ajay Sridhar, and Sergey Levine. Exaug: Robot-conditioned navigation policies via geometric experience augmentation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 4077–4084. IEEE, 2023
work page 2023
-
[30]
CogVLM2: Visual Language Models for Image and Video Understanding
Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng Zhang, De-Feng Liu, Bin Xu, Juanzi Li, Yu-Chen Dong, and Jie Tang. Cogvlm2: Visual language models for image and video understanding...
work page internal anchor Pith review arXiv 2024
-
[31]
Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridging the gap between learning in discrete and continuous environments for vision-and-language nav- igation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15439–15449, 2022
work page 2022
-
[32]
3d- llm: Injecting the 3d world into large language models
Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d- llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems , 36: 20482–20494, 2023
work page 2023
-
[33]
Toward socially aware person-following robots
Shanee S Honig, Tal Oron-Gilad, Hanan Zaichyk, Vardit Sarne-Fleischmann, Samuel Olatunji, and Yael Edan. Toward socially aware person-following robots. IEEE Transactions on Cognitive and Developmental Systems , 10(4):936–954, 2018
work page 2018
-
[34]
Visual language maps for robot navigation
Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. arXiv preprint arXiv:2210.05714 , 2022
-
[35]
Visual language maps for robot navigation
Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 10608–10615. IEEE, 2023
work page 2023
-
[36]
Yulong Huang, Yonggang Zhang, Peng Shi, Zhemin Wu, Junhui Qian, and Jonathon A Chambers. Robust kalman filters based on gaussian scale mixture distributions with application to target tracking. IEEE Transactions on Systems, Man, and Cybernetics: Systems , 49(10):2082– 2096, 2017
work page 2082
-
[37]
Person-following by autonomous robots: A categorical overview
Md Jahidul Islam, Jungseok Hong, and Junaed Sattar. Person-following by autonomous robots: A categorical overview. The International Journal of Robotics Re- search, 38(14):1581–1618, 2019
work page 2019
-
[38]
Eqa-mx: Embodied question answering us- ing multimodal expression
Md Mofijul Islam, Alexi Gladstone, Riashat Islam, and Tariq Iqbal. Eqa-mx: Embodied question answering us- ing multimodal expression. In The Twelfth International Conference on Learning Representations , 2023
work page 2023
-
[39]
Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual represen- tation empowers large language models with image and video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700–13710, 2024
work page 2024
-
[40]
Abhishek Kadian, Joanne Truong, Aaron Gokaslan, Alexander Clegg, Erik Wijmans, Stefan Lee, Manolis Savva, Sonia Chernova, and Dhruv Batra. Sim2real predictivity: Does evaluation in simulation predict real- world performance? IEEE Robotics and Automation Letters, 5(4):6670–6677, 2020
work page 2020
-
[41]
Linh K ¨astner, Bassel Fatloun, Zhengcheng Shen, Daniel Gawrisch, and Jens Lambrecht. Human-following and-guiding in crowded environments using semantic deep-reinforcement-learning for mobile service robots. In 2022 International Conference on Robotics and Automation (ICRA), pages 833–839, 2022
work page 2022
-
[42]
Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X Chang, and Manolis Savva. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Re...
work page 2024
-
[43]
Sim-2-sim transfer for vision-and-language navigation in continuous environ- ments
Jacob Krantz and Stefan Lee. Sim-2-sim transfer for vision-and-language navigation in continuous environ- ments. In European Conference on Computer Vision , pages 588–603. Springer, 2022
work page 2022
-
[44]
Beyond the nav-graph: Vision- and-language navigation in continuous environments
Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision- and-language navigation in continuous environments. In European Conference on Computer Vision , 2020. URL https://api.semanticscholar.org/CorpusID:214802389
work page 2020
-
[45]
Waypoint models for instruction-guided navigation in continuous environ- ments
Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint models for instruction-guided navigation in continuous environ- ments. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 15162–15171, 2021
work page 2021
-
[47]
Room-across-room: Multilingual vision-and-language navigation with dense spatiotempo- ral grounding
Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotempo- ral grounding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4392–4412, 2020
work page 2020
-
[48]
Openfm- nav: Towards open-set zero-shot object navigation via vision-language foundation models
Yuxuan Kuang, Hai Lin, and Meng Jiang. Openfm- nav: Towards open-set zero-shot object navigation via vision-language foundation models. arXiv preprint arXiv:2402.10670, 2024
-
[49]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Mvbench: A comprehensive multi-modal video under- standing benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024
work page 2024
-
[51]
Llama-vid: An image is worth 2 tokens in large language models
Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043 , 2023
-
[52]
Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving
Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. arXiv preprint arXiv:2411.15139 , 2024
-
[53]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023
work page 2023
-
[55]
Ok-robot: What really matters in integrating open-knowledge models for robotics
Peiqi Liu, Yaswanth Orru, Chris Paxton, Nur Muham- mad Mahi Shafiullah, and Lerrel Pinto. Ok-robot: What really matters in integrating open-knowledge models for robotics. arXiv preprint arXiv:2401.12202 , 2024
-
[56]
Bt-adapter: Video conversation is feasible without video instruction tuning
Ruyang Liu, Chen Li, Yixiao Ge, Thomas H Li, Ying Shan, and Ge Li. Bt-adapter: Video conversation is feasible without video instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13658–13667, 2024
work page 2024
-
[57]
St-llm: Large language models are effective temporal learners
Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective temporal learners. In European Conference on Computer Vision, pages 1–18. Springer, 2025
work page 2025
-
[58]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Aligning cyber space with physical world: A comprehensive survey on embodied ai
Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, and Liang Lin. Aligning cyber space with physical world: A comprehensive survey on embodied ai. arXiv preprint arXiv:2407.06886 , 2024
-
[60]
Discuss before moving: Visual language nav- igation via multi-expert discussions
Yuxing Long, Xiaoqi Li, Wenzhe Cai, and Hao Dong. Discuss before moving: Visual language nav- igation via multi-expert discussions. arXiv preprint arXiv:2309.11382, 2023
-
[61]
Instructnav: Zero-shot system for generic instruction navigation in unexplored environment
Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment. arXiv preprint arXiv:2406.04882 , 2024
-
[62]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
Openeqa: Embodied question answering in the era of foundation models
Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foundation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16488–16498, 2024
work page 2024
-
[64]
Core challenges of social robot naviga- tion: A survey
Christoforos Mavrogiannis, Francesca Baldini, Allan Wang, Dapeng Zhao, Pete Trautman, Aaron Steinfeld, and Jean Oh. Core challenges of social robot naviga- tion: A survey. ACM Transactions on Human-Robot Interaction, 12(3):1–39, 2023
work page 2023
-
[65]
Bridging the gap between 2d and 3d visual question answering: A fusion approach for 3d vqa
Wentao Mo and Yang Liu. Bridging the gap between 2d and 3d visual question answering: A fusion approach for 3d vqa. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 38, pages 4261–4268, 2024
work page 2024
-
[66]
Khanh Nguyen, Debadeepta Dey, Chris Brockett, and Bill Dolan. Vision-based navigation with language-based assistance via imitation learning with indirect interven- tion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12527– 12537, 2019
work page 2019
- [67]
-
[68]
Habitat 3.0: A co-habitat for humans, avatars and robots,
Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander William Clegg, Michal Hlavac, So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, avatars and robots. arXiv preprint arXiv:2310.13724 , 2023
-
[69]
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wij- mans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. arXiv preprint arXiv:2109.08238 , 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[70]
Habitat- matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai
Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat- matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and B...
work page 2021
-
[71]
Habitat-web: Learning embodied object- search strategies from human demonstrations at scale
Ram Ramrakhya, Eric Undersander, Dhruv Batra, and Abhishek Das. Habitat-web: Learning embodied object- search strategies from human demonstrations at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5173–5183, 2022
work page 2022
-
[72]
Pirlnav: Pretraining with imitation and rl finetuning for objectnav
Ram Ramrakhya, Dhruv Batra, Erik Wijmans, and Abhishek Das. Pirlnav: Pretraining with imitation and rl finetuning for objectnav. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17896–17906, 2023
work page 2023
-
[73]
Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Un- nat Jain, and Angel X Chang. Language-aligned waypoint (law) supervision for vision-and-language navigation in continuous environments. arXiv preprint arXiv:2109.15207, 2021
-
[74]
A reduction of imitation learning and structured prediction to no-regret online learning
St´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the four- teenth international conference on artificial intelligence and statistics , pages 627–635. JMLR Workshop and Conference Proceedings, 2011
work page 2011
-
[75]
Habitat: A Platform for Embodied AI Research
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research. ICCV, 2019
work page 2019
-
[76]
Habitat: A platform for embodied ai research
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9339–9347, 2019
work page 2019
-
[77]
James A Sethian. Fast marching methods. SIAM review, 41(2):199–235, 1999
work page 1999
-
[78]
Lm- nav: Robotic navigation with large pre-trained models of language, vision, and action
Dhruv Shah, Bła˙zej Osi ´nski, Sergey Levine, et al. Lm- nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on robot learning, pages 492–504. PMLR, 2023
work page 2023
-
[79]
Gnm: A general navigation model to drive any robot
Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hirose, and Sergey Levine. Gnm: A general navigation model to drive any robot. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 7226–7233. IEEE, 2023
work page 2023
-
[80]
Moviechat: From dense token to sparse memory for long video under- standing
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video under- standing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 18221–18232, 2024
work page 2024
-
[81]
Nomad: Goal masked diffusion policies for navigation and exploration
Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. Nomad: Goal masked diffusion policies for navigation and exploration. In 2024 IEEE Interna- tional Conference on Robotics and Automation (ICRA) , pages 63–70. IEEE, 2024
work page 2024
-
[82]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.