{"paper":{"title":"Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A single video-based model unifies multiple robot navigation tasks by standardizing their data formats.","cross_cats":["cs.CV"],"primary_cat":"cs.RO","authors_text":"Haoran Liu, He Wang, Jiazhao Zhang, Kunyu Wang, Minghan Li, Shaoan Wang, Songlin Wei, Zhizheng Zhang, Zhongyuan Wang","submitted_at":"2024-12-09T05:55:55Z","abstract_excerpt":"A practical navigation agent must be capable of handling a wide range of interaction demands, such as following instructions, searching objects, answering questions, tracking people, and more. Existing models for embodied navigation fall short of serving as practical generalists in the real world, as they are often constrained by specific task configurations or pre-defined maps with discretized waypoints. In this work, we present Uni-NaVid, the first video-based vision-language-action (VLA) model designed to unify diverse embodied navigation tasks and enable seamless navigation for mixed long-"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Uni-NaVid is the first video-based vision-language-action model designed to unify diverse embodied navigation tasks and enable seamless navigation for mixed long-horizon tasks in unseen real-world environments.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"Harmonizing input and output data configurations across tasks allows effective integration and positive synergy in learning without loss of performance on individual tasks or introduction of negative interference.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A single video-based model unifies multiple robot navigation tasks by standardizing their data formats.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"1c0f1afd72f5e25d072de0165064af22ef543242ee432af178a193b2660585c9"},"source":{"id":"2412.06224","kind":"arxiv","version":2},"verdict":{"id":"5b829172-a9aa-4c98-bfc9-e76f52ad61e4","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T19:48:37.748335Z","strongest_claim":"Uni-NaVid is the first video-based vision-language-action model designed to unify diverse embodied navigation tasks and enable seamless navigation for mixed long-horizon tasks in unseen real-world environments.","one_line_summary":"Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"Harmonizing input and output data configurations across tasks allows effective integration and positive synergy in learning without loss of performance on individual tasks or introduction of negative interference.","pith_extraction_headline":"A single video-based model unifies multiple robot navigation tasks by standardizing their data formats."},"references":{"count":126,"sample":[{"doi":"","year":2023,"title":"Etpnav: Evolving topological planning for vision-language nav- igation in continuous environments","work_id":"45d6d65d-9132-413b-95e2-b087bcf764c6","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"On Evaluation of Embodied Navigation Agents","work_id":"3b074aa9-2ff9-4ad6-8796-6a25689ecfd3","ref_index":3,"cited_arxiv_id":"1807.06757","is_internal_anchor":true},{"doi":"","year":2018,"title":"Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments","work_id":"48497812-da9c-417f-b670-cbfd4bc1db2e","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Sim-to-real transfer for vision-and-language navigation","work_id":"05f66141-d322-4d1a-80c0-f642fea22591","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1968,"title":"Human memory: A proposed system and its control processes (vol","work_id":"bff6a436-7e9e-4af5-937a-13d7e66c64c4","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":126,"snapshot_sha256":"3da3c4a90e9cc6b094624f6c9531ce1e40b3bef67cb54c929d899cf34f3334bc","internal_anchors":13},"formal_canon":{"evidence_count":2,"snapshot_sha256":"445449dff66f29273a40e9ccb44fd28e4b0e4932013babb6babb7565fbf69e29"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}