{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:3NSFLYB5XM5L4DIY4H65GBVNE4","short_pith_number":"pith:3NSFLYB5","schema_version":"1.0","canonical_sha256":"db6455e03dbb3abe0d18e1fdd306ad272fa57104b1f13a1816e9da3eaae1b047","source":{"kind":"arxiv","id":"2412.06224","version":2},"attestation_state":"computed","paper":{"title":"Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A single video-based model unifies multiple robot navigation tasks by standardizing their data formats.","cross_cats":["cs.CV"],"primary_cat":"cs.RO","authors_text":"Haoran Liu, He Wang, Jiazhao Zhang, Kunyu Wang, Minghan Li, Shaoan Wang, Songlin Wei, Zhizheng Zhang, Zhongyuan Wang","submitted_at":"2024-12-09T05:55:55Z","abstract_excerpt":"A practical navigation agent must be capable of handling a wide range of interaction demands, such as following instructions, searching objects, answering questions, tracking people, and more. Existing models for embodied navigation fall short of serving as practical generalists in the real world, as they are often constrained by specific task configurations or pre-defined maps with discretized waypoints. In this work, we present Uni-NaVid, the first video-based vision-language-action (VLA) model designed to unify diverse embodied navigation tasks and enable seamless navigation for mixed long-"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2412.06224","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.RO","submitted_at":"2024-12-09T05:55:55Z","cross_cats_sorted":["cs.CV"],"title_canon_sha256":"6de18cdeccb65d161bb2f2cf81f80abd312ccf44a0191d91de55ac49a7636abb","abstract_canon_sha256":"9678bece30c07e9689b0428da3a3ad5864662f0e3e9873458f46b86eb5418661"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:46.784715Z","signature_b64":"3U2jNtwWmSCiSsBvoDn3O+6qYe+4443B1x6cmLi67fgMR7Z43qB3VGvlQPspcxmA1tLOAzbLlIOhiijQ7pjtAg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"db6455e03dbb3abe0d18e1fdd306ad272fa57104b1f13a1816e9da3eaae1b047","last_reissued_at":"2026-05-17T23:38:46.784204Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:46.784204Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A single video-based model unifies multiple robot navigation tasks by standardizing their data formats.","cross_cats":["cs.CV"],"primary_cat":"cs.RO","authors_text":"Haoran Liu, He Wang, Jiazhao Zhang, Kunyu Wang, Minghan Li, Shaoan Wang, Songlin Wei, Zhizheng Zhang, Zhongyuan Wang","submitted_at":"2024-12-09T05:55:55Z","abstract_excerpt":"A practical navigation agent must be capable of handling a wide range of interaction demands, such as following instructions, searching objects, answering questions, tracking people, and more. Existing models for embodied navigation fall short of serving as practical generalists in the real world, as they are often constrained by specific task configurations or pre-defined maps with discretized waypoints. In this work, we present Uni-NaVid, the first video-based vision-language-action (VLA) model designed to unify diverse embodied navigation tasks and enable seamless navigation for mixed long-"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Uni-NaVid is the first video-based vision-language-action model designed to unify diverse embodied navigation tasks and enable seamless navigation for mixed long-horizon tasks in unseen real-world environments.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"Harmonizing input and output data configurations across tasks allows effective integration and positive synergy in learning without loss of performance on individual tasks or introduction of negative interference.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A single video-based model unifies multiple robot navigation tasks by standardizing their data formats.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"1c0f1afd72f5e25d072de0165064af22ef543242ee432af178a193b2660585c9"},"source":{"id":"2412.06224","kind":"arxiv","version":2},"verdict":{"id":"5b829172-a9aa-4c98-bfc9-e76f52ad61e4","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T19:48:37.748335Z","strongest_claim":"Uni-NaVid is the first video-based vision-language-action model designed to unify diverse embodied navigation tasks and enable seamless navigation for mixed long-horizon tasks in unseen real-world environments.","one_line_summary":"Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"Harmonizing input and output data configurations across tasks allows effective integration and positive synergy in learning without loss of performance on individual tasks or introduction of negative interference.","pith_extraction_headline":"A single video-based model unifies multiple robot navigation tasks by standardizing their data formats."},"references":{"count":126,"sample":[{"doi":"","year":2023,"title":"Etpnav: Evolving topological planning for vision-language nav- igation in continuous environments","work_id":"45d6d65d-9132-413b-95e2-b087bcf764c6","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"On Evaluation of Embodied Navigation Agents","work_id":"3b074aa9-2ff9-4ad6-8796-6a25689ecfd3","ref_index":3,"cited_arxiv_id":"1807.06757","is_internal_anchor":true},{"doi":"","year":2018,"title":"Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments","work_id":"48497812-da9c-417f-b670-cbfd4bc1db2e","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Sim-to-real transfer for vision-and-language navigation","work_id":"05f66141-d322-4d1a-80c0-f642fea22591","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1968,"title":"Human memory: A proposed system and its control processes (vol","work_id":"bff6a436-7e9e-4af5-937a-13d7e66c64c4","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":126,"snapshot_sha256":"3da3c4a90e9cc6b094624f6c9531ce1e40b3bef67cb54c929d899cf34f3334bc","internal_anchors":13},"formal_canon":{"evidence_count":2,"snapshot_sha256":"445449dff66f29273a40e9ccb44fd28e4b0e4932013babb6babb7565fbf69e29"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2412.06224","created_at":"2026-05-17T23:38:46.784283+00:00"},{"alias_kind":"arxiv_version","alias_value":"2412.06224v2","created_at":"2026-05-17T23:38:46.784283+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2412.06224","created_at":"2026-05-17T23:38:46.784283+00:00"},{"alias_kind":"pith_short_12","alias_value":"3NSFLYB5XM5L","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"3NSFLYB5XM5L4DIY","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"3NSFLYB5","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":27,"internal_anchor_count":27,"sample":[{"citing_arxiv_id":"2605.22036","citing_title":"GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2603.05377","citing_title":"OpenFrontier: General Navigation with Visual-Language Grounded Frontiers","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17249","citing_title":"SEDualVLN: A Spatially-Enhanced Dual-System for Vision-Language Navigation","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19506","citing_title":"EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16899","citing_title":"LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13169","citing_title":"PanoWorld: Towards Spatial Supersensing in 360$^\\circ$ Panorama World","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2509.10796","citing_title":"Follow-Bench: A Unified Motion Planning Benchmark for Socially-Aware Robot Person Following","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2511.17097","citing_title":"Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2512.08639","citing_title":"Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2512.21714","citing_title":"AstraNav-World: World Model for Foresight Control and Consistency","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2507.04447","citing_title":"DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge","ref_index":74,"is_internal_anchor":true},{"citing_arxiv_id":"2602.05467","citing_title":"MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2603.07080","citing_title":"VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2603.20530","citing_title":"Memory Over Maps: 3D Object Localization Without Reconstruction","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13169","citing_title":"PanoWorld: Towards Spatial Supersensing in 360$^\\circ$ Panorama World","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27620","citing_title":"SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation","ref_index":83,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09441","citing_title":"Beyond Isolation: A Unified Benchmark for General-Purpose Navigation","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09053","citing_title":"LCGNav: Local Candidate-Aware Geometric Enhancement for General Topological Planning in Vision-Language Navigation","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25459","citing_title":"GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot Learning","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24391","citing_title":"FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24086","citing_title":"AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10982","citing_title":"{\\Psi}-Map: Panoptic Surface Integrated Mapping Enables Real2Sim Transfer","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08232","citing_title":"HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07973","citing_title":"How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08509","citing_title":"Visually-grounded Humanoid Agents","ref_index":113,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/3NSFLYB5XM5L4DIY4H65GBVNE4","json":"https://pith.science/pith/3NSFLYB5XM5L4DIY4H65GBVNE4.json","graph_json":"https://pith.science/api/pith-number/3NSFLYB5XM5L4DIY4H65GBVNE4/graph.json","events_json":"https://pith.science/api/pith-number/3NSFLYB5XM5L4DIY4H65GBVNE4/events.json","paper":"https://pith.science/paper/3NSFLYB5"},"agent_actions":{"view_html":"https://pith.science/pith/3NSFLYB5XM5L4DIY4H65GBVNE4","download_json":"https://pith.science/pith/3NSFLYB5XM5L4DIY4H65GBVNE4.json","view_paper":"https://pith.science/paper/3NSFLYB5","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2412.06224&json=true","fetch_graph":"https://pith.science/api/pith-number/3NSFLYB5XM5L4DIY4H65GBVNE4/graph.json","fetch_events":"https://pith.science/api/pith-number/3NSFLYB5XM5L4DIY4H65GBVNE4/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/3NSFLYB5XM5L4DIY4H65GBVNE4/action/timestamp_anchor","attest_storage":"https://pith.science/pith/3NSFLYB5XM5L4DIY4H65GBVNE4/action/storage_attestation","attest_author":"https://pith.science/pith/3NSFLYB5XM5L4DIY4H65GBVNE4/action/author_attestation","sign_citation":"https://pith.science/pith/3NSFLYB5XM5L4DIY4H65GBVNE4/action/citation_signature","submit_replication":"https://pith.science/pith/3NSFLYB5XM5L4DIY4H65GBVNE4/action/replication_record"}},"created_at":"2026-05-17T23:38:46.784283+00:00","updated_at":"2026-05-17T23:38:46.784283+00:00"}