{"work":{"id":"2d60e542-3c1c-4656-becc-ded522856631","openalex_id":null,"doi":null,"arxiv_id":"2508.10934","raw_key":null,"title":"ViPE: Video Pose Engine for 3D Geometric Perception","authors":null,"authors_text":"Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren","year":2025,"venue":"cs.CV","abstract":"Accurate 3D geometric perception is an important prerequisite for a wide range of spatial AI systems. While state-of-the-art methods depend on large-scale training data, acquiring consistent and precise 3D annotations from in-the-wild videos remains a key challenge. In this work, we introduce ViPE, a handy and versatile video processing engine designed to bridge this gap. ViPE efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos. It is robust to diverse scenarios, including dynamic selfie videos, cinematic shots, or dashcams, and supports various camera models such as pinhole, wide-angle, and 360{\\deg} panoramas. We have benchmarked ViPE on multiple benchmarks. Notably, it outperforms existing uncalibrated pose estimation baselines by 18%/50% on TUM/KITTI sequences, and runs at 3-5FPS on a single GPU for standard input resolutions. We use ViPE to annotate a large-scale collection of videos. This collection includes around 100K real-world internet videos, 1M high-quality AI-generated videos, and 2K panoramic videos, totaling approximately 96M frames -- all annotated with accurate camera poses and dense depth maps. We open-source ViPE and the annotated dataset with the hope of accelerating the development of spatial AI systems.","external_url":"https://arxiv.org/abs/2508.10934","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T04:45:20.593796+00:00","pith_arxiv_id":"2508.10934","created_at":"2026-05-09T23:04:17.885525+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":true,"display_title":"ViPE: Video Pose Engine for 3D Geometric Perception","render_title":"ViPE: Video Pose Engine for 3D Geometric Perception"},"hub":{"state":{"work_id":"2d60e542-3c1c-4656-becc-ded522856631","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":24,"external_cited_by_count":null,"distinct_field_count":1,"first_pith_cited_at":"2025-09-30T17:59:51+00:00","last_pith_cited_at":"2026-05-22T17:59:43+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-06T14:51:01.702914+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":4},{"context_role":"method","n":4}],"polarity_counts":[{"context_polarity":"background","n":4},{"context_polarity":"use_method","n":4}],"runs":{},"summary":{},"graph":{},"authors":[]}}