pith. sign in

arxiv: 2504.07093 · v2 · pith:BOCRDFTXnew · submitted 2025-04-09 · 💻 cs.CV

FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution

classification 💻 cs.CV
keywords depthvideoestimationflashdepthmodelstreamingacrosshigh-resolution
0
0 comments X
read the original abstract

A versatile video depth estimation model should (1) be accurate and consistent across frames, (2) produce high-resolution depth maps, and (3) support real-time streaming. We propose FlashDepth, a method that satisfies all three requirements, performing depth estimation on a 2044x1148 streaming video at 24 FPS. We show that, with careful modifications to pretrained single-image depth models, these capabilities are enabled with relatively little data and training. We evaluate our approach across multiple unseen datasets against state-of-the-art depth models, and find that ours outperforms them in terms of boundary sharpness and speed by a significant margin, while maintaining competitive accuracy. We hope our model will enable various applications that require high-resolution depth, such as video editing, and online decision-making, such as robotics. We release all code and model weights at https://github.com/Eyeline-Research/FlashDepth

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Stabilizing Streaming Video Geometry via Dynamic Feature Normalization

    cs.CV 2026-05 unverdicted novelty 6.0

    DyFN is a lightweight recurrent module that dynamically normalizes latent feature statistics to remove scale-shift drift and achieve state-of-the-art temporal consistency in streaming monocular geometry estimation whi...

  2. Vista4D: Video Reshooting with 4D Point Clouds

    cs.CV 2026-04 unverdicted novelty 6.0

    Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.

  3. ViPE: Video Pose Engine for 3D Geometric Perception

    cs.CV 2025-08 unverdicted novelty 5.0

    ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.