FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution

Bharath Hariharan; Gene Chou; Guandao Yang; Mohamed Abdelfattah; Ning Yu; Noah Snavely; Paul Debevec; Wenqi Xian

arxiv: 2504.07093 · v2 · pith:BOCRDFTXnew · submitted 2025-04-09 · 💻 cs.CV

FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution

Gene Chou , Wenqi Xian , Guandao Yang , Mohamed Abdelfattah , Bharath Hariharan , Noah Snavely , Ning Yu , Paul Debevec This is my paper

classification 💻 cs.CV

keywords depthvideoestimationflashdepthmodelstreamingacrosshigh-resolution

0 comments

read the original abstract

A versatile video depth estimation model should (1) be accurate and consistent across frames, (2) produce high-resolution depth maps, and (3) support real-time streaming. We propose FlashDepth, a method that satisfies all three requirements, performing depth estimation on a 2044x1148 streaming video at 24 FPS. We show that, with careful modifications to pretrained single-image depth models, these capabilities are enabled with relatively little data and training. We evaluate our approach across multiple unseen datasets against state-of-the-art depth models, and find that ours outperforms them in terms of boundary sharpness and speed by a significant margin, while maintaining competitive accuracy. We hope our model will enable various applications that require high-resolution depth, such as video editing, and online decision-making, such as robotics. We release all code and model weights at https://github.com/Eyeline-Research/FlashDepth

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Stabilizing Streaming Video Geometry via Dynamic Feature Normalization
cs.CV 2026-05 unverdicted novelty 6.0

DyFN is a lightweight recurrent module that dynamically normalizes latent feature statistics to remove scale-shift drift and achieve state-of-the-art temporal consistency in streaming monocular geometry estimation whi...
Vista4D: Video Reshooting with 4D Point Clouds
cs.CV 2026-04 unverdicted novelty 6.0

Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.
ViPE: Video Pose Engine for 3D Geometric Perception
cs.CV 2025-08 unverdicted novelty 5.0

ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.