ICDepth: Taming Video Diffusion Models for Video Depth Estimation via In-Context Conditioning

Jiaxin Xie; Mingzhe Zheng; Qifeng Chen; Xuanhua He

arxiv: 2607.01677 · v1 · pith:HB4SIMBOnew · submitted 2026-07-02 · 💻 cs.CV

ICDepth: Taming Video Diffusion Models for Video Depth Estimation via In-Context Conditioning

Xuanhua He , Jiaxin Xie , Mingzhe Zheng , Qifeng Chen This is my paper

Pith reviewed 2026-07-03 16:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords video depth estimationdiffusion modelsin-context conditioningdata efficiencyzero-shot generalizationtemporal consistencymonocular depth

0 comments

The pith

Pre-trained video diffusion transformers can be adapted for monocular video depth estimation via in-context conditioning to reach state-of-the-art accuracy with far less training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that the spatial-temporal priors inside text-to-video diffusion models transfer to dense geometric prediction when the models are steered by in-context conditioning. If true, this would let generative approaches deliver both long-range temporal consistency and geometric precision without the 10M+ frame datasets that earlier generative depth methods required. The authors introduce SAND-Attention to keep spatial-temporal features aligned and SRFM to add semantic priors from DINOv2, then show the resulting system beats prior methods on standard benchmarks while using only 0.8M frames and generalizing zero-shot to new domains. A reader would care because the result suggests a route to more data-efficient geometric vision by reusing existing generative training rather than starting from scratch.

Core claim

ICDepth adapts pre-trained text-to-video diffusion transformers for video depth estimation via In-Context Conditioning. Key adaptations are SAND-Attention, which enforces precise spatial-temporal alignment through shared RoPE and unidirectional attention to block noise, and SRFM, which injects DINOv2 semantic and resolution priors to improve geometric fidelity. The resulting model reaches state-of-the-art results on multiple benchmarks after training on only 0.8M frames, six to thirteen times less data than competing generative methods, while showing strong zero-shot generalization across domains.

What carries the argument

In-Context Conditioning (ICC) on diffusion transformers, realized through SAND-Attention for alignment and SRFM for semantic prior injection.

If this is right

Generative depth methods no longer require 10M+ training frames to achieve temporal consistency.
Depth estimates can maintain consistency across long video sequences without per-frame drift.
Models trained this way generalize to new visual domains without additional labeled data.
Geometric precision on standard benchmarks matches or exceeds that of purely discriminative video depth estimators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditioning approach could be tested on other dense video tasks such as optical flow or surface normal estimation.
Lower data requirements might allow depth models to be retrained quickly when new camera hardware appears.
If the priors prove broadly transferable, similar adaptations might succeed on non-diffusion backbones for geometric prediction.

Load-bearing premise

The spatial-temporal priors learned during text-to-video pre-training remain rich enough to support precise geometric prediction once steered by in-context conditioning.

What would settle it

A controlled experiment in which ICDepth, after the same 0.8M-frame budget, produces depth maps whose per-frame accuracy or temporal consistency falls below that of strong discriminative baselines on a held-out benchmark would falsify the central claim.

Figures

Figures reproduced from arXiv: 2607.01677 by Jiaxin Xie, Mingzhe Zheng, Qifeng Chen, Xuanhua He.

**Figure 1.** Figure 1: Robust depth estimation across diverse scenarios. Our method performs highresolution depth estimation (1080p) on videos with varying aspect ratios and demonstrates strong generalization across challenging conditions including foggy weather, nighttime scenes, underwater footage, and both 2D and 3D animated content. prediction for the current frame. The third is generalization ability in diverse scenes. Ov… view at source ↗

**Figure 2.** Figure 2: The framework of our proposed ICDepth, which contains two components: SRFM and SAND-Attention. The RoPE alignment ensures that corresponding spatialtemporal positions in the video-depth paired data share identical RoPE positional encodings. 3.1 Preliminary video diffusion transformer (VDiT) Our approach builds upon the VDiT, which leverages a pure transformer architecture for video generation. The model … view at source ↗

**Figure 3.** Figure 3: We compare against representative video depth estimation baselines. For temporal consistency visualization, we show depth profiles (green boxes) extracted along the temporal dimension at the green line locations, illustrating the temporal stability of each method. 4.4 Ablation Studies To validate the effectiveness of our design, we conduct ablation studies on the Sintel dataset to answer three key questio… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison under challenging nighttime conditions. We compare our method with state-of-the-art approaches on nighttime scenes. The second row displays spatio-temporal plots along the green scanlines. Our method demonstrates superior depth accuracy and temporal stability, producing flicker-free results even under adverse illumination and visibility. dard channel concatenation? Q2: How to adapt… view at source ↗

read the original abstract

Monocular video depth estimation requires temporal consistency, geometric accuracy, and generalization across diverse scenarios, yet existing methods struggle to achieve all three simultaneously. Discriminative models excel at per-frame accuracy but suffer from temporal drift due to limited context windows, while generative methods improve consistency and generalization at the cost of extensive training data (10M+ samples) and lack of geometric precision. In response to these issues, we introduce \textbf{ICDepth}, a framework that adapts pre-trained text-to-video diffusion transformers for video depth estimation via In-Context Conditioning (ICC), leveraging their rich spatial-temporal priors. To address key challenges in transferring ICC from generation to dense prediction, we propose: (1)~\textbf{SAND-Attention}, which ensures precise spatial-temporal alignment via shared RoPE and enforces unidirectional attention to prevent noise contamination; (2)~\textbf{SRFM}, which injects DINOv2 semantic and resolution priors to enhance geometric precision. ICDepth achieves state-of-the-art results on multiple benchmarks with remarkable data efficiency, trained on only 0.8M frames ($6$--$13\times$ less than competing generative methods), while demonstrating strong zero-shot generalization to diverse domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ICDepth adapts video diffusion transformers to depth via in-context conditioning and two targeted fixes, claiming strong data efficiency on the abstract's terms.

read the letter

The core contribution is a transfer method that takes pre-trained text-to-video diffusion transformers and turns them into video depth estimators through in-context conditioning. They introduce SAND-Attention to enforce spatial-temporal alignment with shared RoPE and unidirectional attention so noise does not leak back, plus SRFM to inject DINOv2 semantic and resolution signals for sharper geometry. This setup lets them train on 0.8M frames and report SOTA plus zero-shot behavior across domains.

The approach makes sense on paper. Generative models already carry rich motion and appearance priors; the question is whether those priors can be steered toward dense geometric output without retraining from scratch or losing precision. The two proposed components directly target the alignment and precision gaps that usually appear when you repurpose a generative backbone for prediction. If the full experiments show the claimed 6-13x data reduction while beating recent generative depth baselines on standard benchmarks, that is a concrete practical gain.

The main soft spot is that the abstract states SOTA and efficiency numbers without the supporting tables, ablations, or error breakdowns visible here. It is hard to tell how much the new attention and prior modules actually move the needle versus the choice of base model or training schedule. The assumption that diffusion priors transfer cleanly to geometry also needs the per-scene failure cases to be shown, not just aggregate wins. Minor implementation details like exact conditioning format and how SRFM is fused would help reproducibility.

This is for groups working on video depth, 3D reconstruction from video, or anyone trying to reuse large generative backbones for dense tasks. A reader already following diffusion adaptation work would get the most out of the specific fixes. The paper is coherent enough on its own terms to deserve a serious referee, even if the experiments ultimately need tightening.

Referee Report

0 major / 3 minor

Summary. The paper introduces ICDepth, a framework adapting pre-trained text-to-video diffusion transformers for monocular video depth estimation via In-Context Conditioning (ICC). It proposes SAND-Attention to ensure spatial-temporal alignment through shared RoPE and unidirectional attention, and SRFM to inject DINOv2 semantic and resolution priors for geometric precision. The central empirical claim is state-of-the-art performance on multiple benchmarks, trained on only 0.8M frames (6-13× less data than competing generative methods), with strong zero-shot generalization across domains.

Significance. If the reported results hold, the work would be significant for demonstrating efficient transfer of rich spatial-temporal priors from generative video diffusion models to dense geometric prediction tasks. The data efficiency (0.8M frames) and ability to maintain both temporal consistency and geometric accuracy without massive additional training data represent a meaningful advance over prior discriminative methods (limited context) and generative approaches (high data cost, lower precision).

minor comments (3)

[Abstract] Abstract: strong SOTA and data-efficiency claims are made without any quantitative metrics, specific benchmark names, or error values; adding 1-2 key numbers (e.g., AbsRel or δ1 on a standard dataset) would improve the abstract's informativeness.
[Abstract] The acronyms SAND-Attention and SRFM are introduced without parenthetical expansions or one-sentence functional descriptions in the abstract; this reduces immediate readability for readers outside the immediate sub-area.
The manuscript should clarify whether the reported 0.8M-frame count includes only the fine-tuning data or also any auxiliary pre-training stages, and provide a direct comparison table against the exact data volumes used by the cited generative baselines.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of ICDepth, the recognition of its data efficiency (0.8M frames) and zero-shot generalization, and the recommendation for minor revision. We will incorporate any minor clarifications in the revised version.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical adaptation of pre-trained video diffusion transformers to monocular video depth estimation via in-context conditioning, introducing SAND-Attention for alignment and SRFM for semantic priors. All claims rest on benchmark comparisons and data-efficiency measurements (0.8M frames) rather than any mathematical derivation chain, first-principles predictions, or quantities that reduce to fitted inputs by construction. No self-citations function as load-bearing uniqueness theorems, and the method is presented as an engineering transfer with external validation on standard datasets. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Review limited to abstract; no equations, training details, or full methods available to enumerate parameters or axioms exhaustively.

axioms (1)

domain assumption Pre-trained text-to-video diffusion transformers contain rich spatial-temporal priors that transfer effectively to dense prediction tasks.
Invoked in abstract as the basis for leveraging these models via ICC.

invented entities (2)

SAND-Attention no independent evidence
purpose: Ensures precise spatial-temporal alignment via shared RoPE and unidirectional attention to prevent noise contamination.
New mechanism proposed to address transfer challenges from generation to depth prediction.
SRFM no independent evidence
purpose: Injects DINOv2 semantic and resolution priors to enhance geometric precision.
New component introduced to improve accuracy in the adapted model.

pith-pipeline@v0.9.1-grok · 5750 in / 1178 out tokens · 23649 ms · 2026-07-03T16:57:34.107010+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 19 canonical work pages · 4 internal anchors

[1]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023) 16 X. He et al

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Virtual KITTI 2

Cabon, Y., Murray, N., Humenberger, M.: Virtual kitti 2. arXiv preprint arXiv:2001.10773 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001
[3]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Chen, S., Guo, H., Zhu, S., Zhang, F., Huang, Z., Feng, J., Kang, B.: Video depth anything: Consistent depth estimation for super-long videos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22831–22840 (2025)

2025
[4]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Dai,A.,Chang,A.X.,Savva,M.,Halber,M.,Funkhouser,T.,Nießner,M.:Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5828–5839 (2017)

2017
[5]

The international journal of robotics research32(11), 1231–1237 (2013)

Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. The international journal of robotics research32(11), 1231–1237 (2013)

2013
[6]

Long context tuning for video generation.arXiv preprint arXiv:2503.10589, 2025

Guo, Y., Yang, C., Yang, Z., Ma, Z., Lin, Z., Yang, Z., Lin, D., Jiang, L.: Long context tuning for video generation. arXiv preprint arXiv:2503.10589 (2025)

work page arXiv 2025
[7]

arXiv preprint arXiv:2506.04213 (2025)

He, X., Liu, Q., Ye, Z., Ye, W., Wang, Q., Wang, X., Chen, Q., Wan, P., Zhang, D., Gai, K.: Fulldit2: Efficient in-context conditioning for video diffusion transformers. arXiv preprint arXiv:2506.04213 (2025)

work page arXiv 2025
[8]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Hu, W., Gao, X., Li, X., Zhao, S., Cun, X., Zhang, Y., Quan, L., Shan, Y.: Depthcrafter: Generating consistent long depth sequences for open-world videos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2005–2015 (2025)

2005
[9]

In-context LoRA for diffusion transformers

Huang,L.,Wang,W.,Wu,Z.F.,Shi,Y.,Dou,H.,Liang,C.,Feng,Y.,Liu,Y.,Zhou, J.: In-context lora for diffusion transformers. arXiv preprint arXiv:2410.23775 (2024)

work page arXiv 2024
[10]

arXiv preprint arXiv:2503.19907 (2025)

Ju, X., Ye, W., Liu, Q., Wang, Q., Wang, X., Wan, P., Zhang, D., Gai, K., Xu, Q.: Fulldit: Multi-task video generative foundation model with full attention. arXiv preprint arXiv:2503.19907 (2025)

work page arXiv 2025
[11]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Re- purposing diffusion-based image generators for monocular depth estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 9492–9502 (2024)

2024
[12]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

Kopf, J., Rong, X., Huang, J.B.: Robust consistent video depth estimation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 1611–1621 (2021)

2021
[13]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Kuang, Z., Zhang, T., Zhang, K., Tan, H., Bi, S., Hu, Y., Xu, Z., Hasan, M., Wetzstein, G., Luan, F.: Buffer anytime: Zero-shot video depth and normal from image priors. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 17660–17670 (2025)

2025
[14]

arXiv preprint arXiv:2504.07960 (2025)

Li, Z.Y., Du, R., Yan, J., Zhuo, L., Li, Z., Gao, P., Ma, Z., Cheng, M.M.: Vi- sualcloze: A universal image generation framework via visual in-context learning. arXiv preprint arXiv:2504.07960 (2025)

work page arXiv 2025
[15]

arXiv preprint arXiv:2502.01061 (2025)

Lin, G., Jiang, J., Yang, J., Zheng, Z., Liang, C.: Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models. arXiv preprint arXiv:2502.01061 (2025)

work page arXiv 2025
[16]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

ACM Transactions on Graphics (ToG)39(4), 71–1 (2020)

Luo, X., Huang, J.B., Szeliski, R., Matzen, K., Kopf, J.: Consistent video depth estimation. ACM Transactions on Graphics (ToG)39(4), 71–1 (2020)

2020
[18]

arXiv preprint arXiv:2507.16869 (2025) ICDepth 17

Ma, Y., Feng, K., Hu, Z., Wang, X., Wang, Y., Zheng, M., He, X., Zhu, C., Liu, H., He, Y., et al.: Controllable video generation: A survey. arXiv preprint arXiv:2507.16869 (2025) ICDepth 17

work page arXiv 2025
[19]

arXiv preprint arXiv:2506.04590 (2025)

Ma, Y., Feng, K., Zhang, X., Liu, H., Zhang, D.J., Xing, J., Zhang, Y., Yang, A., Wang, Z., Chen, Q.: Follow-your-creation: Empowering 4d creation through video inpainting. arXiv preprint arXiv:2506.04590 (2025)

work page arXiv 2025
[20]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Ma, Y., He, Y., Cun, X., Wang, X., Chen, S., Li, X., Chen, Q.: Follow your pose: Pose-guided text-to-video generation using pose-free videos. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 4117–4125 (2024)

2024
[21]

arXiv preprint arXiv:2602.05551 (2026)

Ma, Y., Wang, Z., Ren, T., Zheng, M., Liu, H., Guo, J., Fong, M., Xue, Y., Zhao, Z., Schindler, K., et al.: Fastvmt: Eliminating redundancy in video motion transfer. arXiv preprint arXiv:2602.05551 (2026)

work page arXiv 2026
[22]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4040–4048 (2016)

2016
[23]

arXiv preprint arXiv:2504.16915 (2025)

Mou, C., Wu, Y., Wu, W., Guo, Z., Zhang, P., Cheng, Y., Luo, Y., Ding, F., Zhang, S., Li, X., et al.: Dreamo: A unified framework for image customization. arXiv preprint arXiv:2504.16915 (2025)

work page arXiv 2025
[24]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,Howes,R.,Huang,P.Y.,Xu,H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features without ...

2023
[25]

1-Fun-1.3B-Control(2024)

PAI, A.: Wan2.1-fun-1.3b-control.https://huggingface.co/alibaba-pai/Wan2. 1-Fun-1.3B-Control(2024)

2024
[26]

In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Palazzolo, E., Behley, J., Lottes, P., Giguere, P., Stachniss, C.: Refusion: 3d recon- struction in dynamic environments for rgb-d cameras exploiting residuals. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 7855–7862. IEEE (2019)

2019
[27]

arXiv preprint arXiv:2505.10696 (2025)

Patel, M., Yang, F., Qiu, Y., Cadena, C., Scherer, S., Hutter, M., Wang, W.: Tartanground: A large-scale dataset for ground robot perception and navigation. arXiv preprint arXiv:2505.10696 (2025)

work page arXiv 2025
[28]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

2023
[29]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Shao, J., Yang, Y., Zhou, H., Zhang, Y., Shen, Y., Guizilini, V., Wang, Y., Poggi, M., Liao, Y.: Learning temporally consistent video depth from video diffusion pri- ors. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22841–22852 (2025)

2025
[30]

arXiv preprint arXiv:2504.15009 (2025)

Song, W., Jiang, H., Yang, Z., Quan, R., Yang, Y.: Insert anything: Image insertion via in-context editing in dit. arXiv preprint arXiv:2504.15009 (2025)

work page arXiv 2025
[31]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Tan, Z., Liu, S., Yang, X., Xue, Q., Wang, X.: Ominicontrol: Minimal and universal control for diffusion transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14940–14950 (2025)

2025
[32]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan Team, Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Wang, W., Zhu, D., Wang, X., Hu, Y., Qiu, Y., Wang, C., Hu, Y., Kapoor, A., Scherer, S.: Tartanair: A dataset to push the limits of visual slam. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 4909–4916. IEEE (2020)

2020
[34]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wang, Y., Shi, M., Li, J., Huang, Z., Cao, Z., Zhang, J., Xian, K., Lin, G.: Neural video depth stabilizer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9466–9476 (2023) 18 X. He et al

2023
[35]

arXiv preprint arXiv:2504.02160 (2025)

Wu, S., Huang, M., Wu, W., Cheng, Y., Ding, F., He, Q.: Less-to-more gener- alization: Unlocking more controllability by in-context generation. arXiv preprint arXiv:2504.02160 (2025)

work page arXiv 2025
[36]

In: International Conference on Learning Representations

Yang, H., Huang, D., Yin, W., Shen, C., Liu, H., He, X., Lin, B., Ouyang, W., He, T.: Depth any video with scalable synthetic data. In: International Conference on Learning Representations. vol. 2025, pp. 97335–97349 (2025)

2025
[37]

Advances in Neural Information Processing Systems37, 21875–21911 (2024)

Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. Advances in Neural Information Processing Systems37, 21875–21911 (2024)

2024
[38]

arXiv preprint arXiv:2506.04216 (2025)

Ye, Z., He, X., Liu, Q., Wang, Q., Wang, X., Wan, P., Zhang, D., Gai, K., Chen, Q., Luo, W.: Unic: Unified in-context video editing. arXiv preprint arXiv:2506.04216 (2025)

work page arXiv 2025
[39]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: Scannet++: A high-fidelity dataset of 3d indoor scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12–22 (2023)

2023
[40]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhang, H., Shen, C., Li, Y., Cao, Y., Liu, Y., Yan, Y.: Exploiting temporal con- sistency for real-time video depth estimation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1725–1734 (2019)

2019
[41]

ACM Transactions on Graphics (ToG)40(4), 1–12 (2021)

Zhang, Z., Cole, F., Tucker, R., Freeman, W.T., Dekel, T.: Consistent depth of moving objects in video. ACM Transactions on Graphics (ToG)40(4), 1–12 (2021)

2021
[42]

arXiv preprint arXiv:2509.12201 (2025)

Zhou, Y., Wang, Y., Zhou, J., Chang, W., Guo, H., Li, Z., Ma, K., Li, X., Wang, Y., Zhu, H., et al.: Omniworld: A multi-domain and multi-modal dataset for 4d world modeling. arXiv preprint arXiv:2509.12201 (2025)

work page arXiv 2025

[1] [1]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023) 16 X. He et al

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Virtual KITTI 2

Cabon, Y., Murray, N., Humenberger, M.: Virtual kitti 2. arXiv preprint arXiv:2001.10773 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001

[3] [3]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Chen, S., Guo, H., Zhu, S., Zhang, F., Huang, Z., Feng, J., Kang, B.: Video depth anything: Consistent depth estimation for super-long videos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22831–22840 (2025)

2025

[4] [4]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Dai,A.,Chang,A.X.,Savva,M.,Halber,M.,Funkhouser,T.,Nießner,M.:Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5828–5839 (2017)

2017

[5] [5]

The international journal of robotics research32(11), 1231–1237 (2013)

Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. The international journal of robotics research32(11), 1231–1237 (2013)

2013

[6] [6]

Long context tuning for video generation.arXiv preprint arXiv:2503.10589, 2025

Guo, Y., Yang, C., Yang, Z., Ma, Z., Lin, Z., Yang, Z., Lin, D., Jiang, L.: Long context tuning for video generation. arXiv preprint arXiv:2503.10589 (2025)

work page arXiv 2025

[7] [7]

arXiv preprint arXiv:2506.04213 (2025)

He, X., Liu, Q., Ye, Z., Ye, W., Wang, Q., Wang, X., Chen, Q., Wan, P., Zhang, D., Gai, K.: Fulldit2: Efficient in-context conditioning for video diffusion transformers. arXiv preprint arXiv:2506.04213 (2025)

work page arXiv 2025

[8] [8]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Hu, W., Gao, X., Li, X., Zhao, S., Cun, X., Zhang, Y., Quan, L., Shan, Y.: Depthcrafter: Generating consistent long depth sequences for open-world videos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2005–2015 (2025)

2005

[9] [9]

In-context LoRA for diffusion transformers

Huang,L.,Wang,W.,Wu,Z.F.,Shi,Y.,Dou,H.,Liang,C.,Feng,Y.,Liu,Y.,Zhou, J.: In-context lora for diffusion transformers. arXiv preprint arXiv:2410.23775 (2024)

work page arXiv 2024

[10] [10]

arXiv preprint arXiv:2503.19907 (2025)

Ju, X., Ye, W., Liu, Q., Wang, Q., Wang, X., Wan, P., Zhang, D., Gai, K., Xu, Q.: Fulldit: Multi-task video generative foundation model with full attention. arXiv preprint arXiv:2503.19907 (2025)

work page arXiv 2025

[11] [11]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Re- purposing diffusion-based image generators for monocular depth estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 9492–9502 (2024)

2024

[12] [12]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

Kopf, J., Rong, X., Huang, J.B.: Robust consistent video depth estimation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 1611–1621 (2021)

2021

[13] [13]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Kuang, Z., Zhang, T., Zhang, K., Tan, H., Bi, S., Hu, Y., Xu, Z., Hasan, M., Wetzstein, G., Luan, F.: Buffer anytime: Zero-shot video depth and normal from image priors. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 17660–17670 (2025)

2025

[14] [14]

arXiv preprint arXiv:2504.07960 (2025)

Li, Z.Y., Du, R., Yan, J., Zhuo, L., Li, Z., Gao, P., Ma, Z., Cheng, M.M.: Vi- sualcloze: A universal image generation framework via visual in-context learning. arXiv preprint arXiv:2504.07960 (2025)

work page arXiv 2025

[15] [15]

arXiv preprint arXiv:2502.01061 (2025)

Lin, G., Jiang, J., Yang, J., Zheng, Z., Liang, C.: Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models. arXiv preprint arXiv:2502.01061 (2025)

work page arXiv 2025

[16] [16]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

ACM Transactions on Graphics (ToG)39(4), 71–1 (2020)

Luo, X., Huang, J.B., Szeliski, R., Matzen, K., Kopf, J.: Consistent video depth estimation. ACM Transactions on Graphics (ToG)39(4), 71–1 (2020)

2020

[18] [18]

arXiv preprint arXiv:2507.16869 (2025) ICDepth 17

Ma, Y., Feng, K., Hu, Z., Wang, X., Wang, Y., Zheng, M., He, X., Zhu, C., Liu, H., He, Y., et al.: Controllable video generation: A survey. arXiv preprint arXiv:2507.16869 (2025) ICDepth 17

work page arXiv 2025

[19] [19]

arXiv preprint arXiv:2506.04590 (2025)

Ma, Y., Feng, K., Zhang, X., Liu, H., Zhang, D.J., Xing, J., Zhang, Y., Yang, A., Wang, Z., Chen, Q.: Follow-your-creation: Empowering 4d creation through video inpainting. arXiv preprint arXiv:2506.04590 (2025)

work page arXiv 2025

[20] [20]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Ma, Y., He, Y., Cun, X., Wang, X., Chen, S., Li, X., Chen, Q.: Follow your pose: Pose-guided text-to-video generation using pose-free videos. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 4117–4125 (2024)

2024

[21] [21]

arXiv preprint arXiv:2602.05551 (2026)

Ma, Y., Wang, Z., Ren, T., Zheng, M., Liu, H., Guo, J., Fong, M., Xue, Y., Zhao, Z., Schindler, K., et al.: Fastvmt: Eliminating redundancy in video motion transfer. arXiv preprint arXiv:2602.05551 (2026)

work page arXiv 2026

[22] [22]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4040–4048 (2016)

2016

[23] [23]

arXiv preprint arXiv:2504.16915 (2025)

Mou, C., Wu, Y., Wu, W., Guo, Z., Zhang, P., Cheng, Y., Luo, Y., Ding, F., Zhang, S., Li, X., et al.: Dreamo: A unified framework for image customization. arXiv preprint arXiv:2504.16915 (2025)

work page arXiv 2025

[24] [24]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,Howes,R.,Huang,P.Y.,Xu,H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features without ...

2023

[25] [25]

1-Fun-1.3B-Control(2024)

PAI, A.: Wan2.1-fun-1.3b-control.https://huggingface.co/alibaba-pai/Wan2. 1-Fun-1.3B-Control(2024)

2024

[26] [26]

In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Palazzolo, E., Behley, J., Lottes, P., Giguere, P., Stachniss, C.: Refusion: 3d recon- struction in dynamic environments for rgb-d cameras exploiting residuals. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 7855–7862. IEEE (2019)

2019

[27] [27]

arXiv preprint arXiv:2505.10696 (2025)

Patel, M., Yang, F., Qiu, Y., Cadena, C., Scherer, S., Hutter, M., Wang, W.: Tartanground: A large-scale dataset for ground robot perception and navigation. arXiv preprint arXiv:2505.10696 (2025)

work page arXiv 2025

[28] [28]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

2023

[29] [29]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Shao, J., Yang, Y., Zhou, H., Zhang, Y., Shen, Y., Guizilini, V., Wang, Y., Poggi, M., Liao, Y.: Learning temporally consistent video depth from video diffusion pri- ors. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22841–22852 (2025)

2025

[30] [30]

arXiv preprint arXiv:2504.15009 (2025)

Song, W., Jiang, H., Yang, Z., Quan, R., Yang, Y.: Insert anything: Image insertion via in-context editing in dit. arXiv preprint arXiv:2504.15009 (2025)

work page arXiv 2025

[31] [31]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Tan, Z., Liu, S., Yang, X., Xue, Q., Wang, X.: Ominicontrol: Minimal and universal control for diffusion transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14940–14950 (2025)

2025

[32] [32]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan Team, Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Wang, W., Zhu, D., Wang, X., Hu, Y., Qiu, Y., Wang, C., Hu, Y., Kapoor, A., Scherer, S.: Tartanair: A dataset to push the limits of visual slam. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 4909–4916. IEEE (2020)

2020

[34] [34]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wang, Y., Shi, M., Li, J., Huang, Z., Cao, Z., Zhang, J., Xian, K., Lin, G.: Neural video depth stabilizer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9466–9476 (2023) 18 X. He et al

2023

[35] [35]

arXiv preprint arXiv:2504.02160 (2025)

Wu, S., Huang, M., Wu, W., Cheng, Y., Ding, F., He, Q.: Less-to-more gener- alization: Unlocking more controllability by in-context generation. arXiv preprint arXiv:2504.02160 (2025)

work page arXiv 2025

[36] [36]

In: International Conference on Learning Representations

Yang, H., Huang, D., Yin, W., Shen, C., Liu, H., He, X., Lin, B., Ouyang, W., He, T.: Depth any video with scalable synthetic data. In: International Conference on Learning Representations. vol. 2025, pp. 97335–97349 (2025)

2025

[37] [37]

Advances in Neural Information Processing Systems37, 21875–21911 (2024)

Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. Advances in Neural Information Processing Systems37, 21875–21911 (2024)

2024

[38] [38]

arXiv preprint arXiv:2506.04216 (2025)

Ye, Z., He, X., Liu, Q., Wang, Q., Wang, X., Wan, P., Zhang, D., Gai, K., Chen, Q., Luo, W.: Unic: Unified in-context video editing. arXiv preprint arXiv:2506.04216 (2025)

work page arXiv 2025

[39] [39]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: Scannet++: A high-fidelity dataset of 3d indoor scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12–22 (2023)

2023

[40] [40]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhang, H., Shen, C., Li, Y., Cao, Y., Liu, Y., Yan, Y.: Exploiting temporal con- sistency for real-time video depth estimation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1725–1734 (2019)

2019

[41] [41]

ACM Transactions on Graphics (ToG)40(4), 1–12 (2021)

Zhang, Z., Cole, F., Tucker, R., Freeman, W.T., Dekel, T.: Consistent depth of moving objects in video. ACM Transactions on Graphics (ToG)40(4), 1–12 (2021)

2021

[42] [42]

arXiv preprint arXiv:2509.12201 (2025)

Zhou, Y., Wang, Y., Zhou, J., Chang, W., Guo, H., Li, Z., Ma, K., Li, X., Wang, Y., Zhu, H., et al.: Omniworld: A multi-domain and multi-modal dataset for 4d world modeling. arXiv preprint arXiv:2509.12201 (2025)

work page arXiv 2025