pith. sign in

arxiv: 2606.13676 · v1 · pith:LNNP63KWnew · submitted 2026-06-11 · 💻 cs.CV

Modality Forcing for Scalable Spatial Generation

Pith reviewed 2026-06-27 06:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords modality forcingdiffusion transformerdepth estimationjoint image depth generationsparse depth dataspatial perceptionimage generation pretrainingconditional generation
0
0 comments X

The pith

Modality Forcing assigns separate noise levels per modality so a single diffusion transformer generates images and depth jointly or conditionally from sparse data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that assigning separate noise levels to image and depth during training, along with per-modality decoders, lets one DiT model handle both modalities in any permutation. This setup trains effectively on sparse real-world depth measurements instead of dense labels or elaborate procedures used before. Experiments scaling models from 370M to 3.3B parameters reveal that larger models pretrained on more image data deliver steadily better depth predictions. The best model matches specialized monocular depth estimators while cutting absolute relative error by 57 percent against earlier joint generative baselines. These outcomes indicate that standard image generation can act as scalable pre-training for tasks that require understanding geometry and space.

Core claim

Modality Forcing enables conditional and joint generation of image and depth in any permutation by assigning separate noise levels per modality. Per-modality decoders let us train on sparse, real-world depth and achieve strong, generalizable depth prediction. We further show that Modality Forcing inherits the scalability of T2I pre-training: by training a set of T2I models from scratch (370M to 3.3B parameters), we find that larger models trained on more image data produce more accurate depth. Our strongest model is competitive with state-of-the-art monocular depth estimators and reduces AbsRel by 57% relative to existing joint image-depth generative models. These results provide strong evid

What carries the argument

Modality Forcing, which assigns separate noise levels to each modality during diffusion training and pairs them with per-modality decoders to support mixed sparse data.

If this is right

  • Joint and conditional image-depth generation works in every ordering or subset without retraining.
  • Training succeeds on sparse real-world depth instead of requiring dense ground truth.
  • Depth accuracy rises as model capacity and image pretraining data increase.
  • The resulting depth estimates reach error levels comparable to dedicated monocular estimators.
  • Image generation pretraining supplies a route to generalizable spatial perception without modality-specific engineering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separate-noise mechanism could extend to other geometric outputs such as surface normals or semantic labels using the same sparse supervision pattern.
  • If the scaling trend holds, depth estimation could follow the same data and compute curves already observed in image and language models.
  • Practitioners might reduce reliance on expensive dense depth capture by fine-tuning large image generators instead.
  • The approach raises the question of whether other perception tasks benefit when image generation remains the dominant pretraining signal.

Load-bearing premise

Separate noise levels per modality plus dedicated decoders will let the shared model extract accurate depth from sparse measurements without introducing biases that favor one modality over the other.

What would settle it

Measure depth prediction error on held-out real scenes while scaling the same training recipe from 370M to 3.3B parameters on fixed image data; if accuracy stops improving once model size grows, the scalability claim would fail.

Figures

Figures reproduced from arXiv: 2606.13676 by Bardienus Pieter Duisterhof, Deva Ramanan, Jeffrey Ichnowski, Justin Johnson, Keunhong Park.

Figure 1
Figure 1. Figure 1: We present Modality Forcing, a post-training recipe to extract spatial priors from text-to [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Modality Forcing generates rich RGB-Depth from text prompts. Unprojecting the points to [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Modality Forcing is a recipe to post-train image-generation models for depth prediction. We [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Scaling experiments. Depth accuracy (δ1, ↑, bottom) and AbsRel (↓, top) by T2I model size. Each line represents a T2I pre-training dataset size (none, 128M, 640M, 1.92B). Train￾ing larger T2I models on more image data yields better depth performance. 4 Results We evaluate Modality Forcing across joint and conditional RGB-Depth tasks. First, we train a suite of T2I models from scratch to study how depth gen… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative image-to-depth generation results. Modality Forcing generates robust and sharp [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative joint image-depth generation results. Modality Forcing samples RGB and [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Modality Forcing inference-time analysis. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The denoising trajectory across depth and rgb dictates the strength of modality conditioning. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
read the original abstract

Text-to-image (T2I) models contain rich spatial priors. Synthesizing photorealistic, cluttered scenes requires an understanding of geometry, including perspective and relative scale. Prior works adapt T2I models to leverage this prior for depth prediction, but they require dense depth data and involve complex recipes. We propose Modality Forcing, a simple, scalable post-training recipe for joint image-depth generation using a single DiT trained on sparse depth data. Modality Forcing enables conditional and joint generation of image and depth in any permutation by assigning separate noise levels per modality. Per-modality decoders let us train on sparse, real-world depth and achieve strong, generalizable depth prediction. We further show that Modality Forcing inherits the scalability of T2I pre-training: by training a set of T2I models from scratch (370M to 3.3B parameters), we find that larger models trained on more image data produce more accurate depth. Our strongest model is competitive with state-of-the-art monocular depth estimators and reduces AbsRel by 57% relative to existing joint image-depth generative models. These results provide strong evidence that image generation is a scalable pre-training objective for spatial perception. https://modality-forcing.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Modality Forcing, a post-training recipe for a single DiT that enables joint/conditional image-depth generation in any permutation via per-modality noise levels and per-modality decoders. This allows training on sparse real-world depth data. Scaling experiments train models from 370M to 3.3B parameters from scratch on image data, showing larger models yield better depth; the strongest model is competitive with monocular SOTA depth estimators and reduces AbsRel by 57% versus prior joint generative models, supporting image generation as scalable pre-training for spatial perception.

Significance. If reproducible, the approach offers a simpler alternative to prior T2I adaptations for depth that avoids dense supervision and complex recipes, while demonstrating clear scaling benefits. The reported gains over joint baselines and competitiveness with specialized monocular estimators would strengthen the case for generative pre-training in perception tasks.

major comments (2)
  1. [Method (implied in abstract description of noise levels and decoders)] The central claim that Modality Forcing enables accurate depth from sparse data rests on the per-modality noise schedules and decoders; without explicit equations or pseudocode showing how noise levels are sampled independently per modality during the forward process and how the decoders are conditioned, it is difficult to verify that modality-specific biases are avoided.
  2. [Abstract (quantitative claims)] The 57% AbsRel reduction and competitiveness with SOTA monocular estimators are load-bearing for the scalability conclusion, yet the abstract does not specify the exact test sets, number of runs, or whether the comparison models were re-trained under identical data regimes; this leaves open whether the gains are due to Modality Forcing or differences in training data scale.
minor comments (1)
  1. [Abstract] The project page link is useful, but all quantitative tables and scaling plots should appear in the main paper with clear captions indicating training data sources and evaluation protocols.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: [Method (implied in abstract description of noise levels and decoders)] The central claim that Modality Forcing enables accurate depth from sparse data rests on the per-modality noise schedules and decoders; without explicit equations or pseudocode showing how noise levels are sampled independently per modality during the forward process and how the decoders are conditioned, it is difficult to verify that modality-specific biases are avoided.

    Authors: We agree that explicit mathematical details are required for verification. The revised manuscript will add the forward-process equations showing independent per-modality noise sampling (i.e., separate t_image and t_depth drawn from the diffusion schedule) together with pseudocode for the training procedure and the conditioning of the per-modality decoders. These additions will make clear how separate noise levels and dedicated decoders avoid cross-modality bias while enabling training on sparse depth. revision: yes

  2. Referee: [Abstract (quantitative claims)] The 57% AbsRel reduction and competitiveness with SOTA monocular estimators are load-bearing for the scalability conclusion, yet the abstract does not specify the exact test sets, number of runs, or whether the comparison models were re-trained under identical data regimes; this leaves open whether the gains are due to Modality Forcing or differences in training data scale.

    Authors: The full paper reports results on NYUv2 and KITTI (standard monocular depth benchmarks) and states that the 57% AbsRel reduction is measured against published joint generative baselines on the same splits. Our scaling experiments train all DiT variants from scratch on identical image data, isolating model size as the variable. To address the abstract concern we will expand it to name the test sets and note that joint baselines follow their original published protocols. We cannot re-train every prior model under our exact regime, but the controlled scaling study within our framework supports that larger image-pretrained models improve depth accuracy. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central contribution is the empirical demonstration that Modality Forcing (separate per-modality noise schedules plus per-modality decoders) permits joint/conditional image-depth generation from sparse real-world depth data while inheriting T2I scaling behavior. All reported results are measured against external monocular depth SOTA baselines and prior joint generative models; no equations, fitted parameters, or self-citations are shown to define the target quantities by construction. The derivation chain therefore remains self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no concrete free parameters, axioms, or invented entities; the approach rests on the domain assumption that T2I models already encode spatial priors.

axioms (1)
  • domain assumption Text-to-image models contain rich spatial priors including geometry, perspective, and relative scale.
    Stated explicitly in the first sentence of the abstract as the foundation for adapting T2I models to depth.

pith-pipeline@v0.9.1-grok · 5765 in / 1347 out tokens · 26174 ms · 2026-06-27T06:43:16.592712+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 3 canonical work pages

  1. [1]

    Latent forcing: Reordering the diffusion trajectory for pixel-space image generation,

    Alan Baade, Eric Ryan Chan, Kyle Sargent, Changan Chen, Justin Johnson, Ehsan Adeli, and Li Fei-Fei. Latent forcing: Reordering the diffusion trajectory for pixel-space image generation,

  2. [2]

    URLhttps://arxiv.org/abs/2602.11401

  3. [3]

    Barron, Ben Mildenhall, Dor Verbin, Pratul P

    Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields.CVPR, 2022

  4. [4]

    Zoedepth: Zero-shot transfer by combining relative and metric depth, 2023

    Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. Zoedepth: Zero-shot transfer by combining relative and metric depth, 2023. URLhttps://arxiv.org/ abs/2302.12288

  5. [5]

    Richter, and Vladlen Koltun

    Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second,

  6. [6]

    URLhttps://arxiv.org/abs/2410.02073

  7. [8]

    URLhttps://arxiv.org/abs/2005.14165

  8. [9]

    Jointdit: Enhancing rgb- depth joint modeling with diffusion transformers

    Kwon Byung-Ki, Qi Dai, Lee Hyoseok, Chong Luo, and Tae-Hyun Oh. Jointdit: Enhancing rgb- depth joint modeling with diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 25261–25271, October 2025

  9. [10]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2025

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2025

  10. [11]

    Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining, 2023

    Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining, 2023. URLhttps://arxiv.org/abs/2304.09151

  11. [12]

    Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017

  12. [13]

    Scalingrectifiedflowtransformers for high-resolution image synthesis, 2024

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, KyleLacey, AlexGoodwin, YannikMarek, andRobinRombach. Scalingrectifiedflowtransformers for high-resolution image synthesis, 2024. URLhttps://arxiv.org/abs/2403.03206

  13. [14]

    Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image, 2024

    Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image, 2024. URLhttps://arxiv.org/abs/2403.12013

  14. [15]

    Image generators are generalist vision learners.arXiv preprint arXiv:2604.20329, 2026

    Valentin Gabeur, Shangbang Long, Songyou Peng, Paul Voigtlaender, Shuyang Sun, Yanan Bao, Karen Truong, Zhicheng Wang, Wenlei Zhou, Jonathan T Barron, Kyle Genova, Nithish Kannen, Sherry Ben, Yandong Li, Mandy Guo, Suhas Yogin, Yiming Gu, Huizhong Chen, Oliver Wang, Saining Xie, Howard Zhou, Kaiming He, Thomas Funkhouser, Jean-Baptiste Alayrac, and Radu S...

  15. [16]

    Vision meets robotics: The kitti dataset.Int

    A Geiger, P Lenz, C Stiller, and R Urtasun. Vision meets robotics: The kitti dataset.Int. J. Rob. Res., 32(11):1231–1237, September 2013. ISSN 0278-3649. doi: 10.1177/0278364913491297. URLhttps://doi.org/10.1177/0278364913491297

  16. [17]

    Lotus: Diffusion-based visual foundation model for high-quality dense prediction, 2025

    Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Kaiqiang Zhou, Hongbo Zhang, Bingbing Liu, and Ying-Cong Chen. Lotus: Diffusion-based visual foundation model for high-quality dense prediction, 2025. URLhttps://arxiv.org/abs/2409.18124

  17. [18]

    Repurposing diffusion-based image generators for monocular depth estimation

    Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  18. [19]

    Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything.arXiv:2304.02643, 2023

  19. [20]

    Orchid: Image latent diffusion for joint appearance and geometry generation, 2025

    Akshay Krishnan, Xinchen Yan, Vincent Casser, and Abhijit Kundu. Orchid: Image latent diffusion for joint appearance and geometry generation, 2025. URLhttps://arxiv.org/abs/ 2501.13087

  20. [21]

    FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

    Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

  21. [22]

    Grounding image matching in 3d with mast3r, 2024

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r, 2024. URLhttps://arxiv.org/abs/2406.09756. 12

  22. [23]

    Back to basics: Let denoising generative models denoise, 2026

    Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise, 2026. URLhttps://arxiv.org/abs/2511.13720

  23. [24]

    A simple approach to unifying diffusion-based conditional generation

    Xirui Li, Charles Herrmann, Kelvin CK Chan, Yinxiao Li, Deqing Sun, and Ming-Hsuan Yang. A simple approach to unifying diffusion-based conditional generation. InThe Thirteenth Interna- tional Conference on Learning Representations, 2025

  24. [25]

    Learning without forgetting, 2017

    Zhizhong Li and Derek Hoiem. Learning without forgetting, 2017. URLhttps://arxiv.org/ abs/1606.09282

  25. [26]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023. URLhttps://arxiv.org/abs/2210.02747

  26. [27]

    Dinov2: Learning robust visual features without supervision, 2024

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khali- dov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick L...

  27. [28]

    Scalable diffusion models with transformers, 2023

    William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/abs/2212.09748

  28. [29]

    UniDepth: Universal monocular metric depth estimation

    Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. UniDepth: Universal monocular metric depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  29. [30]

    High- resolution image synthesis with latent diffusion models, 2022

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models, 2022. URLhttps://arxiv.org/abs/ 2112.10752

  30. [31]

    Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger

    Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2538–2547, 2017. doi: 10.1109/CVPR.2017.272

  31. [32]

    Indoor segmentation and support inference from rgbd images

    Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Andrew Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, and Cordelia Schmid, editors,Computer Vision – ECCV 2012, pages 746–760, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-33715-4

  32. [33]

    Ldm3d: Latent diffusion model for 3d, 2023

    Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, and Vasudev Lal. Ldm3d: Latent diffusion model for 3d, 2023. URLhttps://arxiv.org/abs/2305.10853

  33. [34]

    Roformer: Enhanced transformer with rotary position embedding, 2023

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. URLhttps://arxiv.org/abs/ 2104.09864

  34. [35]

    The bitter lesson, 2019

    Richard Sutton. The bitter lesson, 2019. URLhttp://www.incompleteideas.net/IncIdeas/ BitterLesson.html

  35. [36]

    Sam 3d: 3dfy anything in images

    SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, and Jitendra Malik. Sam 3d: 3dfy anything in images. 2025. URL h...

  36. [37]

    Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

    Z-Image Team. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

  37. [39]

    URLhttp://arxiv.org/abs/1908.00463

  38. [40]

    Wan: Open and advanced large-scale video generative models,

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  39. [41]

    URLhttps://arxiv.org/abs/2503.20314

  40. [42]

    Jianyuan Wang, Minghao Chen, Shangzhan Zhang, Nikita Karaev, Johannes Schönberger, Patrick Labatut, Piotr Bojanowski, David Novotny, Andrea Vedaldi, and Christian Rupprecht. Vggt-ω,

  41. [43]

    URLhttps://arxiv.org/abs/2605.15195

  42. [44]

    Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

    Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5261–5271, 2025

  43. [45]

    Moge-2: Accurate monocular geometry with metric scale and sharp details, 2025

    Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details, 2025. URLhttps://arxiv.org/abs/2507.02546

  44. [46]

    Dust3r: Geometric 3d vision made easy, 2024

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy, 2024. URLhttps://arxiv.org/abs/2312.14132

  45. [47]

    Williams and David Zipser

    Ronald J. Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks.Neural Computation, 1(2):270–280, 1989. doi: 10.1162/neco.1989.1.2.270

  46. [48]

    Pixel-perfect depth with semantics-prompted diffusion transformers.arXiv preprint arXiv:2510.07316, 2025

    Gangwei Xu, Haotong Lin, Hongcheng Luo, Xianqi Wang, Jingfeng Yao, Lianghui Zhu, Yuechuan Pu, Cheng Chi, Haiyang Sun, Bing Wang, et al. Pixel-perfect depth with semantics-prompted diffusion transformers.arXiv preprint arXiv:2510.07316, 2025

  47. [49]

    Context unrolling in omni models,

    Ceyuan Yang, Zhijie Lin, Yang Zhao, Fei Xiao, Hao He, Qi Zhao, Chaorui Deng, Kunchang Li, Zihan Ding, Yuwei Guo, Fuyun Wang, Fangqi Zhu, Xiaonan Nie, Shenhan Zhu, Shanchuan Lin, Hongsheng Li, Weilin Huang, Guang Shi, and Haoqi Fan. Context unrolling in omni models,

  48. [50]

    URLhttps://arxiv.org/abs/2604.21921

  49. [51]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, 2024

  50. [52]

    Depth anything v2.arXiv:2406.09414, 2024

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.arXiv:2406.09414, 2024

  51. [53]

    Jointnet: Extending text-to-image diffusion for dense distribution modeling, 2023

    Jingyang Zhang, Shiwei Li, Yuanxun Lu, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan, and Yao Yao. Jointnet: Extending text-to-image diffusion for dense distribution modeling, 2023. URLhttps://arxiv.org/abs/2310.06347. 14