pith. sign in

arxiv: 2512.01030 · v3 · pith:7V23G2U3new · submitted 2025-11-30 · 💻 cs.CV

Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model

Pith reviewed 2026-05-21 18:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords monocular depth estimationsurface normal predictiondiffusion modelsgeometric dense predictiondeterministic adaptationrectified flow refinementlocal continuity module
0
0 comments X

The pith

Lotus-2 adapts pre-trained diffusion models into a two-stage deterministic system that achieves state-of-the-art monocular depth estimation with only 59K training samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to show that the geometric knowledge already present in large-scale pre-trained diffusion models can be extracted and applied to precise pixel-wise tasks without relying on the models' original stochastic sampling process. It introduces a two-stage protocol: a core predictor that makes a single deterministic pass to build coherent global structure, followed by a constrained refinement step that sharpens local details while staying inside the manifold defined by the first stage. This combination produces accurate depth maps and surface normals from a single image. A sympathetic reader would care because the approach reaches top performance on standard benchmarks while using less than one percent of the data volumes typical for supervised geometric models, suggesting a route to high-quality 3D reasoning that bypasses the need for ever-larger labeled datasets.

Core claim

Lotus-2 is a two-stage deterministic framework in which the first stage employs a single-step predictor with a clean-data objective and a lightweight local continuity module to produce globally coherent geometry free of grid artifacts, and the second stage performs constrained multi-step rectified-flow refinement inside the manifold of the core predictor to enhance fine-grained details, thereby turning the pre-trained generative prior into a stable deterministic world prior for geometric dense prediction.

What carries the argument

The two-stage deterministic adaptation consisting of a single-step core predictor with local continuity module plus constrained rectified-flow refinement that keeps all outputs inside the manifold defined by the core predictor.

If this is right

  • Monocular depth estimation can reach new state-of-the-art accuracy while training on far fewer images than current large-scale supervised approaches.
  • Surface normal prediction can remain competitive with existing methods when the same minimal-data regime is used.
  • Diffusion models can function as deterministic world priors rather than purely stochastic generators for tasks that demand stable, accurate output.
  • Geometric dense prediction can shift from data-scale scaling to effective extraction of priors already learned during image-text pre-training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same adaptation protocol might be tested on other single-image dense tasks such as semantic segmentation or surface reconstruction to see whether the deterministic prior extraction generalizes.
  • If the method succeeds across diverse scene types, it would imply that the main remaining bottleneck in geometric vision is prior extraction rather than raw data volume.
  • One could check whether the deterministic outputs preserve semantic consistency alongside geometric accuracy on images containing rare objects or lighting conditions.

Load-bearing premise

The pre-trained diffusion model already encodes stable, transferable geometric knowledge that can be extracted via a deterministic single-step predictor plus constrained refinement without introducing new inconsistencies or losing the prior's benefits.

What would settle it

A controlled experiment in which Lotus-2 is compared directly against a standard discriminative regression model trained on exactly the same 59K samples; if the two-stage diffusion adaptation shows no accuracy gain on held-out test sets such as NYUv2 depth or KITTI, the value of the generative prior would be called into question.

Figures

Figures reproduced from arXiv: 2512.01030 by Haodong Li, Jing He, Mingzhi Sheng, Ying-Cong Chen.

Figure 1
Figure 1. Figure 1: We present Lotus-2, a two-stage deterministic framework for monocular geometric dense prediction. Our method leverages pre-trained generative model as a deterministic world prior to achieve new state-of-the-art accuracy while requiring remarkably minimal data (trained on only 0.66% of the samples used by MoGe-2 [1]). The decoupled, two-stage design ensures both structurally correct inference and high-fidel… view at source ↗
Figure 2
Figure 2. Figure 2: Adaptation protocol of stochastic formulation (Stochastic-DA). This framework models a conditional generative flow by estimating the velocity field from a random noise latent ϵ to the annotation latent z y , conditioned on the image latent z x . The target velocity vector is v = ϵ − z y . This inherent reliance on noise initialization inherently leads to non-deterministic variance in deterministic geometri… view at source ↗
Figure 3
Figure 3. Figure 3: Adaptation protocol of deterministic formulation (Deterministic-DA). This architecture shifts the paradigm to a noise-free rectified-flow formulation. It directly estimates the velocity field from the source image latent z x to the target annotation latent z y , where the target velocity vector is v = z x − z y . This deterministic setup ensures the stability and structurally consistency for geometric dens… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison between stochastic and deterministic formulation. The figure visualizes the iterative inference process from t = 1 to t = 0. The stochastic formulation (Stochastic-DA) exhibits significant structural variance: dis￾tinct random noise initializations yield inconsistent geometric structures across the entire inference process (highlighted in red circles). While averaging is employed to mitigate the… view at source ↗
Figure 5
Figure 5. Figure 5: Adaptation protocol of the core predictor in Lotus-2. It adopts a single-step formulation (t = 1) with clean-data prediction to efficiently exploit the world priors of pre-trained FLUX model, where input latent zt is equivalent to the image latent z x , i.e, zt = z1 = z x according to the Eq. 11. In addition, there is a pair of Pack-Unpack operations around the diffusion Transformer fθ inherited from FLUX,… view at source ↗
Figure 6
Figure 6. Figure 6: Comparisons among various training time-steps and data scales evaluated on NYUv2 in depth estimation. During inference, if the number of training time-steps T > 50, the inference time-steps are fixed at Tinf = 50; otherwise, Tinf = T. The results show that, when adapting the pre￾trained rectified-flow model to dense prediction, reducing the number of training time-steps leads to improved performance. In pa… view at source ↗
Figure 7
Figure 7. Figure 7: Predictions under Different Model Parameterization Types. Red circles highlight regions with obvious appearance artifacts when residual prediction is used. In contrast, clean￾data prediction produces more accurate predictions without interference from image appearance. the term z x in Eq. 13 attempts to remove these appearance components during inference, however, imperfect prediction makes this removal un… view at source ↗
Figure 9
Figure 9. Figure 9: The training pipeline of detail sharpener. Starting from a structurally correct but coarse annotation predicted by the core predictor, the detail sharpener learns the transition from coarse to fine-grained annotation via a constrained multi-step rectified-flow within the manifold defined by the core predictor. Image x ℰ 𝐳𝐱 𝐳#𝐲𝐜 Transformer 𝑓! LCM Λ Transformer 𝑔! 𝑡 ∈ [0, 1] × 𝑇!"# $ Prediction 𝐲# 𝒟 Detail … view at source ↗
Figure 10
Figure 10. Figure 10: The inference pipeline of Lotus-2. It is a decoupled, two-stage deterministic pipeline that bridges the regression and geometric refinement. First, the core predictor produces stable and structurally consistent prediction via single-step regression. The detail sharpener then employs a constrained multi-step rectified-flow formulation to iteratively refinement without any stochastic noise. The refinement u… view at source ↗
Figure 12
Figure 12. Figure 12: Spectral analysis of high-fidelity refinement. This plot compares the average log-power (y-axis) across spatial frequencies (x-axis) on NYUv2 dataset to validate the con￾tribution of detail sharpener. The decay of the core predic￾tor (w/o sharpener) curve confirms its coarse nature, while the Lotus-2 (w/ sharpener) curve shows recovery of high￾frequency power. To rigorously quantify the contribution of th… view at source ↗
read the original abstract

Recovering pixel-wise geometric properties from a single image is fundamentally ill-posed due to appearance ambiguity and non-injective mappings between 2D observations and 3D structures. While discriminative regression models achieve strong performance through large-scale supervision, their success is bounded by the scale, quality, and diversity of available data, as well as by limited physical reasoning. Recent diffusion models exhibit powerful world priors that encode geometry and semantics learned from massive image-text data, yet directly reusing their stochastic generative formulation is suboptimal for deterministic geometric inference: the former is optimized for diverse and high-fidelity image generation, whereas the latter requires stable and accurate predictions. In this work, we propose Lotus-2, a two-stage deterministic framework for stable, accurate and fine-grained geometric dense prediction, aiming to provide an optimal adaptation protocol to fully exploit the pre-trained generative priors. Specifically, in the first stage, the core predictor employs a single-step deterministic formulation with a clean-data objective and a lightweight local continuity module (LCM) to generate globally coherent structures without grid artifacts. In the second stage, the detail sharpener performs a constrained multi-step rectified-flow refinement within the manifold defined by the core predictor, enhancing fine-grained geometry through noise-free deterministic flow matching. Using only 59K training samples, less than 1% of existing large-scale datasets, Lotus-2 establishes new state-of-the-art results in monocular depth estimation and highly competitive surface normal prediction. These results demonstrate that diffusion models can serve as deterministic world priors, enabling high-quality geometric reasoning beyond traditional discriminative and generative paradigms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces Lotus-2, a two-stage deterministic framework for adapting pre-trained diffusion models to geometric dense prediction tasks including monocular depth estimation and surface normal prediction. Stage 1 employs a single-step deterministic core predictor with a clean-data objective and a lightweight local continuity module (LCM) to produce globally coherent outputs without grid artifacts. Stage 2 applies a detail sharpener that performs constrained multi-step rectified-flow refinement strictly inside the manifold defined by the core predictor output. The central empirical claim is that this protocol yields new state-of-the-art depth estimation results and competitive normal prediction using only 59K training samples, less than 1% of existing large-scale datasets.

Significance. If the reported performance gains are robustly supported by the experiments, the work would be significant for showing that generative diffusion priors can be converted into stable deterministic geometric predictors with minimal task-specific data. This offers a concrete adaptation protocol that could reduce dependence on massive labeled geometric datasets while preserving the world knowledge encoded in large-scale image-text pre-training.

major comments (3)
  1. [§3.2] §3.2 (Detail Sharpener): The claim that refinement occurs 'strictly inside the manifold defined by the core predictor' is load-bearing for attributing gains to the diffusion prior rather than to additional supervised fitting. The manuscript does not specify the concrete enforcement mechanism (e.g., a projection operator, a consistency loss term, or a hard constraint on the flow trajectory), nor does it report a quantitative manifold-consistency metric between stage-1 and stage-2 outputs. Without such evidence, it remains possible that the refinement step simply relaxes the stage-1 prediction, undermining the central narrative.
  2. [Table 2] Table 2 (depth estimation results): The SOTA claim with 59K samples requires explicit side-by-side comparison against the strongest baselines trained on the same 59K subset (or with matched data budgets) rather than only against models trained on full large-scale datasets. If the table only reports numbers against full-data methods, the efficiency argument is not fully substantiated.
  3. [§4.3] §4.3 (Ablations): The incremental contribution of the LCM and the constrained refinement must be quantified against a direct single-stage fine-tuning baseline of the same pre-trained diffusion backbone. Current ablations appear to compare only internal variants; without the external baseline, it is unclear whether the two-stage protocol itself, rather than the choice of backbone, drives the reported gains.
minor comments (3)
  1. [Abstract] The abstract states 'highly competitive surface normal prediction' without naming the primary metrics (e.g., mean angular error) or the evaluation datasets; adding these specifics would improve clarity.
  2. [§3.2] Notation for the rectified-flow velocity field and the constraint operator should be introduced with an equation in §3.2 rather than only in prose.
  3. [Figure 3] Figure 3 (qualitative results) would benefit from zoomed insets highlighting the fine-grained geometry improvements claimed for the detail sharpener.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which help clarify key aspects of our two-stage adaptation framework. We address each major comment point by point below and will revise the manuscript to incorporate the suggested improvements where they strengthen the presentation without altering the core claims.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Detail Sharpener): The claim that refinement occurs 'strictly inside the manifold defined by the core predictor' is load-bearing for attributing gains to the diffusion prior rather than to additional supervised fitting. The manuscript does not specify the concrete enforcement mechanism (e.g., a projection operator, a consistency loss term, or a hard constraint on the flow trajectory), nor does it report a quantitative manifold-consistency metric between stage-1 and stage-2 outputs. Without such evidence, it remains possible that the refinement step simply relaxes the stage-1 prediction, undermining the central narrative.

    Authors: We thank the referee for this important observation. In §3.2 the enforcement is achieved by initializing the rectified-flow trajectory directly from the Stage-1 core predictor output and performing deterministic, noise-free flow-matching steps; because no stochastic noise is injected and the velocity field is conditioned only on the Stage-1 prediction, the trajectory remains inside the manifold by construction. To make this explicit and to provide quantitative support, we will add (i) a precise description of the initialization and conditioning procedure and (ii) a new manifold-consistency metric (mean L2 distance and SSIM between Stage-1 and Stage-2 outputs on the test set) in the revised §3.2 and §4.3. These additions will empirically confirm that Stage-2 refinement does not materially deviate from the core prediction. revision: yes

  2. Referee: [Table 2] Table 2 (depth estimation results): The SOTA claim with 59K samples requires explicit side-by-side comparison against the strongest baselines trained on the same 59K subset (or with matched data budgets) rather than only against models trained on full large-scale datasets. If the table only reports numbers against full-data methods, the efficiency argument is not fully substantiated.

    Authors: We agree that matched-data-budget comparisons would further substantiate the efficiency argument. While the current Table 2 demonstrates competitiveness against published full-data models, we will add a new column (or supplementary table) reporting the performance of representative strong baselines (e.g., the best discriminative depth estimators) when retrained from scratch on the identical 59K-sample subset used for Lotus-2. This revision will allow direct attribution of gains to the diffusion-prior adaptation protocol rather than to data volume. revision: yes

  3. Referee: [§4.3] §4.3 (Ablations): The incremental contribution of the LCM and the constrained refinement must be quantified against a direct single-stage fine-tuning baseline of the same pre-trained diffusion backbone. Current ablations appear to compare only internal variants; without the external baseline, it is unclear whether the two-stage protocol itself, rather than the choice of backbone, drives the reported gains.

    Authors: We appreciate the request for a stronger external baseline. The existing ablations isolate the contributions of LCM and the detail sharpener within our two-stage design. To directly address the referee’s concern, we will include an additional single-stage fine-tuning baseline that applies the identical pre-trained diffusion backbone with standard supervised regression (no LCM, no Stage-2 refinement). Results will be reported in the revised §4.3, enabling readers to isolate the benefit of the two-stage protocol itself. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external pre-trained priors and empirical adaptation

full rationale

The paper's central claim rests on adapting an external pre-trained diffusion model via a two-stage deterministic framework (single-step clean-data predictor with LCM in stage 1, constrained rectified-flow refinement in stage 2). No equations or steps reduce the final geometric predictions or SOTA results to fitted parameters or self-defined quantities by construction. The performance with 59K samples is presented as an empirical outcome exploiting the external prior, not a mathematical identity or renamed fit. No self-citation load-bearing, uniqueness theorems, or ansatz smuggling appear in the provided abstract or description. The derivation chain is self-contained against the external generative prior and reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that pre-trained generative models contain extractable geometric structure and that the introduced modules (LCM and detail sharpener) can access it without new contradictions. No explicit free parameters or invented physical entities are named in the abstract.

axioms (1)
  • domain assumption Pre-trained diffusion models encode transferable geometric and semantic world priors from image-text data.
    Invoked in the opening paragraph to justify reusing generative models for deterministic geometric inference.
invented entities (2)
  • lightweight local continuity module (LCM) no independent evidence
    purpose: Generate globally coherent structures without grid artifacts in the single-step core predictor.
    Introduced as part of the first stage to address a specific failure mode of direct diffusion reuse.
  • detail sharpener no independent evidence
    purpose: Perform constrained multi-step rectified-flow refinement to enhance fine-grained geometry.
    Introduced as the second stage operating inside the manifold defined by the core predictor.

pith-pipeline@v0.9.0 · 5822 in / 1402 out tokens · 45673 ms · 2026-05-21T18:16:09.080534+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

    cs.CV 2026-05 unverdicted novelty 6.0

    UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.

  2. Image Generators are Generalist Vision Learners

    cs.CV 2026-04 unverdicted novelty 6.0

    Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.

  3. Image Generators are Generalist Vision Learners

    cs.CV 2026-04 conditional novelty 6.0

    Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.

  4. Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

    cs.CV 2026-02 unverdicted novelty 6.0

    Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 3 Pith papers · 10 internal anchors

  1. [1]

    MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

    R. Wang, S. Xu, Y . Dong, Y . Deng, J. Xiang, Z. Lv, G. Sun, X. Tong, and J. Yang, “Moge-2: Accurate monocular geometry with metric scale and sharp details,”arXiv preprint arXiv:2507.02546, 2025. 1, 2, 3, 11

  2. [2]

    Adding conditional control to text-to-image diffusion models,

    L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 3836–3847. 2

  3. [3]

    Animate anyone: Consistent and controllable image-to-video synthesis for character animation,

    L. Hu, “Animate anyone: Consistent and controllable image-to-video synthesis for character animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8153–8163. 2

  4. [4]

    2d gaussian splatting for geometrically accurate radiance fields,

    B. Huang, Z. Yu, A. Chen, A. Geiger, and S. Gao, “2d gaussian splatting for geometrically accurate radiance fields,” inACM SIGGRAPH 2024 conference papers, 2024, pp. 1–11. 2

  5. [5]

    Wonder3d: Single image to 3d using cross-domain diffusion,

    X. Long, Y .-C. Guo, C. Lin, Y . Liu, Z. Dou, L. Liu, Y . Ma, S.-H. Zhang, M. Habermann, C. Theobaltet al., “Wonder3d: Single image to 3d using cross-domain diffusion,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 9970–9980. 2

  6. [6]

    Dimer: Disentangled mesh reconstruction model,

    L. Jiang, J. Lin, K. Chen, W. Ge, X. Yang, Y . Jiang, Y . Lyu, X. Zheng, Y . Li, and Y . Chen, “Dimer: Disentangled mesh reconstruction model,” arXiv preprint arXiv:2504.17670, 2025. 2

  7. [7]

    Fb-occ: 3d occupancy prediction based on forward-backward view transformation,

    Z. Li, Z. Yu, D. Austin, M. Fang, S. Lan, J. Kautz, and J. M. Alvarez, “Fb-occ: 3d occupancy prediction based on forward-backward view transformation,”arXiv preprint arXiv:2307.01492, 2023. 2

  8. [8]

    Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,

    Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2

  9. [9]

    Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model

    S. Gu, W. Yin, B. Jin, X. Guo, J. Wang, H. Li, Q. Zhang, and X. Long, “Dome: Taming diffusion model into high-fidelity controllable occupancy world model,”arXiv preprint arXiv:2410.10429, 2024. 2

  10. [10]

    Depth map prediction from a single image using a multi-scale deep network,

    D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,”Advances in neural information processing systems, vol. 27, 2014. 2, 3

  11. [11]

    Neural window fully- connected crfs for monocular depth estimation,

    W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “Neural window fully- connected crfs for monocular depth estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3916–3925. 2 14

  12. [12]

    Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,

    A. Eftekhar, A. Sax, J. Malik, and A. Zamir, “Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 786–10 796. 2, 3

  13. [13]

    Depth anything: Unleashing the power of large-scale unlabeled data,

    L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 10 371–10 381. 2, 3, 11

  14. [14]

    Depth anything v2,

    L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”Advances in Neural Information Processing Sys- tems, vol. 37, pp. 21 875–21 911, 2024. 2, 3, 11

  15. [15]

    Moge: Unlocking accurate monocular geometry estimation for open- domain images with optimal training supervision,

    R. Wang, S. Xu, C. Dai, J. Xiang, Y . Deng, X. Tong, and J. Yang, “Moge: Unlocking accurate monocular geometry estimation for open- domain images with optimal training supervision,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5261–

  16. [16]

    Bi-tta: Bidirectional test-time adapter for remote physiological measurement,

    H. Li, H. Lu, and Y .-C. Chen, “Bi-tta: Bidirectional test-time adapter for remote physiological measurement,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 356–374. 2

  17. [17]

    High- resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695. 2, 4

  18. [18]

    (2024, Aug.) Bfl.ai announces the flux.1 suite of models

    BFL.ai. (2024, Aug.) Bfl.ai announces the flux.1 suite of models. [Online]. Available: https://bfl.ai/announcements/24-08-01-bfl 2, 4, 5, 10

  19. [19]

    Laion- 5b: An open large-scale dataset for training next generation image-text models,

    C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsmanet al., “Laion- 5b: An open large-scale dataset for training next generation image-text models,”Advances in neural information processing systems, vol. 35, pp. 25 278–25 294, 2022. 2, 3

  20. [20]

    Exploiting diffusion prior for generalizable dense prediction,

    H.-Y . Lee, H.-Y . Tseng, and M.-H. Yang, “Exploiting diffusion prior for generalizable dense prediction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7861–7871. 2

  21. [21]

    Repurposing diffusion-based image generators for monoc- ular depth estimation,

    B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monoc- ular depth estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 9492–9502. 2, 4, 5, 11

  22. [22]

    Lotus: Diffusion-based visual foundation model for high-quality dense prediction

    J. He, H. Li, W. Yin, Y . Liang, L. Li, K. Zhou, H. Zhang, B. Liu, and Y .-C. Chen, “Lotus: Diffusion-based visual foundation model for high- quality dense prediction,”arXiv preprint arXiv:2409.18124, 2024. 2, 4

  23. [23]

    Da 2: Depth anything in any direction,

    H. Li, W. Zheng, J. He, Y . Liu, X. Lin, X. Yang, Y .-C. Chen, and C. Guo, “Da 2: Depth anything in any direction,”arXiv preprint arXiv:2509.26618, 2025. 2

  24. [24]

    Jasmine: Harnessing diffusion prior for self-supervised depth estimation,

    J. Wang, C. Lin, C. Guan, L. Nie, J. He, H. Li, K. Liao, and Y . Zhao, “Jasmine: Harnessing diffusion prior for self-supervised depth estimation,”arXiv preprint arXiv:2503.15905, 2025. 2

  25. [25]

    X. Fu, W. Yin, M. Hu, K. Wang, Y . Ma, P. Tan, S. Shen, D. Lin, and Xiaoxiao Long. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image, inEuropean Conference on Computer Vision. Springer, 2024, pp. 241–258. 2, 4, 5, 11

  26. [26]

    C. Zhao, M. Liu, H. Zheng, M. Zhu, Z. Zhao, H. Chen, T. He, and Chunhua Shen. Diception: A generalist diffusion model for visual perceptual tasks,arXiv preprint, 2025. 2, 4

  27. [27]

    Shape and motion from image streams under orthography: a factorization method,

    C. Tomasi and T. Kanade, “Shape and motion from image streams under orthography: a factorization method,”International journal of computer vision, vol. 9, no. 2, pp. 137–154, 1992. 3

  28. [28]

    Modeling the world from internet photo collections,

    N. Snavely, S. M. Seitz, and R. Szeliski, “Modeling the world from internet photo collections,”International journal of computer vision, vol. 80, no. 2, pp. 189–210, 2008. 3

  29. [29]

    Photometric method for determining surface orien- tation from multiple images,

    R. J. Woodham, “Photometric method for determining surface orien- tation from multiple images,”Optical engineering, vol. 19, no. 1, pp. 139–144, 1980. 3

  30. [30]

    A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,

    D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,”International journal of computer vision, vol. 47, no. 1, pp. 7–42, 2002. 3

  31. [31]

    Hartley and A

    R. Hartley and A. Zisserman,Multiple view geometry in computer vision. Cambridge university press, 2003. 3

  32. [32]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,

    R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V . Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 3, pp. 1623–1637, 2020. 3, 11

  33. [33]

    Vision transformers for dense prediction,

    R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12 179–12 188. 3

  34. [34]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020. 3

  35. [35]

    Neural discrete representation learning,

    A. Van Den Oord, O. Vinyalset al., “Neural discrete representation learning,”Advances in neural information processing systems, vol. 30,

  36. [36]

    Generating diverse high-fidelity images with vq-vae-2,

    A. Razavi, A. Van den Oord, and O. Vinyals, “Generating diverse high-fidelity images with vq-vae-2,”Advances in neural information processing systems, vol. 32, 2019. 3

  37. [37]

    Generative adversarial nets,

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014. 3

  38. [38]

    Pixelfolder: An efficient progressive pixel synthesis network for image generation,

    J. He, Y . Zhou, Q. Zhang, J. Peng, Y . Shen, X. Sun, C. Chen, and R. Ji, “Pixelfolder: An efficient progressive pixel synthesis network for image generation,”arXiv preprint arXiv:2204.00833, 2022. 3

  39. [39]

    A style-based generator architecture for generative adversarial networks,

    T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401–

  40. [40]

    Analyzing and improving the image quality of stylegan,

    T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8110–8119. 3

  41. [41]

    Alias-free generative adversarial networks,

    T. Karras, M. Aittala, S. Laine, E. H ¨ark¨onen, J. Hellsten, J. Lehtinen, and T. Aila, “Alias-free generative adversarial networks,”Advances in Neural Information Processing Systems, vol. 34, pp. 852–863, 2021. 3

  42. [42]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020. 3, 4

  43. [43]

    Denoising Diffusion Implicit Models

    J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020. 3

  44. [44]

    Discene: Object decoupling and interaction modeling for complex scene generation,

    X.-L. Li, H. Li, H.-X. Chen, T.-J. Mu, and S.-M. Hu, “Discene: Object decoupling and interaction modeling for complex scene generation,” in SIGGRAPH Asia 2024 Conference Papers, 2024, pp. 1–12. 3

  45. [45]

    Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching,

    Y . Liang, X. Yang, J. Lin, H. Li, X. Xu, and Y . Chen, “Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 6517–6526. 3

  46. [46]

    Advancing high- fidelity 3d and texture generation with 2.5 d latents,

    X. Yang, J. Lin, Y . Xu, H. Li, and Y . Chen, “Advancing high- fidelity 3d and texture generation with 2.5 d latents,”arXiv preprint arXiv:2505.21050, 2025. 3

  47. [47]

    Disenvisioner: Disentangled and enriched visual prompt for customized image generation,

    J. He, H. Li, Y . Hu, G. Shen, Y . Cai, W. Qiu, and Y .-C. Chen, “Disenvisioner: Disentangled and enriched visual prompt for customized image generation,”arXiv preprint arXiv:2410.02067, 2024. 3

  48. [48]

    Tartanair: A dataset to push the limits of visual slam,

    W. Wang, D. Zhu, X. Wang, Y . Hu, Y . Qiu, C. Wang, Y . Hu, A. Kapoor, and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 4909–4916. 4

  49. [49]

    Megadepth: Learning single-view depth predic- tion from internet photos,

    Z. Li and N. Snavely, “Megadepth: Learning single-view depth predic- tion from internet photos,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2041–2050. 4

  50. [50]

    Irs: A large naturalistic indoor robotics stereo dataset to train deep models for dis- parity and surface normal estimation,

    Q. Wang, S. Zheng, Q. Yan, F. Deng, K. Zhao, and X. Chu, “Irs: A large naturalistic indoor robotics stereo dataset to train deep models for dis- parity and surface normal estimation,”arXiv preprint arXiv:1912.09678,

  51. [51]

    Diml/cvl rgb-d dataset: 2m rgb-d images of natural indoor and outdoor scenes.arXiv preprint arXiv:2110.11590, 2021

    J. Cho, D. Min, Y . Kim, and K. Sohn, “Diml/cvl rgb-d dataset: 2m rgb-d images of natural indoor and outdoor scenes,”arXiv preprint arXiv:2110.11590, 2021. 4

  52. [52]

    Blendedmvs: A large-scale dataset for generalized multi-view stereo networks,

    Y . Yao, Z. Luo, S. Li, J. Zhang, Y . Ren, L. Zhou, T. Fang, and L. Quan, “Blendedmvs: A large-scale dataset for generalized multi-view stereo networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 1790–1799. 4

  53. [53]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,”arXiv preprint arXiv:2307.01952,

  54. [54]

    U-net: Convolutional networks for biomedical image segmentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international con- ference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 2015, pp. 234–241. 4 15

  55. [55]

    Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

    D. Li, A. Kamko, E. Akhgari, A. Sabet, L. Xu, and S. Doshi, “Play- ground v2. 5: Three insights towards enhancing aesthetic quality in text- to-image generation,”arXiv preprint arXiv:2402.17245, 2024. 4

  56. [56]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y . Wu, Z. Wang, J. Kwok, P. Luo, H. Luet al., “Pixart-α: Fast training of diffusion transformer for photore- alistic text-to-image synthesis,”arXiv preprint arXiv:2310.00426, 2023. 4

  57. [57]

    Scalable diffusion models with transformers,

    W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4195–4205. 4, 5

  58. [58]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,”arXiv preprint arXiv:2209.03003, 2022. 4

  59. [59]

    Flow Matching for Generative Modeling

    Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747,

  60. [60]

    Scaling rectified flow transformers for high-resolution image synthesis,

    P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” inForty-first international conference on machine learning, 2024. 4

  61. [61]

    Introducing AuraFlow v0.1, an open exploration of large rectified flow models,

    cloneofsimo and Team Fal, “Introducing AuraFlow v0.1, an open exploration of large rectified flow models,” July 2024, accessed: 2025-02-25. [Online]. Available: https://blog.fal.ai/auraflow/ 4

  62. [62]

    Depthfm: Fast generative monocular depth estimation with flow matching,

    M. Gui, J. Schusterbauer, U. Prestel, P. Ma, D. Kotovenko, O. Grebenkova, S. A. Baumann, V . T. Hu, and B. Ommer, “Depthfm: Fast generative monocular depth estimation with flow matching,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 3, 2025, pp. 3203–3211. 4

  63. [63]

    Fine-tuning image-conditional diffusion models is easier than you think,

    G. M. Garcia, K. Abou Zeid, C. Schmidt, D. De Geus, A. Hermans, and B. Leibe, “Fine-tuning image-conditional diffusion models is easier than you think,” in2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025, pp. 753–762. 4

  64. [64]

    Stablenormal: Reducing diffusion variance for stable and sharp normal,

    C. Ye, L. Qiu, X. Gu, Q. Zuo, Y . Wu, Z. Dong, L. Bo, Y . Xiu, and X. Han, “Stablenormal: Reducing diffusion variance for stable and sharp normal,”arXiv preprint arXiv:2406.16864, 2024. 4, 11

  65. [65]

    Gaussian Error Linear Units (GELUs)

    D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),”arXiv preprint arXiv:1606.08415, 2016. 8

  66. [66]

    Lora: Low-rank adaptation of large language models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022. 10

  67. [67]

    Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,

    M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind, “Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 912–10 922. 10

  68. [68]

    Virtual KITTI 2

    Y . Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,”arXiv preprint arXiv:2001.10773, 2020. 10

  69. [69]

    Indoor segmentation and support inference from rgbd images,

    N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” inComputer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12. Springer, 2012, pp. 746–

  70. [70]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes,

    A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839. 11

  71. [71]

    Vision meets robotics: The kitti dataset,

    A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013. 11

  72. [72]

    A multi-view stereo benchmark with high- resolution images and multi-camera videos,

    T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger, “A multi-view stereo benchmark with high- resolution images and multi-camera videos,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3260–

  73. [73]

    Diode: A dense indoor and outdoor depth dataset,

    I. Vasiljevic, N. Kolkin, S. Zhang, R. Luo, H. Wang, F. Z. Dai, A. F. Daniele, M. Mostajabi, S. Basart, M. R. Walteret al., “Diode: A dense indoor and outdoor depth dataset,”arXiv preprint arXiv:1908.00463,

  74. [74]

    Evaluation of cnn- based single-image depth estimation methods,

    T. Koch, L. Liebel, F. Fraundorfer, and M. Korner, “Evaluation of cnn- based single-image depth estimation methods,” inProceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0–0. 11

  75. [75]

    A naturalistic open source movie for optical flow evaluation,

    D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” inComputer Vision– ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12. Springer, 2012, pp. 611–625. 11

  76. [76]

    Rethinking inductive biases for surface normal estimation,

    G. Bae and A. J. Davison, “Rethinking inductive biases for surface normal estimation,”IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 11, 12 VII. BIOGRAPHYSECTION Jing Heis a Doctor of Philosophy student at AI Thrust of Hong Kong University of Science and Technology (Guangzhou). Her research interest lies in visual generative mode...