Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model
Pith reviewed 2026-05-21 18:16 UTC · model grok-4.3
The pith
Lotus-2 adapts pre-trained diffusion models into a two-stage deterministic system that achieves state-of-the-art monocular depth estimation with only 59K training samples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Lotus-2 is a two-stage deterministic framework in which the first stage employs a single-step predictor with a clean-data objective and a lightweight local continuity module to produce globally coherent geometry free of grid artifacts, and the second stage performs constrained multi-step rectified-flow refinement inside the manifold of the core predictor to enhance fine-grained details, thereby turning the pre-trained generative prior into a stable deterministic world prior for geometric dense prediction.
What carries the argument
The two-stage deterministic adaptation consisting of a single-step core predictor with local continuity module plus constrained rectified-flow refinement that keeps all outputs inside the manifold defined by the core predictor.
If this is right
- Monocular depth estimation can reach new state-of-the-art accuracy while training on far fewer images than current large-scale supervised approaches.
- Surface normal prediction can remain competitive with existing methods when the same minimal-data regime is used.
- Diffusion models can function as deterministic world priors rather than purely stochastic generators for tasks that demand stable, accurate output.
- Geometric dense prediction can shift from data-scale scaling to effective extraction of priors already learned during image-text pre-training.
Where Pith is reading between the lines
- The same adaptation protocol might be tested on other single-image dense tasks such as semantic segmentation or surface reconstruction to see whether the deterministic prior extraction generalizes.
- If the method succeeds across diverse scene types, it would imply that the main remaining bottleneck in geometric vision is prior extraction rather than raw data volume.
- One could check whether the deterministic outputs preserve semantic consistency alongside geometric accuracy on images containing rare objects or lighting conditions.
Load-bearing premise
The pre-trained diffusion model already encodes stable, transferable geometric knowledge that can be extracted via a deterministic single-step predictor plus constrained refinement without introducing new inconsistencies or losing the prior's benefits.
What would settle it
A controlled experiment in which Lotus-2 is compared directly against a standard discriminative regression model trained on exactly the same 59K samples; if the two-stage diffusion adaptation shows no accuracy gain on held-out test sets such as NYUv2 depth or KITTI, the value of the generative prior would be called into question.
Figures
read the original abstract
Recovering pixel-wise geometric properties from a single image is fundamentally ill-posed due to appearance ambiguity and non-injective mappings between 2D observations and 3D structures. While discriminative regression models achieve strong performance through large-scale supervision, their success is bounded by the scale, quality, and diversity of available data, as well as by limited physical reasoning. Recent diffusion models exhibit powerful world priors that encode geometry and semantics learned from massive image-text data, yet directly reusing their stochastic generative formulation is suboptimal for deterministic geometric inference: the former is optimized for diverse and high-fidelity image generation, whereas the latter requires stable and accurate predictions. In this work, we propose Lotus-2, a two-stage deterministic framework for stable, accurate and fine-grained geometric dense prediction, aiming to provide an optimal adaptation protocol to fully exploit the pre-trained generative priors. Specifically, in the first stage, the core predictor employs a single-step deterministic formulation with a clean-data objective and a lightweight local continuity module (LCM) to generate globally coherent structures without grid artifacts. In the second stage, the detail sharpener performs a constrained multi-step rectified-flow refinement within the manifold defined by the core predictor, enhancing fine-grained geometry through noise-free deterministic flow matching. Using only 59K training samples, less than 1% of existing large-scale datasets, Lotus-2 establishes new state-of-the-art results in monocular depth estimation and highly competitive surface normal prediction. These results demonstrate that diffusion models can serve as deterministic world priors, enabling high-quality geometric reasoning beyond traditional discriminative and generative paradigms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Lotus-2, a two-stage deterministic framework for adapting pre-trained diffusion models to geometric dense prediction tasks including monocular depth estimation and surface normal prediction. Stage 1 employs a single-step deterministic core predictor with a clean-data objective and a lightweight local continuity module (LCM) to produce globally coherent outputs without grid artifacts. Stage 2 applies a detail sharpener that performs constrained multi-step rectified-flow refinement strictly inside the manifold defined by the core predictor output. The central empirical claim is that this protocol yields new state-of-the-art depth estimation results and competitive normal prediction using only 59K training samples, less than 1% of existing large-scale datasets.
Significance. If the reported performance gains are robustly supported by the experiments, the work would be significant for showing that generative diffusion priors can be converted into stable deterministic geometric predictors with minimal task-specific data. This offers a concrete adaptation protocol that could reduce dependence on massive labeled geometric datasets while preserving the world knowledge encoded in large-scale image-text pre-training.
major comments (3)
- [§3.2] §3.2 (Detail Sharpener): The claim that refinement occurs 'strictly inside the manifold defined by the core predictor' is load-bearing for attributing gains to the diffusion prior rather than to additional supervised fitting. The manuscript does not specify the concrete enforcement mechanism (e.g., a projection operator, a consistency loss term, or a hard constraint on the flow trajectory), nor does it report a quantitative manifold-consistency metric between stage-1 and stage-2 outputs. Without such evidence, it remains possible that the refinement step simply relaxes the stage-1 prediction, undermining the central narrative.
- [Table 2] Table 2 (depth estimation results): The SOTA claim with 59K samples requires explicit side-by-side comparison against the strongest baselines trained on the same 59K subset (or with matched data budgets) rather than only against models trained on full large-scale datasets. If the table only reports numbers against full-data methods, the efficiency argument is not fully substantiated.
- [§4.3] §4.3 (Ablations): The incremental contribution of the LCM and the constrained refinement must be quantified against a direct single-stage fine-tuning baseline of the same pre-trained diffusion backbone. Current ablations appear to compare only internal variants; without the external baseline, it is unclear whether the two-stage protocol itself, rather than the choice of backbone, drives the reported gains.
minor comments (3)
- [Abstract] The abstract states 'highly competitive surface normal prediction' without naming the primary metrics (e.g., mean angular error) or the evaluation datasets; adding these specifics would improve clarity.
- [§3.2] Notation for the rectified-flow velocity field and the constraint operator should be introduced with an equation in §3.2 rather than only in prose.
- [Figure 3] Figure 3 (qualitative results) would benefit from zoomed insets highlighting the fine-grained geometry improvements claimed for the detail sharpener.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments, which help clarify key aspects of our two-stage adaptation framework. We address each major comment point by point below and will revise the manuscript to incorporate the suggested improvements where they strengthen the presentation without altering the core claims.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Detail Sharpener): The claim that refinement occurs 'strictly inside the manifold defined by the core predictor' is load-bearing for attributing gains to the diffusion prior rather than to additional supervised fitting. The manuscript does not specify the concrete enforcement mechanism (e.g., a projection operator, a consistency loss term, or a hard constraint on the flow trajectory), nor does it report a quantitative manifold-consistency metric between stage-1 and stage-2 outputs. Without such evidence, it remains possible that the refinement step simply relaxes the stage-1 prediction, undermining the central narrative.
Authors: We thank the referee for this important observation. In §3.2 the enforcement is achieved by initializing the rectified-flow trajectory directly from the Stage-1 core predictor output and performing deterministic, noise-free flow-matching steps; because no stochastic noise is injected and the velocity field is conditioned only on the Stage-1 prediction, the trajectory remains inside the manifold by construction. To make this explicit and to provide quantitative support, we will add (i) a precise description of the initialization and conditioning procedure and (ii) a new manifold-consistency metric (mean L2 distance and SSIM between Stage-1 and Stage-2 outputs on the test set) in the revised §3.2 and §4.3. These additions will empirically confirm that Stage-2 refinement does not materially deviate from the core prediction. revision: yes
-
Referee: [Table 2] Table 2 (depth estimation results): The SOTA claim with 59K samples requires explicit side-by-side comparison against the strongest baselines trained on the same 59K subset (or with matched data budgets) rather than only against models trained on full large-scale datasets. If the table only reports numbers against full-data methods, the efficiency argument is not fully substantiated.
Authors: We agree that matched-data-budget comparisons would further substantiate the efficiency argument. While the current Table 2 demonstrates competitiveness against published full-data models, we will add a new column (or supplementary table) reporting the performance of representative strong baselines (e.g., the best discriminative depth estimators) when retrained from scratch on the identical 59K-sample subset used for Lotus-2. This revision will allow direct attribution of gains to the diffusion-prior adaptation protocol rather than to data volume. revision: yes
-
Referee: [§4.3] §4.3 (Ablations): The incremental contribution of the LCM and the constrained refinement must be quantified against a direct single-stage fine-tuning baseline of the same pre-trained diffusion backbone. Current ablations appear to compare only internal variants; without the external baseline, it is unclear whether the two-stage protocol itself, rather than the choice of backbone, drives the reported gains.
Authors: We appreciate the request for a stronger external baseline. The existing ablations isolate the contributions of LCM and the detail sharpener within our two-stage design. To directly address the referee’s concern, we will include an additional single-stage fine-tuning baseline that applies the identical pre-trained diffusion backbone with standard supervised regression (no LCM, no Stage-2 refinement). Results will be reported in the revised §4.3, enabling readers to isolate the benefit of the two-stage protocol itself. revision: yes
Circularity Check
No significant circularity; derivation relies on external pre-trained priors and empirical adaptation
full rationale
The paper's central claim rests on adapting an external pre-trained diffusion model via a two-stage deterministic framework (single-step clean-data predictor with LCM in stage 1, constrained rectified-flow refinement in stage 2). No equations or steps reduce the final geometric predictions or SOTA results to fitted parameters or self-defined quantities by construction. The performance with 59K samples is presented as an empirical outcome exploiting the external prior, not a mathematical identity or renamed fit. No self-citation load-bearing, uniqueness theorems, or ansatz smuggling appear in the provided abstract or description. The derivation chain is self-contained against the external generative prior and reported benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pre-trained diffusion models encode transferable geometric and semantic world priors from image-text data.
invented entities (2)
-
lightweight local continuity module (LCM)
no independent evidence
-
detail sharpener
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
single-step deterministic formulation with a clean-data objective and a lightweight local continuity module (LCM) ... constrained multi-step rectified-flow refinement within the manifold defined by the core predictor
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
rectified-flow formulation ... z_t = t z_x + (1-t) z_y, v = z_x - z_y
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
-
Image Generators are Generalist Vision Learners
Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.
-
Image Generators are Generalist Vision Learners
Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.
-
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
Reference graph
Works this paper leans on
-
[1]
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
R. Wang, S. Xu, Y . Dong, Y . Deng, J. Xiang, Z. Lv, G. Sun, X. Tong, and J. Yang, “Moge-2: Accurate monocular geometry with metric scale and sharp details,”arXiv preprint arXiv:2507.02546, 2025. 1, 2, 3, 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Adding conditional control to text-to-image diffusion models,
L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 3836–3847. 2
work page 2023
-
[3]
Animate anyone: Consistent and controllable image-to-video synthesis for character animation,
L. Hu, “Animate anyone: Consistent and controllable image-to-video synthesis for character animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8153–8163. 2
work page 2024
-
[4]
2d gaussian splatting for geometrically accurate radiance fields,
B. Huang, Z. Yu, A. Chen, A. Geiger, and S. Gao, “2d gaussian splatting for geometrically accurate radiance fields,” inACM SIGGRAPH 2024 conference papers, 2024, pp. 1–11. 2
work page 2024
-
[5]
Wonder3d: Single image to 3d using cross-domain diffusion,
X. Long, Y .-C. Guo, C. Lin, Y . Liu, Z. Dou, L. Liu, Y . Ma, S.-H. Zhang, M. Habermann, C. Theobaltet al., “Wonder3d: Single image to 3d using cross-domain diffusion,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 9970–9980. 2
work page 2024
-
[6]
Dimer: Disentangled mesh reconstruction model,
L. Jiang, J. Lin, K. Chen, W. Ge, X. Yang, Y . Jiang, Y . Lyu, X. Zheng, Y . Li, and Y . Chen, “Dimer: Disentangled mesh reconstruction model,” arXiv preprint arXiv:2504.17670, 2025. 2
-
[7]
Fb-occ: 3d occupancy prediction based on forward-backward view transformation,
Z. Li, Z. Yu, D. Austin, M. Fang, S. Lan, J. Kautz, and J. M. Alvarez, “Fb-occ: 3d occupancy prediction based on forward-backward view transformation,”arXiv preprint arXiv:2307.01492, 2023. 2
-
[8]
Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2
work page 2024
-
[9]
Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model
S. Gu, W. Yin, B. Jin, X. Guo, J. Wang, H. Li, Q. Zhang, and X. Long, “Dome: Taming diffusion model into high-fidelity controllable occupancy world model,”arXiv preprint arXiv:2410.10429, 2024. 2
-
[10]
Depth map prediction from a single image using a multi-scale deep network,
D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,”Advances in neural information processing systems, vol. 27, 2014. 2, 3
work page 2014
-
[11]
Neural window fully- connected crfs for monocular depth estimation,
W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “Neural window fully- connected crfs for monocular depth estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3916–3925. 2 14
work page 2022
-
[12]
Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,
A. Eftekhar, A. Sax, J. Malik, and A. Zamir, “Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 786–10 796. 2, 3
work page 2021
-
[13]
Depth anything: Unleashing the power of large-scale unlabeled data,
L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 10 371–10 381. 2, 3, 11
work page 2024
-
[14]
L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”Advances in Neural Information Processing Sys- tems, vol. 37, pp. 21 875–21 911, 2024. 2, 3, 11
work page 2024
-
[15]
R. Wang, S. Xu, C. Dai, J. Xiang, Y . Deng, X. Tong, and J. Yang, “Moge: Unlocking accurate monocular geometry estimation for open- domain images with optimal training supervision,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5261–
work page 2025
-
[16]
Bi-tta: Bidirectional test-time adapter for remote physiological measurement,
H. Li, H. Lu, and Y .-C. Chen, “Bi-tta: Bidirectional test-time adapter for remote physiological measurement,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 356–374. 2
work page 2024
-
[17]
High- resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695. 2, 4
work page 2022
-
[18]
(2024, Aug.) Bfl.ai announces the flux.1 suite of models
BFL.ai. (2024, Aug.) Bfl.ai announces the flux.1 suite of models. [Online]. Available: https://bfl.ai/announcements/24-08-01-bfl 2, 4, 5, 10
work page 2024
-
[19]
Laion- 5b: An open large-scale dataset for training next generation image-text models,
C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsmanet al., “Laion- 5b: An open large-scale dataset for training next generation image-text models,”Advances in neural information processing systems, vol. 35, pp. 25 278–25 294, 2022. 2, 3
work page 2022
-
[20]
Exploiting diffusion prior for generalizable dense prediction,
H.-Y . Lee, H.-Y . Tseng, and M.-H. Yang, “Exploiting diffusion prior for generalizable dense prediction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7861–7871. 2
work page 2024
-
[21]
Repurposing diffusion-based image generators for monoc- ular depth estimation,
B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monoc- ular depth estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 9492–9502. 2, 4, 5, 11
work page 2024
-
[22]
Lotus: Diffusion-based visual foundation model for high-quality dense prediction
J. He, H. Li, W. Yin, Y . Liang, L. Li, K. Zhou, H. Zhang, B. Liu, and Y .-C. Chen, “Lotus: Diffusion-based visual foundation model for high- quality dense prediction,”arXiv preprint arXiv:2409.18124, 2024. 2, 4
-
[23]
Da 2: Depth anything in any direction,
H. Li, W. Zheng, J. He, Y . Liu, X. Lin, X. Yang, Y .-C. Chen, and C. Guo, “Da 2: Depth anything in any direction,”arXiv preprint arXiv:2509.26618, 2025. 2
-
[24]
Jasmine: Harnessing diffusion prior for self-supervised depth estimation,
J. Wang, C. Lin, C. Guan, L. Nie, J. He, H. Li, K. Liao, and Y . Zhao, “Jasmine: Harnessing diffusion prior for self-supervised depth estimation,”arXiv preprint arXiv:2503.15905, 2025. 2
-
[25]
X. Fu, W. Yin, M. Hu, K. Wang, Y . Ma, P. Tan, S. Shen, D. Lin, and Xiaoxiao Long. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image, inEuropean Conference on Computer Vision. Springer, 2024, pp. 241–258. 2, 4, 5, 11
work page 2024
-
[26]
C. Zhao, M. Liu, H. Zheng, M. Zhu, Z. Zhao, H. Chen, T. He, and Chunhua Shen. Diception: A generalist diffusion model for visual perceptual tasks,arXiv preprint, 2025. 2, 4
work page 2025
-
[27]
Shape and motion from image streams under orthography: a factorization method,
C. Tomasi and T. Kanade, “Shape and motion from image streams under orthography: a factorization method,”International journal of computer vision, vol. 9, no. 2, pp. 137–154, 1992. 3
work page 1992
-
[28]
Modeling the world from internet photo collections,
N. Snavely, S. M. Seitz, and R. Szeliski, “Modeling the world from internet photo collections,”International journal of computer vision, vol. 80, no. 2, pp. 189–210, 2008. 3
work page 2008
-
[29]
Photometric method for determining surface orien- tation from multiple images,
R. J. Woodham, “Photometric method for determining surface orien- tation from multiple images,”Optical engineering, vol. 19, no. 1, pp. 139–144, 1980. 3
work page 1980
-
[30]
A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,
D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,”International journal of computer vision, vol. 47, no. 1, pp. 7–42, 2002. 3
work page 2002
-
[31]
R. Hartley and A. Zisserman,Multiple view geometry in computer vision. Cambridge university press, 2003. 3
work page 2003
-
[32]
Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,
R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V . Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 3, pp. 1623–1637, 2020. 3, 11
work page 2020
-
[33]
Vision transformers for dense prediction,
R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12 179–12 188. 3
work page 2021
-
[34]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[35]
Neural discrete representation learning,
A. Van Den Oord, O. Vinyalset al., “Neural discrete representation learning,”Advances in neural information processing systems, vol. 30,
-
[36]
Generating diverse high-fidelity images with vq-vae-2,
A. Razavi, A. Van den Oord, and O. Vinyals, “Generating diverse high-fidelity images with vq-vae-2,”Advances in neural information processing systems, vol. 32, 2019. 3
work page 2019
-
[37]
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014. 3
work page 2014
-
[38]
Pixelfolder: An efficient progressive pixel synthesis network for image generation,
J. He, Y . Zhou, Q. Zhang, J. Peng, Y . Shen, X. Sun, C. Chen, and R. Ji, “Pixelfolder: An efficient progressive pixel synthesis network for image generation,”arXiv preprint arXiv:2204.00833, 2022. 3
-
[39]
A style-based generator architecture for generative adversarial networks,
T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401–
work page 2019
-
[40]
Analyzing and improving the image quality of stylegan,
T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8110–8119. 3
work page 2020
-
[41]
Alias-free generative adversarial networks,
T. Karras, M. Aittala, S. Laine, E. H ¨ark¨onen, J. Hellsten, J. Lehtinen, and T. Aila, “Alias-free generative adversarial networks,”Advances in Neural Information Processing Systems, vol. 34, pp. 852–863, 2021. 3
work page 2021
-
[42]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020. 3, 4
work page 2020
-
[43]
Denoising Diffusion Implicit Models
J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[44]
Discene: Object decoupling and interaction modeling for complex scene generation,
X.-L. Li, H. Li, H.-X. Chen, T.-J. Mu, and S.-M. Hu, “Discene: Object decoupling and interaction modeling for complex scene generation,” in SIGGRAPH Asia 2024 Conference Papers, 2024, pp. 1–12. 3
work page 2024
-
[45]
Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching,
Y . Liang, X. Yang, J. Lin, H. Li, X. Xu, and Y . Chen, “Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 6517–6526. 3
work page 2024
-
[46]
Advancing high- fidelity 3d and texture generation with 2.5 d latents,
X. Yang, J. Lin, Y . Xu, H. Li, and Y . Chen, “Advancing high- fidelity 3d and texture generation with 2.5 d latents,”arXiv preprint arXiv:2505.21050, 2025. 3
-
[47]
Disenvisioner: Disentangled and enriched visual prompt for customized image generation,
J. He, H. Li, Y . Hu, G. Shen, Y . Cai, W. Qiu, and Y .-C. Chen, “Disenvisioner: Disentangled and enriched visual prompt for customized image generation,”arXiv preprint arXiv:2410.02067, 2024. 3
-
[48]
Tartanair: A dataset to push the limits of visual slam,
W. Wang, D. Zhu, X. Wang, Y . Hu, Y . Qiu, C. Wang, Y . Hu, A. Kapoor, and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 4909–4916. 4
work page 2020
-
[49]
Megadepth: Learning single-view depth predic- tion from internet photos,
Z. Li and N. Snavely, “Megadepth: Learning single-view depth predic- tion from internet photos,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2041–2050. 4
work page 2018
-
[50]
Q. Wang, S. Zheng, Q. Yan, F. Deng, K. Zhao, and X. Chu, “Irs: A large naturalistic indoor robotics stereo dataset to train deep models for dis- parity and surface normal estimation,”arXiv preprint arXiv:1912.09678,
-
[51]
J. Cho, D. Min, Y . Kim, and K. Sohn, “Diml/cvl rgb-d dataset: 2m rgb-d images of natural indoor and outdoor scenes,”arXiv preprint arXiv:2110.11590, 2021. 4
-
[52]
Blendedmvs: A large-scale dataset for generalized multi-view stereo networks,
Y . Yao, Z. Luo, S. Li, J. Zhang, Y . Ren, L. Zhou, T. Fang, and L. Quan, “Blendedmvs: A large-scale dataset for generalized multi-view stereo networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 1790–1799. 4
work page 2020
-
[53]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,”arXiv preprint arXiv:2307.01952,
work page internal anchor Pith review Pith/arXiv arXiv
-
[54]
U-net: Convolutional networks for biomedical image segmentation,
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international con- ference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 2015, pp. 234–241. 4 15
work page 2015
-
[55]
Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation
D. Li, A. Kamko, E. Akhgari, A. Sabet, L. Xu, and S. Doshi, “Play- ground v2. 5: Three insights towards enhancing aesthetic quality in text- to-image generation,”arXiv preprint arXiv:2402.17245, 2024. 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y . Wu, Z. Wang, J. Kwok, P. Luo, H. Luet al., “Pixart-α: Fast training of diffusion transformer for photore- alistic text-to-image synthesis,”arXiv preprint arXiv:2310.00426, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Scalable diffusion models with transformers,
W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4195–4205. 4, 5
work page 2023
-
[58]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,”arXiv preprint arXiv:2209.03003, 2022. 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[59]
Flow Matching for Generative Modeling
Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[60]
Scaling rectified flow transformers for high-resolution image synthesis,
P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” inForty-first international conference on machine learning, 2024. 4
work page 2024
-
[61]
Introducing AuraFlow v0.1, an open exploration of large rectified flow models,
cloneofsimo and Team Fal, “Introducing AuraFlow v0.1, an open exploration of large rectified flow models,” July 2024, accessed: 2025-02-25. [Online]. Available: https://blog.fal.ai/auraflow/ 4
work page 2024
-
[62]
Depthfm: Fast generative monocular depth estimation with flow matching,
M. Gui, J. Schusterbauer, U. Prestel, P. Ma, D. Kotovenko, O. Grebenkova, S. A. Baumann, V . T. Hu, and B. Ommer, “Depthfm: Fast generative monocular depth estimation with flow matching,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 3, 2025, pp. 3203–3211. 4
work page 2025
-
[63]
Fine-tuning image-conditional diffusion models is easier than you think,
G. M. Garcia, K. Abou Zeid, C. Schmidt, D. De Geus, A. Hermans, and B. Leibe, “Fine-tuning image-conditional diffusion models is easier than you think,” in2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025, pp. 753–762. 4
work page 2025
-
[64]
Stablenormal: Reducing diffusion variance for stable and sharp normal,
C. Ye, L. Qiu, X. Gu, Q. Zuo, Y . Wu, Z. Dong, L. Bo, Y . Xiu, and X. Han, “Stablenormal: Reducing diffusion variance for stable and sharp normal,”arXiv preprint arXiv:2406.16864, 2024. 4, 11
-
[65]
Gaussian Error Linear Units (GELUs)
D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),”arXiv preprint arXiv:1606.08415, 2016. 8
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[66]
Lora: Low-rank adaptation of large language models
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022. 10
work page 2022
-
[67]
Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,
M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind, “Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 912–10 922. 10
work page 2021
-
[68]
Y . Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,”arXiv preprint arXiv:2001.10773, 2020. 10
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[69]
Indoor segmentation and support inference from rgbd images,
N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” inComputer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12. Springer, 2012, pp. 746–
work page 2012
-
[70]
Scannet: Richly-annotated 3d reconstructions of indoor scenes,
A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839. 11
work page 2017
-
[71]
Vision meets robotics: The kitti dataset,
A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013. 11
work page 2013
-
[72]
A multi-view stereo benchmark with high- resolution images and multi-camera videos,
T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger, “A multi-view stereo benchmark with high- resolution images and multi-camera videos,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3260–
work page 2017
-
[73]
Diode: A dense indoor and outdoor depth dataset,
I. Vasiljevic, N. Kolkin, S. Zhang, R. Luo, H. Wang, F. Z. Dai, A. F. Daniele, M. Mostajabi, S. Basart, M. R. Walteret al., “Diode: A dense indoor and outdoor depth dataset,”arXiv preprint arXiv:1908.00463,
-
[74]
Evaluation of cnn- based single-image depth estimation methods,
T. Koch, L. Liebel, F. Fraundorfer, and M. Korner, “Evaluation of cnn- based single-image depth estimation methods,” inProceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0–0. 11
work page 2018
-
[75]
A naturalistic open source movie for optical flow evaluation,
D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” inComputer Vision– ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12. Springer, 2012, pp. 611–625. 11
work page 2012
-
[76]
Rethinking inductive biases for surface normal estimation,
G. Bae and A. J. Davison, “Rethinking inductive biases for surface normal estimation,”IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 11, 12 VII. BIOGRAPHYSECTION Jing Heis a Doctor of Philosophy student at AI Thrust of Hong Kong University of Science and Technology (Guangzhou). Her research interest lies in visual generative mode...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.