pith. sign in

arxiv: 2605.21116 · v1 · pith:K3FKK2CPnew · submitted 2026-05-20 · 📡 eess.IV

GeoDiff-SAR II: 3D-Driven Foundation Diffusion Models for SAR Generation via Decoupled Control

Pith reviewed 2026-05-21 01:45 UTC · model grok-4.3

classification 📡 eess.IV
keywords SAR image generationcontrollable synthesis3D CAD guidancediffusion modelsscattering centersazimuth generalizationautomatic target recognition
0
0 comments X

The pith

A 3D-guided framework decouples geometry from scattering to generate controllable SAR images across large azimuth gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method for generating Synthetic Aperture Radar images where users specify azimuth angle, depression angle, and polarization mode in advance. It does so by rendering a Geometric-Electromagnetic Conditioning Map from a 3D CAD model and feeding that map plus text descriptions into a diffusion backbone. This approach matters because prior SAR synthesis methods could not reliably handle wide missing azimuth sectors or enforce physical consistency across multiple imaging parameters at once. If the method works, synthetic data could be produced that matches real measurements closely enough to improve training of downstream recognition systems without collecting new flights for every viewpoint.

Core claim

By deriving a Geometric-Electromagnetic Conditioning Map from real sparse-azimuth SAR images during training and from 3D CAD models during inference, the framework converts imaging parameters into text conditions while injecting the map through ControlNet for spatial guidance, thereby unifying geometric-electromagnetic control and parameter-aware generation in one diffusion process that generalizes stably across large viewpoint gaps.

What carries the argument

The Geometric-Electromagnetic Conditioning Map (GECM), a structured intermediate representation that encodes the target pose map and dominant scattering centers to separate macroscopic geometry from microscopic scattering responses.

If this is right

  • Users can specify any combination of azimuth, depression angle, and polarization and obtain a corresponding SAR image without retraining.
  • The model maintains usable output quality even when large contiguous azimuth sectors are missing from the training data.
  • Generated images improve fidelity and physical consistency metrics relative to earlier intensity-only conditioning approaches.
  • Downstream Automatic Target Recognition classifiers trained on the synthetic outputs achieve higher accuracy on real test imagery.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decoupling idea could be tested on other sensor modalities where 3D models exist, such as LiDAR or infrared.
  • If the GECM proves sufficient, future work might reduce reliance on full 3D CAD models by learning the map directly from sparse real data.
  • The approach opens the possibility of on-demand SAR simulation for operational planning where flight collection is impractical.

Load-bearing premise

That a map rendered from a 3D CAD model under chosen azimuth, depression angle, and polarization conditions accurately captures the dominant scattering centers and overall shape present in real SAR returns of the same target.

What would settle it

A direct comparison showing that images generated for a specific set of imaging parameters produce lower physical-consistency scores or lower Automatic Target Recognition accuracy than real SAR measurements of the identical target taken under those exact parameters.

Figures

Figures reproduced from arXiv: 2605.21116 by Fan Zhang, Fei Ma, Lingxiao Peng, Qiang Yin, Xuanting Wu, Yingbing Liu, Yongsheng Zhou.

Figure 1
Figure 1. Figure 1: Comprehensive comparison of representative generation paradigms across multiple capa [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Evolution of SAR image generation frameworks. (a) Traditional electromagnetic simula [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall architecture of the proposed 3D Model-Guided decoupled SAR image generation [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Detailed pipeline of SAR-to-GECM derivation. From a real SAR image and its cor [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualized examples of ground-truth GECMs derived from real SAR images across various [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Detailed pipeline of the 3D Model-Guided GECM Generation. A 3D CAD mesh is [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of the 3D Pose Extraction results across varying radar viewpoints. As the [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of the intermediate processes within the Scattering Simulation branch. (a) [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Zero-shot generalization of the 3D-guided GECM generation across diverse aircraft CAD [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Detailed illustration of the dataset partitioning and content for both the simulated [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comprehensive zero-shot generation results of the Boeing 747 aircraft using the GeoDiff [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative visual comparison of generated SAR images across different methods under [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visualizations of downstream ATR performance on the simulated Boeing dataset. (a) [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visualizations of downstream ATR performance on the real-world Shanxi dataset. (a) [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
read the original abstract

Existing Synthetic Aperture Radar (SAR) image generation methods still lack reliable controllability over key imaging parameters, particularly azimuth angle, depression angle, and polarization mode. Our preliminary GeoDiff-SAR supported limited azimuth completion, but remained ineffective for large missing azimuth sectors and did not provide unified control over multiple imaging conditions. To address this problem, we propose GeoDiff-SAR II, a 3D model-guided decoupled framework for controllable SAR image generation. The proposed framework imposes controllability through physically grounded geometric-electromagnetic cues rather than image intensity alone. We introduce a Geometric-Electromagnetic Conditioning Map (GECM), a structured intermediate representation that encodes the target pose map and dominant scattering centers, thereby decoupling macroscopic geometry from microscopic scattering responses. During training, GECMs are derived from real sparse-azimuth SAR images. During inference, the same representation is rendered directly from a 3D CAD model under specified azimuth, depression angle, and polarization conditions, enabling physically consistent control across large viewpoint gaps. The imaging parameters are further converted into text conditions, while the GECM is injected through ControlNet to provide explicit spatial guidance. Combined with Low-Rank Adaptation (LoRA) on a FLUX backbone, the proposed framework unifies geometric-electromagnetic conditioning and parameter-aware generation within a single process. Experiments on simulated and real datasets demonstrate controllable generation over key SAR imaging parameters, stable generalization across large azimuth gaps, and consistent improvements in image fidelity, physical consistency, and downstream Automatic Target Recognition (ATR) performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces GeoDiff-SAR II, a 3D-driven diffusion framework for controllable SAR image generation. It defines a Geometric-Electromagnetic Conditioning Map (GECM) that encodes target pose and dominant scattering centers. During training GECMs are extracted from real sparse-azimuth SAR; at inference the same representation is rendered from 3D CAD models under user-specified azimuth, depression angle, and polarization. These maps are injected via ControlNet while imaging parameters are supplied as text prompts; the backbone is FLUX fine-tuned with LoRA. Experiments on simulated and real data are said to show controllable generation, stable performance across large azimuth gaps, and gains in fidelity, physical consistency, and downstream ATR accuracy.

Significance. If the GECM truly acts as a domain-invariant bridge between CAD-rendered and real-SAR-derived conditioning signals, the work would provide a practical route to parameter-controlled SAR synthesis that could reduce reliance on scarce real measurements for ATR training. The decoupled geometric-electromagnetic design and reuse of a foundation-model backbone with LoRA are technically attractive features.

major comments (2)
  1. [§3.2] §3.2 (GECM rendering and injection): The central claim of stable generalization across large azimuth gaps rests on the assumption that GECMs rendered from 3D CAD models lie within the distribution of GECMs extracted from real SAR at the same parameters. No quantitative comparison (e.g., Wasserstein distance on scattering-center locations/amplitudes or overlap statistics) is reported between the two sources of GECM; without this, the ControlNet may be queried outside its training support, directly undermining both controllability and ATR gains.
  2. [§4.3] §4.3 and Table 3 (ATR evaluation): The abstract asserts consistent improvements in downstream ATR performance, yet the manuscript provides neither the exact ATR metric values, the baseline methods, nor any ablation that isolates the contribution of the GECM versus text conditioning alone. This absence makes it impossible to verify whether the reported gains are load-bearing or merely incremental.
minor comments (2)
  1. [§3.1] The abstract states that GECMs are 'derived from real sparse-azimuth SAR images' but does not specify the extraction algorithm (e.g., peak detection thresholds or pose estimation method); this detail should be added to §3.1 for reproducibility.
  2. [Figure 4] Figure 4 caption refers to 'large azimuth gaps' without stating the exact angular ranges tested; explicit values would clarify the scope of the generalization claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments on our manuscript. We address each major comment in detail below and outline the revisions we plan to implement.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (GECM rendering and injection): The central claim of stable generalization across large azimuth gaps rests on the assumption that GECMs rendered from 3D CAD models lie within the distribution of GECMs extracted from real SAR at the same parameters. No quantitative comparison (e.g., Wasserstein distance on scattering-center locations/amplitudes or overlap statistics) is reported between the two sources of GECM; without this, the ControlNet may be queried outside its training support, directly undermining both controllability and ATR gains.

    Authors: We acknowledge the validity of this observation. The manuscript currently does not include a quantitative comparison between GECMs from real SAR and CAD-rendered sources. To address this, we will add such an analysis in the revised manuscript, including metrics like Wasserstein distances on scattering center positions and amplitudes, as well as overlap statistics for matched imaging parameters. This will provide evidence that the rendered GECMs are within the training distribution. revision: yes

  2. Referee: [§4.3] §4.3 and Table 3 (ATR evaluation): The abstract asserts consistent improvements in downstream ATR performance, yet the manuscript provides neither the exact ATR metric values, the baseline methods, nor any ablation that isolates the contribution of the GECM versus text conditioning alone. This absence makes it impossible to verify whether the reported gains are load-bearing or merely incremental.

    Authors: We agree that more detailed reporting is necessary to substantiate the ATR improvements. In the revised version, we will provide the exact metric values (such as accuracy, precision, or other relevant measures), explicitly list the baseline methods compared against, and include an ablation study that isolates the impact of the GECM conditioning from the text prompts alone. This will clarify the contribution of each component. revision: yes

Circularity Check

0 steps flagged

No circularity: framework relies on external 3D CAD rendering and real SAR data without self-referential reductions

full rationale

The paper describes a decoupled diffusion framework where GECMs are extracted from real sparse-azimuth SAR images during training and rendered from independent 3D CAD models during inference, with text conditioning on imaging parameters and ControlNet injection. No equations, fitted parameters, or self-citations are shown that define a claimed prediction or performance gain in terms of the same quantities used for evaluation. The central claims of controllability and generalization rest on the external validity of the 3D-to-GECM rendering step rather than any internal loop or renaming of inputs as outputs. This is a standard methodological proposal whose correctness can be assessed against held-out real SAR measurements without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that 3D CAD models plus electromagnetic rendering produce GECMs that transfer to real SAR data; no free parameters are explicitly named in the abstract, but the LoRA adaptation and ControlNet injection implicitly introduce tunable components whose values are not reported.

axioms (1)
  • domain assumption 3D CAD models of targets accurately capture the dominant scattering centers and pose geometry relevant to real SAR imaging
    Invoked when GECMs are rendered from CAD during inference and compared to real sparse-azimuth SAR images during training
invented entities (1)
  • Geometric-Electromagnetic Conditioning Map (GECM) no independent evidence
    purpose: Structured intermediate representation that decouples macroscopic geometry from microscopic scattering responses
    Newly introduced in this work as the core conditioning signal injected via ControlNet

pith-pipeline@v0.9.0 · 5822 in / 1429 out tokens · 40201 ms · 2026-05-21T01:45:01.809625+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We introduce a Geometric-Electromagnetic Conditioning Map (GECM), a structured intermediate representation that encodes the target pose map and dominant scattering centers... During inference, the same representation is rendered directly from a 3D CAD model under specified azimuth, depression angle, and polarization conditions

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 3 internal anchors

  1. [1]

    A tutorial on synthetic aperture radar,

    A. Moreira, P. Prats-Iraola, M. Younis, G. Krieger, I. Hajnsek, and K. P. Papathanassiou, “A tutorial on synthetic aperture radar,”IEEE Geoscience and Remote Sensing Magazine, vol. 1, no. 1, pp. 6–43, 2013

  2. [2]

    Deep learning meets SAR: Concepts, models, pitfalls, and perspectives,

    X. X. Zhu et al., “Deep learning meets SAR: Concepts, models, pitfalls, and perspectives,” IEEE Geoscience and Remote Sensing Magazine, vol. 9, no. 4, pp. 143–172, 2021

  3. [3]

    Deep learning,

    Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, no. 7553, pp. 436–444, 2015. 20

  4. [4]

    Target classification using the deep convolutional networks for SAR images,

    S. Chen, H. Wang, F. Xu, and Y. Q. Jin, “Target classification using the deep convolutional networks for SAR images,”IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 8, pp. 4806–4817, 2016

  5. [5]

    Target recognition of SAR images via few-shot learning,

    B. Ding, G. Wen, C. Ma, X. Yang, and C. Qiu, “Target recognition of SAR images via few-shot learning,”Remote Sensing, vol. 12, no. 1, p. 165, 2020

  6. [6]

    SAR target recognition using zero-shot learning,

    F. Zhang, C. Hu, Q. Yin, W. He, C. Li, and N. M. Robertson, “SAR target recognition using zero-shot learning,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2021

  7. [7]

    Sparse-azimuth SAR target recognition via feature de- coupling,

    Y. Wang, C. Wang, and H. Zhang, “Sparse-azimuth SAR target recognition via feature de- coupling,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2022

  8. [8]

    Few-shot sparse-azimuth SAR target recognition,

    H. Liu, X. Zhang, and Y. Li, “Few-shot sparse-azimuth SAR target recognition,”IEEE Trans- actions on Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2023

  9. [9]

    Generative adversarial nets,

    I. Goodfellow et al., “Generative adversarial nets,” inAdvances in Neural Information Pro- cessing Systems, pp. 2672–2680, 2014

  10. [10]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    A. Ramesh et al., “Hierarchical text-conditional image generation with CLIP latents,” arXiv:2204.06125, 2022

  11. [11]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020

  12. [12]

    High-resolution image syn- thesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image syn- thesis with latent diffusion models,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 10684–10695

  13. [13]

    SAR image generation using generative adversarial networks: A comprehensive review,

    J. Guo et al., “SAR image generation using generative adversarial networks: A comprehensive review,”IEEE Geoscience and Remote Sensing Magazine, vol. 9, no. 2, pp. 110–130, 2021

  14. [14]

    Diffusion models for SAR image generation: A promising paradigm,

    Y. Ji, X. Sun, and M. Wang, “Diffusion models for SAR image generation: A promising paradigm,”IEEE Geoscience and Remote Sensing Letters, vol. 20, pp. 1–5, 2023

  15. [15]

    I. G. Cumming and J. R. Wong,Digital Processing of Synthetic Aperture Radar Data: Algo- rithms and Implementation. Norwood, MA, USA: Artech House, 2005

  16. [16]

    Physical-aware SAR image generation via geometric priors,

    Z. Zhao, Q. Li, and J. Zhang, “Physical-aware SAR image generation via geometric priors,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 200, pp. 120–135, 2023

  17. [17]

    Mitigating hallucinations in SAR image generation under sparse observation,

    M. Xia, Z. Chen, and L. Huang, “Mitigating hallucinations in SAR image generation under sparse observation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–16, 2024

  18. [18]

    Hallucination in conditional diffusion models,

    L. Wang et al., “Hallucination in conditional diffusion models,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 1234–1243, 2023

  19. [19]

    GeoDiff-SAR: A geometric prior guided diffusion model for SAR image generation,

    F. Zhang, X. Wu, F. Ma, Q. Yin, and Y. Hu, “GeoDiff-SAR: A geometric prior guided diffusion model for SAR image generation,”arXiv:2601.03499, 2026

  20. [20]

    3D-aware image generation,

    J. Kim et al., “3D-aware image generation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 4567–4576, 2020. 21

  21. [21]

    Adding conditional control to text-to-image diffusion models,

    L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pp. 3836–3847, 2023

  22. [22]

    [Online]

    Black Forest Labs, “FLUX.1,” 2024. [Online]. Available:https://blackforestlabs.ai

  23. [23]

    SAR raw signal simulation of complex scenes,

    G. Franceschetti, M. Migliaccio, D. Riccio, and G. Schirinzi, “SAR raw signal simulation of complex scenes,”IEEE Transactions on Aerospace and Electronic Systems, vol. 39, no. 1, pp. 323–331, 2003

  24. [24]

    Ray tracing simulation techniques for understanding high- resolution SAR images,

    S. Auer, S. Hinz, and R. Bamler, “Ray tracing simulation techniques for understanding high- resolution SAR images,”IEEE Transactions on Geoscience and Remote Sensing, vol. 48, no. 3, pp. 1445–1456, 2009

  25. [25]

    Scattering center analysis of man-made targets,

    E. Schreiber and I. J. Gupta, “Scattering center analysis of man-made targets,”IEEE Trans- actions on Antennas and Propagation, vol. 53, no. 1, pp. 178–184, 2005

  26. [26]

    Generative adversarial networks for SAR image realism,

    B. Lewis et al., “Generative adversarial networks for SAR image realism,” inProc. SPIE, vol. 10987, 2019, Art. no. 109870V

  27. [27]

    Image-to-image translation with conditional adversarial networks,

    P. Isola, J. Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 1125–1134

  28. [28]

    Unpaired image-to-image translation using cycle-consistent adversarial networks,

    J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” inProc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2223–2232

  29. [29]

    Wasserstein GAN,

    M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” inProc. Int. Conf. Mach. Learn. (ICML), 2017, pp. 214–223

  30. [30]

    Taming transformers for high-resolution image syn- thesis,

    P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image syn- thesis,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 12873– 12883

  31. [31]

    Physics-informed machine learning,

    G. E. Karniadakis et al., “Physics-informed machine learning,”Nature Reviews Physics, vol. 3, no. 6, pp. 422–440, 2021

  32. [32]

    Improving SAR automatic target recognition using simulated data,

    D. Malmgren-Hansen, A. Kusk, J. Dall, et al., “Improving SAR automatic target recognition using simulated data,” inProc. IEEE Radar Conf., 2017, pp. 1150–1153

  33. [33]

    Scattering characteristics guided network for ISAR space target component segmentation,

    F. Zhong, F. Gao, T. Liu, et al., “Scattering characteristics guided network for ISAR space target component segmentation,”IEEE Geoscience and Remote Sensing Letters, vol. 22, pp. 1–5, 2025

  34. [34]

    FiLM: Visual reasoning with a general conditioning layer,

    E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “FiLM: Visual reasoning with a general conditioning layer,” inProc. AAAI Conf. Artif. Intell., 2018

  35. [35]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu et al., “LoRA: Low-rank adaptation of large language models,” inProc. Int. Conf. Learn. Representations (ICLR), 2022

  36. [36]

    Scalable diffusion models with transformers,

    W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 4199–4209. 22

  37. [37]

    VQGAN-CLIP: Open domain image generation and editing with natural language guidance

    K. Crowson, S. Biderman, D. Hall, E. D. Tow, and C. Lawrence, “VQGAN-CLIP: Open domain image generation and editing with natural language guidance,”arXiv:2204.08583, 2022

  38. [38]

    IF-I-XL-v1.0,

    DeepFloyd, “IF-I-XL-v1.0,” 2023. [Online]. Available:https://huggingface.co/DeepFloyd/ IF-I-XL-v1.0

  39. [39]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    J. Chen et al., “PixArt-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis,”arXiv:2310.00426, 2023

  40. [40]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    D. Podell et al., “SDXL: Improving latent diffusion models for high-resolution image synthesis,” arXiv:2307.01952, 2023

  41. [41]

    Stable Diffusion 3.5 Medium,

    Stability AI, “Stable Diffusion 3.5 Medium,” 2024. [Online]. Available:https:// huggingface.co/stabilityai/stable-diffusion-3.5-medium. 23