pith. machine review for the scientific record. sign in

arxiv: 2604.11521 · v1 · submitted 2026-04-13 · 💻 cs.LG · cs.CV

Recognition: unknown

Continuous Adversarial Flow Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:22 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords continuous flow modelsadversarial trainingflow matchingimage generationFID evaluationpost-trainingdiscriminatorgenerative modeling
0
0 comments X

The pith

Training continuous flow models with a learned discriminator rather than mean-squared error produces samples better aligned with the target data distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces continuous adversarial flow models, a variant of continuous-time flow models that replaces the standard mean-squared-error objective of flow matching with an adversarial loss from a learned discriminator. This shift in training criterion changes the induced distribution and yields samples that more closely match the target data, as demonstrated through post-training of existing models. The method delivers large gains on ImageNet 256-pixel generation for both latent-space and pixel-space flow models, and extends to text-to-image tasks with improved benchmark results. Readers interested in generative modeling would care because flow-based approaches are a major alternative to diffusion models, and any objective that tightens distribution alignment directly affects output quality.

Core claim

Continuous adversarial flow models are continuous-time flow models trained with an adversarial objective supplied by a learned discriminator instead of the fixed mean-squared-error criterion used in flow matching. This change induces a different generalized distribution that empirically aligns better with the target data distribution. The approach is proposed primarily for post-training existing flow-matching models such as SiT and JiT, although it can also be used to train models from scratch, and is validated by substantial FID reductions on ImageNet 256px generation together with gains on GenEval and DPG for text-to-image tasks.

What carries the argument

A learned discriminator that replaces the fixed mean-squared-error loss and supplies an adversarial training signal for the continuous-time flow model.

Load-bearing premise

The learned discriminator must supply a stable, non-collapsing training signal that genuinely improves alignment with the target distribution rather than introducing new artifacts or instability.

What would settle it

Applying the post-training procedure to a flow model such as SiT on ImageNet and observing no reduction in guidance-free FID or the appearance of training collapse would indicate that the adversarial objective does not deliver the claimed alignment benefit.

Figures

Figures reproduced from arXiv: 2604.11521 by Ceyuan Yang, Hao Chen, Haoqi Fan, Shanchuan Lin, Zhijie Lin.

Figure 1
Figure 1. Figure 1: Generation without guidance. Our method yields better generalization. 1 Introduction Flow matching [46] has achieved significant success in recent years, yet a critical problem remains. The issue is particularly evident in the generation of visual 1 Correspondence to: Shanchuan Lin <peterlin@bytedance.com> arXiv:2604.11521v1 [cs.LG] 13 Apr 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of the training dynamic. Top: learned G(xt, t) trajectories over the probability flow. Bottom: corresponding −D(xt, t) values at all xt. −D is taken for more intuitive visualization as the generation process runs backward in time. (a) shows if only training D with v¯t as positives without G as negatives, D degenerates to uniform gradient. (b,c) show how D reacts to G during training. (d) show… view at source ↗
Figure 3
Figure 3. Figure 3: Curated text-to-image samples on GenEval prompts. Without PE and CFG to show the most diverse range of samples. Left is FM. Right is CAFM. More visualizations are in [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: c, we experiment with a lower λot to 1 since 160 epochs and see further FID improvement, while λot = 4 eventually plateaus. This shows the impor￾tance of decreasing λot over training, concurring with the findings of AFM. In Fig. 4d, we further increase N to 8 at 700 epochs and see faster convergence at the later stage. Note that we have swept other changes during training, includ￾ing decreasing the learnin… view at source ↗
Figure 6
Figure 6. Figure 6: SiT-XL/2 guidance-free, latent-space ImageNet 256px generation. Top is FM (FID 8.26). Bottom is CAFM (FID 3.63). Uncurated. We highlight samples with visible improvements in red [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: SiT-XL/2 guided, latent-space ImageNet 256px generation. Top is FM (CFG 1.5, FID 2.06). Bottom is CAFM (CFG 1.3, FID 1.53). Uncurated [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: JiT-H/16 guidance-free, pixel-space ImageNet 256px generation. Top is FM (FID 7.17). Bottom is CAFM (FID 3.57). Uncurated. We highlight samples with visible improvements in red [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: JiT-H/16 guided, pixel-space ImageNet 256px generation. Top is FM (CFG2.2, FID 1.86). Bottom is CAFM (CFG1.8, FID 1.80). Uncurated [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Curated text-to-image comparisons on DPG benchmark prompts. Prompts are shortened for paper presentation. (part 1 of 4) [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 10
Figure 10. Figure 10: Curated text-to-image comparisons on DPG benchmark prompts. Prompts are shortened for paper presentation. (part 2 of 4) [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
Figure 10
Figure 10. Figure 10: Curated text-to-image comparisons on DPG benchmark prompts. Prompts are shortened for paper presentation. (part 3 of 4) [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗
Figure 10
Figure 10. Figure 10: Curated text-to-image comparisons on DPG benchmark prompts. Prompts are shortened for paper presentation. (part 4 of 4) [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Failure cases for guidance-free text-to-image generation [PITH_FULL_IMAGE:figures/full_fig_p033_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Discriminator gradient norm [PITH_FULL_IMAGE:figures/full_fig_p039_12.png] view at source ↗
read the original abstract

We propose continuous adversarial flow models, a type of continuous-time flow model trained with an adversarial objective. Unlike flow matching, which uses a fixed mean-squared-error criterion, our approach introduces a learned discriminator to guide training. This change in objective induces a different generalized distribution, which empirically produces samples that are better aligned with the target data distribution. Our method is primarily proposed for post-training existing flow-matching models, although it can also train models from scratch. On the ImageNet 256px generation task, our post-training substantially improves the guidance-free FID of latent-space SiT from 8.26 to 3.63 and of pixel-space JiT from 7.17 to 3.57. It also improves guided generation, reducing FID from 2.06 to 1.53 for SiT and from 1.86 to 1.80 for JiT. We further evaluate our approach on text-to-image generation, where it achieves improved results on both the GenEval and DPG benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces continuous adversarial flow models, a variant of continuous-time flow models trained via an adversarial objective using a learned discriminator rather than the fixed MSE loss of flow matching. The method is primarily intended for post-training existing flow-matching models (e.g., SiT and JiT) to induce a different generalized distribution that better aligns with the target data. Empirical results on ImageNet 256px report substantial FID reductions (guidance-free: SiT 8.26→3.63, JiT 7.17→3.57; guided: SiT 2.06→1.53, JiT 1.86→1.80) and improved scores on GenEval and DPG for text-to-image generation.

Significance. If the reported FID gains are causally attributable to the adversarial objective rather than additional optimization steps, the approach could provide a practical post-training refinement technique for flow-based generative models, potentially improving sample quality without relying on classifier-free guidance. The empirical gains on standard benchmarks are notable, but the absence of controls for training duration leaves the mechanistic contribution of the discriminator unestablished.

major comments (2)
  1. [Experimental results on ImageNet (post-training protocol)] The central claim that the adversarial discriminator induces a better-aligned generalized distribution (and thus the observed FID drops) is not supported by a necessary control: an ablation continuing the identical base model (SiT or JiT) for the same number of post-training steps using only the original flow-matching/MSE objective, identical optimizer, schedule, and batch size. Without this, the improvements (e.g., SiT guidance-free FID 8.26 to 3.63) could arise from extra gradient steps alone, rendering the discriminator incidental.
  2. [Method and training details] The manuscript provides insufficient training details and ablations for the discriminator (architecture, training schedule relative to the flow model, loss weighting, stability measures) and for the post-training procedure itself. This makes it impossible to assess whether the learned discriminator supplies a stable, non-collapsing signal or introduces new artifacts, directly undermining evaluation of the weakest assumption in the central claim.
minor comments (2)
  1. [Method] Notation for the continuous-time flow and adversarial objective should be clarified with explicit equations distinguishing the discriminator-augmented loss from standard flow matching.
  2. [Text-to-image experiments] The text-to-image results on GenEval and DPG would benefit from reporting the exact base models, guidance scales, and number of post-training steps for direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We agree that additional controls and details are needed to strengthen the claims regarding the contribution of the adversarial objective. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Experimental results on ImageNet (post-training protocol)] The central claim that the adversarial discriminator induces a better-aligned generalized distribution (and thus the observed FID drops) is not supported by a necessary control: an ablation continuing the identical base model (SiT or JiT) for the same number of post-training steps using only the original flow-matching/MSE objective, identical optimizer, schedule, and batch size. Without this, the improvements (e.g., SiT guidance-free FID 8.26 to 3.63) could arise from extra gradient steps alone, rendering the discriminator incidental.

    Authors: We agree that this control is necessary to establish causality. In the revised manuscript, we will add results from continuing training of the base SiT and JiT models for the same number of post-training steps using only the original flow-matching MSE objective, with identical optimizer, learning rate schedule, and batch size. These results will be reported alongside the adversarial post-training outcomes to isolate the effect of the discriminator. revision: yes

  2. Referee: [Method and training details] The manuscript provides insufficient training details and ablations for the discriminator (architecture, training schedule relative to the flow model, loss weighting, stability measures) and for the post-training procedure itself. This makes it impossible to assess whether the learned discriminator supplies a stable, non-collapsing signal or introduces new artifacts, directly undermining evaluation of the weakest assumption in the central claim.

    Authors: We acknowledge that the current manuscript lacks sufficient implementation details. In the revision, we will expand the Methods section with the discriminator architecture, the precise training schedule (including how it interleaves with the flow model), loss weighting coefficients, and any regularization or stability techniques used. We will also include targeted ablations on these hyperparameters to demonstrate that the discriminator provides a stable training signal without introducing artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical claims or derivation

full rationale

The paper introduces an adversarial objective for continuous flow models and supports its claims through direct empirical comparisons of FID scores on ImageNet 256px and other benchmarks (e.g., SiT guidance-free FID dropping from 8.26 to 3.63). These are measured outcomes against external baselines rather than quantities derived from fitted parameters, self-referential equations, or load-bearing self-citations. No self-definitional steps, fitted-input predictions, uniqueness theorems, or ansatzes smuggled via prior work appear in the abstract or described method; the central result is an observed improvement from the changed training objective, which remains independently falsifiable via the reported metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on standard assumptions from generative modeling literature with no new free parameters, axioms, or invented entities explicitly introduced in the abstract.

pith-pipeline@v0.9.0 · 5477 in / 1128 out tokens · 23523 ms · 2026-05-10T15:22:20.482489+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

86 extracted references · 41 canonical work pages · 18 internal anchors

  1. [1]

    Bitdance: Scaling autoregressive generative models with binary tokens.arXiv preprint arXiv:2602.14041,

    Ai, Y., Han, J., Zhuang, S., Mao, W., Hu, X., Yang, Z., Yang, Z., Huang, H., Yue, X., Chen, H.: Bitdance: Scaling autoregressive generative models with binary tokens. arXiv preprint arXiv:2602.14041 (2026) 11

  2. [2]

    Towards Principled Methods for Training Generative Adversarial Networks

    Arjovsky, M., Bottou, L.: Towards principled methods for training generative ad- versarial networks. arXiv preprint arXiv:1701.04862 (2017) 5

  3. [3]

    In: International conference on machine learning

    Arjovsky,M.,Chintala,S.,Bottou,L.:Wassersteingenerativeadversarialnetworks. In: International conference on machine learning. pp. 214–223. Pmlr (2017) 37

  4. [4]

    Layer Normalization

    Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016) 8, 39

  5. [5]

    Black Forest Labs: FLUX.2: Analyzing and enhancing the latent space of FLUX – representation comparison (2025),https://bfl.ai/research/representation- comparison14

  6. [6]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Cai, H., Cao, S., Du, R., Gao, P., Hoi, S., Hou, Z., Huang, S., Jiang, D., Jin, X., Li, L., et al.: Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699 (2025) 2, 11, 27

  7. [7]

    Flow matching on general geometries.arXiv preprint arXiv:2302.03660, 2023

    Chen, R.T., Lipman, Y.: Flow matching on general geometries. arXiv preprint arXiv:2302.03660 (2023) 2, 14

  8. [8]

    Advances in neural information processing systems31(2018) 2

    Chen, R.T., Rubanova, Y., Bettencourt, J., Duvenaud, D.K.: Neural ordinary dif- ferential equations. Advances in neural information processing systems31(2018) 2

  9. [9]

    Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025

    Chen, S., Ge, C., Zhang, S., Sun, P., Luo, P.: Pixelflow: Pixel-space generative models with flow. arXiv preprint arXiv:2504.07963 (2025) 11

  10. [10]

    Training Deep Nets with Sublinear Memory Cost

    Chen, T., Xu, B., Zhang, C., Guestrin, C.: Training deep nets with sublinear mem- ory cost. arXiv preprint arXiv:1604.06174 (2016) 8

  11. [11]

    arXiv preprint arXiv:2510.08799 (2025) 14

    Choudhury, R., Lin, S., Wang, J., Chen, H., Zhao, Q., Cheng, F., Jiang, L., Kitani, K., Jeni, L.A.: Skipsr: Faster super resolution with token skipping. arXiv preprint arXiv:2510.08799 (2025) 14

  12. [12]

    Advances in neural information processing systems35, 2406–2422 (2022) 14

    De Bortoli, V., Mathieu, E., Hutchinson, M., Thornton, J., Teh, Y.W., Doucet, A.: Riemannian score-based generative modelling. Advances in neural information processing systems35, 2406–2422 (2022) 14

  13. [13]

    Emerging Properties in Unified Multimodal Pretraining

    Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 11

  14. [14]

    Advances in neural information processing systems34, 8780–8794 (2021) 2, 11, 14, 20

    Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021) 2, 11, 14, 20

  15. [15]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Gao, Y., Guo, H., Hoang, T., Huang, W., Jiang, L., Kong, F., Li, H., Li, J., Li, L., Li, X., et al.: Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113 (2025) 2

  16. [16]

    Advances in Neural Information Processing Systems36, 52132–52152 (2023) 3, 11, 12, 28 16 S

    Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023) 3, 11, 12, 28 16 S. Lin et al

  17. [17]

    Advances in neural in- formation processing systems27(2014) 2, 14, 20, 21

    Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Advances in neural in- formation processing systems27(2014) 2, 14, 20, 21

  18. [18]

    Explaining and Harnessing Adversarial Examples

    Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014) 2

  19. [19]

    Advances in neural information processing systems 30(2017) 37

    Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. Advances in neural information processing systems 30(2017) 37

  20. [20]

    Advances in neural information processing systems30(2017) 9

    Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017) 9

  21. [21]

    In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications (2021) 2, 10, 14

    Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications (2021) 2, 10, 14

  22. [22]

    In: International Conference on Machine Learning

    Hoogeboom, E., Heek, J., Salimans, T.: simple diffusion: End-to-end diffusion for high resolution images. In: International Conference on Machine Learning. pp. 13213–13232. PMLR (2023) 10, 11

  23. [23]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Hoogeboom, E., Mensink, T., Heek, J., Lamerigts, K., Gao, R., Salimans, T.: Sim- pler diffusion: 1.5 fid on imagenet512 with pixel-space diffusion. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18062–18071 (2025) 11

  24. [24]

    arXiv preprint arXiv:2312.08825 (2023) 14

    Hu, V.T., Chen, Y., Caron, M., Asano, Y.M., Snoek, C.G., Ommer, B.: Guided diffusion from self-supervised diffusion features. arXiv preprint arXiv:2312.08825 (2023) 14

  25. [25]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P., Yu, G.: Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135 (2024) 3, 11, 12, 28

  26. [26]

    Advances in Neural Information Processing Systems 37, 44177–44215 (2024) 2, 5

    Huang, N., Gokaslan, A., Kuleshov, V., Tompkin, J.: The gan is dead; long live the gan! a modern gan baseline. Advances in Neural Information Processing Systems 37, 44177–44215 (2024) 2, 5

  27. [27]

    In: International conference on machine learning

    Hudson, D.A., Zitnick, L.: Generative adversarial transformers. In: International conference on machine learning. pp. 4487–4499. PMLR (2021) 5

  28. [28]

    arXiv preprint arXiv:2509.24935 (2025) 2

    Hyun, S., Lee, M., Heo, J.P.: Scalable gans with transformers. arXiv preprint arXiv:2509.24935 (2025) 2

  29. [29]

    arXiv preprint arXiv:1807.00734 , year =

    Jolicoeur-Martineau, A.: The relativistic discriminator: a key element missing from standard gan. arXiv preprint arXiv:1807.00734 (2018) 5

  30. [30]

    In: European Conference on Computer Vision

    Kang, M., Zhang, R., Barnes, C., Paris, S., Kwak, S., Park, J., Shechtman, E., Zhu, J.Y., Park, T.: Distilling diffusion models into conditional gans. In: European Conference on Computer Vision. pp. 428–447. Springer (2024) 14

  31. [31]

    In: International Conference on Learning Representations (2018) 5

    Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for im- proved quality, stability, and variation. In: International Conference on Learning Representations (2018) 5

  32. [32]

    Advances in neural information processing systems33, 12104–12114 (2020) 5, 38

    Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., Aila, T.: Training generative adversarial networks with limited data. Advances in neural information processing systems33, 12104–12114 (2020) 5, 38

  33. [33]

    Advances in Neural Information Processing Systems37, 52996–53021 (2024) 2, 14

    Karras, T., Aittala, M., Kynkäänniemi, T., Lehtinen, J., Aila, T., Laine, S.: Guid- ing a diffusion model with a bad version of itself. Advances in Neural Information Processing Systems37, 52996–53021 (2024) 2, 14

  34. [34]

    In: International Conference on Machine Learning

    Kim, D., Kim, Y., Kwon, S.J., Kang, W., Moon, I.C.: Refining generative pro- cess with discriminator guidance in score-based diffusion models. In: International Conference on Machine Learning. pp. 16567–16598. PMLR (2023) 14 Continuous Adversarial Flow Models 17

  35. [35]

    Adam: A Method for Stochastic Optimization

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 10

  36. [36]

    arXiv preprint arXiv:2006.15704 , author =

    Li, S., Zhao, Y., Varma, R., Salpekar, O., Noordhuis, P., Li, T., Paszke, A., Smith, J., Vaughan, B., Damania, P., et al.: Pytorch distributed: Experiences on acceler- ating data parallel training. arXiv preprint arXiv:2006.15704 (2020) 8

  37. [37]

    Back to Basics: Let Denoising Generative Models Denoise

    Li, T., He, K.: Back to basics: Let denoising generative models denoise. arXiv preprint arXiv:2511.13720 (2025) 3, 9, 11

  38. [38]

    Geometric GAN

    Lim, J.H., Ye, J.C.: Geometric gan. arXiv preprint arXiv:1705.02894 (2017) 21

  39. [39]

    In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Lin,S.,Liu,B.,Li,J.,Yang,X.:Commondiffusionnoiseschedulesandsamplesteps are flawed. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 5404–5411 (2024) 14

  40. [40]

    Sdxl- lightning: Progressive adversarial diffusion distillation

    Lin, S., Wang, A., Yang, X.: Sdxl-lightning: Progressive adversarial diffusion dis- tillation. arXiv preprint arXiv:2402.13929 (2024) 2, 14

  41. [41]

    In: Forty-second International Conference on Machine Learning (2025) 2, 14, 27

    Lin, S., Xia, X., Ren, Y., Yang, C., Xiao, X., Jiang, L.: Diffusion adversarial post- training for one-step video generation. In: Forty-second International Conference on Machine Learning (2025) 2, 14, 27

  42. [42]

    Autoregressive adversarial post-training for real-time inter- active video generation.arXiv preprint arXiv:2506.09350,

    Lin, S., Yang, C., He, H., Jiang, J., Ren, Y., Xia, X., Zhao, Y., Xiao, X., Jiang, L.: Autoregressive adversarial post-training for real-time interactive video generation. arXiv preprint arXiv:2506.09350 (2025) 2, 14

  43. [43]

    Lin,S.,Yang,C.,Lin,Z.,Chen,H.,Fan,H.:Adversarialflowmodels.arXivpreprint arXiv:2511.22475 (2025) 2, 4, 5, 37

  44. [44]

    arXiv preprint arXiv:2401.00110 (2023) 2, 4, 14

    Lin, S., Yang, X.: Diffusion model with perceptual loss. arXiv preprint arXiv:2401.00110 (2023) 2, 4, 14

  45. [45]

    arXiv preprint arXiv:2403.12706 (2024) 2, 14

    Lin, S., Yang, X.: Animatediff-lightning: Cross-model diffusion distillation. arXiv preprint arXiv:2403.12706 (2024) 2, 14

  46. [46]

    In: The Eleventh International Conference on Learning Representations (2023) 1, 3

    Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Representations (2023) 1, 3

  47. [47]

    In: The Eleventh International Conference on Learning Representations (ICLR) (2023) 3

    Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. In: The Eleventh International Conference on Learning Representations (ICLR) (2023) 3

  48. [48]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 11

  49. [49]

    Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

    Lu, C., Song, Y.: Simplifying, stabilizing and scaling continuous-time consistency models. arXiv preprint arXiv:2410.11081 (2024) 8

  50. [50]

    In: European Conference on Computer Vision

    Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In: European Conference on Computer Vision. pp. 23–40. Springer (2024) 3, 9, 11

  51. [51]

    In: Proceedings of the IEEE international conference on computer vision

    Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Paul Smolley, S.: Least squares gen- erative adversarial networks. In: Proceedings of the IEEE international conference on computer vision. pp. 2794–2802 (2017) 6, 20, 21, 35

  52. [52]

    Advances in neural information processing systems33, 2503–2515 (2020) 14

    Mathieu, E., Nickel, M.: Riemannian continuous normalizing flows. Advances in neural information processing systems33, 2503–2515 (2020) 14

  53. [53]

    Mescheder, L., Geiger, A., Nowozin, S.: Which training methods for gans do ac- tually converge? In: International conference on machine learning. pp. 3481–3490. PMLR (2018) 37

  54. [54]

    Advances in neural information processing systems29(2016) 14 18 S

    Nowozin, S., Cseke, B., Tomioka, R.: f-gan: Training generative neural samplers us- ing variational divergence minimization. Advances in neural information processing systems29(2016) 14 18 S. Lin et al

  55. [55]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 10, 11

  56. [56]

    Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 11

  57. [57]

    Advances in neural information processing systems37, 117340–117362 (2024) 2, 14

    Ren, Y., Xia, X., Lu, Y., Zhang, J., Wu, J., Xie, P., Wang, X., Xiao, X.: Hyper-sd: Trajectory segmented consistency model for efficient image synthesis. Advances in neural information processing systems37, 117340–117362 (2024) 2, 14

  58. [58]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 10, 14

  59. [59]

    Advances in neural information processing systems30(2017) 5, 37

    Roth, K., Lucchi, A., Nowozin, S., Hofmann, T.: Stabilizing training of genera- tive adversarial networks through regularization. Advances in neural information processing systems30(2017) 5, 37

  60. [60]

    International journal of computer vision115(3), 211–252 (2015) 3, 9

    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog- nition challenge. International journal of computer vision115(3), 211–252 (2015) 3, 9

  61. [61]

    Advances in neural information processing systems29(2016) 9

    Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. Advances in neural information processing systems29(2016) 9

  62. [62]

    In: SIGGRAPH Asia 2024 Conference Papers

    Sauer, A., Boesel, F., Dockhorn, T., Blattmann, A., Esser, P., Rombach, R.: Fast high-resolution image synthesis with latent adversarial diffusion distillation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024) 2, 14

  63. [63]

    In: European Conference on Computer Vision

    Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distilla- tion. In: European Conference on Computer Vision. pp. 87–103. Springer (2024) 2, 14

  64. [64]

    In: ACM SIGGRAPH 2022 conference proceedings

    Sauer, A., Schwarz, K., Geiger, A.: Stylegan-xl: Scaling stylegan to large diverse datasets. In: ACM SIGGRAPH 2022 conference proceedings. pp. 1–10 (2022) 2

  65. [65]

    Seaweed-7b: Cost-effective training of video generation foundation model

    Seawead, T., Yang, C., Lin, Z., Zhao, Y., Lin, S., Ma, Z., Guo, H., Chen, H., Qi, L., Wang, S., et al.: Seaweed-7b: Cost-effective training of video generation foundation model. arXiv preprint arXiv:2504.08685 (2025) 2

  66. [66]

    Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

    Seedance, T., Chen, H., Chen, S., Chen, X., Chen, Y., Chen, Y., Chen, Z., Cheng, F., Cheng, T., Cheng, X., et al.: Seedance 1.5 pro: A native audio-visual joint generation foundation model. arXiv preprint arXiv:2512.13507 (2025) 2

  67. [67]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Seedream, T., Chen, Y., Gao, Y., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y., et al.: Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427 (2025) 2

  68. [68]

    In: Interna- tional Conference on Learning Representations (2021) 2, 14

    Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. In: Interna- tional Conference on Learning Representations (2021) 2, 14

  69. [69]

    arXiv preprint arXiv:2601.16208 (2026),https://arxiv.org/abs/2601.16208

    Tong, S., Zheng, B., Wang, Z., Tang, B., Ma, N., Brown, E., Yang, J., Fergus, R., LeCun,Y.,Xie,S.:Scalingtext-to-imagediffusiontransformerswithrepresentation autoencoders. arXiv preprint arXiv:2601.16208 (2026) 14

  70. [70]

    Advances in neural information pro- cessing systems30(2017) 8 Continuous Adversarial Flow Models 19

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017) 8 Continuous Adversarial Flow Models 19

  71. [71]

    Advances in neural information processing systems37, 83951–84009 (2024) 14

    Wang, F.Y., Huang, Z., Bergman, A.W., Shen, D., Gao, P., Lingelbach, M., Sun, K., Bian, W., Song, G., Liu, Y., et al.: Phased consistency models. Advances in neural information processing systems37, 83951–84009 (2024) 14

  72. [72]

    arXiv preprint arXiv:2506.05301 (2025)

    Wang, J., Lin, S., Lin, Z., Ren, Y., Wei, M., Yue, Z., Zhou, S., Chen, H., Zhao, Y., Yang, C., et al.: Seedvr2: One-step video restoration via diffusion adversarial post-training. arXiv preprint arXiv:2506.05301 (2025) 14

  73. [73]

    Diffuse and disperse: Image generation with representation regularization.arXiv preprint arXiv:2506.09027, 2025

    Wang, R., He, K.: Diffuse and disperse: Image generation with representation reg- ularization. arXiv preprint arXiv:2506.09027 (2025) 11

  74. [74]

    Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268,

    Wang, S., Gao, Z., Zhu, C., Huang, W., Wang, L.: Pixnerd: Pixel neural field diffusion. arXiv preprint arXiv:2507.23268 (2025) 11

  75. [75]

    DDT: Decoupled diffusion Transformer.arXiv:2504.05741, 2025

    Wang, S., Tian, Z., Huang, W., Wang, L.: Ddt: Decoupled diffusion transformer. arXiv preprint arXiv:2504.05741 (2025) 11

  76. [76]

    arXiv preprint arXiv:2510.01184 (2025) 14

    Xu, Y., Wu, Y., Park, S., Zhou, Z., Tulsiani, S.: Temporal score rescaling for tem- perature sampling in diffusion and flow models. arXiv preprint arXiv:2510.01184 (2025) 14

  77. [77]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Xu, Y., Zhao, Y., Xiao, Z., Hou, T.: Ufogen: You forward once large scale text-to- image generation via diffusion gans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8196–8206 (2024) 2, 14

  78. [78]

    Advances in neural information processing systems37, 47455–47487 (2024) 2, 14

    Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., Freeman, B.: Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems37, 47455–47487 (2024) 2, 14

  79. [79]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940 (2024) 11

  80. [80]

    Advances in neural information processing systems32(2019) 8, 39

    Zhang, B., Sennrich, R.: Root mean square layer normalization. Advances in neural information processing systems32(2019) 8, 39

Showing first 80 references.