pith. sign in

arxiv: 2512.02012 · v2 · submitted 2025-12-01 · 💻 cs.CV · cs.LG

Improved Mean Flows: On the Challenges of Fastforward Generative Models

Pith reviewed 2026-05-17 02:14 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords MeanFlowone-step generationImageNet 256x256fastforward modelsclassifier-free guidancevelocity predictiongenerative modeling
0
0 comments X

The pith

Reformulated MeanFlow training enables 1.72 FID one-step generation on ImageNet 256x256 from scratch.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper targets two core difficulties in MeanFlow for one-step image generation. It recasts the training objective as a regression loss on instantaneous velocity that is re-parameterized by a network predicting average velocity, creating a more standard and stable optimization problem. It also converts fixed guidance into explicit conditioning variables handled through in-context processing, preserving test-time flexibility while shrinking model size. If these changes hold, single-evaluation models become competitive with multi-step methods on large datasets without any distillation step.

Core claim

The improved MeanFlow (iMF) recasts the original training target, which depended on both ground-truth fields and the network itself, into a loss on instantaneous velocity v re-parameterized by a network predicting average velocity u. This yields a standard regression problem that improves training stability. Guidance is formulated as explicit conditioning variables processed via in-context conditioning, retaining flexibility at test time. Trained entirely from scratch, iMF reaches 1.72 FID with a single function evaluation on ImageNet 256×256, outperforming prior one-step methods and closing the gap to multi-step approaches without distillation.

What carries the argument

Re-parameterization of the objective as a regression loss on instantaneous velocity v via a network predicting average velocity u, combined with explicit guidance scales as in-context conditioning variables.

If this is right

  • One-step generative models can be trained stably from scratch on large-scale image datasets like ImageNet.
  • Guidance scale remains adjustable at inference time without retraining the model.
  • In-context conditioning reduces model size while maintaining or improving performance.
  • Single-evaluation generation reaches FID scores competitive with multi-step methods.
  • High-quality fastforward generation is possible without distillation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The velocity re-parameterization could be tested in other flow-matching or velocity-based generative frameworks to check for similar stability gains.
  • Explicit in-context conditioning might extend to text-to-image or video generation to improve flexibility in those settings.
  • Combining this approach with model compression techniques could enable real-time single-step generation on resource-limited hardware.
  • Experiments at higher resolutions or on different data modalities would test whether the performance scaling holds beyond 256×256 images.

Load-bearing premise

Re-parameterizing the training objective as a loss on instantaneous velocity re-parameterized by average velocity prediction creates a standard regression problem that improves stability without introducing bias or new instabilities.

What would settle it

A side-by-side training run on ImageNet 256×256 where the re-parameterized loss produces higher instability or worse final FID than the original MeanFlow objective would falsify the stability and performance gains.

Figures

Figures reproduced from arXiv: 2512.02012 by Eli Shechtman, J. Zico Kolter, Kaiming He, Yiyang Lu, Zhengyang Geng, Zongze Wu.

Figure 1
Figure 1. Figure 1: Conceptual comparison. Original MeanFlow (MF) [12] predicts average velocity u by a network uθ. As the ground-truth u is unknown, original MF substitutes u with the network’s own prediction. We show that the original MF objective is equivalent to a loss on the instantaneous velocity v (namely, v-loss), but re￾parameterized by the neural network uθ (namely, u-pred), as shown in (a). This re-parameterization… view at source ↗
Figure 2
Figure 2. Figure 2: MeanFlow as v-loss. Original MeanFlow (MF) [12] models the average velocity u and train the network uθ via a u-loss parameterized by uθ itself. We show that MF can be reformulated as a v-loss re-parameterized by uθ, driven by the MeanFlow identity in Eq. (8). 4.1. MeanFlow as v-loss Eq. (7) suggests that original MF [12] is a u-loss parame￾terized by u-pred. In this subsection, we first show that the origi… view at source ↗
Figure 3
Figure 3. Figure 3: Training losses. We examine the loss of samples only with t ̸= r, since a batch also contains samples of t = r, for which the JVP term becomes zero due to its coefficient (t − r). Both MF and iMF can be viewed as v-loss, using different forms of compound Vθ. Original MF’s loss is non-decreasing and has high variance. (Settings: MeanFlow-B/2, trained with basic ℓ2 loss with no adaptive weighting, and with n… view at source ↗
Figure 4
Figure 4. Figure 4: Optimal CFG scales shift under different settings. In general, a stronger setting has a smaller optimal CFG scale, as re￾flected by increased training epochs (left) and inference steps (right). This investigation is enabled by our flexible CFG-conditioning, where a single model can support varying CFG scales even in the single/few-NFE case. (Settings: iMF-B/2 on ImageNet 256×256.) CFG guidance scale ω. Sim… view at source ↗
Figure 6
Figure 6. Figure 6: FID curves during training. The original MeanFlow￾B/2 baseline has a 1-NFE FID of 6.17. Using the improved training objective (Sec. 4.1), FID improves to 5.68. Incorporating flexible CFG conditioning (Sec. 4.2) reduces FID to 4.57. Replacing adaLN￾zero with in-context conditioning (Sec. 4.3) further improves FID to 4.09. See also Tab. 1. on ImageNet [8] class-conditional generation at 256×256 resolution. F… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of 1-NFE generation on ImageNet 256×256. We show uncurated results on the three classes listed here; more are in appendix. The model is iMF-XL/2. With the remarkable progress of 1-NFE generation, the use of a tokenizer begins to incur a non-negligible cost at inference time. While our work focuses on advancing fast￾forward models and is orthogonal to tokenizer design, from a practical s… view at source ↗
Figure 8
Figure 8. Figure 8: Uncurated 1-NFE class-conditional generation samples of iMF-XL/2 on ImageNet 256×256. [16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern￾hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017. [17] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS Workshop, 2021. [18] Jonathan Ho, Ajay Jain… view at source ↗
Figure 9
Figure 9. Figure 9: Uncurated 1-NFE class-conditional generation samples of iMF-XL/2 on ImageNet 256×256. Ermon. Cmt: Mid-training for efficient learning of consis￾tency, mean flow, and flow map models. arXiv preprint arXiv:2509.24526, 2025. [22] Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. In CVPR, 2023. [23] Tero Karras, Mi… view at source ↗
read the original abstract

MeanFlow (MF) has recently been established as a framework for one-step generative modeling. However, its ``fastforward'' nature introduces key challenges in both the training objective and the guidance mechanism. First, the original MF's training target depends not only on the underlying ground-truth fields but also on the network itself. To address this issue, we recast the objective as a loss on the instantaneous velocity $v$, re-parameterized by a network that predicts the average velocity $u$. Our reformulation yields a more standard regression problem and improves the training stability. Second, the original MF fixes the classifier-free guidance scale during training, which sacrifices flexibility. We tackle this issue by formulating guidance as explicit conditioning variables, thereby retaining flexibility at test time. The diverse conditions are processed through in-context conditioning, which reduces model size and benefits performance. Overall, our $\textbf{improved MeanFlow}$ ($\textbf{iMF}$) method, trained entirely from scratch, achieves $\textbf{1.72}$ FID with a single function evaluation (1-NFE) on ImageNet 256$\times$256. iMF substantially outperforms prior methods of this kind and closes the gap with multi-step methods while using no distillation. We hope our work will further advance fastforward generative modeling as a stand-alone paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents 'improved MeanFlow' (iMF), an enhancement to the MeanFlow framework for one-step (fastforward) generative modeling. The authors identify two challenges: (1) the original training target depending on the network itself, addressed by recasting the objective as a loss on instantaneous velocity v re-parameterized via a network predicting average velocity u; (2) fixed classifier-free guidance scale, addressed by explicit conditioning variables processed through in-context conditioning. They report that iMF, trained from scratch, achieves an FID of 1.72 on ImageNet 256×256 using a single function evaluation (1-NFE), substantially outperforming prior one-step methods and closing the gap with multi-step approaches without using distillation.

Significance. If the re-parameterized objective is mathematically equivalent to the original MeanFlow loss and the learned model satisfies the mean-flow consistency condition without bias, this work would represent a meaningful step forward in developing efficient, standalone one-step generative models. The reported 1.72 FID score with 1-NFE is competitive and highlights the potential of fastforward paradigms. The use of in-context conditioning for flexible guidance is a practical contribution that could benefit other conditional generation tasks.

major comments (2)
  1. [Method section (training objective reformulation)] The reformulation of the training objective as a loss on instantaneous velocity v, re-parameterized by a network predicting average velocity u, is presented as yielding a more standard regression problem that improves stability. However, no derivation is provided showing that this re-parameterization is equivalent to the original MeanFlow objective or that the mapping from u to v preserves the fixed point exactly without introducing bias or approximation error. This is load-bearing for the central claim, as the 1.72 FID result is reported for a model trained with the new objective.
  2. [Experiments section] Experimental results: The manuscript reports a concrete 1.72 FID for 1-NFE on ImageNet 256×256 but provides no error bars, ablation studies isolating the effect of the re-parameterization versus the guidance changes, or verification that the learned flow satisfies the original mean-flow consistency condition. These omissions make it difficult to confirm that the performance gain stems from the proposed fixes rather than an altered objective.
minor comments (2)
  1. [Abstract] The abstract and introduction could more explicitly define 'fastforward' and 'MeanFlow' for readers new to the prior work, including a brief recap of the original objective.
  2. [Method] Notation for v (instantaneous velocity) and u (average velocity) should be introduced with a clear equation relating them to the original MeanFlow fields.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions made to strengthen the paper.

read point-by-point responses
  1. Referee: [Method section (training objective reformulation)] The reformulation of the training objective as a loss on instantaneous velocity v, re-parameterized by a network predicting average velocity u, is presented as yielding a more standard regression problem that improves stability. However, no derivation is provided showing that this re-parameterization is equivalent to the original MeanFlow objective or that the mapping from u to v preserves the fixed point exactly without introducing bias or approximation error. This is load-bearing for the central claim, as the 1.72 FID result is reported for a model trained with the new objective.

    Authors: We agree that an explicit derivation is necessary to support the central claim. In the revised manuscript we have added a full derivation in the Method section (new subsection 3.2) proving that the re-parameterized objective on instantaneous velocity v is mathematically equivalent to the original MeanFlow loss. The derivation shows that, when the network satisfies the mean-flow consistency condition, the mapping from the predicted average velocity u to v preserves the fixed point exactly and introduces neither bias nor approximation error; the change is only in the form of the regression target, which improves numerical stability without altering the optimization landscape at convergence. revision: yes

  2. Referee: [Experiments section] Experimental results: The manuscript reports a concrete 1.72 FID for 1-NFE on ImageNet 256×256 but provides no error bars, ablation studies isolating the effect of the re-parameterization versus the guidance changes, or verification that the learned flow satisfies the original mean-flow consistency condition. These omissions make it difficult to confirm that the performance gain stems from the proposed fixes rather than an altered objective.

    Authors: We acknowledge these omissions weaken the experimental validation. In the revised version we have added (i) error bars computed over three independent runs for the main 1.72 FID result, (ii) ablation tables that isolate the contribution of the velocity re-parameterization from the in-context conditioning changes, and (iii) a consistency verification experiment that reports the mean-flow consistency error on held-out data, confirming the learned model satisfies the original condition to within numerical tolerance. These additions directly address the concern that gains might stem from an altered objective. revision: yes

Circularity Check

0 steps flagged

No significant circularity; reformulation is a modeling choice and performance is empirical.

full rationale

The paper recasts the original MF training target (which depends on the network) as a loss on instantaneous velocity v re-parameterized via a network outputting average velocity u. This is described as producing a more standard regression problem that improves stability. The headline result of 1.72 FID at 1-NFE on ImageNet 256x256 is obtained by training the resulting model from scratch and measuring its generative performance on held-out data. No equation in the provided text reduces the reported FID or the validity of the one-step generator to a fitted parameter or self-citation by construction. The re-parameterization is a design decision whose correctness is assessed by downstream empirical outcomes rather than being tautological. No self-citation chain or uniqueness theorem is invoked to force the central claim. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions in generative modeling (data distribution admits a well-behaved velocity field, classifier-free guidance can be treated as conditioning) plus the domain assumption that the velocity re-parameterization improves stability. No new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Re-parameterizing the training target as a regression on instantaneous velocity produces a more stable and standard optimization problem
    Invoked when the authors state that the reformulation yields a more standard regression problem and improves training stability.

pith-pipeline@v0.9.0 · 5550 in / 1445 out tokens · 34373 ms · 2026-05-17T02:14:01.678446+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Discrete MeanFlow: One-Step Generation via Conditional Transition Kernels

    cs.LG 2026-05 unverdicted novelty 7.0

    Discrete MeanFlow parameterizes CTMC conditional transition kernels with a boundary-by-construction design to enable exact one-step generation in discrete state spaces.

  2. One-Step Generative Modeling via Wasserstein Gradient Flows

    cs.LG 2026-05 conditional novelty 7.0

    W-Flow achieves state-of-the-art one-step ImageNet 256x256 generation at 1.29 FID by training a static neural network to follow a Wasserstein gradient flow that minimizes Sinkhorn divergence, delivering roughly 100x f...

  3. CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making

    cs.AI 2026-05 unverdicted novelty 7.0

    CoFlow achieves state-of-the-art coordination quality in offline MARL using only 1-3 denoising steps by natively coupling velocity fields across agents via coordinated attention and gating.

  4. CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making

    cs.AI 2026-05 unverdicted novelty 7.0

    CoFlow achieves state-of-the-art coordination in offline MARL using single-pass joint velocity fields with Coordinated Velocity Attention and Adaptive Coordination Gating.

  5. How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

    cs.LG 2026-04 unverdicted novelty 7.0

    FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.

  6. Speech Enhancement Based on Drifting Models

    cs.SD 2026-04 unverdicted novelty 7.0

    DriftSE achieves one-step speech enhancement by evolving the pushforward distribution of a mapping function to match the clean speech distribution using a learned drifting field.

  7. Learning Sampled-data Control for Swarms via MeanFlow

    cs.LG 2026-03 unverdicted novelty 7.0

    Generalizes MeanFlow to learn finite-horizon minimum-energy control coefficients for linear swarm systems via a differential identity and stop-gradient regression objective.

  8. Setting-Matched and Semantics-Scaled Benchmarking of One-Step Generative Models Against Multistep Diffusion and Flow Models

    cs.CV 2026-03 unverdicted novelty 7.0

    Matched benchmarking reveals FID misleads in few-step regimes under CFG, prompting CLIP-scaled and PickScore-scaled FID and IS variants for better semantic evaluation of one-step image generators.

  9. Flow Map Language Models: One-step Language Modeling via Continuous Denoising

    cs.CL 2026-02 unverdicted novelty 7.0

    Continuous flow language models match discrete diffusion baselines and their distilled one-step flow map versions exceed 8-step discrete diffusion quality on LM1B and OWT.

  10. Efficient Image Synthesis with Sphere Latent Encoder

    cs.CV 2026-05 unverdicted novelty 6.0

    Decouples Sphere Encoder into fixed pretrained encoder and spherical latent denoiser, yielding higher quality and faster inference than the joint original on Animal-Faces, Oxford-Flowers and ImageNet-1K.

  11. ELF: Embedded Language Flows

    cs.CL 2026-05 unverdicted novelty 6.0

    ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.

  12. A Few-Step Generative Model on Cumulative Flow Maps

    cs.LG 2026-05 unverdicted novelty 6.0

    Cumulative flow maps unify few-step generative modeling for diffusion and flow models via cumulative transport and parameterization with minimal changes to time embeddings and objectives.

  13. CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making

    cs.AI 2026-05 unverdicted novelty 6.0

    CoFlow preserves inter-agent coordination in few-step offline MARL by using a natively joint velocity field with Coordinated Velocity Attention and Adaptive Coordination Gating, matching or exceeding baselines in 1-3 ...

  14. How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

    cs.LG 2026-04 unverdicted novelty 6.0

    FMRG is a training-free single-trajectory guidance framework for flow-based models that matches or exceeds baselines on reward-guided tasks and inverse problems using as few as 3 NFEs.

  15. Point-MF: One-step Point Cloud Generation from a Single Image via Mean Flows

    cs.CV 2026-04 unverdicted novelty 6.0

    Point-MF performs one-step point cloud reconstruction from single images by learning a mean velocity field in point space with a tailored Diffusion Transformer and a new auxiliary loss.

  16. Speech Enhancement Based on Drifting Models

    cs.SD 2026-04 unverdicted novelty 6.0

    DriftSE formulates speech denoising as an equilibrium problem solved in one step via a learned drifting field that matches distributions, enabling unpaired training and outperforming multi-step baselines on VoiceBank-DEMAND.

  17. Speech Enhancement Based on Drifting Models

    cs.SD 2026-04 unverdicted novelty 6.0

    DriftSE achieves one-step speech enhancement by evolving a pushforward distribution to match clean speech using a drifting field, outperforming multi-step diffusion on VoiceBank-DEMAND.

  18. FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation

    cs.CL 2026-04 unverdicted novelty 6.0

    FlowLM converts diffusion LMs to flow matching via fine-tuning, achieving few-step generation that rivals or beats 2000-step diffusion and saturates faster than training flow models from scratch.

  19. Flow Map Language Models: One-step Language Modeling via Continuous Denoising

    cs.CL 2026-02 conditional novelty 6.0

    Continuous flows on token embeddings with flow-map distillation produce one-step language models whose quality exceeds recent 8-step discrete diffusion baselines on LM1B and OpenWebText.

  20. Drift Flow Matching

    cs.LG 2026-05 unverdicted novelty 5.0

    Drift Flow Matching connects direct transport maps from Drift Models with flow-based iterative refinement to enable adaptive computation in generative modeling.

  21. Real-time Speech Restoration using Data Prediction Mean Flows

    eess.AS 2026-05 unverdicted novelty 5.0

    A Data Prediction Mean Flow model enables real-time speech restoration with 120x lower compute and no algorithmic latency beyond the STFT while matching state-of-the-art offline quality.

  22. Accelerating Redshift-Conditioned Galaxy Image Synthesis with One-step Generative Modeling

    astro-ph.IM 2026-05 unverdicted novelty 4.0

    One-step pixel-MeanFlow models recover key galaxy morphology statistics at orders-of-magnitude lower computational cost than standard DDPM sampling while remaining weaker on fine-grained structure.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 16 Pith papers · 4 internal anchors

  1. [1]

    Building nor- malizing flows with stochastic interpolants

    Michael S Albergo and Eric Vanden-Eijnden. Building nor- malizing flows with stochastic interpolants. InICLR, 2023

  2. [2]

    Stochastic interpolants: A unifying framework for flows and diffusions

    Michael S Albergo, Nicholas M Boffi, and Eric Vanden- Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. InICLR, 2023

  3. [3]

    Flow map matching.TMLR, 2025

    Nicholas M Boffi, Michael S Albergo, and Eric Vanden- Eijnden. Flow map matching.TMLR, 2025

  4. [4]

    Large scale GAN training for high fidelity natural image synthesis

    Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. InICLR, 2019

  5. [5]

    Maskgit: Masked generative image transformer

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In CVPR, 2022

  6. [6]

    Visual generation without guidance

    Huayu Chen, Kai Jiang, Kaiwen Zheng, Jianfei Chen, Hang Su, and Jun Zhu. Visual generation without guidance. In ICML, 2025

  7. [7]

    arXiv preprint arXiv:2510.14974 (2025)

    Hansheng Chen, Kai Zhang, Hao Tan, Leonidas Guibas, Gordon Wetzstein, and Sai Bi. pi-flow: Policy-based few- step generation via imitation distillation.arXiv preprint arXiv:2510.14974, 2025

  8. [8]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009

  9. [9]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InNeurIPS, 2021

  10. [10]

    One step diffusion via shortcut models

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. InICLR, 2025

  11. [11]

    Consistency models made easy

    Zhengyang Geng, Ashwini Pokle, William Luo, Justin Lin, and J Zico Kolter. Consistency models made easy. InICLR, 2024

  12. [12]

    Mean flows for one-step generative modeling

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling. InNeurIPS, 2025

  13. [13]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord- huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large mini- batch sgd: Training imagenet in 1 hour.arXiv preprint arXiv:1706.02677, 2017

  14. [14]

    Starflow: Scaling latent normalizing flows for high-resolution image synthesis.NeurIPS, 2025

    Jiatao Gu, Tianrong Chen, David Berthelot, Huangjie Zheng, Yuyang Wang, Ruixiang Zhang, Laurent Dinh, Miguel An- gel Bautista, Josh Susskind, and Shuangfei Zhai. Starflow: Scaling latent normalizing flows for high-resolution image synthesis.NeurIPS, 2025

  15. [15]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, 2016. 10 class 14: indigo bunting, indigo finch, indigo bird, Passerina cyanea class 22: bald eagle, American eagle, Haliaeetus leucocephalus class 42: agama class 81: ptarmigan class 108: sea anemone, anemone class 140: red-backed sandpiper, dunli...

  16. [16]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017

  17. [17]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshop, 2021

  18. [18]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020

  19. [19]

    simple diffusion: End-to-end diffusion for high resolution images

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. In ICML, 2023

  20. [20]

    Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion

    Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion. In CVPR, 2025

  21. [21]

    Zheyuan Hu, Chieh-Hsin Lai, Yuki Mitsufuji, and Stefano 11 class 483: castle class 540: drilling platform, offshore rig class 562: fountain class 649: megalith, megalithic structure class 698: palace class 963: pizza, pizza pie class 970: alp class 973: coral reef class 976: promontory, headland, head, foreland class 985: daisy Figure 9.Uncurated1-NFE cla...

  22. [22]

    Scaling up gans for text-to-image synthesis

    Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. InCVPR, 2023

  23. [23]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InNeurIPS, 2022

  24. [24]

    Consistency trajectory models: Learning probability flow ODE trajectory of diffusion

    Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Mu- rata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ODE trajectory of diffusion. In ICLR, 2024

  25. [25]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2015. 12

  26. [26]

    Applying guidance in a limited interval improves sample and distribution quality in diffusion models

    Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models. InNeurIPS, 2024

  27. [27]

    Autoregressive image generation using resid- ual quantization

    Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using resid- ual quantization. InCVPR, 2022

  28. [28]

    Decoupled meanflow: Turning flow models into flow maps for acceler- ated sampling.arXiv preprint arXiv:2510.24474, 2025

    Kyungmin Lee, Sihyun Yu, and Jinwoo Shin. Decoupled meanflow: Turning flow models into flow maps for acceler- ated sampling.arXiv preprint arXiv:2510.24474, 2025

  29. [29]

    Autoregressive image generation without vector quantization

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. InNeurIPS, 2024

  30. [30]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023

  31. [31]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

  32. [32]

    Simplifying, stabilizing and scaling continuous-time consistency models

    Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. InICLR, 2025

  33. [33]

    Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InECCV, 2024

  34. [34]

    On distillation of guided diffusion models

    Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. InCVPR, 2023

  35. [35]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InCVPR, 2023

  36. [36]

    Flow-anchored consistency models

    Yansong Peng, Kai Zhu, Yu Liu, Pingyu Wu, Hebei Li, Xi- aoyan Sun, and Feng Wu. Flow-anchored consistency models. arXiv preprint arXiv:2507.03738, 2025

  37. [37]

    Beyond next-token: Next-x prediction for autoregressive visual generation

    Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. Beyond next-token: Next-x prediction for autoregressive visual generation. InICCV, 2025

  38. [38]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2021

  39. [39]

    Improved techniques for training gans

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. InNeurIPS, 2016

  40. [40]

    Stylegan-xl: Scaling stylegan to large diverse datasets

    Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. InSIGGRAPH, 2022

  41. [41]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

  42. [42]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015

  43. [43]

    Improved techniques for training consistency models

    Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models. InICLR, 2024

  44. [44]

    Generative modeling by estimating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. InNeurIPS, 2019

  45. [45]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InICLR, 2021

  46. [46]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InICML, 2023

  47. [47]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

    Jianlin Su, Yu Lu, Shengfeng Pan, Murtadha Ahmed, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

  48. [48]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

  49. [49]

    Diffusion models without classifier-free guidance

    Zhicong Tang, Jianmin Bao, Dong Chen, and Baining Guo. Diffusion models without classifier-free guidance. InICML, 2025

  50. [50]

    Visual autoregressive modeling: Scalable image generation via next-scale prediction

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Li- wei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InNeurIPS, 2024

  51. [51]

    Jetformer: An autoregressive generative model of raw images and text

    Michael Tschannen, André Susano Pinto, and Alexander Kolesnikov. Jetformer: An autoregressive generative model of raw images and text. InICLR, 2025

  52. [52]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017

  53. [53]

    Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

    Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

  54. [54]

    Transition models: Rethinking the generative learning objective.arXiv preprint arXiv:2509.04394, 2025

    Zidong Wang, Yiyuan Zhang, Xiaoyu Yue, Xiangyu Yue, Yangguang Li, Wanli Ouyang, and Lei Bai. Transition models: Rethinking the generative learning objective.arXiv preprint arXiv:2509.04394, 2025

  55. [55]

    Reconstruc- tion vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruc- tion vs. generation: Taming optimization dilemma in latent diffusion models. InCVPR, 2025

  56. [56]

    Randomized autoregressive visual generation

    Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang- Chieh Chen. Randomized autoregressive visual generation. InICCV, 2025

  57. [57]

    Representa- tion alignment for generation: Training diffusion transformers is easier than you think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representa- tion alignment for generation: Training diffusion transformers is easier than you think. InICLR, 2025

  58. [58]

    Root mean square layer normalization

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. InNeurIPS, 2019

  59. [59]

    Zhang, A

    Huijie Zhang, Aliaksandr Siarohin, Willi Menapace, Michael Vasilkovsky, Sergey Tulyakov, Qing Qu, and Ivan Sko- rokhodov. Alphaflow: Understanding and improving mean- flow models.arXiv preprint arXiv:2510.20771, 2025

  60. [60]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690, 2025

  61. [61]

    Inductive moment matching

    Linqi Zhou, Stefano Ermon, and Jiaming Song. Inductive moment matching. InICML, 2025. 13