Improved Mean Flows: On the Challenges of Fastforward Generative Models

Eli Shechtman; J. Zico Kolter; Kaiming He; Yiyang Lu; Zhengyang Geng; Zongze Wu

arxiv: 2512.02012 · v2 · submitted 2025-12-01 · 💻 cs.CV · cs.LG

Improved Mean Flows: On the Challenges of Fastforward Generative Models

Zhengyang Geng , Yiyang Lu , Zongze Wu , Eli Shechtman , J. Zico Kolter , Kaiming He This is my paper

Pith reviewed 2026-05-17 02:14 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords MeanFlowone-step generationImageNet 256x256fastforward modelsclassifier-free guidancevelocity predictiongenerative modeling

0 comments

The pith

Reformulated MeanFlow training enables 1.72 FID one-step generation on ImageNet 256x256 from scratch.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper targets two core difficulties in MeanFlow for one-step image generation. It recasts the training objective as a regression loss on instantaneous velocity that is re-parameterized by a network predicting average velocity, creating a more standard and stable optimization problem. It also converts fixed guidance into explicit conditioning variables handled through in-context processing, preserving test-time flexibility while shrinking model size. If these changes hold, single-evaluation models become competitive with multi-step methods on large datasets without any distillation step.

Core claim

The improved MeanFlow (iMF) recasts the original training target, which depended on both ground-truth fields and the network itself, into a loss on instantaneous velocity v re-parameterized by a network predicting average velocity u. This yields a standard regression problem that improves training stability. Guidance is formulated as explicit conditioning variables processed via in-context conditioning, retaining flexibility at test time. Trained entirely from scratch, iMF reaches 1.72 FID with a single function evaluation on ImageNet 256×256, outperforming prior one-step methods and closing the gap to multi-step approaches without distillation.

What carries the argument

Re-parameterization of the objective as a regression loss on instantaneous velocity v via a network predicting average velocity u, combined with explicit guidance scales as in-context conditioning variables.

If this is right

One-step generative models can be trained stably from scratch on large-scale image datasets like ImageNet.
Guidance scale remains adjustable at inference time without retraining the model.
In-context conditioning reduces model size while maintaining or improving performance.
Single-evaluation generation reaches FID scores competitive with multi-step methods.
High-quality fastforward generation is possible without distillation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The velocity re-parameterization could be tested in other flow-matching or velocity-based generative frameworks to check for similar stability gains.
Explicit in-context conditioning might extend to text-to-image or video generation to improve flexibility in those settings.
Combining this approach with model compression techniques could enable real-time single-step generation on resource-limited hardware.
Experiments at higher resolutions or on different data modalities would test whether the performance scaling holds beyond 256×256 images.

Load-bearing premise

Re-parameterizing the training objective as a loss on instantaneous velocity re-parameterized by average velocity prediction creates a standard regression problem that improves stability without introducing bias or new instabilities.

What would settle it

A side-by-side training run on ImageNet 256×256 where the re-parameterized loss produces higher instability or worse final FID than the original MeanFlow objective would falsify the stability and performance gains.

Figures

Figures reproduced from arXiv: 2512.02012 by Eli Shechtman, J. Zico Kolter, Kaiming He, Yiyang Lu, Zhengyang Geng, Zongze Wu.

**Figure 1.** Figure 1: Conceptual comparison. Original MeanFlow (MF) [12] predicts average velocity u by a network uθ. As the ground-truth u is unknown, original MF substitutes u with the network’s own prediction. We show that the original MF objective is equivalent to a loss on the instantaneous velocity v (namely, v-loss), but reparameterized by the neural network uθ (namely, u-pred), as shown in (a). This re-parameterization… view at source ↗

**Figure 2.** Figure 2: MeanFlow as v-loss. Original MeanFlow (MF) [12] models the average velocity u and train the network uθ via a u-loss parameterized by uθ itself. We show that MF can be reformulated as a v-loss re-parameterized by uθ, driven by the MeanFlow identity in Eq. (8). 4.1. MeanFlow as v-loss Eq. (7) suggests that original MF [12] is a u-loss parameterized by u-pred. In this subsection, we first show that the origi… view at source ↗

**Figure 3.** Figure 3: Training losses. We examine the loss of samples only with t ̸= r, since a batch also contains samples of t = r, for which the JVP term becomes zero due to its coefficient (t − r). Both MF and iMF can be viewed as v-loss, using different forms of compound Vθ. Original MF’s loss is non-decreasing and has high variance. (Settings: MeanFlow-B/2, trained with basic ℓ2 loss with no adaptive weighting, and with n… view at source ↗

**Figure 4.** Figure 4: Optimal CFG scales shift under different settings. In general, a stronger setting has a smaller optimal CFG scale, as reflected by increased training epochs (left) and inference steps (right). This investigation is enabled by our flexible CFG-conditioning, where a single model can support varying CFG scales even in the single/few-NFE case. (Settings: iMF-B/2 on ImageNet 256×256.) CFG guidance scale ω. Sim… view at source ↗

**Figure 6.** Figure 6: FID curves during training. The original MeanFlowB/2 baseline has a 1-NFE FID of 6.17. Using the improved training objective (Sec. 4.1), FID improves to 5.68. Incorporating flexible CFG conditioning (Sec. 4.2) reduces FID to 4.57. Replacing adaLNzero with in-context conditioning (Sec. 4.3) further improves FID to 4.09. See also Tab. 1. on ImageNet [8] class-conditional generation at 256×256 resolution. F… view at source ↗

**Figure 7.** Figure 7: Qualitative results of 1-NFE generation on ImageNet 256×256. We show uncurated results on the three classes listed here; more are in appendix. The model is iMF-XL/2. With the remarkable progress of 1-NFE generation, the use of a tokenizer begins to incur a non-negligible cost at inference time. While our work focuses on advancing fastforward models and is orthogonal to tokenizer design, from a practical s… view at source ↗

**Figure 8.** Figure 8: Uncurated 1-NFE class-conditional generation samples of iMF-XL/2 on ImageNet 256×256. [16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017. [17] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS Workshop, 2021. [18] Jonathan Ho, Ajay Jain… view at source ↗

**Figure 9.** Figure 9: Uncurated 1-NFE class-conditional generation samples of iMF-XL/2 on ImageNet 256×256. Ermon. Cmt: Mid-training for efficient learning of consistency, mean flow, and flow map models. arXiv preprint arXiv:2509.24526, 2025. [22] Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. In CVPR, 2023. [23] Tero Karras, Mi… view at source ↗

read the original abstract

MeanFlow (MF) has recently been established as a framework for one-step generative modeling. However, its ``fastforward'' nature introduces key challenges in both the training objective and the guidance mechanism. First, the original MF's training target depends not only on the underlying ground-truth fields but also on the network itself. To address this issue, we recast the objective as a loss on the instantaneous velocity $v$, re-parameterized by a network that predicts the average velocity $u$. Our reformulation yields a more standard regression problem and improves the training stability. Second, the original MF fixes the classifier-free guidance scale during training, which sacrifices flexibility. We tackle this issue by formulating guidance as explicit conditioning variables, thereby retaining flexibility at test time. The diverse conditions are processed through in-context conditioning, which reduces model size and benefits performance. Overall, our $\textbf{improved MeanFlow}$ ($\textbf{iMF}$) method, trained entirely from scratch, achieves $\textbf{1.72}$ FID with a single function evaluation (1-NFE) on ImageNet 256$\times$256. iMF substantially outperforms prior methods of this kind and closes the gap with multi-step methods while using no distillation. We hope our work will further advance fastforward generative modeling as a stand-alone paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper patches two practical issues in MeanFlow to reach 1.72 FID at one step on ImageNet but the re-parameterization lacks a shown equivalence proof.

read the letter

The main thing here is that the authors improve MeanFlow with a re-parameterized training loss and explicit guidance conditioning, landing at 1.72 FID for one-step generation on ImageNet 256. They train from scratch and skip distillation, which is a practical win if the numbers hold. What they do is recast the original network-dependent target as a loss on instantaneous velocity v, where the network now outputs average velocity u. This is meant to make it a standard regression and stabilize training. For guidance, they make the scale an explicit condition instead of fixing it, using in-context conditioning to handle it efficiently. The paper does well by delivering a concrete performance number that beats prior one-step methods and narrows the gap to multi-step ones. The conditioning approach also seems like a smart way to retain flexibility without bloating the model. The soft spots are around the re-parameterization. As the stress-test points out, there's no shown proof that the u to v mapping keeps the original objective's fixed point intact or avoids introducing bias. The description just says it yields a more standard problem and improves stability, without quantifying error or verifying the mean-flow consistency. That makes the central claim rest more on the empirical result than on guaranteed equivalence. This paper is aimed at researchers working on efficient, one-step generative models for images. Someone following developments in MeanFlow or looking for distillation-free alternatives would get value from the fixes and the reported FID. It deserves a serious referee because the empirical result is sharp and the ideas are targeted, even if the math on the reformulation could use more rigor.

Referee Report

2 major / 2 minor

Summary. The manuscript presents 'improved MeanFlow' (iMF), an enhancement to the MeanFlow framework for one-step (fastforward) generative modeling. The authors identify two challenges: (1) the original training target depending on the network itself, addressed by recasting the objective as a loss on instantaneous velocity v re-parameterized via a network predicting average velocity u; (2) fixed classifier-free guidance scale, addressed by explicit conditioning variables processed through in-context conditioning. They report that iMF, trained from scratch, achieves an FID of 1.72 on ImageNet 256×256 using a single function evaluation (1-NFE), substantially outperforming prior one-step methods and closing the gap with multi-step approaches without using distillation.

Significance. If the re-parameterized objective is mathematically equivalent to the original MeanFlow loss and the learned model satisfies the mean-flow consistency condition without bias, this work would represent a meaningful step forward in developing efficient, standalone one-step generative models. The reported 1.72 FID score with 1-NFE is competitive and highlights the potential of fastforward paradigms. The use of in-context conditioning for flexible guidance is a practical contribution that could benefit other conditional generation tasks.

major comments (2)

[Method section (training objective reformulation)] The reformulation of the training objective as a loss on instantaneous velocity v, re-parameterized by a network predicting average velocity u, is presented as yielding a more standard regression problem that improves stability. However, no derivation is provided showing that this re-parameterization is equivalent to the original MeanFlow objective or that the mapping from u to v preserves the fixed point exactly without introducing bias or approximation error. This is load-bearing for the central claim, as the 1.72 FID result is reported for a model trained with the new objective.
[Experiments section] Experimental results: The manuscript reports a concrete 1.72 FID for 1-NFE on ImageNet 256×256 but provides no error bars, ablation studies isolating the effect of the re-parameterization versus the guidance changes, or verification that the learned flow satisfies the original mean-flow consistency condition. These omissions make it difficult to confirm that the performance gain stems from the proposed fixes rather than an altered objective.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly define 'fastforward' and 'MeanFlow' for readers new to the prior work, including a brief recap of the original objective.
[Method] Notation for v (instantaneous velocity) and u (average velocity) should be introduced with a clear equation relating them to the original MeanFlow fields.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions made to strengthen the paper.

read point-by-point responses

Referee: [Method section (training objective reformulation)] The reformulation of the training objective as a loss on instantaneous velocity v, re-parameterized by a network predicting average velocity u, is presented as yielding a more standard regression problem that improves stability. However, no derivation is provided showing that this re-parameterization is equivalent to the original MeanFlow objective or that the mapping from u to v preserves the fixed point exactly without introducing bias or approximation error. This is load-bearing for the central claim, as the 1.72 FID result is reported for a model trained with the new objective.

Authors: We agree that an explicit derivation is necessary to support the central claim. In the revised manuscript we have added a full derivation in the Method section (new subsection 3.2) proving that the re-parameterized objective on instantaneous velocity v is mathematically equivalent to the original MeanFlow loss. The derivation shows that, when the network satisfies the mean-flow consistency condition, the mapping from the predicted average velocity u to v preserves the fixed point exactly and introduces neither bias nor approximation error; the change is only in the form of the regression target, which improves numerical stability without altering the optimization landscape at convergence. revision: yes
Referee: [Experiments section] Experimental results: The manuscript reports a concrete 1.72 FID for 1-NFE on ImageNet 256×256 but provides no error bars, ablation studies isolating the effect of the re-parameterization versus the guidance changes, or verification that the learned flow satisfies the original mean-flow consistency condition. These omissions make it difficult to confirm that the performance gain stems from the proposed fixes rather than an altered objective.

Authors: We acknowledge these omissions weaken the experimental validation. In the revised version we have added (i) error bars computed over three independent runs for the main 1.72 FID result, (ii) ablation tables that isolate the contribution of the velocity re-parameterization from the in-context conditioning changes, and (iii) a consistency verification experiment that reports the mean-flow consistency error on held-out data, confirming the learned model satisfies the original condition to within numerical tolerance. These additions directly address the concern that gains might stem from an altered objective. revision: yes

Circularity Check

0 steps flagged

No significant circularity; reformulation is a modeling choice and performance is empirical.

full rationale

The paper recasts the original MF training target (which depends on the network) as a loss on instantaneous velocity v re-parameterized via a network outputting average velocity u. This is described as producing a more standard regression problem that improves stability. The headline result of 1.72 FID at 1-NFE on ImageNet 256x256 is obtained by training the resulting model from scratch and measuring its generative performance on held-out data. No equation in the provided text reduces the reported FID or the validity of the one-step generator to a fitted parameter or self-citation by construction. The re-parameterization is a design decision whose correctness is assessed by downstream empirical outcomes rather than being tautological. No self-citation chain or uniqueness theorem is invoked to force the central claim. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions in generative modeling (data distribution admits a well-behaved velocity field, classifier-free guidance can be treated as conditioning) plus the domain assumption that the velocity re-parameterization improves stability. No new free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Re-parameterizing the training target as a regression on instantaneous velocity produces a more stable and standard optimization problem
Invoked when the authors state that the reformulation yields a more standard regression problem and improves training stability.

pith-pipeline@v0.9.0 · 5550 in / 1445 out tokens · 34373 ms · 2026-05-17T02:14:01.678446+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We recast the objective as a loss on the instantaneous velocity v, re-parameterized by a network that predicts the average velocity u. ... V_θ(zt) ≜ u_θ(zt) + (t−r) JVP_sg(u_θ; v_θ)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the MeanFlow identity ... u(zt) = v(zt) − (t−r) d/dt u(zt)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Discrete MeanFlow: One-Step Generation via Conditional Transition Kernels
cs.LG 2026-05 unverdicted novelty 7.0

Discrete MeanFlow parameterizes CTMC conditional transition kernels with a boundary-by-construction design to enable exact one-step generation in discrete state spaces.
One-Step Generative Modeling via Wasserstein Gradient Flows
cs.LG 2026-05 conditional novelty 7.0

W-Flow achieves state-of-the-art one-step ImageNet 256x256 generation at 1.29 FID by training a static neural network to follow a Wasserstein gradient flow that minimizes Sinkhorn divergence, delivering roughly 100x f...
CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making
cs.AI 2026-05 unverdicted novelty 7.0

CoFlow achieves state-of-the-art coordination quality in offline MARL using only 1-3 denoising steps by natively coupling velocity fields across agents via coordinated attention and gating.
CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making
cs.AI 2026-05 unverdicted novelty 7.0

CoFlow achieves state-of-the-art coordination in offline MARL using single-pass joint velocity fields with Coordinated Velocity Attention and Adaptive Coordination Gating.
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
cs.LG 2026-04 unverdicted novelty 7.0

FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.
Speech Enhancement Based on Drifting Models
cs.SD 2026-04 unverdicted novelty 7.0

DriftSE achieves one-step speech enhancement by evolving the pushforward distribution of a mapping function to match the clean speech distribution using a learned drifting field.
Learning Sampled-data Control for Swarms via MeanFlow
cs.LG 2026-03 unverdicted novelty 7.0

Generalizes MeanFlow to learn finite-horizon minimum-energy control coefficients for linear swarm systems via a differential identity and stop-gradient regression objective.
Setting-Matched and Semantics-Scaled Benchmarking of One-Step Generative Models Against Multistep Diffusion and Flow Models
cs.CV 2026-03 unverdicted novelty 7.0

Matched benchmarking reveals FID misleads in few-step regimes under CFG, prompting CLIP-scaled and PickScore-scaled FID and IS variants for better semantic evaluation of one-step image generators.
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
cs.CL 2026-02 unverdicted novelty 7.0

Continuous flow language models match discrete diffusion baselines and their distilled one-step flow map versions exceed 8-step discrete diffusion quality on LM1B and OWT.
Efficient Image Synthesis with Sphere Latent Encoder
cs.CV 2026-05 unverdicted novelty 6.0

Decouples Sphere Encoder into fixed pretrained encoder and spherical latent denoiser, yielding higher quality and faster inference than the joint original on Animal-Faces, Oxford-Flowers and ImageNet-1K.
ELF: Embedded Language Flows
cs.CL 2026-05 unverdicted novelty 6.0

ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.
A Few-Step Generative Model on Cumulative Flow Maps
cs.LG 2026-05 unverdicted novelty 6.0

Cumulative flow maps unify few-step generative modeling for diffusion and flow models via cumulative transport and parameterization with minimal changes to time embeddings and objectives.
CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making
cs.AI 2026-05 unverdicted novelty 6.0

CoFlow preserves inter-agent coordination in few-step offline MARL by using a natively joint velocity field with Coordinated Velocity Attention and Adaptive Coordination Gating, matching or exceeding baselines in 1-3 ...
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
cs.LG 2026-04 unverdicted novelty 6.0

FMRG is a training-free single-trajectory guidance framework for flow-based models that matches or exceeds baselines on reward-guided tasks and inverse problems using as few as 3 NFEs.
Point-MF: One-step Point Cloud Generation from a Single Image via Mean Flows
cs.CV 2026-04 unverdicted novelty 6.0

Point-MF performs one-step point cloud reconstruction from single images by learning a mean velocity field in point space with a tailored Diffusion Transformer and a new auxiliary loss.
Speech Enhancement Based on Drifting Models
cs.SD 2026-04 unverdicted novelty 6.0

DriftSE formulates speech denoising as an equilibrium problem solved in one step via a learned drifting field that matches distributions, enabling unpaired training and outperforming multi-step baselines on VoiceBank-DEMAND.
Speech Enhancement Based on Drifting Models
cs.SD 2026-04 unverdicted novelty 6.0

DriftSE achieves one-step speech enhancement by evolving a pushforward distribution to match clean speech using a drifting field, outperforming multi-step diffusion on VoiceBank-DEMAND.
FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation
cs.CL 2026-04 unverdicted novelty 6.0

FlowLM converts diffusion LMs to flow matching via fine-tuning, achieving few-step generation that rivals or beats 2000-step diffusion and saturates faster than training flow models from scratch.
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
cs.CL 2026-02 conditional novelty 6.0

Continuous flows on token embeddings with flow-map distillation produce one-step language models whose quality exceeds recent 8-step discrete diffusion baselines on LM1B and OpenWebText.
Drift Flow Matching
cs.LG 2026-05 unverdicted novelty 5.0

Drift Flow Matching connects direct transport maps from Drift Models with flow-based iterative refinement to enable adaptive computation in generative modeling.
Real-time Speech Restoration using Data Prediction Mean Flows
eess.AS 2026-05 unverdicted novelty 5.0

A Data Prediction Mean Flow model enables real-time speech restoration with 120x lower compute and no algorithmic latency beyond the STFT while matching state-of-the-art offline quality.
Accelerating Redshift-Conditioned Galaxy Image Synthesis with One-step Generative Modeling
astro-ph.IM 2026-05 unverdicted novelty 4.0

One-step pixel-MeanFlow models recover key galaxy morphology statistics at orders-of-magnitude lower computational cost than standard DDPM sampling while remaining weaker on fine-grained structure.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 16 Pith papers · 4 internal anchors

[1]

Building nor- malizing flows with stochastic interpolants

Michael S Albergo and Eric Vanden-Eijnden. Building nor- malizing flows with stochastic interpolants. InICLR, 2023

work page 2023
[2]

Stochastic interpolants: A unifying framework for flows and diffusions

Michael S Albergo, Nicholas M Boffi, and Eric Vanden- Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. InICLR, 2023

work page 2023
[3]

Flow map matching.TMLR, 2025

Nicholas M Boffi, Michael S Albergo, and Eric Vanden- Eijnden. Flow map matching.TMLR, 2025

work page 2025
[4]

Large scale GAN training for high fidelity natural image synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. InICLR, 2019

work page 2019
[5]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In CVPR, 2022

work page 2022
[6]

Visual generation without guidance

Huayu Chen, Kai Jiang, Kaiwen Zheng, Jianfei Chen, Hang Su, and Jun Zhu. Visual generation without guidance. In ICML, 2025

work page 2025
[7]

arXiv preprint arXiv:2510.14974 (2025)

Hansheng Chen, Kai Zhang, Hao Tan, Leonidas Guibas, Gordon Wetzstein, and Sai Bi. pi-flow: Policy-based few- step generation via imitation distillation.arXiv preprint arXiv:2510.14974, 2025

work page arXiv 2025
[8]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009

work page 2009
[9]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InNeurIPS, 2021

work page 2021
[10]

One step diffusion via shortcut models

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. InICLR, 2025

work page 2025
[11]

Consistency models made easy

Zhengyang Geng, Ashwini Pokle, William Luo, Justin Lin, and J Zico Kolter. Consistency models made easy. InICLR, 2024

work page 2024
[12]

Mean flows for one-step generative modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling. InNeurIPS, 2025

work page 2025
[13]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord- huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large mini- batch sgd: Training imagenet in 1 hour.arXiv preprint arXiv:1706.02677, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Starflow: Scaling latent normalizing flows for high-resolution image synthesis.NeurIPS, 2025

Jiatao Gu, Tianrong Chen, David Berthelot, Huangjie Zheng, Yuyang Wang, Ruixiang Zhang, Laurent Dinh, Miguel An- gel Bautista, Josh Susskind, and Shuangfei Zhai. Starflow: Scaling latent normalizing flows for high-resolution image synthesis.NeurIPS, 2025

work page 2025
[15]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, 2016. 10 class 14: indigo bunting, indigo finch, indigo bird, Passerina cyanea class 22: bald eagle, American eagle, Haliaeetus leucocephalus class 42: agama class 81: ptarmigan class 108: sea anemone, anemone class 140: red-backed sandpiper, dunli...

work page 2016
[16]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017

work page 2017
[17]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshop, 2021

work page 2021
[18]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020

work page 2020
[19]

simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. In ICML, 2023

work page 2023
[20]

Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion

Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion. In CVPR, 2025

work page 2025
[21]

Zheyuan Hu, Chieh-Hsin Lai, Yuki Mitsufuji, and Stefano 11 class 483: castle class 540: drilling platform, offshore rig class 562: fountain class 649: megalith, megalithic structure class 698: palace class 963: pizza, pizza pie class 970: alp class 973: coral reef class 976: promontory, headland, head, foreland class 985: daisy Figure 9.Uncurated1-NFE cla...

work page arXiv 2025
[22]

Scaling up gans for text-to-image synthesis

Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. InCVPR, 2023

work page 2023
[23]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InNeurIPS, 2022

work page 2022
[24]

Consistency trajectory models: Learning probability flow ODE trajectory of diffusion

Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Mu- rata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ODE trajectory of diffusion. In ICLR, 2024

work page 2024
[25]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2015. 12

work page 2015
[26]

Applying guidance in a limited interval improves sample and distribution quality in diffusion models

Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models. InNeurIPS, 2024

work page 2024
[27]

Autoregressive image generation using resid- ual quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using resid- ual quantization. InCVPR, 2022

work page 2022
[28]

Decoupled meanflow: Turning flow models into flow maps for acceler- ated sampling.arXiv preprint arXiv:2510.24474, 2025

Kyungmin Lee, Sihyun Yu, and Jinwoo Shin. Decoupled meanflow: Turning flow models into flow maps for acceler- ated sampling.arXiv preprint arXiv:2510.24474, 2025

work page arXiv 2025
[29]

Autoregressive image generation without vector quantization

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. InNeurIPS, 2024

work page 2024
[30]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023

work page 2023
[31]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

work page 2023
[32]

Simplifying, stabilizing and scaling continuous-time consistency models

Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. InICLR, 2025

work page 2025
[33]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InECCV, 2024

work page 2024
[34]

On distillation of guided diffusion models

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. InCVPR, 2023

work page 2023
[35]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InCVPR, 2023

work page 2023
[36]

Flow-anchored consistency models

Yansong Peng, Kai Zhu, Yu Liu, Pingyu Wu, Hebei Li, Xi- aoyan Sun, and Feng Wu. Flow-anchored consistency models. arXiv preprint arXiv:2507.03738, 2025

work page arXiv 2025
[37]

Beyond next-token: Next-x prediction for autoregressive visual generation

Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. Beyond next-token: Next-x prediction for autoregressive visual generation. InICCV, 2025

work page 2025
[38]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2021

work page 2021
[39]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. InNeurIPS, 2016

work page 2016
[40]

Stylegan-xl: Scaling stylegan to large diverse datasets

Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. InSIGGRAPH, 2022

work page 2022
[41]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[42]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015

work page 2015
[43]

Improved techniques for training consistency models

Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models. InICLR, 2024

work page 2024
[44]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. InNeurIPS, 2019

work page 2019
[45]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InICLR, 2021

work page 2021
[46]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InICML, 2023

work page 2023
[47]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

Jianlin Su, Yu Lu, Shengfeng Pan, Murtadha Ahmed, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

work page 2024
[48]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Diffusion models without classifier-free guidance

Zhicong Tang, Jianmin Bao, Dong Chen, and Baining Guo. Diffusion models without classifier-free guidance. InICML, 2025

work page 2025
[50]

Visual autoregressive modeling: Scalable image generation via next-scale prediction

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Li- wei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InNeurIPS, 2024

work page 2024
[51]

Jetformer: An autoregressive generative model of raw images and text

Michael Tschannen, André Susano Pinto, and Alexander Kolesnikov. Jetformer: An autoregressive generative model of raw images and text. InICLR, 2025

work page 2025
[52]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017

work page 2017
[53]

Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

work page arXiv 2025
[54]

Transition models: Rethinking the generative learning objective.arXiv preprint arXiv:2509.04394, 2025

Zidong Wang, Yiyuan Zhang, Xiaoyu Yue, Xiangyu Yue, Yangguang Li, Wanli Ouyang, and Lei Bai. Transition models: Rethinking the generative learning objective.arXiv preprint arXiv:2509.04394, 2025

work page arXiv 2025
[55]

Reconstruc- tion vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruc- tion vs. generation: Taming optimization dilemma in latent diffusion models. InCVPR, 2025

work page 2025
[56]

Randomized autoregressive visual generation

Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang- Chieh Chen. Randomized autoregressive visual generation. InICCV, 2025

work page 2025
[57]

Representa- tion alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representa- tion alignment for generation: Training diffusion transformers is easier than you think. InICLR, 2025

work page 2025
[58]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. InNeurIPS, 2019

work page 2019
[59]

Zhang, A

Huijie Zhang, Aliaksandr Siarohin, Willi Menapace, Michael Vasilkovsky, Sergey Tulyakov, Qing Qu, and Ivan Sko- rokhodov. Alphaflow: Understanding and improving mean- flow models.arXiv preprint arXiv:2510.20771, 2025

work page arXiv 2025
[60]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Inductive moment matching

Linqi Zhou, Stefano Ermon, and Jiaming Song. Inductive moment matching. InICML, 2025. 13

work page 2025

[1] [1]

Building nor- malizing flows with stochastic interpolants

Michael S Albergo and Eric Vanden-Eijnden. Building nor- malizing flows with stochastic interpolants. InICLR, 2023

work page 2023

[2] [2]

Stochastic interpolants: A unifying framework for flows and diffusions

Michael S Albergo, Nicholas M Boffi, and Eric Vanden- Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. InICLR, 2023

work page 2023

[3] [3]

Flow map matching.TMLR, 2025

Nicholas M Boffi, Michael S Albergo, and Eric Vanden- Eijnden. Flow map matching.TMLR, 2025

work page 2025

[4] [4]

Large scale GAN training for high fidelity natural image synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. InICLR, 2019

work page 2019

[5] [5]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In CVPR, 2022

work page 2022

[6] [6]

Visual generation without guidance

Huayu Chen, Kai Jiang, Kaiwen Zheng, Jianfei Chen, Hang Su, and Jun Zhu. Visual generation without guidance. In ICML, 2025

work page 2025

[7] [7]

arXiv preprint arXiv:2510.14974 (2025)

Hansheng Chen, Kai Zhang, Hao Tan, Leonidas Guibas, Gordon Wetzstein, and Sai Bi. pi-flow: Policy-based few- step generation via imitation distillation.arXiv preprint arXiv:2510.14974, 2025

work page arXiv 2025

[8] [8]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009

work page 2009

[9] [9]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InNeurIPS, 2021

work page 2021

[10] [10]

One step diffusion via shortcut models

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. InICLR, 2025

work page 2025

[11] [11]

Consistency models made easy

Zhengyang Geng, Ashwini Pokle, William Luo, Justin Lin, and J Zico Kolter. Consistency models made easy. InICLR, 2024

work page 2024

[12] [12]

Mean flows for one-step generative modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling. InNeurIPS, 2025

work page 2025

[13] [13]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord- huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large mini- batch sgd: Training imagenet in 1 hour.arXiv preprint arXiv:1706.02677, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

Starflow: Scaling latent normalizing flows for high-resolution image synthesis.NeurIPS, 2025

Jiatao Gu, Tianrong Chen, David Berthelot, Huangjie Zheng, Yuyang Wang, Ruixiang Zhang, Laurent Dinh, Miguel An- gel Bautista, Josh Susskind, and Shuangfei Zhai. Starflow: Scaling latent normalizing flows for high-resolution image synthesis.NeurIPS, 2025

work page 2025

[15] [15]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, 2016. 10 class 14: indigo bunting, indigo finch, indigo bird, Passerina cyanea class 22: bald eagle, American eagle, Haliaeetus leucocephalus class 42: agama class 81: ptarmigan class 108: sea anemone, anemone class 140: red-backed sandpiper, dunli...

work page 2016

[16] [16]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017

work page 2017

[17] [17]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshop, 2021

work page 2021

[18] [18]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020

work page 2020

[19] [19]

simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. In ICML, 2023

work page 2023

[20] [20]

Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion

Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion. In CVPR, 2025

work page 2025

[21] [21]

Zheyuan Hu, Chieh-Hsin Lai, Yuki Mitsufuji, and Stefano 11 class 483: castle class 540: drilling platform, offshore rig class 562: fountain class 649: megalith, megalithic structure class 698: palace class 963: pizza, pizza pie class 970: alp class 973: coral reef class 976: promontory, headland, head, foreland class 985: daisy Figure 9.Uncurated1-NFE cla...

work page arXiv 2025

[22] [22]

Scaling up gans for text-to-image synthesis

Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. InCVPR, 2023

work page 2023

[23] [23]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InNeurIPS, 2022

work page 2022

[24] [24]

Consistency trajectory models: Learning probability flow ODE trajectory of diffusion

Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Mu- rata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ODE trajectory of diffusion. In ICLR, 2024

work page 2024

[25] [25]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2015. 12

work page 2015

[26] [26]

Applying guidance in a limited interval improves sample and distribution quality in diffusion models

Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models. InNeurIPS, 2024

work page 2024

[27] [27]

Autoregressive image generation using resid- ual quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using resid- ual quantization. InCVPR, 2022

work page 2022

[28] [28]

Decoupled meanflow: Turning flow models into flow maps for acceler- ated sampling.arXiv preprint arXiv:2510.24474, 2025

Kyungmin Lee, Sihyun Yu, and Jinwoo Shin. Decoupled meanflow: Turning flow models into flow maps for acceler- ated sampling.arXiv preprint arXiv:2510.24474, 2025

work page arXiv 2025

[29] [29]

Autoregressive image generation without vector quantization

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. InNeurIPS, 2024

work page 2024

[30] [30]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023

work page 2023

[31] [31]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

work page 2023

[32] [32]

Simplifying, stabilizing and scaling continuous-time consistency models

Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. InICLR, 2025

work page 2025

[33] [33]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InECCV, 2024

work page 2024

[34] [34]

On distillation of guided diffusion models

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. InCVPR, 2023

work page 2023

[35] [35]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InCVPR, 2023

work page 2023

[36] [36]

Flow-anchored consistency models

Yansong Peng, Kai Zhu, Yu Liu, Pingyu Wu, Hebei Li, Xi- aoyan Sun, and Feng Wu. Flow-anchored consistency models. arXiv preprint arXiv:2507.03738, 2025

work page arXiv 2025

[37] [37]

Beyond next-token: Next-x prediction for autoregressive visual generation

Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. Beyond next-token: Next-x prediction for autoregressive visual generation. InICCV, 2025

work page 2025

[38] [38]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2021

work page 2021

[39] [39]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. InNeurIPS, 2016

work page 2016

[40] [40]

Stylegan-xl: Scaling stylegan to large diverse datasets

Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. InSIGGRAPH, 2022

work page 2022

[41] [41]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002

[42] [42]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015

work page 2015

[43] [43]

Improved techniques for training consistency models

Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models. InICLR, 2024

work page 2024

[44] [44]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. InNeurIPS, 2019

work page 2019

[45] [45]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InICLR, 2021

work page 2021

[46] [46]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InICML, 2023

work page 2023

[47] [47]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

Jianlin Su, Yu Lu, Shengfeng Pan, Murtadha Ahmed, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

work page 2024

[48] [48]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Diffusion models without classifier-free guidance

Zhicong Tang, Jianmin Bao, Dong Chen, and Baining Guo. Diffusion models without classifier-free guidance. InICML, 2025

work page 2025

[50] [50]

Visual autoregressive modeling: Scalable image generation via next-scale prediction

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Li- wei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InNeurIPS, 2024

work page 2024

[51] [51]

Jetformer: An autoregressive generative model of raw images and text

Michael Tschannen, André Susano Pinto, and Alexander Kolesnikov. Jetformer: An autoregressive generative model of raw images and text. InICLR, 2025

work page 2025

[52] [52]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017

work page 2017

[53] [53]

Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

work page arXiv 2025

[54] [54]

Transition models: Rethinking the generative learning objective.arXiv preprint arXiv:2509.04394, 2025

Zidong Wang, Yiyuan Zhang, Xiaoyu Yue, Xiangyu Yue, Yangguang Li, Wanli Ouyang, and Lei Bai. Transition models: Rethinking the generative learning objective.arXiv preprint arXiv:2509.04394, 2025

work page arXiv 2025

[55] [55]

Reconstruc- tion vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruc- tion vs. generation: Taming optimization dilemma in latent diffusion models. InCVPR, 2025

work page 2025

[56] [56]

Randomized autoregressive visual generation

Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang- Chieh Chen. Randomized autoregressive visual generation. InICCV, 2025

work page 2025

[57] [57]

Representa- tion alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representa- tion alignment for generation: Training diffusion transformers is easier than you think. InICLR, 2025

work page 2025

[58] [58]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. InNeurIPS, 2019

work page 2019

[59] [59]

Zhang, A

Huijie Zhang, Aliaksandr Siarohin, Willi Menapace, Michael Vasilkovsky, Sergey Tulyakov, Qing Qu, and Ivan Sko- rokhodov. Alphaflow: Understanding and improving mean- flow models.arXiv preprint arXiv:2510.20771, 2025

work page arXiv 2025

[60] [60]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[61] [61]

Inductive moment matching

Linqi Zhou, Stefano Ermon, and Jiaming Song. Inductive moment matching. InICML, 2025. 13

work page 2025