arxiv: 2603.14186 · v4 · submitted 2026-03-15 · 💻 cs.CV

Recognition: no theorem link

Setting-Matched and Semantics-Scaled Benchmarking of One-Step Generative Models Against Multistep Diffusion and Flow Models

Advaith Ravishankar , Serena Liu , Mingyang Wang , Todd Zhou , Jeffrey Zhou , Arnav Sharma , Ziling Hu , L\'eopold Das

show 6 more authors

Abdulaziz Sobirov Faizaan Siddique Freddy Yu Seungjoo Baek Yan Luo Mengyu Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords one-step generative modelsdiffusion modelsbenchmarkingFIDCLIP scoreclassifier-free guidancetext-to-image generationimage quality metrics

0 comments

The pith

FID scores can improve under guidance changes that actually degrade text-image alignment and human preference in one-step image generators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares one-step flow models against multi-step diffusion and flow baselines using identical sampling steps and classifier-free guidance values on ImageNet validation, ImageNetV2, and a new out-of-distribution set called reLAIONet. It shows that guidance adjustments which lower FID often raise CLIP Score and Pick Score in opposing directions, producing images that score better on distribution metrics yet worse on alignment and preference. To expose these trade-offs, the authors define CLIP-scaled and PickScore-scaled versions of FID and Inception Score that re-weight the original metrics toward semantic consistency. A reader should care because current development loops that chase raw FID may be selecting models that look stronger on paper but deliver visibly poorer outputs in practice.

Core claim

In few-step regimes, increasing classifier-free guidance can lower FID while simultaneously lowering CLIP-based alignment and Pick Score, resulting in reduced visual quality; this pattern appears across one-step flows and multi-step baselines when settings are matched. The authors establish the pattern through controlled class-conditional generation on ImageNet and reLAIONet, then introduce csFID, psFID, csIS, and psIS as diagnostics that penalize misalignment between generated images and conditioning signals.

What carries the argument

CLIP-scaled and PickScore-scaled FID (csFID, psFID) and Inception Score (csIS, psIS) that multiply the base metric by an alignment term derived from CLIP or Pick Score to penalize semantically inconsistent generations.

If this is right

Model selection and hyperparameter search in few-step generators must jointly optimize FID and alignment scores rather than FID alone.
One-step models remain competitive with multi-step systems only when guidance is chosen to preserve text-image consistency rather than to minimize FID.
Out-of-distribution sets aligned to the same label space as ImageNet expose generalization gaps that standard validation FID misses.
The same guidance trade-off appears in both native one-step flows and multi-step baselines once step count is controlled.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training loops that early-stop or select checkpoints solely on FID may systematically favor models that over-smooth or ignore conditioning.
The scaled metrics could be computed at negligible extra cost during evaluation and used to re-rank existing model releases.
If alignment penalties prove robust, future one-step architectures may need explicit conditioning losses rather than relying on post-hoc guidance tuning.

Load-bearing premise

That CLIP Score and Pick Score more reliably track true visual quality and human preference than raw FID when sampling steps are few.

What would settle it

A blinded human preference study in which raters consistently choose images selected by lowest raw FID over those selected by lowest csFID or psFID would falsify the claim that the scaled metrics better reflect quality.

Figures

Figures reproduced from arXiv: 2603.14186 by Abdulaziz Sobirov, Advaith Ravishankar, Arnav Sharma, Faizaan Siddique, Freddy Yu, Jeffrey Zhou, L\'eopold Das, Mengyu Wang, Mingyang Wang, Serena Liu, Seungjoo Baek, Todd Zhou, Yan Luo, Ziling Hu.

**Figure 1.** Figure 1: Left: Example images from ImageNet, ImageNetV2, and reLAIONet for the same label IDs, illustrating the distribution shift of reLAIONet. Middle/Right: Qualitative samples from Scuba Diver and Sports Car classes, where lower FID still exhibit visible distortions (e.g., faces and local structure). The reported FID is a model-level score computed over the full evaluation set, so all samples from a given model … view at source ↗

**Figure 2.** Figure 2: Within model family comparison for native-one step models. Images generated with classes [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of generation quality across all models. Images generated with classes Golden [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Ablations for Meanflow across steps sizes of 1, 5, 10, 15, 20, 25 and CFG’s of 1, 3, 6, 7, 9, 12, [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of generation quality across all models. Images generated with classes lighthouse, [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Within model family comparison for FLUX.1-dev and Stable Diffusion 3.5 Large [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Within model family comparison for Scale RAE and RAE [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Within model family comparison for SoFlow and SiT ODE [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Ablations for Improved Meanflow (iMF) across steps sizes of 1, 5, 10, 15, 20, 25 and CFG’s [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

read the original abstract

State-of-the-art text-to-image models produce high-quality images, but inference remains expensive as generation requires several sequential ODE or denoising steps. Native one-step models aim to reduce this cost by mapping noise to an image in a single step, yet fair comparisons to multi-step systems are difficult because studies use mismatched sampling steps and different classifier-free guidance (CFG) settings, where CFG can shift FID, Inception Score, and CLIP-based alignment in opposing directions. It is also unclear how well one-step models scale to multi-step inference, and there is limited standardized out-of-distribution evaluation for label-ID-conditioned generators beyond ImageNet. To address this, we benchmark eight models spanning one-step flows (MeanFlow, Improved MeanFlow, SoFlow), multi-step baselines (RAE, Scale-RAE), and established systems (SiT, Stable Diffusion 3.5, FLUX.1) under a class-conditional protocol on ImageNet validation, ImageNetV2, and reLAIONet, our new proofread out-of-distribution dataset aligned to ImageNet label IDs. Using FID, Inception Score, CLIP Score, and Pick Score, we show that FID-focused model development and CFG selection can be misleading in few-step regimes, where guidance changes can improve FID while degrading text-image alignment and human preference signals, worsening visual quality. To make these tradeoffs explicit, we introduce CLIP-scaled and PickScore-scaled variants of FID (csFID, psFID) and Inception Score (csIS, psIS) to serve as a diagnostic for semantically aligned image generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a real mismatch between FID and actual quality in one-step models and offers scaled metrics plus a matched benchmark, but the scaled metrics rest on an untested assumption that they track human preference better.

read the letter

The key takeaway is that standard FID can steer CFG selection the wrong way for one-step models, improving the score while alignment and perceived quality drop. They address this by running matched evaluations across one-step flows (MeanFlow variants, SoFlow) and multi-step baselines (SD 3.5, FLUX, SiT) on ImageNet, ImageNetV2, and their new reLAIONet dataset, then introduce csFID, psFID, csIS, and psIS that scale the usual metrics by CLIP Score and PickScore. The benchmarking protocol itself is the clearest contribution: consistent step counts and CFG ranges make the comparisons more apples-to-apples than most prior work. The new dataset is also a practical addition for label-conditioned OOD checks. The soft spot is exactly where the stress-test note lands. The central claim—that guidance changes can improve FID while worsening visual quality—depends on csFID and psFID being better proxies than raw FID. The paper defines the scaling but does not report a separate human preference study or external correlation check on the same few-step outputs. Without that, the argument that degraded alignment equals worse quality stays assumptive. The experimental design looks coherent on paper, with multiple datasets and established metrics, though details on statistical testing and exact matching procedures would need checking in the full text. This is aimed at researchers who train or deploy one-step and few-step generators and need clearer diagnostics than FID alone. It is worth sending to peer review because the benchmarking gap is genuine and the proposed fixes are concrete, even if the metrics will need more validation before they become standard.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that FID-focused model development and CFG selection can be misleading in few-step regimes for text-to-image generation, as guidance changes may improve FID while degrading text-image alignment (via CLIP Score) and human preference signals (via PickScore), thereby worsening visual quality. To support this, the authors perform setting-matched benchmarks of one-step flow models (MeanFlow, Improved MeanFlow, SoFlow) against multi-step baselines (RAE, Scale-RAE, SiT, Stable Diffusion 3.5, FLUX.1) on ImageNet validation, ImageNetV2, and a new reLAIONet out-of-distribution dataset using FID, Inception Score, CLIP Score, and Pick Score. They introduce CLIP-scaled and PickScore-scaled variants (csFID, psFID, csIS, psIS) as diagnostics for semantically aligned generation.

Significance. If the central empirical findings hold after validation, this work would be significant for generative modeling research by cautioning against over-reliance on unscaled FID in low-step regimes and offering scaled metrics to better track alignment and preference. The standardized matched-setting protocol and reLAIONet dataset represent useful resources for future one-step vs. multi-step comparisons. The contribution is primarily empirical and diagnostic rather than theoretical.

major comments (3)

[Abstract and Metrics] Abstract and Metrics section: The central claim that FID improvements can coincide with degraded visual quality rests on csFID and psFID being superior proxies for human preference. The paper provides no direct human preference study, correlation analysis with external benchmarks, or validation showing that the scaled metrics correlate more strongly with human judgments than raw FID on the few-step samples.
[Experimental Setup] Experimental Setup: Details on the exact procedures for matching sampling steps, CFG scales, and other hyperparameters across one-step and multi-step models are insufficient. This is load-bearing for the fairness of the comparisons and the claim that FID-focused CFG selection misleads.
[Dataset] Dataset section: The construction of reLAIONet (proofreading process, alignment to ImageNet label IDs, and handling of out-of-distribution aspects) lacks sufficient detail to assess potential confounds or biases, which undermines the OOD evaluation claims.

minor comments (2)

[Results] Include statistical significance testing (e.g., error bars, p-values) for reported metric differences to strengthen the empirical claims.
[Metrics] Provide explicit formulas for csFID, psFID, csIS, and psIS, including how the scaling factors are computed and applied.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we will revise the manuscript to address the concerns raised.

read point-by-point responses

Referee: The central claim that FID improvements can coincide with degraded visual quality rests on csFID and psFID being superior proxies for human preference. The paper provides no direct human preference study, correlation analysis with external benchmarks, or validation showing that the scaled metrics correlate more strongly with human judgments than raw FID on the few-step samples.

Authors: We acknowledge that our manuscript does not include a new human preference study or explicit correlation analysis comparing the scaled metrics to raw FID against human judgments. The scaled metrics (csFID, psFID, csIS, psIS) are introduced as diagnostic tools to highlight the trade-offs observed between FID and semantic alignment metrics like CLIP Score and PickScore in few-step regimes. PickScore itself is derived from human preference data, providing an indirect link. We will revise the abstract and metrics section to clarify that these are proposed as complementary diagnostics rather than validated superior proxies, and add a discussion of their rationale based on the empirical observations in the paper. We believe this addresses the core concern without overclaiming. revision: partial
Referee: Details on the exact procedures for matching sampling steps, CFG scales, and other hyperparameters across one-step and multi-step models are insufficient. This is load-bearing for the fairness of the comparisons and the claim that FID-focused CFG selection misleads.

Authors: We agree that more detailed descriptions are necessary for reproducibility and to substantiate the fairness of the comparisons. In the revised manuscript, we will expand the Experimental Setup section to include precise information on how sampling steps were matched (e.g., one-step models use 1 step, multi-step use their standard or equivalent effective steps), the specific CFG scales tested for each model, hyperparameter selection procedures, and any normalization or conditioning adjustments made to ensure setting-matched evaluation. This will directly support the claims regarding misleading FID-focused CFG selection. revision: yes
Referee: The construction of reLAIONet (proofreading process, alignment to ImageNet label IDs, and handling of out-of-distribution aspects) lacks sufficient detail to assess potential confounds or biases, which undermines the OOD evaluation claims.

Authors: We thank the referee for pointing this out. The Dataset section will be revised to provide comprehensive details on the construction of reLAIONet, including the proofreading process (e.g., manual review steps and criteria), how captions were aligned to ImageNet label IDs, the selection of out-of-distribution images, and any steps taken to mitigate biases or confounds. This will strengthen the credibility of the OOD evaluation results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons rest on explicitly defined metrics without reduction to inputs

full rationale

The paper conducts setting-matched benchmarking across one-step and multi-step models using standard metrics (FID, IS, CLIP Score, Pick Score) on ImageNet and OOD sets. It introduces csFID/psFID/csIS/psIS explicitly as scaled diagnostics to surface tradeoffs between FID and alignment scores. The central observation—that CFG can improve raw FID while degrading alignment—is shown via direct computation on model outputs rather than any fitted parameter, self-definition, or self-citation chain. No equations reduce a claimed prediction to its own inputs by construction, and the scaled metrics are presented as new tools rather than smuggled ansatzes or renamed known results. The derivation chain is therefore self-contained against the reported empirical data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the validity of standard metric assumptions (FID, CLIP, PickScore) and the correctness of the new dataset alignment. No explicit free parameters are introduced beyond those implicit in the metrics themselves.

axioms (1)

domain assumption Standard assumptions underlying FID, Inception Score, CLIP Score, and Pick Score calculations remain valid for the evaluated one-step and multi-step models.
The paper applies these metrics without additional validation of their foundational properties in the few-step regime.

invented entities (1)

reLAIONet no independent evidence
purpose: Proofread out-of-distribution dataset aligned to ImageNet label IDs for standardized OOD evaluation.
New dataset introduced in this work; no independent evidence of its construction details or validation is provided in the abstract.

pith-pipeline@v0.9.0 · 5645 in / 1312 out tokens · 61469 ms · 2026-05-15T12:12:07.759414+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 12 internal anchors

[1]

Hugging Face model card (2024),https://huggingface

Black Forest Labs: Flux.1 [schnell] model card. Hugging Face model card (2024),https://huggingface. co/black-forest-labs/FLUX.1-schnell

work page 2024
[2]

arXiv preprint arXiv:2211.09800 (2022),https://arxiv.org/abs/2211.09800

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800 (2022),https://arxiv.org/abs/2211.09800

work page arXiv 2022
[3]

In: 2009 IEEE Conference on Computer Vision and Pattern Recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009
[4]

Generative Modeling via Drifting

Deng, M., Li, H., Li, T., Du, Y., He, K.: Generative modeling via drifting (2026).https://doi.org/10. 48550/arXiv.2602.04770,https://arxiv.org/abs/2602.04770

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

In: Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F

Esser, P., Kulal, S., Blattmann, A., Entezari, R., M¨ uller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis. In: Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F. (eds.) Proc...

work page 2024
[6]

Mean Flows for One-step Generative Modeling

Geng, Z., Deng, M., Bai, X., Kolter, J.Z., He, K.: Mean flows for one-step generative modeling (2025). https://doi.org/10.48550/arXiv.2505.13447,https://arxiv.org/abs/2505.13447 12

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.13447 2025
[7]

Improved Mean Flows: On the Challenges of Fastforward Generative Models

Geng, Z., Lu, Y., Wu, Z., Shechtman, E., Kolter, J.Z., He, K.: Improved mean flows: On the challenges of fastforward generative models (2025).https://doi.org/10.48550/arXiv.2512.02012,https://arxiv. org/abs/2512.02012

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.02012 2025
[8]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time- scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems (NeurIPS) (2017).https://doi.org/10.48550/arXiv.1706.08500,https://arxiv.org/abs/ 1706.08500

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.08500 2017
[9]

In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 6840–6851. Curran Associates, Inc. (2020),https://proceedings.neurips.cc/paper_files/paper/ 2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

work page 2020
[10]

Classifier-Free Diffusion Guidance

Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022),https: //arxiv.org/abs/2207.12598

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: An open dataset of user preferences for text-to-image generation (2023).https://doi.org/10.48550/arXiv.2305.01569, https://arxiv.org/abs/2305.01569

work page doi:10.48550/arxiv.2305.01569 2023
[12]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview

Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Representations (2023),https://openreview. net/forum?id=PqvMRDCJT9t

work page 2023
[13]

In: The Eleventh International Conference on Learning Representations (2023),https: //openreview.net/forum?id=XVjTT1nw5z

Liu, X., Gong, C., qiang liu: Flow straight and fast: Learning to generate and transfer data with rectified flow. In: The Eleventh International Conference on Learning Representations (2023),https: //openreview.net/forum?id=XVjTT1nw5z

work page 2023
[14]

In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=1k4yZbbDqX

Liu, X., Zhang, X., Ma, J., Peng, J., qiang liu: Instaflow: One step is enough for high-quality diffusion- based text-to-image generation. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=1k4yZbbDqX

work page 2024
[15]

Luo, T., Yuan, H., Liu, Z.: Soflow: Solution flow models for one-step generative modeling (2025).https: //doi.org/10.48550/arXiv.2512.15657,https://arxiv.org/abs/2512.15657

work page doi:10.48550/arxiv.2512.15657 2025
[16]

48550/arXiv.2401.08740,https://arxiv.org/abs/2401.08740

Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers (2024).https://doi.org/10. 48550/arXiv.2401.08740,https://arxiv.org/abs/2401.08740

work page arXiv 2024
[17]

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language super- vision (2021).https://doi.org/10.48550/arXiv.2103.00020,https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.00020 2021
[18]

Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize to imagenet? arXiv preprint arXiv:1902.10811 (2019),https://arxiv.org/abs/1902.10811

work page internal anchor Pith review Pith/arXiv arXiv 1902
[19]

High-Resolution Image Synthesis with Latent Diffusion Models

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684–10695 (2022),https://arxiv.org/abs/2112.10752

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Improved Techniques for Training GANs

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Advances in Neural Information Processing Systems (NeurIPS) (2016).https: //doi.org/10.48550/arXiv.1606.03498,https://arxiv.org/abs/1606.03498

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1606.03498 2016
[21]

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S., Crowson, K., Schmidt, L., Kaczmarczyk, R., Jitsev, J.: Laion-5b: An open large-scale dataset for training next generation image-text models (2022).https://doi.org/10.48550/arXiv.2210.08402,https://...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.08402 2022
[22]

15769, arXiv:2306.15769 (last revised Oct 29, 2024)

Shirali, A., Hardt, M.: What makes imagenet look unlike laion (2023),https://arxiv.org/abs/2306. 15769, arXiv:2306.15769 (last revised Oct 29, 2024)

work page arXiv 2023
[23]

Denoising Diffusion Implicit Models

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (ICLR) (2021),https://arxiv.org/abs/2010.02502

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

arXiv preprint arXiv:2601.16208 (2026),https://arxiv.org/abs/2601.16208

Tong, S., Zheng, B., Wang, Z., Tang, B., Ma, N., Brown, E., Yang, J., Fergus, R., LeCun, Y., Xie, S.: Scaling text-to-image diffusion transformers with representation autoencoders. arXiv preprint arXiv:2601.16208 (2026),https://arxiv.org/abs/2601.16208

work page arXiv 2026
[25]

Diffusion Transformers with Representation Autoencoders

Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690 (2025),https://arxiv.org/abs/2510.11690 13 A Additional Across Family Qualitative Samples R A E 2 5 S t e p s F I D : 1 1 . 6 0 M M H M : 0 . 8 9 F l u x . 1 - d e v 2 5 S t e p s F I D : 2 5 . 6 7 M M H M : 0 . 6 9 S o F l o w 2 ...

work page internal anchor Pith review Pith/arXiv arXiv 2025