pith. sign in

arxiv: 2605.26449 · v1 · pith:7NMEJKJ3new · submitted 2026-05-26 · 💻 cs.CV · cs.AI

Cross-scale Aligned Supervision for Training GANs

Pith reviewed 2026-06-29 18:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords GANsmulti-scale generationconsistency regularizationcross-scale alignmentone-step inferenceImageNetFID evaluation
0
0 comments X

The pith

Standard scale-wise adversarial losses in multi-scale GANs allow intermediate outputs to drift toward different samples rather than refine the same one, and a generator-side consistency regularizer fixes the misalignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that independent adversarial supervision at each resolution fails to enforce a shared sample trajectory from coarse to fine stages. Because each intermediate image is optimized only for realism at its own scale, later stages can generate an unrelated sample instead of refining the prior output. CAT addresses this by retaining scale-specific discrimination while adding a simple consistency term that forces every intermediate generator output to align with the final high-resolution result. On class-conditional ImageNet-256 this produces an FID-50K of 1.56 after 60 epochs using one-step inference, surpassing prior one-step GANs and diffusion baselines.

Core claim

The central claim is that the observed multi-stage synthesis in GANs is not a true coarse-to-fine hierarchy because scale-wise adversarial losses do not constrain later stages to preserve the same sample trajectory; the proposed Cross-scale Aligned Transformer resolves the misalignment by adding generator-side consistency regularization that aligns all intermediate outputs with the final output while leaving the discriminator scale-wise.

What carries the argument

Cross-scale Aligned Transformer (CAT), a generator-side consistency regularization that penalizes deviation between intermediate and final generator outputs to enforce identical sample trajectories across scales.

If this is right

  • One-step inference on ImageNet-256 reaches FID-50K of 1.56 after only 60 training epochs.
  • The method outperforms prior one-step GANs as well as diffusion and flow baselines under the same one-step evaluation protocol.
  • Scale-wise discrimination is preserved while generation trajectories are forced to coincide.
  • Training converges in far fewer epochs than typical multi-scale GAN schedules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same consistency idea could be tested on non-adversarial multi-scale generators such as cascaded diffusion models.
  • If the regularization works by reducing trajectory variance, it may also improve controllability when conditioning signals are applied at early stages.
  • The approach suggests that explicit trajectory constraints might replace or reduce the need for progressive growing schedules in other hierarchical generators.

Load-bearing premise

Enforcing consistency between intermediate and final generator outputs will improve alignment without reducing diversity or harming training stability.

What would settle it

An ablation in which the consistency term is added yet the FID remains above 2.0 or intra-class diversity metrics drop sharply would show the regularization fails to deliver the claimed benefit.

Figures

Figures reproduced from arXiv: 2605.26449 by Jae-Pil Heo, MinKyu Lee, Sangeek Hyun.

Figure 1
Figure 1. Figure 1: Method overview and comparison with 1-step baselines. (Left) Our method combines generator￾side consistency regularization with a scale-wise discriminator. (Right) Our method achieves strong FID with sub￾stantially fewer training epochs in one￾step generation in ImageNet-256. Recent generative models have achieved remarkable progress in image synthesis. A common principle behind many of these advances is t… view at source ↗
Figure 2
Figure 2. Figure 2: Failure modes of standard scale-wise adversarial supervision. (a) Since each scale￾specific image is independently supervised by the discriminator at its own resolution, adversarial gradients can push different stages toward different realistic modes. This enforces scale-wise realism, but not sample-wise cross-scale alignment. (b) At each stage, xk is produced for adversarial supervision, while the subsequ… view at source ↗
Figure 4
Figure 4. Figure 4: Cross-scale inconsistency analysis. We analyze whether intermediate outputs are coher￾ently accumulated toward the final output under standard scale-wise supervision. (a) The distance to the highest-scale image (δk) remains large, showing that intermediate outputs stay far from the final image. (b) The inter-stage rewrite magnitude (Rk) is also large, indicating that later stages substantially rewrite the … view at source ↗
Figure 5
Figure 5. Figure 5: FID-50K training curve. Model Lcons FID-50K 20 epo 40 epo G-B/2 5.43 – G-B/2 ✓ 4.06 – G-M/2 3.27 2.34 G-M/2 ✓ 3.00 1.93 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Effect of generator-side consistency regularization. Under scale-wise supervision, adding Lcons improves cross-scale alignment across different model sizes, reducing δk and Rk while increasing Ak. We use the 20 epoch checkpoint for Base and the 40 epoch checkpoint for Medium. 10k 30k 50k Iterations 4 10 20 40 100 FID-50K Scale-agg Scale-wise + cons. 1 3 5 7 9 11 Layer 0.00 0.25 0.50 0.75 1.00 Attention rat… view at source ↗
Figure 7
Figure 7. Figure 7: Analysis on scale-aggregated discriminator. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Uncurated generated examples and latent interpolation on ImageNet-256. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Overall pipeline of CAT. The generator produces accumulated intermediate outputs [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Uncurated sample comparison with iMF. We compare randomly generated ImageNet￾256 samples from our model (left) and iMF-XL/2 (right), without manual curation. Our model is trained for 60 epochs and uses 166.7 GFLOPs for one-step inference, whereas iMF-XL/2 is trained for 800 epochs and uses 174.6 GFLOPs. For sampling, our model uses truncation ψ = 0.85, and iMF-XL/2 follows the official configuration with … view at source ↗
Figure 11
Figure 11. Figure 11: Uncurated sample comparison with iMF. We compare randomly generated ImageNet￾256 samples from our model (left) and iMF-XL/2 (right), without manual curation. CAT-H/2 requires 60 epochs training and 166.7 inference GFLOPs, while iMF-XL/2 needs 800 epochs and 174.6 GFLOPs. For sampling, our model uses truncation ψ = 0.85, and iMF-XL/2 follows the official configuration with interval [0.42, 0.62] and ω = 8.0… view at source ↗
Figure 12
Figure 12. Figure 12: Uncurated sample comparison with iMF. We compare randomly generated ImageNet￾256 samples from our model (left) and iMF-XL/2 (right), without manual curation. CAT-H/2 requires 60 epochs training and 166.7 inference GFLOPs, while iMF-XL/2 needs 800 epochs and 174.6 GFLOPs. For sampling, our model uses truncation ψ = 0.85, and iMF-XL/2 follows the official configuration with interval [0.42, 0.62] and ω = 8.0… view at source ↗
read the original abstract

Modern GANs often introduce adversarial supervision on intermediate generator outputs and interpret the resulting multi-stage synthesis as coarse-to-fine hierarchical generation. In this work, we challenge this interpretation. We argue that standard scale-wise adversarial supervision does not construct a proper coarse-to-fine hierarchy: each intermediate image is independently pushed toward the real distribution at its own resolution, but this scale-wise realism does not ensure that outputs across stages represent the identical generated sample. Moreover, the scale-specific image produced at each stage is not used as an explicit refinement target for the subsequent stage. Therefore, its adversarial loss can improve a scale-specific output without constraining later stages to preserve the same sample trajectory, allowing them to move toward a different sample rather than refine the previous output. We refer to this problem as a cross-scale trajectory misalignment problem. To resolve it, we propose CAT, a Cross-scale Aligned Transformer for multi-scale adversarial generation. CAT keeps the discriminator scale-wise, so each intermediate output is evaluated at its own resolution, while adding a simple generator-side consistency regularization that aligns intermediate outputs with the final output. On class-conditional ImageNet-256, CAT-H/2 achieves an FID-50K of 1.56 with one-step inference after only 60 training epochs, outperforming strong one-step GAN and diffusion/flow baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper argues that standard multi-scale GANs suffer from a cross-scale trajectory misalignment problem: scale-wise adversarial supervision pushes each intermediate output toward realism at its resolution but does not constrain later stages to refine the same sample trajectory, nor does it use the intermediate output as an explicit refinement target. To address this, the authors propose CAT (Cross-scale Aligned Transformer), which retains scale-wise discrimination while adding a generator-side consistency regularization that aligns intermediate outputs with the final output. On class-conditional ImageNet-256, CAT-H/2 is reported to reach FID-50K of 1.56 with one-step inference after only 60 training epochs, outperforming strong one-step GAN and diffusion/flow baselines.

Significance. If the reported FID is reproducible and the consistency term is shown to be the causal factor rather than an artifact of architecture, optimizer, or implicit regularization effects, the work would be significant for efficient one-step generative modeling. It offers a lightweight, architecture-agnostic fix that could shorten training of high-quality multi-scale GANs and reframes the interpretation of hierarchical synthesis in adversarial generators.

major comments (2)
  1. [Abstract] Abstract: the central empirical claim (FID 1.56 after 60 epochs) is presented without any ablation studies, diversity metrics (precision/recall), or stability analysis. This leaves the attribution of the result to the proposed consistency regularization unverified and makes it impossible to rule out that the low FID arises from reduced sample support or altered training dynamics induced by the new loss term.
  2. [Abstract] Abstract: the description of the misalignment problem and its resolution is purely qualitative; no equations, pseudocode, or formal definition of the generator-side consistency regularization are supplied, preventing assessment of whether the term enforces identical trajectories across scales without introducing new biases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the two major comments below and will revise the abstract accordingly to strengthen the presentation of results and method.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim (FID 1.56 after 60 epochs) is presented without any ablation studies, diversity metrics (precision/recall), or stability analysis. This leaves the attribution of the result to the proposed consistency regularization unverified and makes it impossible to rule out that the low FID arises from reduced sample support or altered training dynamics induced by the new loss term.

    Authors: We agree that the abstract's brevity omits supporting details. The full manuscript contains ablations in Section 4.2 isolating the consistency term's contribution (FID rises substantially without it), precision/recall in Table 2, and stability across seeds. We will revise the abstract to reference these results and note that the consistency regularization is the key addition beyond standard multi-scale training. revision: yes

  2. Referee: [Abstract] Abstract: the description of the misalignment problem and its resolution is purely qualitative; no equations, pseudocode, or formal definition of the generator-side consistency regularization are supplied, preventing assessment of whether the term enforces identical trajectories across scales without introducing new biases.

    Authors: The abstract is intentionally high-level. Section 2 formally defines the cross-scale trajectory misalignment, and Section 3 gives the exact consistency loss (Eq. 3: L_cons = sum ||x_l - x_L||_2^2 over scales) plus pseudocode in Algorithm 1. This term directly penalizes deviation from the final trajectory without altering the discriminator. We will add a concise equation and description of the regularization to the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: conceptual proposal with independent empirical claims

full rationale

The paper identifies a conceptual issue with scale-wise adversarial supervision in multi-scale GANs (cross-scale trajectory misalignment) and proposes CAT, which retains scale-wise discriminators while adding generator-side consistency regularization. No equations, parameter fits, self-citations, uniqueness theorems, or ansatzes appear in the provided text. The central claims rest on the proposed regularization's effect and reported FID results rather than any derivation that reduces to its own inputs by construction. This is a standard empirical method paper whose derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no mathematical details, free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5793 in / 861 out tokens · 73317 ms · 2026-06-29T18:54:26.209285+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 8 canonical work pages · 7 internal anchors

  1. [1]

    Chang, H

    H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022

  2. [2]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009

  3. [3]

    Donahue and K

    J. Donahue and K. Simonyan. Large scale adversarial representation learning.Advances in neural information processing systems, 32, 2019

  4. [4]

    Esser, R

    P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

  5. [5]

    Frans, D

    K. Frans, D. Hafner, S. Levine, and P. Abbeel. One step diffusion via shortcut models. InThe Thirteenth International Conference on Learning Representations

  6. [6]

    Z. Geng, M. Deng, X. Bai, J. Z. Kolter, and K. He. Mean flows for one-step generative modeling. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  7. [7]

    Z. Geng, Y . Lu, Z. Wu, E. Shechtman, J. Z. Kolter, and K. He. Improved mean flows: On the challenges of fastforward generative models.arXiv preprint arXiv:2512.02012, 2025

  8. [8]

    I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014

  9. [9]

    J. Gu, T. Chen, D. Berthelot, H. Zheng, Y . Wang, R. ZHANG, L. Dinh, M. Á. Bautista, J. M. Susskind, and S. Zhai. STARFlow: Scaling latent normalizing flows for high-resolution image synthesis. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  10. [10]

    Heusel, H

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

  11. [11]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  12. [12]

    Hoogeboom, J

    E. Hoogeboom, J. Heek, and T. Salimans. simple diffusion: End-to-end diffusion for high resolution images. InInternational Conference on Machine Learning, pages 13213–13232. PMLR, 2023

  13. [13]

    S. Hyun, M. Lee, and J.-P. Heo. Scalable gans with transformers.arXiv preprint arXiv:2509.24935, 2025

  14. [14]

    Jolicoeur-Martineau

    A. Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standard gan. InInternational Conference on Learning Representations

  15. [15]

    Kang, J.-Y

    M. Kang, J.-Y . Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris, and T. Park. Scaling up gans for text-to-image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10124–10134, 2023. 10

  16. [16]

    Karnewar and O

    A. Karnewar and O. Wang. Msg-gan: Multi-scale gradients for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7799–7808, 2020

  17. [17]

    Progressive Growing of GANs for Improved Quality, Stability, and Variation

    T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation.arXiv preprint arXiv:1710.10196, 2017

  18. [18]

    Karras, S

    T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. Analyzing and improving the image quality of stylegan. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020

  19. [19]

    T. Li, Y . Tian, H. Li, M. Deng, and K. He. Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024

  20. [20]

    J. Lin, R. Zhang, F. Ganz, S. Han, and J.-Y . Zhu. Anycost gans for interactive image synthesis and editing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14986–14996, 2021

  21. [21]

    S. Lin, X. Xia, Y . Ren, C. Yang, X. Xiao, and L. Jiang. Diffusion adversarial post-training for one-step video generation. InInternational Conference on Machine Learning, pages 37959– 37974. PMLR, 2025

  22. [22]

    S. Lin, C. Yang, Z. Lin, H. Chen, and H. Fan. Adversarial flow models.arXiv preprint arXiv:2511.22475, 2025

  23. [23]

    Flow Matching for Generative Modeling

    Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  24. [24]

    Y . Lu, S. Lu, Q. Sun, H. Zhao, Z. Jiang, X. Wang, T. Li, Z. Geng, and K. He. One-step latent-free image generation with pixel mean flows.arXiv preprint arXiv:2601.22158, 2026

  25. [25]

    N. Ma, M. Goldstein, M. S. Albergo, N. M. Boffi, E. Vanden-Eijnden, and S. Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

  26. [26]

    Oquab, T

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research Journal, 2024

  27. [27]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  28. [28]

    Y . Peng, K. Zhu, Y . Liu, P. Wu, H. Li, X. Sun, and F. Wu. FACM: Flow-anchored consistency models. InThe F ourteenth International Conference on Learning Representations, 2026

  29. [29]

    S. Ren, Q. Yu, J. He, X. Shen, A. Yuille, and L.-C. Chen. Beyond next-token: Next-x prediction for autoregressive visual generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15781–15791, 2025

  30. [30]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  31. [31]

    Sauer, K

    A. Sauer, K. Schwarz, and A. Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022

  32. [32]

    N. Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

  33. [33]

    Song and P

    Y . Song and P. Dhariwal. Improved techniques for training consistency models. InThe Twelfth International Conference on Learning Representations, 2024

  34. [34]

    Y . Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. InProceedings of the 40th International Conference on Machine Learning, pages 32211–32252, 2023. 11

  35. [35]

    J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  36. [36]

    K. Tian, Y . Jiang, Z. Yuan, B. Peng, and L. Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

  37. [37]

    S. Wang, Z. Tian, W. Huang, and L. Wang. Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

  38. [38]

    S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie. Convnext v2: Co- designing and scaling convnets with masked autoencoders. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16133–16142, 2023

  39. [39]

    J. Yao, B. Yang, and X. Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

  40. [40]

    Q. Yu, J. He, X. Deng, X. Shen, and L.-C. Chen. Randomized autoregressive visual generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18431– 18441, 2025

  41. [41]

    S. Yu, S. Kwak, H. Jang, J. Jeong, J. Huang, J. Shin, and S. Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InThe Thirteenth International Conference on Learning Representations

  42. [42]

    Zhang, A

    H. Zhang, A. Siarohin, W. Menapace, M. Vasilkovsky, S. Tulyakov, Q. Qu, and I. Skorokhodov. Alphaflow: Understanding and improving meanflow models. InThe F ourteenth International Conference on Learning Representations, 2026

  43. [43]

    Zheng, N

    B. Zheng, N. Ma, S. Tong, and S. Xie. Diffusion transformers with representation autoencoders. InThe F ourteenth International Conference on Learning Representations, 2026

  44. [44]

    J. Zhu, C. Yang, K. Zheng, Y . Xu, Z. Shi, Y . Zhang, Q. Chen, and Y . Shen. Exploring sparse moe in gans for text-conditioned image synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18411–18423, 2025. 12 A Appendix A.1 Implementation details Table 8: Model configurations and training hyperparameters. configs G-B/2 ...