pith. sign in

arxiv: 2606.26016 · v1 · pith:G3SKPAV6new · submitted 2026-06-24 · 💻 cs.CV

MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation

Pith reviewed 2026-06-25 19:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords normalizing flowsmasked image modelingimage generationVAE encodersemantic manifoldImageNetFIDend-to-end framework
0
0 comments X

The pith

MIMFlow uses a VAE encoder on masked images to let normalizing flows model only a low-frequency semantic manifold while a decoder handles details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Normalizing flows often exhaust capacity on pixel-level details due to invertibility requirements, limiting their ability to capture high-level semantics. The paper integrates masked image modeling by training a VAE encoder on masked images to produce semantic latents. This setup decouples the generative process so the flow models a simplified manifold and the decoder manages high-frequency synthesis. The result is reported as an FID of 2.50 and 71.3 percent linear probing accuracy on ImageNet 256 by 256, with gains over baseline flows despite using half the tokens.

Core claim

By employing a VAE encoder to infer semantic latent from masked images, MIMFlow achieves a principled decoupling of the generative task: the Normalizing Flow focuses on modeling a simplified, low-frequency semantic manifold, while a specialized decoder handles high-frequency synthesis.

What carries the argument

VAE encoder on masked images that infers semantic latent for the normalizing flow, enabling decoupling of low-frequency manifold modeling from high-frequency pixel synthesis.

Load-bearing premise

A VAE encoder applied to masked images will extract a semantic latent low-frequency and simplified enough that the normalizing flow can model it without wasting capacity.

What would settle it

Training a standard normalizing flow directly on latents extracted from unmasked images and obtaining equal or better FID scores on ImageNet 256x256 would show the masking step adds no benefit to the claimed decoupling.

Figures

Figures reproduced from arXiv: 2606.26016 by Limin Wang, Qiushi Guo, Shuai Wang, Tiezheng Ge, Xiaowei Xu, Xinwen Zhang, Yang Chen.

Figure 1
Figure 1. Figure 1: MIM in Different Paradigms. (a) Self-Supervised Learning: Employs high￾ratio masking as a self-supervised proxy task for representation learning. (b) Generative Tokenizers: A two-stage approach where the latent space is pre-trained with MIM before training a separate generative model. (c) MIMFlow (Ours): A unified framework that jointly optimizes latent semantics, pixel reconstruction, and generative flow … view at source ↗
Figure 2
Figure 2. Figure 2: Structure of MIMFlow. N is the number of image patches, K is the number of learnable latent query tokens, m is the binary mask, and e denotes learnable decoder embeddings. MAE [17] and SimMIM [43], present inherent difficulties for density estimation. MAE only processes visible patches, resulting in a latent sequence whose length and positional context vary with the random mask pattern, which imposes an in… view at source ↗
Figure 3
Figure 3. Figure 3: Selected Samples on ImageNet 256 × 256 from MIMFlow-L. We use classifier￾free guidance equal to 2.0. flow model to learn a more efficient and structured semantic manifold, extracting higher generative value from the same parameter budget. Efficiency via Token Compression. A key highlight of MIMFlow is its token efficiency. While most latent models (e.g., DiT, LDM, SimFlow) operate on a 16 × 16 = 256 token … view at source ↗
Figure 4
Figure 4. Figure 4: UMAP visualization on ImageNet of the learned latent space from (a) SD￾VAE; (b) MIMFlow. Colors indicate different classes. MIMFlow presents a more dis￾criminative latent space. flow model, thereby violating the principled decoupling and hindering the NF’s ability to model global structure. Synergy of Auxiliary Semantic Priors. We investigate various auxiliary su￾pervision signals (DINO, CLIP, HOG) in Tab.… view at source ↗
Figure 5
Figure 5. Figure 5: Jacobian Spectral Analysis of STARFlow and MIMFlow. The three pan￾els report, from left to right, the empirical distributions of the largest singular value σmax(J), the smallest singular value σmin(J), and the log-condition number log10 κ(J) (with κ(J) = σmax(J)/σmin(J)). These results confirm that the masking bottleneck effectively forces the latent manifold to prioritize high-level semantic coherence ove… view at source ↗
Figure 6
Figure 6. Figure 6: Linear Probe Accuracy vs Depth under Different Mask Ratios. C.3 Efficiency Analysis A key advantage of our MIMFlow is its high efficiency, achieved through a sig￾nificantly reduced token budget. While existing methods typically rely on 256 or even 1024 tokens to represent sequences, our approach operates effectively with only 128 tokens. As demonstrated in [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
read the original abstract

Normalizing Flows (NFs) are powerful generative models capable of exact density estimation and sampling. However, their strict invertibility often forces the model to exhaust its capacity on low-level pixel details, hindering the capture of high-level semantic structures. While Masked Image Modeling (MIM) has excelled in representation learning, its integration into generative pipelines has remained largely modular and disjointed. In this paper, we propose MIMFlow, a unified end-to-end framework that jointly optimizes latent semantics, pixel reconstruction, and generative flow. By employing a VAE encoder to infer semantic latent from masked images, MIMFlow achieves a principled decoupling of the generative task: the Normalizing Flow focuses on modeling a simplified, low-frequency semantic manifold, while a specialized decoder handles high-frequency synthesis. This design effectively resolves the inherent capacity bottleneck of NFs, allowing the model to prioritize global structural coherence over redundant noise. Empirical results on ImageNet 256$\times$256 show that MIMFlow-L reaches 71.3\% linear probing accuracy and an FID of 2.50. Despite using only 128 tokens (50\% fewer than standard models), it yields a 32.8\% performance gain over similar-scale NF baselines. Our code is available at https://github.com/MCG-NJU/MIMFlow.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes MIMFlow, a unified end-to-end framework integrating Masked Image Modeling with Normalizing Flows. A VAE encoder infers semantic latents from masked images, allowing the NF to model a simplified low-frequency semantic manifold while a specialized decoder handles high-frequency synthesis. This is claimed to resolve NF capacity bottlenecks on pixel details. On ImageNet 256×256, MIMFlow-L reports FID 2.50 and 71.3% linear probing accuracy using 128 tokens (50% fewer than standard models), with a 32.8% gain over similar-scale NF baselines. Code is released.

Significance. If the decoupling mechanism is substantiated, the approach could meaningfully improve the applicability of exact-likelihood NF models to high-resolution image synthesis by freeing capacity for semantic structure. Releasing code supports reproducibility and follow-up work.

major comments (2)
  1. [Abstract] Abstract: The central claim that the VAE encoder on masked images produces a 'simplified, low-frequency semantic manifold' (allowing the NF to avoid high-frequency modeling) is load-bearing for attributing the FID and accuracy gains to the proposed decoupling, yet the manuscript provides no frequency-spectrum analysis, power-spectrum comparisons, or latent visualizations confirming reduced high-frequency energy in the inferred latent relative to pixels or unmasked inputs.
  2. [Abstract] Abstract: The reported metrics (FID 2.50, 71.3% probing accuracy, 32.8% gain) and the claim of principled decoupling are presented without ablations, baseline details, or controls that isolate the contribution of the MIM-VAE-NF integration versus other design choices (e.g., decoder architecture or token count), leaving the mechanism unverified.
minor comments (1)
  1. [Abstract] The abstract refers to 'MIMFlow-L' without defining the variant (model scale, depth, or other hyperparameters).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for stronger empirical support of the decoupling claim. We address each point below and will revise the manuscript to incorporate the requested analyses.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the VAE encoder on masked images produces a 'simplified, low-frequency semantic manifold' (allowing the NF to avoid high-frequency modeling) is load-bearing for attributing the FID and accuracy gains to the proposed decoupling, yet the manuscript provides no frequency-spectrum analysis, power-spectrum comparisons, or latent visualizations confirming reduced high-frequency energy in the inferred latent relative to pixels or unmasked inputs.

    Authors: We agree that the manuscript does not currently contain frequency-spectrum analysis, power-spectrum comparisons, or supporting latent visualizations. In the revision we will add these elements, including power-spectrum plots of the inferred latents versus raw pixels and unmasked inputs, together with qualitative latent visualizations, to directly substantiate the reduced high-frequency content. revision: yes

  2. Referee: [Abstract] Abstract: The reported metrics (FID 2.50, 71.3% probing accuracy, 32.8% gain) and the claim of principled decoupling are presented without ablations, baseline details, or controls that isolate the contribution of the MIM-VAE-NF integration versus other design choices (e.g., decoder architecture or token count), leaving the mechanism unverified.

    Authors: We concur that additional controls are required to isolate the contribution of the MIM-VAE-NF integration. The revised manuscript will include ablations that disable the masked VAE encoder, vary token count while holding other components fixed, and compare against decoder-only variants, thereby clarifying the source of the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity: decoupling presented as architectural design, not reduced by construction

full rationale

The abstract asserts that applying a VAE encoder to masked images produces a low-frequency semantic latent allowing NF to model only that manifold, but supplies no equations, fitted parameters, or self-citations that make this decoupling equivalent to its inputs by definition. The performance numbers (FID 2.50, 71.3% probing) are reported as empirical outcomes rather than predictions forced by the modeling choice itself. No load-bearing self-citation, ansatz smuggling, or renaming of known results appears in the provided text; the central mechanism is a stated design assumption whose validity is left to external verification.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the framework implicitly relies on the existence of a semantic manifold separable by masking and VAE encoding.

pith-pipeline@v0.9.1-grok · 5783 in / 1176 out tokens · 24359 ms · 2026-06-25T19:32:16.064638+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 14 linked inside Pith

  1. [1]

    Chen, H., Han, Y., Chen, F., Li, X., Wang, Y., Wang, J., Wang, Z., Liu, Z., Zou, D., Raj, B.: Masked autoencoders are effective tokenizers for diffusion models (2025), https://arxiv.org/abs/2502.03444

  2. [2]

    Chen, R.T.Q., Rubanova, Y., Bettencourt, J., Duvenaud, D.: Neural ordinary dif- ferential equations (2019),https://arxiv.org/abs/1806.07366

  3. [3]

    arXiv preprint arXiv:2504.07963 (2025)

    Chen, S., Ge, C., Zhang, S., Sun, P., Luo, P.: Pixelflow: Pixel-space generative models with flow. arXiv preprint arXiv:2504.07963 (2025)

  4. [4]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Chen, Y., Xu, X., Wang, S., Zhu, C., Wen, R., Li, X., Ge, T., Wang, L.: Flowing backwards: Improving normalizing flows via reverse representation alignment. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 3074– 3082 (2026)

  5. [5]

    IEEE Conference on Computer Vision and Pattern Recognition pp

    Deng,J.,Dong,W.,Socher,R.,Li,L.J.,Li,K.,Fei-Fei,L.:ImageNet:ALarge-scale Hierarchical Image Database. IEEE Conference on Computer Vision and Pattern Recognition pp. 248–255 (2009)

  6. [6]

    Advances in Neural Information Processing Systems34, 8780–8794 (2021)

    Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems34, 8780–8794 (2021)

  7. [7]

    arXiv preprint arXiv:1410.8516 (2014)

    Dinh, L., Krueger, D., Bengio, Y.: Nice: Non-linear independent components esti- mation. arXiv preprint arXiv:1410.8516 (2014)

  8. [8]

    arXiv preprint arXiv:1605.08803 (2016)

    Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real nvp. arXiv preprint arXiv:1605.08803 (2016)

  9. [9]

    In: International Conference on Artificial Intelligence and Statistics

    Draxler, F., Sorrenson, P., Zimmermann, L., Rousselot, A., Köthe, U.: Free-form flows: Make any architecture a normalizing flow. In: International Conference on Artificial Intelligence and Statistics. pp. 2197–2205. PMLR (2024)

  10. [10]

    arXiv preprint arXiv:2402.06578 (2024)

    Draxler, F., Wahl, S., Schnörr, C., Köthe, U.: On the universality of volume- preserving and coupling-based normalizing flows. arXiv preprint arXiv:2402.06578 (2024)

  11. [11]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Gao, S., Zhou, P., Cheng, M.M., Yan, S.: Masked diffusion transformer is a strong image synthesizer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23164–23173 (2023)

  12. [12]

    Gao, Y., Chen, C., Chen, T., Gu, J.: One layer is enough: Adapting pretrained visual encoders for image generation (2025),https://arxiv.org/abs/2512.07829

  13. [13]

    Advances in Neu- ral Information Processing Systems33, 22104–22117 (2020) MIMFlow 21

    Giaquinto, R., Banerjee, A.: Gradient boosted normalizing flows. Advances in Neu- ral Information Processing Systems33, 22104–22117 (2020) MIMFlow 21

  14. [14]

    arXiv preprint arXiv:2506.06276 (2025)

    Gu,J.,Chen,T.,Berthelot,D.,Zheng,H.,Wang,Y.,Zhang,R.,Dinh,L.,Bautista, M.A., Susskind, J., Zhai, S.: Starflow: Scaling latent normalizing flows for high- resolution image synthesis. arXiv preprint arXiv:2506.06276 (2025)

  15. [15]

    Gu, J., Chen, T., Shen, Y., Berthelot, D., Zhai, S., Susskind, J.: Normalizing trajectory models (2026),https://arxiv.org/abs/2605.08078

  16. [16]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Gu, J., Shen, Y., Chen, T., Dinh, L., Wang, Y., Bautista, M.A., Berthelot, D., Susskind, J., Zhai, S.: Starflow-v: End-to-end video generative modeling with au- toregressive normalizing flows. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9084–9094 (2026)

  17. [17]

    He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners (2021),https://arxiv.org/abs/2111.06377

  18. [18]

    Advances in neural information processing systems30(2017)

    Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

  19. [19]

    arXiv preprint arXiv:2410.19324 (2024)

    Hoogeboom, E., Mensink, T., Heek, J., Lamerigts, K., Gao, R., Salimans, T.: Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion. arXiv preprint arXiv:2410.19324 (2024)

  20. [20]

    arXiv preprint arXiv:2212.11972 (2022)

    Jabri, A., Fleet, D., Chen, T.: Scalable adaptive computation for iterative genera- tion. arXiv preprint arXiv:2212.11972 (2022)

  21. [21]

    Kingma, D.P., Dhariwal, P.: Glow: Generative flow with invertible 1x1 convolutions (2018),https://arxiv.org/abs/1807.03039

  22. [22]

    IEEE transactions on pattern analysis and machine intelligence43(11), 3964–3979 (2020)

    Kobyzev, I., Prince, S.J., Brubaker, M.A.: Normalizing flows: An introduction and review of current methods. IEEE transactions on pattern analysis and machine intelligence43(11), 3964–3979 (2020)

  23. [23]

    Advances in Neural Information Processing Systems32(2019)

    Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., Aila, T.: Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems32(2019)

  24. [24]

    In: arXiv preprint arXiv:2405.18373 (2024)

    Lee, S.H., Park, S., Kim, G.M.: REPA-E: End-to-end training of latent-diffusion models via representation alignment. In: arXiv preprint arXiv:2405.18373 (2024)

  25. [25]

    Advances in Neural Information Processing Systems37, 56424–56445 (2024)

    Li, T., Tian, Y., Li, H., Deng, M., He, K.: Autoregressive image generation with- out vector quantization. Advances in Neural Information Processing Systems37, 56424–56445 (2024)

  26. [26]

    arXiv preprint arXiv:2401.08740 (2024)

    Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. arXiv preprint arXiv:2401.08740 (2024)

  27. [27]

    arXiv preprint arXiv:2304.07193 (2023)

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  28. [28]

    Journal of Machine Learning Research22(57), 1–64 (2021)

    Papamakarios, G., Nalisnick, E., Rezende, D.J., Mohamed, S., Lakshminarayanan, B.: Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research22(57), 1–64 (2021)

  29. [29]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4195–4205 (2023)

  30. [30]

    org/abs/2505.06708

    Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S., Liu, D., Zhou, J., Lin, J.: Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free (2025),https://arxiv. org/abs/2505.06708

  31. [31]

    arXiv preprint arXiv:2103.00020 (2021) 22 Y

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021) 22 Y. Chen et al

  32. [32]

    arXiv preprint arXiv:2502.20388 (2025)

    Ren, S., Yu, Q., He, J., Shen, X., Yuille, A., Chen, L.C.: Beyond next-token: Next- x prediction for autoregressive visual generation. arXiv preprint arXiv:2502.20388 (2025)

  33. [33]

    In: Bach, F., Blei, D

    Rezende, D., Mohamed, S.: Variational inference with normalizing flows. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 37, pp. 1530–1538. PMLR, Lille, France (07–09 Jul 2015),https://proceedings.mlr.press/v37/ rezende15.html

  34. [34]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  35. [35]

    Advances in neural information processing systems29(2016)

    Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. Advances in neural information processing systems29(2016)

  36. [36]

    Shen, Y., Chen, T., Gao, Y., Zhang, Y., Wang, Y., Ángel Bautista, M., Zhai, S., Susskind, J.M., Gu, J.: Starflow2: Bridging language models and normalizing flows for unified multimodal generation (2026),https://arxiv.org/abs/2605.08029

  37. [37]

    Singh, J., Zheng, B., Wu, Z., Zhang, R., Shechtman, E., Xie, S.: Improved baselines with representation autoencoders (2026),https://arxiv.org/abs/2605.18324

  38. [38]

    Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive modeling: Scalableimagegenerationvianext-scaleprediction.Advancesinneuralinformation processing systems37, 84839–84865 (2024)

  39. [39]

    arXiv preprint arXiv:2411.19722 (2024)

    Tschannen, M., Pinto, A.S., Kolesnikov, A.: Jetformer: An autoregressive genera- tive model of raw images and text. arXiv preprint arXiv:2411.19722 (2024)

  40. [40]

    Tu, G., Fu, X., Yu, S., Tang, Y., Kang, H., Qin, L., Zhang, Y., Gu, J.: Latent reasoning with normalizing flows (2026),https://arxiv.org/abs/2606.06447

  41. [41]

    Wang, S., Gao, Z., Zhu, C., Huang, W., Wang, L.: Pixnerd: Pixel neural field diffusion (2025),https://arxiv.org/abs/2507.23268

  42. [42]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Wang, S., Tian, Z., Huang, W., Wang, L.: Ddt: Decoupled diffusion transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 40633–40642 (June 2026)

  43. [43]

    Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., Hu, H.: Simmim: A simple framework for masked image modeling (2022),https://arxiv.org/abs/ 2111.09886

  44. [44]

    Yang, J., Li, T., Fan, L., Tian, Y., Wang, Y.: Latent denoising makes good tok- enizers (2026),https://arxiv.org/abs/2507.15856

  45. [45]

    Yao, J., Song, Y., Zhou, Y., Wang, X.: Towards scalable pre-training of visual tokenizers for generation (2025),https://arxiv.org/abs/2512.13687

  46. [46]

    generation: Taming optimization dilemma in latent diffusion models (2025),https://arxiv.org/abs/2501.01423

    Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models (2025),https://arxiv.org/abs/2501.01423

  47. [47]

    arXiv preprint arXiv:2410.06940 (2024)

    Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940 (2024)

  48. [48]

    arXiv preprint arXiv:2412.06329 (2024)

    Zhai, S., Zhang, R., Nakkiran, P., Berthelot, D., Gu, J., Zheng, H., Chen, T., Bautista, M.A., Jaitly, N., Susskind, J.: Normalizing flows are capable generative models. arXiv preprint arXiv:2412.06329 (2024)

  49. [49]

    Zhao, Q., Zheng, G., Yang, T., Zhu, R., Leng, X., Gould, S., Zheng, L.: Simflow: Simplified and end-to-end training of latent normalizing flows (2025),https:// arxiv.org/abs/2512.04084

  50. [50]

    Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders (2025),https://arxiv.org/abs/2510.11690 MIMFlow 23

  51. [51]

    org/abs/2510.23588

    Zheng, G., Zhao, Q., Yang, T., Xiao, F., Lin, Z., Wu, J., Deng, J., Zhang, Y., Zhu, R.: Farmer: Flow autoregressive transformer over pixels (2025),https://arxiv. org/abs/2510.23588

  52. [52]

    In: Transactions on Machine Learning Research (TMLR) (2024)

    Zheng, H., Nie, W., Vahdat, A., Anandkumar, A.: Fast training of diffusion mod- els with masked transformers. In: Transactions on Machine Learning Research (TMLR) (2024)

  53. [53]

    In: arXiv preprint arXiv:2405.15438 (2024)

    Zheng, Y., Tian, Y., Li, S., Wu, Z., Liu, B., Li, J., Ye, B., Zhou, J.R.: LightningDiT: A vision-foundation-model-aligned VAE for fast and high-quality generation. In: arXiv preprint arXiv:2405.15438 (2024)