Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry

Duoduo Xue; Junhui Hou; Zhiyu Zhu

arxiv: 2606.00094 · v3 · pith:3RZGA262new · submitted 2026-05-25 · 💻 cs.CV · cs.AI

Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry

Duoduo Xue , Zhiyu Zhu , Junhui Hou This is my paper

Pith reviewed 2026-07-02 23:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords diffusion modelsimage generationdata manifoldpatch tokenizationFID evaluationImageNet-256transformer backbonesoft top-k aggregation

0 comments

The pith

Integrating discrete patch tokenization into a continuous diffusion score function explicitly models data manifold geometry and improves image generation quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MIND, a framework that inserts discrete patch tokenization directly into the score function of a continuous diffusion model to capture the underlying geometry of the data manifold. This hybrid approach aims to combine the structural precision of discrete tokens with the parallel sampling flexibility of diffusion while maintaining end-to-end differentiability through a soft top-k aggregation step. Dual-branch high-frequency embeddings address transformer spectral bias, and a multi-stage sampling scheme adjusts behavior across timesteps. On ImageNet-256x256, the resulting models report substantially lower FID scores than DiT and SiT baselines, with a 130M-parameter variant outperforming a 3.1B-parameter competitor when guidance is used.

Core claim

By embedding discrete patch tokenization inside the continuous diffusion score function and training it differentiably with soft top-k aggregation, the model learns a parameterization that respects manifold geometry; this yields FID of 22.73 after 80 epochs without guidance (versus 43.47 for vanilla DiT-B/2) and FID of 2.06 for the 130M-parameter MIND-B with guidance (versus 3.1B-parameter LlamaGen).

What carries the argument

Integration of discrete patch tokenization into the continuous diffusion score function via soft top-k aggregation, which quantifies manifold structure while preserving differentiability and parallel generation.

If this is right

The base model halves FID relative to the DiT-B/2 baseline after identical 80-epoch training without guidance.
MIND-B with 130M parameters reaches FID 2.06 with guidance, beating a 3.1B-parameter model.
MIND-XL with 715M parameters further lowers FID to 1.95.
Average FID drops 15.95 relative to DiT and 9.06 relative to SiT across reported settings.
Multi-stage transition sampling dynamically adapts the schedule during inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tokenization-plus-score-function pattern could be tested on non-image domains such as audio or 3D shapes to check whether manifold geometry benefits transfer.
If the soft top-k mechanism proves stable, it might allow hybrid discrete-continuous architectures in other score-based or flow-based generative models.
Lower parameter counts achieving competitive FID suggest that explicit manifold modeling may reduce the need for ever-larger transformer backbones.

Load-bearing premise

That discrete patch tokenization can be stably inserted into the continuous score function without causing instability or mode collapse that would erase the reported FID gains.

What would settle it

Training the proposed integration on ImageNet-256x256 and measuring whether FID remains at or above the DiT baseline or whether training diverges.

Figures

Figures reproduced from arXiv: 2606.00094 by Duoduo Xue, Junhui Hou, Zhiyu Zhu.

**Figure 1.** Figure 1: Illustration of the proposed MIND, an image generation framework with explicit modeling of data manifold geometry. 3 Proposed Method The low-dimensional manifold hypothesis is widely adopted in the theoretical analysis of image generation with diffusion models [2,10,42,51,63,65], but has not been explicitly utilized in modern generative models. In this paper, we leverage the image tokenization method as th… view at source ↗

**Figure 1.** Figure 1: Illustration of the proposed MIND, an image generation framework with explicit modeling of data manifold geometry. tage, we empirically contrast two distinct methodologies: a continuous projection that maps patchified VAE latents directly onto a 6-dimensional hypersphere (where the exceptionally low dimensionality enforces a high-density space), and a discrete approach that projects VQ token indices into… view at source ↗

**Figure 2.** Figure 2: Continuous (top) and discrete projection (bottom). The plot shows the LPIPS distance between the reconstructed and original images during optimization. Solid lines represent the models with manifold constraints, whereas dashed lines indicate the unconstrained reconstruction. To mitigate these challenges, we employ discrete tokenization to quantize the manifold structure, an approach that significantly … view at source ↗

**Figure 3.** Figure 3: Visual comparison between our MIND-B and SiT-B/2 with cfg=2.0. classifier-free guidance scales. Under the small-scale setting (∼35M parameters), the proposed MIND-S significantly outperforms the baseline models. Specifically, at cfg = 1.5, MIND-S improves the FID by 22.09 and 13.29 compared with DiT-S/2 and SiT-S/2, respectively. MIND-S surpasses eMIGM-S with only one-third of the parameters. Scaling up to… view at source ↗

**Figure 3.** Figure 3: Visual comparison between our MIND-B and SiT-B/2 with cfg=2.0 [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Examples of visual results from our MIND-B with cfg=4.0. Furthermore, our method maintains a distinct advantage over continuous architectures. Specifically, compared with its baseline DiT-XL/2 [49], MIND improves the generation ability from FID=2.27 to 1.95. The proposed MIND using a stronger tokenizer named GigaTok [30], outperforms the 2B-parameter SimDiff [27] and SiT-XL/2 [44]. Compared with recent mas… view at source ↗

**Figure 5.** Figure 5: Examples of visual results from our MIND-B with FID=2.21 [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of visual results from our one-step MIND-B-G [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Convergence speed [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 9.** Figure 9: Examples of visual results from our MIND-B-G for text-to-image generation [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

Image generative models aim to sample data points from the underlying data manifold, a task that requires learning and decoding a dense, low-dimensional, and compact parameterization space. To achieve this, we propose the Data Manifold-aware Image diffusioN moDel (MIND), a novel framework that explicitly models manifold geometry by integrating discrete patch tokenization into the score function of a continuous diffusion model. This approach successfully leverages both the structural quantification capabilities of discrete tokens and the parallel generation flexibility of continuous diffusion. Moreover, we enable end-to-end differentiable training via a novel soft top-$k$ aggregation mechanism and introduce dual-branch high-frequency feature embedding layers to alleviate the spectral bias of transformer backbones on low-dimensional inputs. Furthermore, for inference, we design a multi-stage transition sampling scheme that dynamically adjusts the sampling scheme based on timestep. Extensive experiments on ImageNet 256$\times$256 demonstrate the effectiveness of MIND. After 80-epoch training, our base model achieves an FID of 22.73 without guidance, nearly halving the 43.47 FID of the vanilla DiT-B/2 baseline. The proposed method reduces FID by 15.95 and 9.06 on average compared with the baselines DiT and SiT, respectively. For image generation on ImageNet-256$\times$256 with guidance, the proposed MIND-B with only 130M parameters achieves an FID of 2.06, superpassing the LlamaGen-3B with 3.1B parameters. The proposed MIND-XL with 715M parameters further reduces the FID to 1.95. Our MIND introduces a fresh perspective on diffusion-based image generation, paving the way for future research and innovation in this community. The code will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MIND mixes discrete patch tokens into a continuous diffusion score function via soft top-k and dual-branch layers, delivering clear FID cuts on ImageNet with smaller models, though attribution to the manifold geometry part still needs ablations.

read the letter

The paper's main move is to insert discrete patch tokenization straight into the diffusion score function, using soft top-k aggregation to keep training end-to-end differentiable and a dual-branch high-frequency embedding to counter transformer spectral bias. They add a multi-stage sampling schedule at inference. This is a specific hybrid construction that goes beyond plain DiT or SiT extensions.

The results are the strongest part. After 80 epochs their base model reaches 22.73 FID without guidance, against 43.47 for DiT-B/2. With guidance the 130M-parameter MIND-B hits 2.06, beating LlamaGen-3B at 3.1B parameters, and the 715M XL version reaches 1.95. Those are concrete efficiency gains on the standard ImageNet-256 benchmark.

The soft spots sit in the attribution. The abstract ties the gains to explicit manifold geometry modeling through the discrete-continuous integration, but does not isolate that component from the multi-stage sampler or the embedding changes. Without those ablations it is hard to know how much the claimed geometry modeling actually drives the numbers versus other engineering choices. The 80-epoch schedule is also short by current standards, so it would help to see whether the gap holds under longer or more standardized training.

The work is aimed at researchers building diffusion models who want to explore discrete-continuous hybrids. The empirical claims are sharp enough and the method is described clearly enough that it deserves a serious referee, even if the paper will need revisions for clearer contribution breakdowns and more training transparency.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Data Manifold-aware Image diffusioN moDel (MIND), which integrates discrete patch tokenization into the score function of a continuous diffusion model using a soft top-k aggregation mechanism to explicitly model data manifold geometry. Additional components include dual-branch high-frequency feature embeddings to address spectral bias and a multi-stage transition sampling scheme. On ImageNet-256×256, the base model reports FID 22.73 without guidance (vs. 43.47 for DiT-B/2) after 80 epochs, average FID reductions of 15.95 and 9.06 versus DiT and SiT baselines, and with guidance MIND-B (130M params) reaches FID 2.06, outperforming LlamaGen-3B (3.1B params); MIND-XL further reaches 1.95.

Significance. If the empirical results hold under scrutiny, the work provides a concrete mechanism for blending discrete structural tokens with continuous diffusion, potentially improving geometric fidelity and parameter efficiency in generative models. The reported ability of a 130M model to surpass a 3.1B baseline is noteworthy and could motivate further hybrid discrete-continuous designs.

major comments (2)

[Abstract and experimental results section] Abstract and experimental results section: the headline FID claims (22.73 vs. 43.47 without guidance; 2.06 with guidance) are presented as direct evidence that the discrete-continuous integration captures manifold geometry, yet no ablation isolates the soft top-k aggregation or dual-branch embedding from the multi-stage sampling or other training choices. Without such controls it is impossible to confirm that the reported gains are attributable to the proposed manifold modeling rather than ancillary implementation details.
[Method section describing framework integration] Method section describing framework integration: the soft top-k aggregation is introduced to enable end-to-end differentiability when inserting discrete patch tokens into the continuous score function, but the manuscript provides no analysis (e.g., stability bounds, mode-coverage diagnostics, or gradient-norm monitoring) addressing the risk that this hybrid construction could induce instability or mode collapse, which would undermine both the geometric claim and the FID attribution.

minor comments (2)

[Abstract] Abstract: 'superpassing' is a typographical error and should read 'surpassing'.
[Abstract] Abstract: the acronym expansion 'Data Manifold-aware Image diffusioN moDel' contains inconsistent capitalization that should be standardized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We provide point-by-point responses to the major comments and indicate the revisions we will make to address them.

read point-by-point responses

Referee: [Abstract and experimental results section] Abstract and experimental results section: the headline FID claims (22.73 vs. 43.47 without guidance; 2.06 with guidance) are presented as direct evidence that the discrete-continuous integration captures manifold geometry, yet no ablation isolates the soft top-k aggregation or dual-branch embedding from the multi-stage sampling or other training choices. Without such controls it is impossible to confirm that the reported gains are attributable to the proposed manifold modeling rather than ancillary implementation details.

Authors: We agree that the current presentation would benefit from explicit ablations to isolate the effects of the soft top-k aggregation and dual-branch embeddings. In the revised manuscript, we will include additional ablation experiments that compare the full MIND model against variants without these components, while keeping the multi-stage sampling and other training settings consistent. This will provide clearer attribution of the performance gains to the manifold modeling aspects. revision: yes
Referee: [Method section describing framework integration] Method section describing framework integration: the soft top-k aggregation is introduced to enable end-to-end differentiability when inserting discrete patch tokens into the continuous score function, but the manuscript provides no analysis (e.g., stability bounds, mode-coverage diagnostics, or gradient-norm monitoring) addressing the risk that this hybrid construction could induce instability or mode collapse, which would undermine both the geometric claim and the FID attribution.

Authors: The manuscript does not currently include dedicated analysis on stability or mode collapse risks. We will add gradient norm monitoring results and mode coverage discussions based on our training logs to the revised version. However, we note that the strong empirical performance, including high sample diversity in the reported FID scores, provides indirect evidence against significant mode collapse. Formal stability bounds are beyond the scope of this work but could be explored in future research. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal with external baseline comparisons

full rationale

The paper introduces the MIND framework via architectural choices (discrete patch tokenization integrated into continuous diffusion score function, soft top-k aggregation, dual-branch embeddings, multi-stage sampling) and validates via direct FID measurements on ImageNet-256x256 against independent external baselines (DiT-B/2, SiT, LlamaGen). No derivation chain, first-principles prediction, or fitted parameter is presented that reduces to its own inputs by construction. No self-citations are load-bearing for any uniqueness claim. The central results are empirical performance deltas, not algebraic identities or renamed known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard diffusion assumptions plus the new modeling choice that discrete tokens can quantify manifold geometry when injected into the score function.

axioms (1)

domain assumption Continuous diffusion score functions can be augmented with discrete token information to better capture data manifold geometry
Invoked in the description of the MIND framework integration

pith-pipeline@v0.9.1-grok · 5858 in / 1366 out tokens · 20630 ms · 2026-07-02T23:08:52.672657+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 6 canonical work pages · 1 internal anchor

[1]

In: NeurIPS

Austin, J., Johnson, D.D., Ho, J., Tarlow, D., van den Berg, R.: Structured de- noising diffusion models in discrete state-spaces. In: NeurIPS. pp. 17981–17993 (2021) 16 D. Xue et al

2021
[2]

Azangulov, I., Deligiannidis, G., Rousseau, J.: Convergence of diffusion models underthemanifoldhypothesisinhigh-dimensions.arXivpreprintarXiv:2409.18804 (2024)

work page arXiv 2024
[3]

In: ICLR (2022)

Bao, F., Li, C., Zhu, J., Zhang, B.: Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. In: ICLR (2022)

2022
[4]

In: CVPR

Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion models. In: CVPR. pp. 22563–22575 (2023)

2023
[5]

In: ICLR (2019)

Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: ICLR (2019)

2019
[6]

In: CVPR

Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: Maskgit: Masked gener- ative image transformer. In: CVPR. pp. 11305–11315 (2022)

2022
[7]

In: ICML

Chen, M., Huang, K., Zhao, T., Wang, M.: Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. In: ICML. pp. 4672–4712 (2023)

2023
[8]

arXiv preprint arXiv:2504.07963 (2025)

Chen, S., Ge, C., Zhang, S., Sun, P., Luo, P.: Pixelflow: Pixel-space generative models with flow. arXiv preprint arXiv:2504.07963 (2025)

work page arXiv 2025
[9]

In: NeurIPS

Cui, H., Pehlevan, C., Lu, Y.M.: A solvable model of learning generative diffusion: theory and insights. In: NeurIPS. pp. 5253–5296 (2025)

2025
[10]

In: NeurIPS

De Bortoli, V., Mathieu, E., Hutchinson, M., Thornton, J., Teh, Y.W., Doucet, A.: Riemannian score-based generative modelling. In: NeurIPS. pp. 2406–2422 (2022)

2022
[11]

In: CVPR

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR. pp. 248–255 (2009)

2009
[12]

In: NeurIPS

Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. In: NeurIPS. pp. 8780–8794 (2021)

2021
[13]

In: CVPR

Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR. pp. 12873–12883 (2021)

2021
[14]

In: NeurIPS

Geng,Z.,Deng,M.,Bai,X.,Kolter,J.Z.,He,K.:Meanflowsforone-stepgenerative modeling. In: NeurIPS. pp. 75460–75482 (2025)

2025
[15]

In: CVPR

Geng,Z., Lu,Y.,Wu,Z.,Shechtman,E.,Kolter,J.Z.,He,K.:Improvedmeanflows: On the challenges of fastforward generative models. In: CVPR. pp. 30467–30476 (2026)

2026
[16]

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Commun. ACM63(11), 139–144 (2020)

2020
[17]

In: NeurIPS

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS. pp. 6840–6851 (2020)

2020
[18]

In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications (2021)

Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications (2021)

2021
[19]

In: ICML

Hoogeboom, E., Heek, J., Salimans, T.: Simple diffusion: End-to-end diffusion for high resolution images. In: ICML. pp. 13213–13232 (2023)

2023
[20]

In: ICLR (2026)

Huang, Y., Wang, S.H., Bertozzi, A.L., Wang, B.: RMFlow: Refined mean flow by a noise-injection step for multimodal generation. In: ICLR (2026)

2026
[21]

In: NeurIPS

Jo, J., Hwang, S.J.: Continuous diffusion model for language modeling. In: NeurIPS. pp. 97394–97430 (2025)

2025
[22]

In: CVPR

Kang, M., Zhu, J.Y., Zhang, R., Park, J., Shechtman, E., Paris, S., Park, T.: Scaling up gans for text-to-image synthesis. In: CVPR. pp. 10124–10134 (2023)

2023
[23]

In: NeurIPS

Karras,T.,Aittala,M.,Aila,T.,Laine,S.:Elucidatingthedesignspaceofdiffusion- based generative models. In: NeurIPS. pp. 26565–26577 (2022) Manifold-Aware Image Diffusion Model 17

2022
[24]

In: NeurIPS

Karras, T., Aittala, M., Kynkäänniemi, T., Lehtinen, J., Aila, T., Laine, S.: Guid- ing a diffusion model with a bad version of itself. In: NeurIPS. pp. 52996–53021 (2024)

2024
[25]

In: CVPR

Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR. pp. 4396–4405 (2019)

2019
[26]

IEEE Transactions on Pattern Analysis and Machine Intelligence43(11), 3964–3979 (2021)

Kobyzev, I., Prince, S.J., Brubaker, M.A.: Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence43(11), 3964–3979 (2021)

2021
[27]

arXiv preprint arXiv:2602.10099 (2026)

Kumar, A., Patel, V.M.: Learning on the manifold: Unlocking standard diffusion transformers with representation encoders. arXiv preprint arXiv:2602.10099 (2026)

work page arXiv 2026
[28]

In: NeurIPS

Kynkäänniemi, T., Aittala, M., Karras, T., Laine, S., Aila, T., Lehtinen, J.: Ap- plying guidance in a limited interval improves sample and distribution quality in diffusion models. In: NeurIPS. pp. 122458–122483 (2024)

2024
[29]

In: CVPR

Lee, D., Kim, C., Kim, S., Cho, M., Han, W.S.: Autoregressive image generation using residual quantization. In: CVPR. pp. 11523–11532 (2022)

2022
[30]

arXiv preprint arXiv:2504.10483 , year=

Leng, X., Singh, J., Hou, Y., Xing, Z., Xie, S., Zheng, L.: Repa-e: Unlock- ing vae for end-to-end tuning with latent diffusion transformers. arXiv preprint arXiv:2504.10483 (2025)

work page arXiv 2025
[31]

In: NeurIPS

Li, T., Tian, Y., Li, H., Deng, M., He, K.: Autoregressive image generation without vector quantization. In: NeurIPS. vol. 37, pp. 56424–56445 (2024)

2024
[32]

Detailflow: 1d coarse-to-fine autoregressive image generation via next-detail prediction

Liu, Y., Qu, L., Zhang, H., Wang, X., Jiang, Y., Gao, Y., Ye, H., Li, X., Wang, S., Du, D.K., et al.: Detailflow: 1d coarse-to-fine autoregressive image generation via next-detail prediction. arXiv preprint arXiv:2505.21473 (2025)

work page arXiv 2025
[33]

Transactions on Machine Learning Research (2024)

Loaiza-Ganem, G., Ross, B.L., Hosseinzadeh, R., Caterini, A.L., Cresswell, J.C.: Deep generative models through the lens of the manifold hypothesis: A survey and new connections. Transactions on Machine Learning Research (2024)

2024
[34]

In: ICML

Lou, A., Meng, C., Ermon, S.: Discrete diffusion modeling by estimating the ratios of the data distribution. In: ICML. pp. 32819–32848 (2024)

2024
[35]

In: ECCV

Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In: ECCV. pp. 23–40 (2024)

2024
[36]

In: NeurIPS

Ma, X., Zhao, F., Ling, P., Qiu, H., Wei, Z., Yu, H., Huang, J., Zeng, Z., Ma, L.: Towards better & faster autoregressive image generation: From the perspective of entropy. In: NeurIPS. pp. 31466–31497 (2025)

2025
[37]

In: CVPR

Ma, Z., Wei, L., Wang, S., Zhang, S., Tian, Q.: Deco: Frequency-decoupled pixel diffusion for end-to-end image generation. In: CVPR. pp. 43600–43610 (2026)

2026
[38]

In: ICML

Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: ICML. pp. 8162–8171 (2021)

2021
[39]

In: NeurIPS

van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. In: NeurIPS. pp. 6309–6318 (2017)

2017
[40]

In: ICCV

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: ICCV. pp. 4195–4205 (2023)

2023
[41]

In: ICLR (2021)

Pope, P., Zhu, C., Abdelkader, A., Goldblum, M., Goldstein, T.: The intrinsic dimension of images and its impact on learning. In: ICLR (2021)

2021
[42]

In: CVPR

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR. pp. 10684–10695 (2022)

2022
[43]

In: NeurIPS

Sahoo, S., Arriola, M., Schiff, Y., Gokaslan, A., Marroquin, E., Chiu, J., Rush, A., Kuleshov, V.: Simple and effective masked diffusion language models. In: NeurIPS. pp. 130136–130184 (2024)

2024
[44]

In: ICLR (2022) 18 D

Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. In: ICLR (2022) 18 D. Xue et al

2022
[45]

In: ACM SIGGRAPH 2022 Conference Proceedings (2022)

Sauer, A., Schwarz, K., Geiger, A.: Stylegan-xl: Scaling stylegan to large diverse datasets. In: ACM SIGGRAPH 2022 Conference Proceedings (2022)

2022
[46]

In: ICCV

Shi, F., Luo, Z., Ge, Y., Yang, Y., Shan, Y., Wang, L.: Scalable image tokenization with index backpropagation quantization. In: ICCV. pp. 16037–16046 (2025)

2025
[47]

In: NeurIPS

Shi, J., Han, K., Wang, Z., Doucet, A., Titsias, M.K.: Simplified and generalized masked diffusion for discrete data. In: NeurIPS. pp. 103131–103167 (2024)

2024
[48]

In: ICLR (2021)

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)

2021
[49]

In: ICML

Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: ICML. pp. 32211–32252 (2023)

2023
[50]

In: NeurIPS

Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: NeurIPS. p. 11895–11907 (2019)

2019
[51]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Sun, P., Jiang, Y., Chen, S., Zhang, S., Peng, B., Luo, P., Yuan, Z.: Autoregres- sive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

In: Proceedings of The 27th International Conference on Artificial Intelligence and Statistics

Tang, R., Yang, Y.: Adaptivity of diffusion models to manifold structures. In: Proceedings of The 27th International Conference on Artificial Intelligence and Statistics. pp. 1648–1656 (2024)

2024
[53]

In: NeurIPS

Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive modeling: scalable image generation via next-scale prediction. In: NeurIPS. pp. 84839–84865 (2024)

2024
[54]

In: ICLR 2025 Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy (2025)

Wang, P., Zhang, H., Zhang, Z., Chen, S., Ma, Y., Qu, Q.: Diffusion models learn low-dimensional distributions via subspace clustering. In: ICLR 2025 Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy (2025)

2025
[55]

In: CVPR

Wang, S., Tian, Z., Huang, W., Wang, L.: Ddt: Decoupled diffusion transformer. In: CVPR. pp. 40633–40642 (2026)

2026
[56]

Transactions on Machine Learning Research (2025)

Xiong, J., Liu, G., Huang, L., Wu, C., Wu, T., Mu, Y., Yao, Y., Shen, H., Wan, Z., Huang, J., et al.: Autoregressive models in vision: A survey. Transactions on Machine Learning Research (2025)

2025
[57]

In: ICCV

Xiong, T., Liew, J.H., Huang, Z., Feng, J., Liu, X.: Gigatok: Scaling visual tok- enizers to 3 billion parameters for autoregressive image generation. In: ICCV. pp. 18770–18780 (2025)

2025
[58]

In: ICML

You, Z., Ou, J., Zhang, X., Hu, J., ZHOU, J., Li, C.: Effective and efficient masked image generation models. In: ICML. pp. 72730–72746 (2025)

2025
[59]

In: ICLR (2022)

Yu, J., Li, X., Koh, J.Y., Zhang, H., Pang, R., Qin, J., Ku, A., Xu, Y., Baldridge, J., Wu, Y.: Vector-quantized image modeling with improved VQGAN. In: ICLR (2022)

2022
[60]

In: ICLR (2024)

Yu, L., Lezama, J., Gundavarapu, N.B., Versari, L., Sohn, K., Minnen, D., Cheng, Y., Gupta, A., Gu, X., Hauptmann, A.G., Gong, B., Yang, M.H., Essa, I., Ross, D.A., Jiang, L.: Language model beats diffusion - tokenizer is key to visual gener- ation. In: ICLR (2024)

2024
[61]

In: ICLR (2025)

Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. In: ICLR (2025)

2025
[62]

In: CVPR

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR. pp. 586–595 (2018)

2018
[63]

In: NeurIPS

Zheng,A.,Wen,X.,Zhang,X.,Ma,C.,Wang,T.,YU,G.,Zhang,X.,QI,X.:Vision foundation models as effective visual tokenizers for autoregressive generation. In: NeurIPS. pp. 62656–62675 (2025)

2025
[64]

In: ICLR (2026) Manifold-Aware Image Diffusion Model 19

Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders. In: ICLR (2026) Manifold-Aware Image Diffusion Model 19

2026
[65]

Feature" level inherently retains all continuous architectural benefits (including our dual-branch high-frequency embeddings), the strict superiority of “Logits

Zhu, L., Wei, F., Lu, Y., Chen, D.: Scaling the codebook size of vqgan to 100,000 with a utilization rate of 99%. In: NeurIPS. pp. 12612–12635 (2024) Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry (Supplementary Material) Duoduo Xue1, Zhiyu Zhu1,2, Junhui Hou1 1 Department of Computer Science, City University of Hong Kong, Chi...

2024
[66]

MIND-B, implemented with CFG intervals like eMIGM, incurs a sampling cost that matches DiT and is significantly lower than eMIGM

The tokenizer in MIND-B does not need extra training latency using pre- encoded tokens, with decoding costs on par with VAEs in DiT/SiT architectures. MIND-B, implemented with CFG intervals like eMIGM, incurs a sampling cost that matches DiT and is significantly lower than eMIGM. Moreover, we test the convergence speed using the experiment settings in Tab...
[67]

Diffusion & Noise Settings Noise Scale Factorc1 0.6 0.8 Signal Scale Factorc2 1.0 1.0 Training Timestep Range (t)t∈[0.2,0.95]t∈[0.2,0.95]
[68]

Embedding & Vocabulary Vocabulary SizeV16,384 8192 Embedding DimensionL16 8 Embedding Subspace 4 2
[69]

Network Architecture Total Parameters(M) 130.48 35.21 Number of Blocks 14 14 Hidden Size 768 384 Attention Heads 12 6 Condition Embedding Dimension 128 128
[70]

Training & Optimization Optimizer DeepSpeed AdamW Base Learning Rate1.0×10 −3 1.0×10−3 Batch Size 1024 2048 Training Epochs 80 80

2048
[71]

Inference Timestep schedule Linear Linear Temperatureτ0.99 0.99 Sampling steps 250 250 η0.99 0.99 (Top-p, Top-k,ρ1,ρ2)(cfg= 1.0) (200, 0.3, 0, 0.9) (100, 0.8, 0.3, 0.9) (Top-p, Top-k,ρ1,ρ2)(cfg= 1.5) (100, 0.8, 0.1, 0.9) (100, 0.8, 0.2, 0.99) (Top-p, Top-k,ρ1,ρ2)(cfg= 2.0) (100, 0.8, 0.1, 0.9) (300, 0.8, 0.3, 0.9) Table S-9:Training and Inference Paramete...

2048

[1] [1]

In: NeurIPS

Austin, J., Johnson, D.D., Ho, J., Tarlow, D., van den Berg, R.: Structured de- noising diffusion models in discrete state-spaces. In: NeurIPS. pp. 17981–17993 (2021) 16 D. Xue et al

2021

[2] [2]

Azangulov, I., Deligiannidis, G., Rousseau, J.: Convergence of diffusion models underthemanifoldhypothesisinhigh-dimensions.arXivpreprintarXiv:2409.18804 (2024)

work page arXiv 2024

[3] [3]

In: ICLR (2022)

Bao, F., Li, C., Zhu, J., Zhang, B.: Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. In: ICLR (2022)

2022

[4] [4]

In: CVPR

Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion models. In: CVPR. pp. 22563–22575 (2023)

2023

[5] [5]

In: ICLR (2019)

Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: ICLR (2019)

2019

[6] [6]

In: CVPR

Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: Maskgit: Masked gener- ative image transformer. In: CVPR. pp. 11305–11315 (2022)

2022

[7] [7]

In: ICML

Chen, M., Huang, K., Zhao, T., Wang, M.: Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. In: ICML. pp. 4672–4712 (2023)

2023

[8] [8]

arXiv preprint arXiv:2504.07963 (2025)

Chen, S., Ge, C., Zhang, S., Sun, P., Luo, P.: Pixelflow: Pixel-space generative models with flow. arXiv preprint arXiv:2504.07963 (2025)

work page arXiv 2025

[9] [9]

In: NeurIPS

Cui, H., Pehlevan, C., Lu, Y.M.: A solvable model of learning generative diffusion: theory and insights. In: NeurIPS. pp. 5253–5296 (2025)

2025

[10] [10]

In: NeurIPS

De Bortoli, V., Mathieu, E., Hutchinson, M., Thornton, J., Teh, Y.W., Doucet, A.: Riemannian score-based generative modelling. In: NeurIPS. pp. 2406–2422 (2022)

2022

[11] [11]

In: CVPR

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR. pp. 248–255 (2009)

2009

[12] [12]

In: NeurIPS

Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. In: NeurIPS. pp. 8780–8794 (2021)

2021

[13] [13]

In: CVPR

Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR. pp. 12873–12883 (2021)

2021

[14] [14]

In: NeurIPS

Geng,Z.,Deng,M.,Bai,X.,Kolter,J.Z.,He,K.:Meanflowsforone-stepgenerative modeling. In: NeurIPS. pp. 75460–75482 (2025)

2025

[15] [15]

In: CVPR

Geng,Z., Lu,Y.,Wu,Z.,Shechtman,E.,Kolter,J.Z.,He,K.:Improvedmeanflows: On the challenges of fastforward generative models. In: CVPR. pp. 30467–30476 (2026)

2026

[16] [16]

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Commun. ACM63(11), 139–144 (2020)

2020

[17] [17]

In: NeurIPS

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS. pp. 6840–6851 (2020)

2020

[18] [18]

In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications (2021)

Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications (2021)

2021

[19] [19]

In: ICML

Hoogeboom, E., Heek, J., Salimans, T.: Simple diffusion: End-to-end diffusion for high resolution images. In: ICML. pp. 13213–13232 (2023)

2023

[20] [20]

In: ICLR (2026)

Huang, Y., Wang, S.H., Bertozzi, A.L., Wang, B.: RMFlow: Refined mean flow by a noise-injection step for multimodal generation. In: ICLR (2026)

2026

[21] [21]

In: NeurIPS

Jo, J., Hwang, S.J.: Continuous diffusion model for language modeling. In: NeurIPS. pp. 97394–97430 (2025)

2025

[22] [22]

In: CVPR

Kang, M., Zhu, J.Y., Zhang, R., Park, J., Shechtman, E., Paris, S., Park, T.: Scaling up gans for text-to-image synthesis. In: CVPR. pp. 10124–10134 (2023)

2023

[23] [23]

In: NeurIPS

Karras,T.,Aittala,M.,Aila,T.,Laine,S.:Elucidatingthedesignspaceofdiffusion- based generative models. In: NeurIPS. pp. 26565–26577 (2022) Manifold-Aware Image Diffusion Model 17

2022

[24] [24]

In: NeurIPS

Karras, T., Aittala, M., Kynkäänniemi, T., Lehtinen, J., Aila, T., Laine, S.: Guid- ing a diffusion model with a bad version of itself. In: NeurIPS. pp. 52996–53021 (2024)

2024

[25] [25]

In: CVPR

Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR. pp. 4396–4405 (2019)

2019

[26] [26]

IEEE Transactions on Pattern Analysis and Machine Intelligence43(11), 3964–3979 (2021)

Kobyzev, I., Prince, S.J., Brubaker, M.A.: Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence43(11), 3964–3979 (2021)

2021

[27] [27]

arXiv preprint arXiv:2602.10099 (2026)

Kumar, A., Patel, V.M.: Learning on the manifold: Unlocking standard diffusion transformers with representation encoders. arXiv preprint arXiv:2602.10099 (2026)

work page arXiv 2026

[28] [28]

In: NeurIPS

Kynkäänniemi, T., Aittala, M., Karras, T., Laine, S., Aila, T., Lehtinen, J.: Ap- plying guidance in a limited interval improves sample and distribution quality in diffusion models. In: NeurIPS. pp. 122458–122483 (2024)

2024

[29] [29]

In: CVPR

Lee, D., Kim, C., Kim, S., Cho, M., Han, W.S.: Autoregressive image generation using residual quantization. In: CVPR. pp. 11523–11532 (2022)

2022

[30] [30]

arXiv preprint arXiv:2504.10483 , year=

Leng, X., Singh, J., Hou, Y., Xing, Z., Xie, S., Zheng, L.: Repa-e: Unlock- ing vae for end-to-end tuning with latent diffusion transformers. arXiv preprint arXiv:2504.10483 (2025)

work page arXiv 2025

[31] [31]

In: NeurIPS

Li, T., Tian, Y., Li, H., Deng, M., He, K.: Autoregressive image generation without vector quantization. In: NeurIPS. vol. 37, pp. 56424–56445 (2024)

2024

[32] [32]

Detailflow: 1d coarse-to-fine autoregressive image generation via next-detail prediction

Liu, Y., Qu, L., Zhang, H., Wang, X., Jiang, Y., Gao, Y., Ye, H., Li, X., Wang, S., Du, D.K., et al.: Detailflow: 1d coarse-to-fine autoregressive image generation via next-detail prediction. arXiv preprint arXiv:2505.21473 (2025)

work page arXiv 2025

[33] [33]

Transactions on Machine Learning Research (2024)

Loaiza-Ganem, G., Ross, B.L., Hosseinzadeh, R., Caterini, A.L., Cresswell, J.C.: Deep generative models through the lens of the manifold hypothesis: A survey and new connections. Transactions on Machine Learning Research (2024)

2024

[34] [34]

In: ICML

Lou, A., Meng, C., Ermon, S.: Discrete diffusion modeling by estimating the ratios of the data distribution. In: ICML. pp. 32819–32848 (2024)

2024

[35] [35]

In: ECCV

Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In: ECCV. pp. 23–40 (2024)

2024

[36] [36]

In: NeurIPS

Ma, X., Zhao, F., Ling, P., Qiu, H., Wei, Z., Yu, H., Huang, J., Zeng, Z., Ma, L.: Towards better & faster autoregressive image generation: From the perspective of entropy. In: NeurIPS. pp. 31466–31497 (2025)

2025

[37] [37]

In: CVPR

Ma, Z., Wei, L., Wang, S., Zhang, S., Tian, Q.: Deco: Frequency-decoupled pixel diffusion for end-to-end image generation. In: CVPR. pp. 43600–43610 (2026)

2026

[38] [38]

In: ICML

Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: ICML. pp. 8162–8171 (2021)

2021

[39] [39]

In: NeurIPS

van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. In: NeurIPS. pp. 6309–6318 (2017)

2017

[40] [40]

In: ICCV

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: ICCV. pp. 4195–4205 (2023)

2023

[41] [41]

In: ICLR (2021)

Pope, P., Zhu, C., Abdelkader, A., Goldblum, M., Goldstein, T.: The intrinsic dimension of images and its impact on learning. In: ICLR (2021)

2021

[42] [42]

In: CVPR

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR. pp. 10684–10695 (2022)

2022

[43] [43]

In: NeurIPS

Sahoo, S., Arriola, M., Schiff, Y., Gokaslan, A., Marroquin, E., Chiu, J., Rush, A., Kuleshov, V.: Simple and effective masked diffusion language models. In: NeurIPS. pp. 130136–130184 (2024)

2024

[44] [44]

In: ICLR (2022) 18 D

Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. In: ICLR (2022) 18 D. Xue et al

2022

[45] [45]

In: ACM SIGGRAPH 2022 Conference Proceedings (2022)

Sauer, A., Schwarz, K., Geiger, A.: Stylegan-xl: Scaling stylegan to large diverse datasets. In: ACM SIGGRAPH 2022 Conference Proceedings (2022)

2022

[46] [46]

In: ICCV

Shi, F., Luo, Z., Ge, Y., Yang, Y., Shan, Y., Wang, L.: Scalable image tokenization with index backpropagation quantization. In: ICCV. pp. 16037–16046 (2025)

2025

[47] [47]

In: NeurIPS

Shi, J., Han, K., Wang, Z., Doucet, A., Titsias, M.K.: Simplified and generalized masked diffusion for discrete data. In: NeurIPS. pp. 103131–103167 (2024)

2024

[48] [48]

In: ICLR (2021)

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)

2021

[49] [49]

In: ICML

Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: ICML. pp. 32211–32252 (2023)

2023

[50] [50]

In: NeurIPS

Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: NeurIPS. p. 11895–11907 (2019)

2019

[51] [51]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Sun, P., Jiang, Y., Chen, S., Zhang, S., Peng, B., Luo, P., Yuan, Z.: Autoregres- sive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

In: Proceedings of The 27th International Conference on Artificial Intelligence and Statistics

Tang, R., Yang, Y.: Adaptivity of diffusion models to manifold structures. In: Proceedings of The 27th International Conference on Artificial Intelligence and Statistics. pp. 1648–1656 (2024)

2024

[53] [53]

In: NeurIPS

Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive modeling: scalable image generation via next-scale prediction. In: NeurIPS. pp. 84839–84865 (2024)

2024

[54] [54]

In: ICLR 2025 Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy (2025)

Wang, P., Zhang, H., Zhang, Z., Chen, S., Ma, Y., Qu, Q.: Diffusion models learn low-dimensional distributions via subspace clustering. In: ICLR 2025 Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy (2025)

2025

[55] [55]

In: CVPR

Wang, S., Tian, Z., Huang, W., Wang, L.: Ddt: Decoupled diffusion transformer. In: CVPR. pp. 40633–40642 (2026)

2026

[56] [56]

Transactions on Machine Learning Research (2025)

Xiong, J., Liu, G., Huang, L., Wu, C., Wu, T., Mu, Y., Yao, Y., Shen, H., Wan, Z., Huang, J., et al.: Autoregressive models in vision: A survey. Transactions on Machine Learning Research (2025)

2025

[57] [57]

In: ICCV

Xiong, T., Liew, J.H., Huang, Z., Feng, J., Liu, X.: Gigatok: Scaling visual tok- enizers to 3 billion parameters for autoregressive image generation. In: ICCV. pp. 18770–18780 (2025)

2025

[58] [58]

In: ICML

You, Z., Ou, J., Zhang, X., Hu, J., ZHOU, J., Li, C.: Effective and efficient masked image generation models. In: ICML. pp. 72730–72746 (2025)

2025

[59] [59]

In: ICLR (2022)

Yu, J., Li, X., Koh, J.Y., Zhang, H., Pang, R., Qin, J., Ku, A., Xu, Y., Baldridge, J., Wu, Y.: Vector-quantized image modeling with improved VQGAN. In: ICLR (2022)

2022

[60] [60]

In: ICLR (2024)

Yu, L., Lezama, J., Gundavarapu, N.B., Versari, L., Sohn, K., Minnen, D., Cheng, Y., Gupta, A., Gu, X., Hauptmann, A.G., Gong, B., Yang, M.H., Essa, I., Ross, D.A., Jiang, L.: Language model beats diffusion - tokenizer is key to visual gener- ation. In: ICLR (2024)

2024

[61] [61]

In: ICLR (2025)

Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. In: ICLR (2025)

2025

[62] [62]

In: CVPR

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR. pp. 586–595 (2018)

2018

[63] [63]

In: NeurIPS

Zheng,A.,Wen,X.,Zhang,X.,Ma,C.,Wang,T.,YU,G.,Zhang,X.,QI,X.:Vision foundation models as effective visual tokenizers for autoregressive generation. In: NeurIPS. pp. 62656–62675 (2025)

2025

[64] [64]

In: ICLR (2026) Manifold-Aware Image Diffusion Model 19

Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders. In: ICLR (2026) Manifold-Aware Image Diffusion Model 19

2026

[65] [65]

Feature" level inherently retains all continuous architectural benefits (including our dual-branch high-frequency embeddings), the strict superiority of “Logits

Zhu, L., Wei, F., Lu, Y., Chen, D.: Scaling the codebook size of vqgan to 100,000 with a utilization rate of 99%. In: NeurIPS. pp. 12612–12635 (2024) Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry (Supplementary Material) Duoduo Xue1, Zhiyu Zhu1,2, Junhui Hou1 1 Department of Computer Science, City University of Hong Kong, Chi...

2024

[66] [66]

MIND-B, implemented with CFG intervals like eMIGM, incurs a sampling cost that matches DiT and is significantly lower than eMIGM

The tokenizer in MIND-B does not need extra training latency using pre- encoded tokens, with decoding costs on par with VAEs in DiT/SiT architectures. MIND-B, implemented with CFG intervals like eMIGM, incurs a sampling cost that matches DiT and is significantly lower than eMIGM. Moreover, we test the convergence speed using the experiment settings in Tab...

[67] [67]

Diffusion & Noise Settings Noise Scale Factorc1 0.6 0.8 Signal Scale Factorc2 1.0 1.0 Training Timestep Range (t)t∈[0.2,0.95]t∈[0.2,0.95]

[68] [68]

Embedding & Vocabulary Vocabulary SizeV16,384 8192 Embedding DimensionL16 8 Embedding Subspace 4 2

[69] [69]

Network Architecture Total Parameters(M) 130.48 35.21 Number of Blocks 14 14 Hidden Size 768 384 Attention Heads 12 6 Condition Embedding Dimension 128 128

[70] [70]

Training & Optimization Optimizer DeepSpeed AdamW Base Learning Rate1.0×10 −3 1.0×10−3 Batch Size 1024 2048 Training Epochs 80 80

2048

[71] [71]

Inference Timestep schedule Linear Linear Temperatureτ0.99 0.99 Sampling steps 250 250 η0.99 0.99 (Top-p, Top-k,ρ1,ρ2)(cfg= 1.0) (200, 0.3, 0, 0.9) (100, 0.8, 0.3, 0.9) (Top-p, Top-k,ρ1,ρ2)(cfg= 1.5) (100, 0.8, 0.1, 0.9) (100, 0.8, 0.2, 0.99) (Top-p, Top-k,ρ1,ρ2)(cfg= 2.0) (100, 0.8, 0.1, 0.9) (300, 0.8, 0.3, 0.9) Table S-9:Training and Inference Paramete...

2048