pith. sign in

arxiv: 2606.29814 · v1 · pith:3XXAF4BOnew · submitted 2026-06-29 · 💻 cs.CV

Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

Pith reviewed 2026-06-30 06:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords masked discrete diffusiontext-to-image synthesistoken editinggrouped cross-entropyhigh-resolution image generationself-correctionvocabulary sparsity
0
0 comments X

The pith

A token-editing step at inference and a grouped loss fix self-correction and sparsity problems in masked discrete diffusion for text-to-image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Nemotron-Labs-Diffusion-Image as a masked discrete diffusion model that generates high-resolution images from text. It tackles the fact that once a discrete token is unmasked it cannot be changed, removing any chance for later correction, and the fact that large vocabularies spread the training signal too thin. The fixes are a mechanism that lets the model rewrite already-unmasked tokens during sampling and a Grouped Cross-Entropy loss that treats embedding neighbors of the true token as positive examples. A custom fused operator keeps memory use low. If these changes work, discrete models become competitive with continuous diffusion on standard quality benchmarks without needing full-image latent refinement at every step.

Core claim

By adding a token-editing mechanism that allows dynamic revision of unmasked tokens during inference and a Grouped Cross-Entropy objective that supplies positive learning signals to tokens neighboring the ground truth in embedding space, together with a fused operator that reduces VRAM consumption, masked discrete diffusion models overcome their lack of self-correction and training-signal sparsity, yielding improved efficiency and higher image fidelity on high-resolution text-to-image tasks.

What carries the argument

The token-editing mechanism, which revises already-unmasked discrete tokens at inference time, paired with the Grouped Cross-Entropy loss that rewards embedding-space neighbors of the correct token.

If this is right

  • Discrete models can now iteratively refine an image in the same way continuous models progressively denoise the full latent.
  • Larger token vocabularies become practical for generation because the loss no longer starves most tokens of gradient.
  • Training runs require less memory in high-vocabulary regimes thanks to the fused operator.
  • The resulting generators reach 0.90 on GenEval, 86.9 on DPG, and 10.76 on HPSv3.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The editing step might reduce the number of sampling iterations needed to reach a given quality level.
  • The same grouped-loss idea could be tested on other discrete generative tasks such as audio or video token sequences.
  • If token editing works reliably, future work could explore learned policies for when and which tokens to rewrite.

Load-bearing premise

The token-editing mechanism can be applied at inference without introducing new inconsistencies or artifacts that cancel out the self-correction benefit.

What would settle it

Generate the same prompts with and without the token-editing step and find that human preference or GenEval scores do not rise or that visible artifacts increase when editing is enabled.

read the original abstract

We propose Nemotron-Labs-Diffusion-Image, a state-of-the-art masked discrete diffusion model (MDM) for high-resolution text-to-image synthesis. Compared with prior work on masked image generation, Nemotron-Labs-Diffusion-Image addresses two key challenges. First, unlike continuous diffusion models which progressively refine latent representations across the entire image, standard MDMs lack self-correcting capability because discrete tokens cannot be modified once they are unmasked. Second, although increasing the vocabulary size of discrete image tokenizers improves reconstruction fidelity, it introduces optimization difficulties for generative modeling as the per-token training signal becomes increasingly sparse. To address the first challenge, Nemotron-Labs-Diffusion-Image incorporates a token-editing mechanism that enables the model to dynamically revise already-unmasked tokens during inference, similar to how a sculptor iteratively refines their work. To tackle the second challenge, we propose a Grouped Cross-Entropy (GCE) objective that assigns positive learning signals to tokens neighboring the ground truth in embedding space, thereby alleviating signal sparsity. To further improve training efficiency, we implement a custom fused operator for GCE that significantly reduces VRAM usage in large-vocabulary settings. Experimental results demonstrate that these innovations substantially improve both training efficiency and image fidelity of masked discrete image generators, achieving a score of 0.90 on GenEval, 86.9 on DPG and 10.76 of HPSv3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes Nemotron-Labs-Diffusion-Image, a masked discrete diffusion model (MDM) for high-resolution text-to-image synthesis. It identifies two challenges in prior MDMs: lack of self-correction because unmasked discrete tokens cannot be revised, and optimization difficulties from sparse per-token signals with large vocabularies. The work introduces a token-editing mechanism for dynamic revision of unmasked tokens at inference and a Grouped Cross-Entropy (GCE) objective that assigns positive signals to embedding-space neighbors of the ground-truth token, plus a fused operator to reduce VRAM usage. It reports scores of 0.90 on GenEval, 86.9 on DPG, and 10.76 on HPSv3 as evidence of improved efficiency and fidelity.

Significance. If the claims are substantiated, the token-editing mechanism and GCE objective would represent targeted advances for MDMs, addressing self-correction and sparsity issues that currently limit discrete models relative to continuous diffusion approaches. The fused operator for GCE could offer a practical efficiency gain in large-vocabulary regimes. These elements, if validated with proper controls, would be of interest to the image synthesis community.

major comments (3)
  1. [Abstract] Abstract: The reported benchmark scores (0.90 GenEval, 86.9 DPG, 10.76 HPSv3) are presented with no experimental details, baselines, ablations, training configurations, dataset information, or error analysis, making it impossible to determine whether the proposed mechanisms drive the claimed improvements.
  2. [Abstract] Abstract (first challenge paragraph): The token-editing mechanism is described as allowing the model to revise already-unmasked tokens at inference, yet the training is characterized as standard masked discrete diffusion with no auxiliary losses or edit supervision mentioned; this leaves an unverified train-inference gap that risks out-of-distribution states and artifacts rather than reliable self-correction.
  3. [Abstract] Abstract (second challenge paragraph): The Grouped Cross-Entropy (GCE) objective is introduced to mitigate signal sparsity by assigning positive signals to neighboring tokens, but no equations, pseudocode, or ablation results are supplied to demonstrate its formulation, gradient behavior, or quantitative effect on optimization or final metrics.
minor comments (1)
  1. [Abstract] Abstract: The abstract asserts 'state-of-the-art' performance without naming the specific prior MDM baselines or metric definitions used for comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that the abstract is overly concise and will revise it to provide more context on experimental details, the mechanisms, and their validation while preserving brevity. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The reported benchmark scores (0.90 GenEval, 86.9 DPG, 10.76 HPSv3) are presented with no experimental details, baselines, ablations, training configurations, dataset information, or error analysis, making it impossible to determine whether the proposed mechanisms drive the claimed improvements.

    Authors: The abstract is intentionally brief. The full manuscript contains a dedicated Experiments section (Section 4) that reports all requested information: training configurations, datasets (including LAION and internal high-res data), baselines (e.g., comparisons to prior MDMs and continuous diffusion models), ablations isolating token-editing and GCE, and error analysis via qualitative examples and metric breakdowns. We will revise the abstract to include a short clause referencing these controls and noting that the gains are measured against strong baselines. revision: yes

  2. Referee: [Abstract] Abstract (first challenge paragraph): The token-editing mechanism is described as allowing the model to revise already-unmasked tokens at inference, yet the training is characterized as standard masked discrete diffusion with no auxiliary losses or edit supervision mentioned; this leaves an unverified train-inference gap that risks out-of-distribution states and artifacts rather than reliable self-correction.

    Authors: The token-editing procedure is an inference-only technique that iteratively re-masks and re-predicts selected tokens using the same trained MDM; no auxiliary losses or edit-specific supervision are required because the model was already trained to denoise arbitrary partial masks. This is analogous to iterative refinement in continuous diffusion. We acknowledge the referee's concern about potential distribution shift and will add a clarifying sentence in the revised abstract plus a short discussion in Section 3.1 on why the mechanism stays in-distribution. We will also include an ablation measuring artifact rates with and without editing. revision: partial

  3. Referee: [Abstract] Abstract (second challenge paragraph): The Grouped Cross-Entropy (GCE) objective is introduced to mitigate signal sparsity by assigning positive signals to neighboring tokens, but no equations, pseudocode, or ablation results are supplied to demonstrate its formulation, gradient behavior, or quantitative effect on optimization or final metrics.

    Authors: The abstract summarizes GCE at a high level. The full formulation (including the mathematical definition, grouping strategy in embedding space, gradient analysis, and the fused operator) appears in Section 3.2, with pseudocode in Algorithm 1 and ablations in Section 4.3 quantifying its impact on convergence speed and final metrics. We will update the abstract to briefly state the core idea and direct readers to the detailed treatment in the methods section. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed derivation

full rationale

The paper introduces a token-editing mechanism and Grouped Cross-Entropy objective as innovations for masked discrete diffusion, then reports empirical scores on GenEval, DPG, and HPSv3. No equations, fitted parameters, or self-citations are shown that reduce these outcomes to the inputs by construction. The derivation chain consists of standard MDM training augmented by the proposed components, with results presented as experimental outcomes rather than tautological predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Abstract-only review; ledger reflects only the two challenges and two proposed mechanisms stated in the text. No numerical free parameters are mentioned.

axioms (2)
  • domain assumption Standard MDMs lack self-correcting capability because discrete tokens cannot be modified once unmasked.
    Directly stated as first key challenge.
  • domain assumption Larger vocabulary sizes introduce optimization difficulties due to increasingly sparse per-token training signals.
    Directly stated as second key challenge.
invented entities (2)
  • token-editing mechanism no independent evidence
    purpose: Enable dynamic revision of already-unmasked tokens during inference
    Proposed solution to first challenge; no independent evidence supplied.
  • Grouped Cross-Entropy (GCE) objective no independent evidence
    purpose: Assign positive learning signals to tokens neighboring the ground truth in embedding space
    Proposed solution to second challenge; no independent evidence supplied.

pith-pipeline@v0.9.1-grok · 5819 in / 1265 out tokens · 26351 ms · 2026-06-30T06:05:48.228120+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 38 canonical work pages · 23 internal anchors

  1. [1]

    GPT-4o System Card

    OpenAI. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  2. [2]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  3. [3]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  4. [4]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

  5. [5]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 11 Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

  6. [6]

    Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

  7. [7]

    Meissonic: Revitalizing masked generative transformers for efficient high-resolution text-to-image synthesis.arXiv preprint arXiv:2410.08261, 2024

    Jinbin Bai, Tian Ye, Wei Chow, Enxin Song, Xiangtai Li, Zhen Dong, Lei Zhu, and Shuicheng Yan. Meissonic: Revitalizing masked generative transformers for efficient high-resolution text-to-image synthesis.arXiv preprint arXiv:2410.08261, 2024

  8. [8]

    Maskgit: Masked generative image transformer

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022

  9. [9]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

  10. [10]

    Lavida-o: Elastic masked diffusion models for unified multimodal understanding and generation.arXiv preprint arXiv:2509.19244, 2025

    Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, and Jason Kuen. Lavida-o: Elastic masked diffusion models for unified multimodal understanding and generation.arXiv preprint arXiv:2509.19244, 2025

  11. [11]

    Lavida-r1: Advancing reasoning for unified multimodal diffusion language models.arXiv preprint arXiv:2602.14147, 2026

    Shufan Li, Yuchen Zhu, Jiuxiang Gu, Kangning Liu, Zhe Lin, Yongxin Chen, Molei Tao, Aditya Grover, and Jason Kuen. Lavida-r1: Advancing reasoning for unified multimodal diffusion language models.arXiv preprint arXiv:2602.14147, 2026

  12. [12]

    MMaDA: Multimodal Large Diffusion Language Models

    Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

  13. [13]

    Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

    Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, et al. Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model.arXiv preprint arXiv:2505.23606, 2025

  14. [14]

    Efficient sequence packing with- out cross-contamination: Accelerating large language models without impacting performance.arXiv preprint arXiv:2107.02027, 2021

    Mario Michael Krell, Matej Kosec, Sergio P Perez, and Andrew Fitzgibbon. Efficient sequence packing with- out cross-contamination: Accelerating large language models without impacting performance.arXiv preprint arXiv:2107.02027, 2021

  15. [15]

    Sparse-lavida: Sparse multimodal discrete diffusion language models.arXiv preprint arXiv:2512.14008, 2025

    Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, and Jason Kuen. Sparse-lavida: Sparse multimodal discrete diffusion language models.arXiv preprint arXiv:2512.14008, 2025

  16. [16]

    dkv-cache: The cache for diffusion language models

    Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. dkv-cache: The cache for diffusion language models. arXiv preprint arXiv:2505.15781, 2025

  17. [17]

    Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jin- sheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

  18. [18]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

  19. [19]

    Scaling the codebook size of vq-gan to 100,000 with a utilization rate of 99%.Advances in Neural Information Processing Systems, 37:12612–12635, 2024

    Lei Zhu, Fangyun Wei, Yanye Lu, and Dong Chen. Scaling the codebook size of vq-gan to 100,000 with a utilization rate of 99%.Advances in Neural Information Processing Systems, 37:12612–12635, 2024

  20. [20]

    Scalable training for vector-quantized networks with 100% codebook utilization.arXiv preprint arXiv:2509.10140, 2025

    Yifan Chang, Jie Qin, Limeng Qiao, Xiaofeng Wang, Zheng Zhu, Lin Ma, and Xingang Wang. Scalable training for vector-quantized networks with 100% codebook utilization.arXiv preprint arXiv:2509.10140, 2025

  21. [21]

    Scalable image tokenization with index backpropagation quantization

    Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16037–16046, 2025

  22. [22]

    Snce: Geometry-aware supervision for scalable discrete image generation.arXiv preprint arXiv:2603.15150, 2026

    Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Aditya Grover, and Jason Kuen. Snce: Geometry-aware supervision for scalable discrete image generation.arXiv preprint arXiv:2603.15150, 2026

  23. [23]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  24. [24]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu Ella. Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 5(7):16, 2024. 12 Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

  25. [25]

    Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation, 2024

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation, 2024

  26. [26]

    Finite Scalar Quantization: VQ-VAE Made Simple

    Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505, 2023

  27. [27]

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023

  28. [28]

    Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

  29. [29]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

  30. [30]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022

  31. [31]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

  32. [32]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  33. [33]

    Unified discrete diffusion for simultaneous vision-language generation.arXiv, 2022

    Minghui Hu, Chuanxia Zheng, Heliang Zheng, Tat-Jen Cham, Chaoyue Wang, Zuopeng Yang, Dacheng Tao, and Ponnuthurai N Suganthan. Unified discrete diffusion for simultaneous vision-language generation.arXiv, 2022

  34. [34]

    Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

    Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193, 2025

  35. [35]

    Edit flows: Flow matching with edit operations,

    Marton Havasi, Brian Karrer, Itai Gat, and Ricky TQ Chen. Edit flows: Flow matching with edit operations. arXiv preprint arXiv:2506.09018, 2025

  36. [36]

    Tiwei Bie, Maosong Cao, Xiang Cao, Bingsen Chen, Fuyuan Chen, Kun Chen, Lun Du, Daozhuo Feng, Haibo Feng, Mingliang Gong, et al. Llada2. 1: Speeding up text diffusion via token editing.arXiv preprint arXiv:2602.08676, 2026

  37. [37]

    Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

    Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

  38. [38]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  39. [39]

    Nemotron-labs-diffusion: A tri-mode language model unifying autoregressive, diffusion, and self-speculation decoding.preprint, May 2026

    Yonggan Fu, Lexington Whalen, Abhinav Garg, Chengyue Wu, Maksim Khadkevich, Nicolai Oswald, Enze Xie, Daniel Egert, Sharath Turuvekere Sreenivas, Shizhe Diao, Chenhan Yu, Ye Yu, Weijia Chen, Sajad Norouzi, Shiyi Lan, Ligeng Zhu, Jin Wang, Jindong Jiang, Morteza Mardani, Mehran Maghoumi, Song Han, Ante Jukic, Nima Tajbakhsh, Jan Kautz, and Pavlo Molchanov....

  40. [40]

    Beyond masks: Efficient, flexible diffusion language models via deletion-insertion processes

    Fangyu Ding, Ding Ding, Sijin Chen, Kaibo Wang, Peng Xu, Zijin Feng, Haoli Bai, Kai Han, Youliang Yan, Binhang Yuan, et al. Beyond masks: Efficient, flexible diffusion language models via deletion-insertion processes. arXiv preprint arXiv:2603.23507, 2026

  41. [41]

    arXiv preprint arXiv:2512.15596 , year =

    Shuibai Zhang, Fred Zhangzhi Peng, Yiheng Zhang, Jin Pan, and Grigorios G Chrysos. Corrective diffusion language models.arXiv preprint arXiv:2512.15596, 2025

  42. [42]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 13 Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

  43. [43]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

  44. [44]

    Hpsv3: Towards wide-spectrum human preference score

    Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

  45. [45]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  46. [46]

    Dall·e 3.https://openai.com/index/dall-e-3/, 2023

    OpenAI. Dall·e 3.https://openai.com/index/dall-e-3/, 2023

  47. [47]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

  48. [48]

    Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023

  49. [49]

    Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

  50. [50]

    Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

    Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

  51. [51]

    Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

  52. [52]

    LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

    Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

  53. [53]

    Lavida: A large diffusion language model for multimodal understanding

    Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for multimodal understanding. arXiv preprint arXiv:2505.16839, 2025

  54. [54]

    Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

  55. [55]

    Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022

    Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022

  56. [56]

    Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025

    Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025

  57. [57]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  58. [58]

    Laion-aesthetics

    Christoph Schuhmann. Laion-aesthetics. https://laion.ai/blog/laion-aesthetics/, 2022. Accessed: 2024 - 03 - 06

  59. [59]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 14 Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis A. Additional Technical Details A.1. Formulation of M...

  60. [60]

    (Nvidia Open Model License) Datasets:LAION [ 54] (MIT), COYO [55] (CC-BY-4.0), MJHQ [25] (CC-BY-4.0), BLIP3o-60k [43] (Apache-2.0), and ShareGPT4o-Image [56] (CC-BY-4.0). 20 Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis a beautiful sunset, bright and colourful, ultra realistic, UHD, 8k fluffy white ...