Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

Aditya Grover; Greg Heinrich; Hanrong Ye; Jan Kautz; Pavlo Molchanov; Shufan Li; Yonggan Fu

arxiv: 2606.29814 · v1 · pith:3XXAF4BOnew · submitted 2026-06-29 · 💻 cs.CV

Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

Shufan Li , Greg Heinrich , Hanrong Ye , Yonggan Fu , Aditya Grover , Jan Kautz , Pavlo Molchanov This is my paper

Pith reviewed 2026-06-30 06:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords masked discrete diffusiontext-to-image synthesistoken editinggrouped cross-entropyhigh-resolution image generationself-correctionvocabulary sparsity

0 comments

The pith

A token-editing step at inference and a grouped loss fix self-correction and sparsity problems in masked discrete diffusion for text-to-image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Nemotron-Labs-Diffusion-Image as a masked discrete diffusion model that generates high-resolution images from text. It tackles the fact that once a discrete token is unmasked it cannot be changed, removing any chance for later correction, and the fact that large vocabularies spread the training signal too thin. The fixes are a mechanism that lets the model rewrite already-unmasked tokens during sampling and a Grouped Cross-Entropy loss that treats embedding neighbors of the true token as positive examples. A custom fused operator keeps memory use low. If these changes work, discrete models become competitive with continuous diffusion on standard quality benchmarks without needing full-image latent refinement at every step.

Core claim

By adding a token-editing mechanism that allows dynamic revision of unmasked tokens during inference and a Grouped Cross-Entropy objective that supplies positive learning signals to tokens neighboring the ground truth in embedding space, together with a fused operator that reduces VRAM consumption, masked discrete diffusion models overcome their lack of self-correction and training-signal sparsity, yielding improved efficiency and higher image fidelity on high-resolution text-to-image tasks.

What carries the argument

The token-editing mechanism, which revises already-unmasked discrete tokens at inference time, paired with the Grouped Cross-Entropy loss that rewards embedding-space neighbors of the correct token.

If this is right

Discrete models can now iteratively refine an image in the same way continuous models progressively denoise the full latent.
Larger token vocabularies become practical for generation because the loss no longer starves most tokens of gradient.
Training runs require less memory in high-vocabulary regimes thanks to the fused operator.
The resulting generators reach 0.90 on GenEval, 86.9 on DPG, and 10.76 on HPSv3.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The editing step might reduce the number of sampling iterations needed to reach a given quality level.
The same grouped-loss idea could be tested on other discrete generative tasks such as audio or video token sequences.
If token editing works reliably, future work could explore learned policies for when and which tokens to rewrite.

Load-bearing premise

The token-editing mechanism can be applied at inference without introducing new inconsistencies or artifacts that cancel out the self-correction benefit.

What would settle it

Generate the same prompts with and without the token-editing step and find that human preference or GenEval scores do not rise or that visible artifacts increase when editing is enabled.

read the original abstract

We propose Nemotron-Labs-Diffusion-Image, a state-of-the-art masked discrete diffusion model (MDM) for high-resolution text-to-image synthesis. Compared with prior work on masked image generation, Nemotron-Labs-Diffusion-Image addresses two key challenges. First, unlike continuous diffusion models which progressively refine latent representations across the entire image, standard MDMs lack self-correcting capability because discrete tokens cannot be modified once they are unmasked. Second, although increasing the vocabulary size of discrete image tokenizers improves reconstruction fidelity, it introduces optimization difficulties for generative modeling as the per-token training signal becomes increasingly sparse. To address the first challenge, Nemotron-Labs-Diffusion-Image incorporates a token-editing mechanism that enables the model to dynamically revise already-unmasked tokens during inference, similar to how a sculptor iteratively refines their work. To tackle the second challenge, we propose a Grouped Cross-Entropy (GCE) objective that assigns positive learning signals to tokens neighboring the ground truth in embedding space, thereby alleviating signal sparsity. To further improve training efficiency, we implement a custom fused operator for GCE that significantly reduces VRAM usage in large-vocabulary settings. Experimental results demonstrate that these innovations substantially improve both training efficiency and image fidelity of masked discrete image generators, achieving a score of 0.90 on GenEval, 86.9 on DPG and 10.76 of HPSv3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Token editing and grouped cross-entropy target real MDM gaps but rest on unreported experiments.

read the letter

The main things to know are that the paper introduces inference-time token editing to let the model revise unmasked tokens and a Grouped Cross-Entropy loss to spread signal across embedding neighbors for large vocabularies. It also adds a fused operator for lower VRAM use during training.

These address two stated problems with standard masked discrete diffusion: once a token is revealed it cannot be changed, and big token sets make the per-token loss too sparse. The mechanisms are specific responses rather than generic scaling.

The paper does a reasonable job naming the limitations and sketching fixes that could plausibly help. The reported scores (0.90 GenEval, 86.9 DPG, 10.76 HPSv3) would be competitive if the experiments hold.

The soft spots are the lack of any experimental details, baselines, ablations, or error analysis in the abstract. The stress-test concern about train-inference mismatch is on point: the training is described as standard masked diffusion with no mention of edit supervision, so it is unclear whether the model ever sees the edited states it is supposed to correct at inference. If the edits are applied heuristically they could create out-of-distribution inputs and add artifacts. The full paper would need to show how the editing is trained or matched to close that gap.

This is for readers already working on discrete diffusion models who want to test these particular tweaks. A serious referee should see it if the full manuscript supplies the missing methods, ablations, and reproducible results, because the targeted problems are genuine even if the evidence so far is thin.

Referee Report

3 major / 1 minor

Summary. The paper proposes Nemotron-Labs-Diffusion-Image, a masked discrete diffusion model (MDM) for high-resolution text-to-image synthesis. It identifies two challenges in prior MDMs: lack of self-correction because unmasked discrete tokens cannot be revised, and optimization difficulties from sparse per-token signals with large vocabularies. The work introduces a token-editing mechanism for dynamic revision of unmasked tokens at inference and a Grouped Cross-Entropy (GCE) objective that assigns positive signals to embedding-space neighbors of the ground-truth token, plus a fused operator to reduce VRAM usage. It reports scores of 0.90 on GenEval, 86.9 on DPG, and 10.76 on HPSv3 as evidence of improved efficiency and fidelity.

Significance. If the claims are substantiated, the token-editing mechanism and GCE objective would represent targeted advances for MDMs, addressing self-correction and sparsity issues that currently limit discrete models relative to continuous diffusion approaches. The fused operator for GCE could offer a practical efficiency gain in large-vocabulary regimes. These elements, if validated with proper controls, would be of interest to the image synthesis community.

major comments (3)

[Abstract] Abstract: The reported benchmark scores (0.90 GenEval, 86.9 DPG, 10.76 HPSv3) are presented with no experimental details, baselines, ablations, training configurations, dataset information, or error analysis, making it impossible to determine whether the proposed mechanisms drive the claimed improvements.
[Abstract] Abstract (first challenge paragraph): The token-editing mechanism is described as allowing the model to revise already-unmasked tokens at inference, yet the training is characterized as standard masked discrete diffusion with no auxiliary losses or edit supervision mentioned; this leaves an unverified train-inference gap that risks out-of-distribution states and artifacts rather than reliable self-correction.
[Abstract] Abstract (second challenge paragraph): The Grouped Cross-Entropy (GCE) objective is introduced to mitigate signal sparsity by assigning positive signals to neighboring tokens, but no equations, pseudocode, or ablation results are supplied to demonstrate its formulation, gradient behavior, or quantitative effect on optimization or final metrics.

minor comments (1)

[Abstract] Abstract: The abstract asserts 'state-of-the-art' performance without naming the specific prior MDM baselines or metric definitions used for comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that the abstract is overly concise and will revise it to provide more context on experimental details, the mechanisms, and their validation while preserving brevity. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Abstract] Abstract: The reported benchmark scores (0.90 GenEval, 86.9 DPG, 10.76 HPSv3) are presented with no experimental details, baselines, ablations, training configurations, dataset information, or error analysis, making it impossible to determine whether the proposed mechanisms drive the claimed improvements.

Authors: The abstract is intentionally brief. The full manuscript contains a dedicated Experiments section (Section 4) that reports all requested information: training configurations, datasets (including LAION and internal high-res data), baselines (e.g., comparisons to prior MDMs and continuous diffusion models), ablations isolating token-editing and GCE, and error analysis via qualitative examples and metric breakdowns. We will revise the abstract to include a short clause referencing these controls and noting that the gains are measured against strong baselines. revision: yes
Referee: [Abstract] Abstract (first challenge paragraph): The token-editing mechanism is described as allowing the model to revise already-unmasked tokens at inference, yet the training is characterized as standard masked discrete diffusion with no auxiliary losses or edit supervision mentioned; this leaves an unverified train-inference gap that risks out-of-distribution states and artifacts rather than reliable self-correction.

Authors: The token-editing procedure is an inference-only technique that iteratively re-masks and re-predicts selected tokens using the same trained MDM; no auxiliary losses or edit-specific supervision are required because the model was already trained to denoise arbitrary partial masks. This is analogous to iterative refinement in continuous diffusion. We acknowledge the referee's concern about potential distribution shift and will add a clarifying sentence in the revised abstract plus a short discussion in Section 3.1 on why the mechanism stays in-distribution. We will also include an ablation measuring artifact rates with and without editing. revision: partial
Referee: [Abstract] Abstract (second challenge paragraph): The Grouped Cross-Entropy (GCE) objective is introduced to mitigate signal sparsity by assigning positive signals to neighboring tokens, but no equations, pseudocode, or ablation results are supplied to demonstrate its formulation, gradient behavior, or quantitative effect on optimization or final metrics.

Authors: The abstract summarizes GCE at a high level. The full formulation (including the mathematical definition, grouping strategy in embedding space, gradient analysis, and the fused operator) appears in Section 3.2, with pseudocode in Algorithm 1 and ablations in Section 4.3 quantifying its impact on convergence speed and final metrics. We will update the abstract to briefly state the core idea and direct readers to the detailed treatment in the methods section. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed derivation

full rationale

The paper introduces a token-editing mechanism and Grouped Cross-Entropy objective as innovations for masked discrete diffusion, then reports empirical scores on GenEval, DPG, and HPSv3. No equations, fitted parameters, or self-citations are shown that reduce these outcomes to the inputs by construction. The derivation chain consists of standard MDM training augmented by the proposed components, with results presented as experimental outcomes rather than tautological predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Abstract-only review; ledger reflects only the two challenges and two proposed mechanisms stated in the text. No numerical free parameters are mentioned.

axioms (2)

domain assumption Standard MDMs lack self-correcting capability because discrete tokens cannot be modified once unmasked.
Directly stated as first key challenge.
domain assumption Larger vocabulary sizes introduce optimization difficulties due to increasingly sparse per-token training signals.
Directly stated as second key challenge.

invented entities (2)

token-editing mechanism no independent evidence
purpose: Enable dynamic revision of already-unmasked tokens during inference
Proposed solution to first challenge; no independent evidence supplied.
Grouped Cross-Entropy (GCE) objective no independent evidence
purpose: Assign positive learning signals to tokens neighboring the ground truth in embedding space
Proposed solution to second challenge; no independent evidence supplied.

pith-pipeline@v0.9.1-grok · 5819 in / 1265 out tokens · 26351 ms · 2026-06-30T06:05:48.228120+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 38 canonical work pages · 23 internal anchors

[1]

GPT-4o System Card

OpenAI. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024
[3]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 11 Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

2022
[6]

Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

2020
[7]

Meissonic: Revitalizing masked generative transformers for efficient high-resolution text-to-image synthesis.arXiv preprint arXiv:2410.08261, 2024

Jinbin Bai, Tian Ye, Wei Chow, Enxin Song, Xiangtai Li, Zhen Dong, Lei Zhu, and Shuicheng Yan. Meissonic: Revitalizing masked generative transformers for efficient high-resolution text-to-image synthesis.arXiv preprint arXiv:2410.08261, 2024

work page arXiv 2024
[8]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022

2022
[9]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Lavida-o: Elastic masked diffusion models for unified multimodal understanding and generation.arXiv preprint arXiv:2509.19244, 2025

Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, and Jason Kuen. Lavida-o: Elastic masked diffusion models for unified multimodal understanding and generation.arXiv preprint arXiv:2509.19244, 2025

work page arXiv 2025
[11]

Lavida-r1: Advancing reasoning for unified multimodal diffusion language models.arXiv preprint arXiv:2602.14147, 2026

Shufan Li, Yuchen Zhu, Jiuxiang Gu, Kangning Liu, Zhe Lin, Yongxin Chen, Molei Tao, Aditya Grover, and Jason Kuen. Lavida-r1: Advancing reasoning for unified multimodal diffusion language models.arXiv preprint arXiv:2602.14147, 2026

work page arXiv 2026
[12]

MMaDA: Multimodal Large Diffusion Language Models

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, et al. Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model.arXiv preprint arXiv:2505.23606, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Efficient sequence packing with- out cross-contamination: Accelerating large language models without impacting performance.arXiv preprint arXiv:2107.02027, 2021

Mario Michael Krell, Matej Kosec, Sergio P Perez, and Andrew Fitzgibbon. Efficient sequence packing with- out cross-contamination: Accelerating large language models without impacting performance.arXiv preprint arXiv:2107.02027, 2021

work page arXiv 2021
[15]

Sparse-lavida: Sparse multimodal discrete diffusion language models.arXiv preprint arXiv:2512.14008, 2025

Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, and Jason Kuen. Sparse-lavida: Sparse multimodal discrete diffusion language models.arXiv preprint arXiv:2512.14008, 2025

work page arXiv 2025
[16]

dkv-cache: The cache for diffusion language models

Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. dkv-cache: The cache for diffusion language models. arXiv preprint arXiv:2505.15781, 2025

work page arXiv 2025
[17]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jin- sheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Scaling the codebook size of vq-gan to 100,000 with a utilization rate of 99%.Advances in Neural Information Processing Systems, 37:12612–12635, 2024

Lei Zhu, Fangyun Wei, Yanye Lu, and Dong Chen. Scaling the codebook size of vq-gan to 100,000 with a utilization rate of 99%.Advances in Neural Information Processing Systems, 37:12612–12635, 2024

2024
[20]

Scalable training for vector-quantized networks with 100% codebook utilization.arXiv preprint arXiv:2509.10140, 2025

Yifan Chang, Jie Qin, Limeng Qiao, Xiaofeng Wang, Zheng Zhu, Lin Ma, and Xingang Wang. Scalable training for vector-quantized networks with 100% codebook utilization.arXiv preprint arXiv:2509.10140, 2025

work page arXiv 2025
[21]

Scalable image tokenization with index backpropagation quantization

Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16037–16046, 2025

2025
[22]

Snce: Geometry-aware supervision for scalable discrete image generation.arXiv preprint arXiv:2603.15150, 2026

Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Aditya Grover, and Jason Kuen. Snce: Geometry-aware supervision for scalable discrete image generation.arXiv preprint arXiv:2603.15150, 2026

work page arXiv 2026
[23]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

2023
[24]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu Ella. Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 5(7):16, 2024. 12 Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation, 2024

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation, 2024

2024
[26]

Finite Scalar Quantization: VQ-VAE Made Simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

2017
[29]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Unified discrete diffusion for simultaneous vision-language generation.arXiv, 2022

Minghui Hu, Chuanxia Zheng, Heliang Zheng, Tat-Jen Cham, Chaoyue Wang, Zuopeng Yang, Dacheng Tao, and Ponnuthurai N Suganthan. Unified discrete diffusion for simultaneous vision-language generation.arXiv, 2022

2022
[34]

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Edit flows: Flow matching with edit operations,

Marton Havasi, Brian Karrer, Itai Gat, and Ricky TQ Chen. Edit flows: Flow matching with edit operations. arXiv preprint arXiv:2506.09018, 2025

work page arXiv 2025
[36]

Tiwei Bie, Maosong Cao, Xiang Cao, Bingsen Chen, Fuyuan Chen, Kun Chen, Lun Du, Daozhuo Feng, Haibo Feng, Mingliang Gong, et al. Llada2. 1: Speeding up text diffusion via token editing.arXiv preprint arXiv:2602.08676, 2026

work page arXiv 2026
[37]

Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

work page arXiv 2026
[38]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Nemotron-labs-diffusion: A tri-mode language model unifying autoregressive, diffusion, and self-speculation decoding.preprint, May 2026

Yonggan Fu, Lexington Whalen, Abhinav Garg, Chengyue Wu, Maksim Khadkevich, Nicolai Oswald, Enze Xie, Daniel Egert, Sharath Turuvekere Sreenivas, Shizhe Diao, Chenhan Yu, Ye Yu, Weijia Chen, Sajad Norouzi, Shiyi Lan, Ligeng Zhu, Jin Wang, Jindong Jiang, Morteza Mardani, Mehran Maghoumi, Song Han, Ante Jukic, Nima Tajbakhsh, Jan Kautz, and Pavlo Molchanov....

2026
[40]

Beyond masks: Efficient, flexible diffusion language models via deletion-insertion processes

Fangyu Ding, Ding Ding, Sijin Chen, Kaibo Wang, Peng Xu, Zijin Feng, Haoli Bai, Kai Han, Youliang Yan, Binhang Yuan, et al. Beyond masks: Efficient, flexible diffusion language models via deletion-insertion processes. arXiv preprint arXiv:2603.23507, 2026

work page arXiv 2026
[41]

arXiv preprint arXiv:2512.15596 , year =

Shuibai Zhang, Fred Zhangzhi Peng, Yiheng Zhang, Jin Pan, and Grigorios G Chrysos. Corrective diffusion language models.arXiv preprint arXiv:2512.15596, 2025

work page arXiv 2025
[42]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 13 Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Hpsv3: Towards wide-spectrum human preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

2025
[45]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024
[46]

Dall·e 3.https://openai.com/index/dall-e-3/, 2023

OpenAI. Dall·e 3.https://openai.com/index/dall-e-3/, 2023

2023
[47]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023

2023
[49]

Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

2015
[50]

Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

2024
[51]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Lavida: A large diffusion language model for multimodal understanding

Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for multimodal understanding. arXiv preprint arXiv:2505.16839, 2025

work page arXiv 2025
[54]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

2022
[55]

Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022

2022
[56]

Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025

Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025

work page arXiv 2025
[57]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[58]

Laion-aesthetics

Christoph Schuhmann. Laion-aesthetics. https://laion.ai/blog/laion-aesthetics/, 2022. Accessed: 2024 - 03 - 06

2022
[59]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 14 Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis A. Additional Technical Details A.1. Formulation of M...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

(Nvidia Open Model License) Datasets:LAION [ 54] (MIT), COYO [55] (CC-BY-4.0), MJHQ [25] (CC-BY-4.0), BLIP3o-60k [43] (Apache-2.0), and ShareGPT4o-Image [56] (CC-BY-4.0). 20 Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis a beautiful sunset, bright and colourful, ultra realistic, UHD, 8k fluffy white ...

[1] [1]

GPT-4o System Card

OpenAI. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024

[3] [3]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 11 Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

2022

[6] [6]

Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

2020

[7] [7]

Meissonic: Revitalizing masked generative transformers for efficient high-resolution text-to-image synthesis.arXiv preprint arXiv:2410.08261, 2024

Jinbin Bai, Tian Ye, Wei Chow, Enxin Song, Xiangtai Li, Zhen Dong, Lei Zhu, and Shuicheng Yan. Meissonic: Revitalizing masked generative transformers for efficient high-resolution text-to-image synthesis.arXiv preprint arXiv:2410.08261, 2024

work page arXiv 2024

[8] [8]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022

2022

[9] [9]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Lavida-o: Elastic masked diffusion models for unified multimodal understanding and generation.arXiv preprint arXiv:2509.19244, 2025

Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, and Jason Kuen. Lavida-o: Elastic masked diffusion models for unified multimodal understanding and generation.arXiv preprint arXiv:2509.19244, 2025

work page arXiv 2025

[11] [11]

Lavida-r1: Advancing reasoning for unified multimodal diffusion language models.arXiv preprint arXiv:2602.14147, 2026

Shufan Li, Yuchen Zhu, Jiuxiang Gu, Kangning Liu, Zhe Lin, Yongxin Chen, Molei Tao, Aditya Grover, and Jason Kuen. Lavida-r1: Advancing reasoning for unified multimodal diffusion language models.arXiv preprint arXiv:2602.14147, 2026

work page arXiv 2026

[12] [12]

MMaDA: Multimodal Large Diffusion Language Models

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, et al. Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model.arXiv preprint arXiv:2505.23606, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Efficient sequence packing with- out cross-contamination: Accelerating large language models without impacting performance.arXiv preprint arXiv:2107.02027, 2021

Mario Michael Krell, Matej Kosec, Sergio P Perez, and Andrew Fitzgibbon. Efficient sequence packing with- out cross-contamination: Accelerating large language models without impacting performance.arXiv preprint arXiv:2107.02027, 2021

work page arXiv 2021

[15] [15]

Sparse-lavida: Sparse multimodal discrete diffusion language models.arXiv preprint arXiv:2512.14008, 2025

Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, and Jason Kuen. Sparse-lavida: Sparse multimodal discrete diffusion language models.arXiv preprint arXiv:2512.14008, 2025

work page arXiv 2025

[16] [16]

dkv-cache: The cache for diffusion language models

Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. dkv-cache: The cache for diffusion language models. arXiv preprint arXiv:2505.15781, 2025

work page arXiv 2025

[17] [17]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jin- sheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Scaling the codebook size of vq-gan to 100,000 with a utilization rate of 99%.Advances in Neural Information Processing Systems, 37:12612–12635, 2024

Lei Zhu, Fangyun Wei, Yanye Lu, and Dong Chen. Scaling the codebook size of vq-gan to 100,000 with a utilization rate of 99%.Advances in Neural Information Processing Systems, 37:12612–12635, 2024

2024

[20] [20]

Scalable training for vector-quantized networks with 100% codebook utilization.arXiv preprint arXiv:2509.10140, 2025

Yifan Chang, Jie Qin, Limeng Qiao, Xiaofeng Wang, Zheng Zhu, Lin Ma, and Xingang Wang. Scalable training for vector-quantized networks with 100% codebook utilization.arXiv preprint arXiv:2509.10140, 2025

work page arXiv 2025

[21] [21]

Scalable image tokenization with index backpropagation quantization

Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16037–16046, 2025

2025

[22] [22]

Snce: Geometry-aware supervision for scalable discrete image generation.arXiv preprint arXiv:2603.15150, 2026

Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Aditya Grover, and Jason Kuen. Snce: Geometry-aware supervision for scalable discrete image generation.arXiv preprint arXiv:2603.15150, 2026

work page arXiv 2026

[23] [23]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

2023

[24] [24]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu Ella. Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 5(7):16, 2024. 12 Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation, 2024

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation, 2024

2024

[26] [26]

Finite Scalar Quantization: VQ-VAE Made Simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

2017

[29] [29]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[30] [30]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [31]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Unified discrete diffusion for simultaneous vision-language generation.arXiv, 2022

Minghui Hu, Chuanxia Zheng, Heliang Zheng, Tat-Jen Cham, Chaoyue Wang, Zuopeng Yang, Dacheng Tao, and Ponnuthurai N Suganthan. Unified discrete diffusion for simultaneous vision-language generation.arXiv, 2022

2022

[34] [34]

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Edit flows: Flow matching with edit operations,

Marton Havasi, Brian Karrer, Itai Gat, and Ricky TQ Chen. Edit flows: Flow matching with edit operations. arXiv preprint arXiv:2506.09018, 2025

work page arXiv 2025

[36] [36]

Tiwei Bie, Maosong Cao, Xiang Cao, Bingsen Chen, Fuyuan Chen, Kun Chen, Lun Du, Daozhuo Feng, Haibo Feng, Mingliang Gong, et al. Llada2. 1: Speeding up text diffusion via token editing.arXiv preprint arXiv:2602.08676, 2026

work page arXiv 2026

[37] [37]

Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

work page arXiv 2026

[38] [38]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Nemotron-labs-diffusion: A tri-mode language model unifying autoregressive, diffusion, and self-speculation decoding.preprint, May 2026

Yonggan Fu, Lexington Whalen, Abhinav Garg, Chengyue Wu, Maksim Khadkevich, Nicolai Oswald, Enze Xie, Daniel Egert, Sharath Turuvekere Sreenivas, Shizhe Diao, Chenhan Yu, Ye Yu, Weijia Chen, Sajad Norouzi, Shiyi Lan, Ligeng Zhu, Jin Wang, Jindong Jiang, Morteza Mardani, Mehran Maghoumi, Song Han, Ante Jukic, Nima Tajbakhsh, Jan Kautz, and Pavlo Molchanov....

2026

[40] [40]

Beyond masks: Efficient, flexible diffusion language models via deletion-insertion processes

Fangyu Ding, Ding Ding, Sijin Chen, Kaibo Wang, Peng Xu, Zijin Feng, Haoli Bai, Kai Han, Youliang Yan, Binhang Yuan, et al. Beyond masks: Efficient, flexible diffusion language models via deletion-insertion processes. arXiv preprint arXiv:2603.23507, 2026

work page arXiv 2026

[41] [41]

arXiv preprint arXiv:2512.15596 , year =

Shuibai Zhang, Fred Zhangzhi Peng, Yiheng Zhang, Jin Pan, and Grigorios G Chrysos. Corrective diffusion language models.arXiv preprint arXiv:2512.15596, 2025

work page arXiv 2025

[42] [42]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 13 Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Hpsv3: Towards wide-spectrum human preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

2025

[45] [45]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024

[46] [46]

Dall·e 3.https://openai.com/index/dall-e-3/, 2023

OpenAI. Dall·e 3.https://openai.com/index/dall-e-3/, 2023

2023

[47] [47]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023

2023

[49] [49]

Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

2015

[50] [50]

Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

2024

[51] [51]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Lavida: A large diffusion language model for multimodal understanding

Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for multimodal understanding. arXiv preprint arXiv:2505.16839, 2025

work page arXiv 2025

[54] [54]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

2022

[55] [55]

Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022

2022

[56] [56]

Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025

Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025

work page arXiv 2025

[57] [57]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[58] [58]

Laion-aesthetics

Christoph Schuhmann. Laion-aesthetics. https://laion.ai/blog/laion-aesthetics/, 2022. Accessed: 2024 - 03 - 06

2022

[59] [59]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 14 Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis A. Additional Technical Details A.1. Formulation of M...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

(Nvidia Open Model License) Datasets:LAION [ 54] (MIT), COYO [55] (CC-BY-4.0), MJHQ [25] (CC-BY-4.0), BLIP3o-60k [43] (Apache-2.0), and ShareGPT4o-Image [56] (CC-BY-4.0). 20 Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis a beautiful sunset, bright and colourful, ultra realistic, UHD, 8k fluffy white ...