pith. machine review for the scientific record. sign in

arxiv: 2505.15809 · v2 · submitted 2025-05-21 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MMaDA: Multimodal Large Diffusion Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal diffusion modelsunified architecturechain-of-thought fine-tuningreinforcement learningtext-to-image generationmultimodal understandingfoundation modelpolicy gradient
0
0 comments X

The pith

A single diffusion architecture unifies text reasoning, multimodal understanding, and image generation without modality-specific parts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MMaDA as a multimodal diffusion foundation model built on one shared probabilistic framework that treats text and images the same way. It adds a mixed chain-of-thought fine-tuning step to align reasoning steps across modalities and a custom reinforcement learning method called UniGRPO to improve both understanding and generation after pretraining. The resulting 8B model beats LLaMA-3-7B and Qwen2-7B on textual reasoning, Show-o and SEED-X on multimodal tasks, and SDXL and Janus on text-to-image creation. These results come from experiments that test generalization across the three domains in a single system. The approach aims to close the gap between pretraining and post-training inside unified diffusion models.

Core claim

MMaDA uses a unified diffusion architecture with a shared probabilistic formulation and modality-agnostic design to process textual reasoning, multimodal understanding, and text-to-image generation in one model. A mixed long chain-of-thought fine-tuning strategy creates a common reasoning format across modalities to support later reinforcement learning. UniGRPO, a policy-gradient RL algorithm with diversified rewards, then unifies post-training for both reasoning and generation tasks, producing consistent gains. The 8B version outperforms listed baselines on each task type.

What carries the argument

Unified diffusion architecture with shared probabilistic formulation and modality-agnostic design, plus mixed long CoT fine-tuning and the UniGRPO policy-gradient RL algorithm

If this is right

  • Text and image tasks can share the same core components and training pipeline.
  • Unified CoT formats enable effective cold-start reinforcement learning across modalities.
  • One RL method can drive improvements in both reasoning and generation at the same time.
  • The 8B model already exceeds several specialized systems on standard benchmarks.
  • The framework supplies a complete pretraining-to-post-training pipeline for future diffusion models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The design could lower the cost of running multiple separate models by consolidating them into one.
  • Similar shared formulations might apply to additional modalities such as audio without major redesign.
  • Scaling the same architecture to larger sizes could test whether the unification benefits hold at higher capacity.
  • The RL stage might be adapted to other generative objectives beyond the three evaluated here.

Load-bearing premise

A single probabilistic formulation and modality-agnostic design can integrate and process different data types without any modality-specific components.

What would settle it

An experiment that adds modality-specific components to an otherwise identical model and measures clear gains on any of the three tasks would falsify the claim that the shared design is sufficient.

read the original abstract

We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces MMaDA, a class of multimodal diffusion foundation models that employs a unified diffusion architecture with shared probabilistic formulation and modality-agnostic design, a mixed long chain-of-thought (CoT) fine-tuning strategy to align reasoning across modalities, and the UniGRPO unified policy-gradient RL algorithm for post-training. The central claim is that the resulting 8B model exhibits strong generalization, outperforming LLaMA-3-7B and Qwen2-7B on textual reasoning, Show-o and SEED-X on multimodal understanding, and SDXL and Janus on text-to-image generation.

Significance. If the experimental results and ablations hold, the work would constitute a meaningful advance in unified multimodal modeling by demonstrating that a single diffusion-based architecture can jointly handle reasoning and generation tasks without modality-specific components, while also providing a concrete RL post-training method (UniGRPO) tailored to diffusion models.

major comments (3)
  1. [Abstract] Abstract and experimental sections: performance claims (outperformance over LLaMA-3-7B, Show-o, SDXL, etc.) are stated without reference to specific benchmarks, metrics, evaluation protocols, or error bars; this prevents verification of the generalization assertions.
  2. [Method] Method and experiments: the claim that the modality-agnostic unified diffusion architecture is sufficient for seamless multimodal integration is not isolated from the mixed long CoT fine-tuning and UniGRPO contributions; no ablation studies are described that evaluate the base architecture alone against the baselines.
  3. [Experiments] Experiments: the paper does not report whether the reported gains persist when the CoT and UniGRPO stages are removed or replaced with standard fine-tuning, leaving open whether the architecture itself is load-bearing for the cross-modal results.
minor comments (2)
  1. The GitHub link for code and models is provided, which supports reproducibility; ensure the released artifacts include the exact training configurations and evaluation scripts used for the reported numbers.
  2. [Method] Notation for the shared probabilistic formulation could be clarified with an explicit equation early in the method section to make the modality-agnostic property easier to follow.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for improving the clarity and rigor of our presentation. We address each major comment below and have revised the manuscript accordingly to provide greater specificity in the abstract and experimental sections, as well as to include new ablations that better isolate the contributions of the unified architecture.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental sections: performance claims (outperformance over LLaMA-3-7B, Show-o, SDXL, etc.) are stated without reference to specific benchmarks, metrics, evaluation protocols, or error bars; this prevents verification of the generalization assertions.

    Authors: We agree that the abstract and experimental reporting would benefit from greater specificity. In the revised manuscript, we have updated the abstract to explicitly name the benchmarks (GSM8K and MATH for textual reasoning, MMMU and MMBench for multimodal understanding, and COCO FID for text-to-image generation), the evaluation metrics, protocols, and standard deviations. The experimental section has been expanded with tables that include these details and error bars for all comparisons. revision: yes

  2. Referee: [Method] Method and experiments: the claim that the modality-agnostic unified diffusion architecture is sufficient for seamless multimodal integration is not isolated from the mixed long CoT fine-tuning and UniGRPO contributions; no ablation studies are described that evaluate the base architecture alone against the baselines.

    Authors: We acknowledge that the original manuscript did not sufficiently isolate the base architecture. We have added a dedicated ablation subsection (Section 5.4) in the revised version that evaluates the pre-trained unified diffusion model without mixed CoT fine-tuning or UniGRPO. These results are compared directly against the baselines to demonstrate that the modality-agnostic design provides a competitive foundation for cross-modal integration. revision: yes

  3. Referee: [Experiments] Experiments: the paper does not report whether the reported gains persist when the CoT and UniGRPO stages are removed or replaced with standard fine-tuning, leaving open whether the architecture itself is load-bearing for the cross-modal results.

    Authors: We have addressed this by adding experiments in the revision that remove the mixed CoT stage (replacing it with standard supervised fine-tuning) and omit UniGRPO. The results indicate that while our proposed stages provide additional gains, the unified diffusion architecture remains the primary enabler of the cross-modal capabilities, as performance drops notably without it relative to specialized baselines. revision: yes

Circularity Check

0 steps flagged

No derivation chain reduces to inputs by construction; empirical claims rest on joint innovations

full rationale

The manuscript introduces MMaDA via three explicit design choices (unified diffusion architecture, mixed long CoT fine-tuning, UniGRPO) and reports empirical outperformance. No equations, uniqueness theorems, or fitted-parameter predictions appear in the abstract or described sections that would collapse a claimed result back onto its own inputs. The modality-agnostic property is asserted as an architectural feature rather than derived from prior results. A low score is assigned only because the three contributions are presented jointly, leaving the isolated sufficiency of the base architecture untested; this is a limitation of experimental isolation, not a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are detailed beyond the high-level description of the architecture and algorithms.

pith-pipeline@v0.9.0 · 5611 in / 1023 out tokens · 28649 ms · 2026-05-15T14:47:07.216120+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

    cs.LG 2026-03 unverdicted novelty 8.0

    Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

  2. Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.

  3. Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...

  4. UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

    cs.MM 2026-05 unverdicted novelty 7.0

    UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...

  5. Relative Score Policy Optimization for Diffusion Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.

  6. Discrete Langevin-Inspired Posterior Sampling

    cs.LG 2026-05 unverdicted novelty 7.0

    ΔLPS is a gradient-guided discrete posterior sampler for inverse problems that works with masked or uniform discrete diffusion priors and outperforms prior discrete methods on image restoration tasks.

  7. BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

    cs.CV 2026-04 unverdicted novelty 7.0

    BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.

  8. Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space

    cs.CL 2026-05 unverdicted novelty 6.0

    Language generation is recast as optimal control and solved approximately with flow matching in rectified latent control space to enable high-fidelity parallel text generation.

  9. Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    b1 trains dLLMs to dynamically select reasoning block sizes via monotonic entropy descent with RL, improving coherence over fixed-size baselines on reasoning benchmarks.

  10. NoiseRater: Meta-Learned Noise Valuation for Diffusion Model Training

    cs.LG 2026-05 unverdicted novelty 6.0

    NoiseRater meta-learns instance-level importance scores for noise in diffusion training via bilevel optimization, then uses a two-stage pipeline to improve efficiency and generation quality on FFHQ and ImageNet.

  11. Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation

    cs.AI 2026-04 conditional novelty 6.0

    Auditing five frontier VLMs reveals severe grounding failures (max 0.23 IoU, 19.1% Acc@0.5) and format collapse (up to 99% parse failure) in medical VQA; fine-tuning yields 85.5% SLAKE recall but perception remains th...

  12. dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

    cs.RO 2026-04 unverdicted novelty 6.0

    A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.

  13. Stability-Weighted Decoding for Diffusion Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    Stability-Weighted Decoding improves diffusion LLM accuracy by modulating token scores with temporal stability from KL divergence between prediction steps.

  14. Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.

  15. Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.

  16. UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

    cs.CV 2026-04 unverdicted novelty 5.0

    UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.

  17. DMax: Aggressive Parallel Decoding for dLLMs

    cs.LG 2026-04 unverdicted novelty 5.0

    DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...

  18. Motus: A Unified Latent Action World Model

    cs.CV 2025-12 unverdicted novelty 5.0

    Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.

  19. TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

    cs.AI 2026-04 unverdicted novelty 4.0

    TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.

  20. Show-o2: Improved Native Unified Multimodal Models

    cs.CV 2025-06 unverdicted novelty 4.0

    Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

Reference graph

Works this paper leans on

126 extracted references · 126 canonical work pages · cited by 19 Pith papers · 24 internal anchors

  1. [1]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018

  2. [2]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

  3. [3]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  4. [5]

    VL-GPT: A generative pre-trained transformer for vision and language understanding and generation.CoRR, abs/2312.09251, 2023

    Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, and Ying Shan. VL-GPT: A generative pre-trained transformer for vision and language understanding and generation.CoRR, abs/2312.09251, 2023

  5. [6]

    Generative pretraining in multimodality.CoRR, abs/2307.05222, 2023

    Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality.CoRR, abs/2307.05222, 2023

  6. [7]

    Generative multimodal models are in-context learners.arXiv preprint arXiv:2312.13286, 2023a

    Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners.CoRR, abs/2312.13286, 2023

  7. [8]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

  8. [9]

    World model on million-length video and language with ringattention.arXiv preprint, 2024

    Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention.arXiv preprint, 2024

  9. [10]

    Vila-u: a unified foundation model integrating visual understanding and generation

    Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429, 2024

  10. [11]

    Emu: Generative pretraining in multimodality

    Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality. InICLR, 2023

  11. [13]

    Janus: Decoupling visual encoding for unified multimodal understanding and generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848, 2024

  12. [14]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  13. [15]

    Gpt-4 technical report

    OpenAI. Gpt-4 technical report. 2023

  14. [16]

    DreamLLM: Synergistic multimodal comprehension and creation

    Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. DreamLLM: Synergistic multimodal comprehension and creation. InICLR, 2024

  15. [17]

    Generating images with multimodal language models

    Jing Yu Koh, Daniel Fried, and Russ R Salakhutdinov. Generating images with multimodal language models. Advancesin Neural Information Processing Systems, 36:21487–21506, 2023

  16. [18]

    Llava-plus: Learning to use tools for creating multimodal agents

    Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. InEuropean Conference on Computer Vision, pages 126–142. Springer, 2024

  17. [19]

    Seed-x: Multimodal models with unified multi-granularity comprehension and generation

    Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024

  18. [20]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

  19. [21]

    Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024

  20. [22]

    Large Language Diffusion Models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

  21. [23]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, pages 6840–6851, 2020. 18

  22. [24]

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023

  23. [25]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

  24. [26]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  25. [27]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  26. [28]

    d1: Scaling reasoning in diffusion large language models via reinforcement learning, 2025

    Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning, 2025

  27. [30]

    Xing, and Liang Lin

    Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning, 2022

  28. [31]

    Lawrence Zitnick, and Ross Girshick

    Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning, 2016

  29. [32]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, pages 26296–26306, 2024

  30. [33]

    Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

  31. [34]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.CoRR, abs/2308.12966, 2023

  32. [35]

    mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration

    Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. InCVPR, pages 13040–13051, 2024

  33. [36]

    Llava-phi: Efficient multi-modal assistant with small language model

    Yichen Zhu, Minjie Zhu, Ning Liu, Zhiyuan Xu, and Yaxin Peng. Llava-phi: Efficient multi-modal assistant with small language model. InProceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited, pages 18–22, 2024

  34. [37]

    The refinedweb dataset for falcon LLM: outperforming curated corpora with web data only

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli, Alessandro Cappelli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon LLM: outperforming curated corpora with web data only. InNeurIPS, 2023

  35. [38]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255, 2009

  36. [39]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InCVPR, pages 3558–3568, 2021

  37. [40]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InICCV, pages 4015–4026, 2023

  38. [41]

    laion-aesthetics-12m-umap

    David McClure. laion-aesthetics-12m-umap. https://huggingface.co/datasets/dclure/ laion-aesthetics-12m-umap, 2024. 19

  39. [42]

    Journeydb: A benchmark for generative image understanding

    Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, Limin Wang, and Hongsheng Li. Journeydb: A benchmark for generative image understanding. InNeurIPS, 2023

  40. [43]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. GitHub, 2023

  41. [44]

    Reasonflux: Hierarchical llm reasoning via scaling thought templates

    Ling Yang, Zhaochen Yu, Bin Cui, and Mengdi Wang. Reasonflux: Hierarchical llm reasoning via scaling thought templates. arXiv preprint arXiv:2502.06772, 2025

  42. [45]

    Limo: Less is more for reasoning, 2025

    Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning, 2025

  43. [46]

    s1: Simple test-time scaling, 2025

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025

  44. [47]

    Open Thoughts

    OpenThoughts Team. Open Thoughts. https://open-thoughts.ai, January 2025

  45. [48]

    Acemath: Advancing frontier math reasoning with post-training and reward modeling.arXiv preprint, 2024

    Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acemath: Advancing frontier math reasoning with post-training and reward modeling.arXiv preprint, 2024

  46. [49]

    Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl, 2025

    Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl, 2025

  47. [50]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  48. [51]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763, 2021

  49. [52]

    Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36, 2024

    JiazhengXu, XiaoLiu, YuchenWu, YuxuanTong, QinkaiLi, MingDing, JieTang, andYuxiaoDong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36, 2024

  50. [53]

    Geneval: An object-focused framework for evaluating text-to-image alignment

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023

  51. [54]

    Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025

    Yuwei Niu, Munan Ning, Mengren Zheng, Bin Lin, Peng Jin, Jiaqi Liao, Kunpeng Ning, Bin Zhu, and Li Yuan. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025

  52. [55]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022

  53. [57]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  54. [58]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

  55. [59]

    Vargpt: Unified understanding and generation in a visual autoregressive multimodal large language model.arXiv preprint arXiv:2501.12327, 2025

    Xianwei Zhuang, Yuxin Xie, Yufan Deng, Liming Liang, Jinghan Ru, Yuguo Yin, and Yuexian Zou. Vargpt: Unified understanding and generation in a visual autoregressive multimodal large language model.arXiv preprint arXiv:2501.12327, 2025

  56. [60]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  57. [61]

    Openai o1 system card.preprint, 2024

    OpenAI. Openai o1 system card.preprint, 2024. 20

  58. [62]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  59. [63]

    Multimodal foundation models: From specialists to general-purpose assistants.Foundations and Trends® in Computer Graphics and Vision, 16(1-2):1–214, 2024

    Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao, et al. Multimodal foundation models: From specialists to general-purpose assistants.Foundations and Trends® in Computer Graphics and Vision, 16(1-2):1–214, 2024

  60. [64]

    A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023

  61. [65]

    Hallucination of Multimodal Large Language Models: A Survey

    Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucina- tion of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930, 2024

  62. [66]

    Visual instruction tuning.NeurIPS, 36, 2024

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36, 2024

  63. [67]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.CoRR, abs/2304.10592, 2023

  64. [68]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  65. [69]

    Mm1: Methods, analysis & insights from multimodal llm pre-training

    Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, et al. Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611, 2024

  66. [70]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  67. [71]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

  68. [72]

    Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Zhongdao Wang, James T. Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InICLR. OpenReview.net, 2024

  69. [73]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021

  70. [74]

    Raphael: Text-to- image generation via large mixture of diffusion paths.NeurIPS, 36, 2024

    Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, and Ping Luo. Raphael: Text-to- image generation via large mixture of diffusion paths.NeurIPS, 36, 2024

  71. [75]

    Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms

    Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, and Bin Cui. Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. InForty-firstInternational Conference on Machine Learning, 2024

  72. [76]

    Itercomp: Iterative composition-aware feedback learning from model gallery for text-to-image generation

    Xinchen Zhang, Ling Yang, Guohao Li, Yaqi Cai, Jiake Xie, Yong Tang, Yujiu Yang, Mengdi Wang, and Bin Cui. Itercomp: Iterative composition-aware feedback learning from model gallery for text-to-image generation. In ICLR, 2025

  73. [77]

    An overview of diffusion models: Applications, guided generation, statistical rates and optimization.arXiv preprint arXiv:2404.07771, 2024

    Minshuo Chen, Song Mei, Jianqing Fan, and Mengdi Wang. An overview of diffusion models: Applications, guided generation, statistical rates and optimization.arXiv preprint arXiv:2404.07771, 2024

  74. [78]

    Diffusion-sharpening: Fine-tuning diffusion models with denoising trajectory sharpening.arXiv preprint arXiv:2502.12146, 2025

    Ye Tian, Ling Yang, Xinchen Zhang, Yunhai Tong, Mengdi Wang, and Bin Cui. Diffusion-sharpening: Fine-tuning diffusion models with denoising trajectory sharpening.arXiv preprint arXiv:2502.12146, 2025

  75. [79]

    Reward-directed conditional diffusion: Provable distribution estimation and reward improvement.Advancesin Neural Information Processing Systems, 36:60599–60635, 2023

    Hui Yuan, Kaixuan Huang, Chengzhuo Ni, Minshuo Chen, and Mengdi Wang. Reward-directed conditional diffusion: Provable distribution estimation and reward improvement.Advancesin Neural Information Processing Systems, 36:60599–60635, 2023

  76. [80]

    Gradient guidance for diffusion models: An optimization perspective.arXiv preprint arXiv:2404.14743, 2024

    Yingqing Guo, Hui Yuan, Yukang Yang, Minshuo Chen, and Mengdi Wang. Gradient guidance for diffusion models: An optimization perspective.arXiv preprint arXiv:2404.14743, 2024. 21

  77. [81]

    Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data

    Minshuo Chen, Kaixuan Huang, Tuo Zhao, and Mengdi Wang. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. InInternational Conference on Machine Learning, pages 4672–4712. PMLR, 2023

  78. [82]

    Rectified diffusion: Straightness is not your need in rectified flow.arXiv preprint arXiv:2410.07303,

    Fu-Yun Wang, Ling Yang, Zhaoyang Huang, Mengdi Wang, and Hongsheng Li. Rectified diffusion: Straightness is not your need in rectified flow.arXiv preprint arXiv:2410.07303, 2024

  79. [83]

    Improving diffusion-based image synthesis with context prediction.Advances in Neural Information Processing Systems, 36:37636–37656, 2023

    Ling Yang, Jingwei Liu, Shenda Hong, Zhilong Zhang, Zhilin Huang, Zheming Cai, Wentao Zhang, and Bin Cui. Improving diffusion-based image synthesis with context prediction.Advances in Neural Information Processing Systems, 36:37636–37656, 2023

  80. [84]

    Structure-guided adversarial training of diffusion models

    Ling Yang, Haotian Qian, Zhilong Zhang, Jingwei Liu, and Bin Cui. Structure-guided adversarial training of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7256–7266, 2024

Showing first 80 references.