arxiv: 2505.15809 · v2 · submitted 2025-05-21 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MMaDA: Multimodal Large Diffusion Language Models

Ling Yang , Ye Tian , Bowen Li , Xinchen Zhang , Ke Shen , Yunhai Tong , Mengdi Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal diffusion modelsunified architecturechain-of-thought fine-tuningreinforcement learningtext-to-image generationmultimodal understandingfoundation modelpolicy gradient

0 comments

The pith

A single diffusion architecture unifies text reasoning, multimodal understanding, and image generation without modality-specific parts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MMaDA as a multimodal diffusion foundation model built on one shared probabilistic framework that treats text and images the same way. It adds a mixed chain-of-thought fine-tuning step to align reasoning steps across modalities and a custom reinforcement learning method called UniGRPO to improve both understanding and generation after pretraining. The resulting 8B model beats LLaMA-3-7B and Qwen2-7B on textual reasoning, Show-o and SEED-X on multimodal tasks, and SDXL and Janus on text-to-image creation. These results come from experiments that test generalization across the three domains in a single system. The approach aims to close the gap between pretraining and post-training inside unified diffusion models.

Core claim

MMaDA uses a unified diffusion architecture with a shared probabilistic formulation and modality-agnostic design to process textual reasoning, multimodal understanding, and text-to-image generation in one model. A mixed long chain-of-thought fine-tuning strategy creates a common reasoning format across modalities to support later reinforcement learning. UniGRPO, a policy-gradient RL algorithm with diversified rewards, then unifies post-training for both reasoning and generation tasks, producing consistent gains. The 8B version outperforms listed baselines on each task type.

What carries the argument

Unified diffusion architecture with shared probabilistic formulation and modality-agnostic design, plus mixed long CoT fine-tuning and the UniGRPO policy-gradient RL algorithm

If this is right

Text and image tasks can share the same core components and training pipeline.
Unified CoT formats enable effective cold-start reinforcement learning across modalities.
One RL method can drive improvements in both reasoning and generation at the same time.
The 8B model already exceeds several specialized systems on standard benchmarks.
The framework supplies a complete pretraining-to-post-training pipeline for future diffusion models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The design could lower the cost of running multiple separate models by consolidating them into one.
Similar shared formulations might apply to additional modalities such as audio without major redesign.
Scaling the same architecture to larger sizes could test whether the unification benefits hold at higher capacity.
The RL stage might be adapted to other generative objectives beyond the three evaluated here.

Load-bearing premise

A single probabilistic formulation and modality-agnostic design can integrate and process different data types without any modality-specific components.

What would settle it

An experiment that adds modality-specific components to an otherwise identical model and measures clear gains on any of the three tasks would falsify the claim that the shared design is sufficient.

read the original abstract

We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MMaDA bundles a modality-agnostic diffusion backbone with mixed CoT fine-tuning and UniGRPO into one package, but the performance edges over LLaMA-3, Show-o, and SDXL are not isolated to the architecture.

read the letter

The core pitch is a single diffusion model that does textual reasoning, multimodal understanding, and text-to-image generation without separate modality heads, trained via a shared probabilistic setup, a unified long CoT stage, and a new policy-gradient RL method called UniGRPO. They open-source the 8B model and code, which is the most immediately useful part for anyone who wants to test the recipe themselves. The mixed CoT curation that aligns reasoning formats across text and vision is a concrete step that could help cold-start the RL phase, and UniGRPO's diversified reward modeling is presented as a way to handle both reasoning and generation under one post-training loop. That combination is new relative to the cited prior diffusion multimodal work. The claims of beating LLaMA-3-7B on text tasks, Show-o on understanding, and SDXL on generation are the headline results, but they come from the full stack rather than controlled comparisons. The abstract gives no ablations that turn off the CoT or RL pieces to show what the base architecture alone delivers, so it is still open whether the modality-agnostic design is load-bearing or whether the gains are mostly from the training additions. If the full paper has those breakdowns and error analysis, the central claim strengthens; without them the evidence stays joint. This is worth a serious referee for groups working on unified multimodal training pipelines, because the open resources let others run the necessary checks. I would bring it to a reading group for the method details but would not cite it yet until the contribution split is clearer.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces MMaDA, a class of multimodal diffusion foundation models that employs a unified diffusion architecture with shared probabilistic formulation and modality-agnostic design, a mixed long chain-of-thought (CoT) fine-tuning strategy to align reasoning across modalities, and the UniGRPO unified policy-gradient RL algorithm for post-training. The central claim is that the resulting 8B model exhibits strong generalization, outperforming LLaMA-3-7B and Qwen2-7B on textual reasoning, Show-o and SEED-X on multimodal understanding, and SDXL and Janus on text-to-image generation.

Significance. If the experimental results and ablations hold, the work would constitute a meaningful advance in unified multimodal modeling by demonstrating that a single diffusion-based architecture can jointly handle reasoning and generation tasks without modality-specific components, while also providing a concrete RL post-training method (UniGRPO) tailored to diffusion models.

major comments (3)

[Abstract] Abstract and experimental sections: performance claims (outperformance over LLaMA-3-7B, Show-o, SDXL, etc.) are stated without reference to specific benchmarks, metrics, evaluation protocols, or error bars; this prevents verification of the generalization assertions.
[Method] Method and experiments: the claim that the modality-agnostic unified diffusion architecture is sufficient for seamless multimodal integration is not isolated from the mixed long CoT fine-tuning and UniGRPO contributions; no ablation studies are described that evaluate the base architecture alone against the baselines.
[Experiments] Experiments: the paper does not report whether the reported gains persist when the CoT and UniGRPO stages are removed or replaced with standard fine-tuning, leaving open whether the architecture itself is load-bearing for the cross-modal results.

minor comments (2)

The GitHub link for code and models is provided, which supports reproducibility; ensure the released artifacts include the exact training configurations and evaluation scripts used for the reported numbers.
[Method] Notation for the shared probabilistic formulation could be clarified with an explicit equation early in the method section to make the modality-agnostic property easier to follow.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for improving the clarity and rigor of our presentation. We address each major comment below and have revised the manuscript accordingly to provide greater specificity in the abstract and experimental sections, as well as to include new ablations that better isolate the contributions of the unified architecture.

read point-by-point responses

Referee: [Abstract] Abstract and experimental sections: performance claims (outperformance over LLaMA-3-7B, Show-o, SDXL, etc.) are stated without reference to specific benchmarks, metrics, evaluation protocols, or error bars; this prevents verification of the generalization assertions.

Authors: We agree that the abstract and experimental reporting would benefit from greater specificity. In the revised manuscript, we have updated the abstract to explicitly name the benchmarks (GSM8K and MATH for textual reasoning, MMMU and MMBench for multimodal understanding, and COCO FID for text-to-image generation), the evaluation metrics, protocols, and standard deviations. The experimental section has been expanded with tables that include these details and error bars for all comparisons. revision: yes
Referee: [Method] Method and experiments: the claim that the modality-agnostic unified diffusion architecture is sufficient for seamless multimodal integration is not isolated from the mixed long CoT fine-tuning and UniGRPO contributions; no ablation studies are described that evaluate the base architecture alone against the baselines.

Authors: We acknowledge that the original manuscript did not sufficiently isolate the base architecture. We have added a dedicated ablation subsection (Section 5.4) in the revised version that evaluates the pre-trained unified diffusion model without mixed CoT fine-tuning or UniGRPO. These results are compared directly against the baselines to demonstrate that the modality-agnostic design provides a competitive foundation for cross-modal integration. revision: yes
Referee: [Experiments] Experiments: the paper does not report whether the reported gains persist when the CoT and UniGRPO stages are removed or replaced with standard fine-tuning, leaving open whether the architecture itself is load-bearing for the cross-modal results.

Authors: We have addressed this by adding experiments in the revision that remove the mixed CoT stage (replacing it with standard supervised fine-tuning) and omit UniGRPO. The results indicate that while our proposed stages provide additional gains, the unified diffusion architecture remains the primary enabler of the cross-modal capabilities, as performance drops notably without it relative to specialized baselines. revision: yes

Circularity Check

0 steps flagged

No derivation chain reduces to inputs by construction; empirical claims rest on joint innovations

full rationale

The manuscript introduces MMaDA via three explicit design choices (unified diffusion architecture, mixed long CoT fine-tuning, UniGRPO) and reports empirical outperformance. No equations, uniqueness theorems, or fitted-parameter predictions appear in the abstract or described sections that would collapse a claimed result back onto its own inputs. The modality-agnostic property is asserted as an architectural feature rather than derived from prior results. A low score is assigned only because the three contributions are presented jointly, leaving the isolated sufficiency of the base architecture untested; this is a limitation of experimental isolation, not a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are detailed beyond the high-level description of the architecture and algorithms.

pith-pipeline@v0.9.0 · 5611 in / 1023 out tokens · 28649 ms · 2026-05-15T14:47:07.216120+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design... Lunify(θ) =−E t,x0,xt [1/t ∑ I[xi_t=[MASK]] log pθ(xi_0|xt)]
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Mixed Long-CoT fine-tuning... UniGRPO... unified policy-gradient-based RL

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
cs.LG 2026-03 unverdicted novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
cs.MM 2026-05 unverdicted novelty 7.0

UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...
Relative Score Policy Optimization for Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 7.0

RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
Discrete Langevin-Inspired Posterior Sampling
cs.LG 2026-05 unverdicted novelty 7.0

ΔLPS is a gradient-guided discrete posterior sampler for inverse problems that works with masked or uniform discrete diffusion priors and outperforms prior discrete methods on image restoration tasks.
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
cs.CV 2026-04 unverdicted novelty 7.0

BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
cs.CL 2026-05 unverdicted novelty 6.0

Language generation is recast as optimal control and solved approximately with flow matching in rectified latent control space to enable high-fidelity parallel text generation.
Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

b1 trains dLLMs to dynamically select reasoning block sizes via monotonic entropy descent with RL, improving coherence over fixed-size baselines on reasoning benchmarks.
NoiseRater: Meta-Learned Noise Valuation for Diffusion Model Training
cs.LG 2026-05 unverdicted novelty 6.0

NoiseRater meta-learns instance-level importance scores for noise in diffusion training via bilevel optimization, then uses a two-stage pipeline to improve efficiency and generation quality on FFHQ and ImageNet.
Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation
cs.AI 2026-04 conditional novelty 6.0

Auditing five frontier VLMs reveals severe grounding failures (max 0.23 IoU, 19.1% Acc@0.5) and format collapse (up to 99% parse failure) in medical VQA; fine-tuning yields 85.5% SLAKE recall but perception remains th...
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
cs.RO 2026-04 unverdicted novelty 6.0

A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
Stability-Weighted Decoding for Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 6.0

Stability-Weighted Decoding improves diffusion LLM accuracy by modulating token scores with temporal stability from KL divergence between prediction steps.
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
cs.AI 2026-04 unverdicted novelty 6.0

Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
cs.CV 2026-04 unverdicted novelty 5.0

Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection
cs.CV 2026-04 unverdicted novelty 5.0

UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.
DMax: Aggressive Parallel Decoding for dLLMs
cs.LG 2026-04 unverdicted novelty 5.0

DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...
Motus: A Unified Latent Action World Model
cs.CV 2025-12 unverdicted novelty 5.0

Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.
TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training
cs.AI 2026-04 unverdicted novelty 4.0

TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.
Show-o2: Improved Native Unified Multimodal Models
cs.CV 2025-06 unverdicted novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

Reference graph

Works this paper leans on

126 extracted references · 126 canonical work pages · cited by 19 Pith papers · 24 internal anchors

[1]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018

work page 2018
[2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 2020
[3]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

VL-GPT: A generative pre-trained transformer for vision and language understanding and generation.CoRR, abs/2312.09251, 2023

Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, and Ying Shan. VL-GPT: A generative pre-trained transformer for vision and language understanding and generation.CoRR, abs/2312.09251, 2023

work page arXiv 2023
[6]

Generative pretraining in multimodality.CoRR, abs/2307.05222, 2023

Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality.CoRR, abs/2307.05222, 2023

work page arXiv 2023
[7]

Generative multimodal models are in-context learners.arXiv preprint arXiv:2312.13286, 2023a

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners.CoRR, abs/2312.13286, 2023

work page arXiv 2023
[8]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

World model on million-length video and language with ringattention.arXiv preprint, 2024

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention.arXiv preprint, 2024

work page 2024
[10]

Vila-u: a unified foundation model integrating visual understanding and generation

Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429, 2024

work page arXiv 2024
[11]

Emu: Generative pretraining in multimodality

Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality. InICLR, 2023

work page 2023
[13]

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848, 2024

work page arXiv 2024
[14]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Gpt-4 technical report

OpenAI. Gpt-4 technical report. 2023

work page 2023
[16]

DreamLLM: Synergistic multimodal comprehension and creation

Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. DreamLLM: Synergistic multimodal comprehension and creation. InICLR, 2024

work page 2024
[17]

Generating images with multimodal language models

Jing Yu Koh, Daniel Fried, and Russ R Salakhutdinov. Generating images with multimodal language models. Advancesin Neural Information Processing Systems, 36:21487–21506, 2023

work page 2023
[18]

Llava-plus: Learning to use tools for creating multimodal agents

Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. InEuropean Conference on Computer Vision, pages 126–142. Springer, 2024

work page 2024
[19]

Seed-x: Multimodal models with unified multi-granularity comprehension and generation

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024

work page arXiv 2024
[20]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, pages 6840–6851, 2020. 18

work page 2020
[24]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[27]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

d1: Scaling reasoning in diffusion large language models via reinforcement learning, 2025

Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning, 2025

work page 2025
[30]

Xing, and Liang Lin

Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning, 2022

work page 2022
[31]

Lawrence Zitnick, and Ross Girshick

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning, 2016

work page 2016
[32]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, pages 26296–26306, 2024

work page 2024
[33]

Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

work page 2023
[34]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.CoRR, abs/2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. InCVPR, pages 13040–13051, 2024

work page 2024
[36]

Llava-phi: Efficient multi-modal assistant with small language model

Yichen Zhu, Minjie Zhu, Ning Liu, Zhiyuan Xu, and Yaxin Peng. Llava-phi: Efficient multi-modal assistant with small language model. InProceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited, pages 18–22, 2024

work page 2024
[37]

The refinedweb dataset for falcon LLM: outperforming curated corpora with web data only

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli, Alessandro Cappelli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon LLM: outperforming curated corpora with web data only. InNeurIPS, 2023

work page 2023
[38]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255, 2009

work page 2009
[39]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InCVPR, pages 3558–3568, 2021

work page 2021
[40]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InICCV, pages 4015–4026, 2023

work page 2023
[41]

laion-aesthetics-12m-umap

David McClure. laion-aesthetics-12m-umap. https://huggingface.co/datasets/dclure/ laion-aesthetics-12m-umap, 2024. 19

work page 2024
[42]

Journeydb: A benchmark for generative image understanding

Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, Limin Wang, and Hongsheng Li. Journeydb: A benchmark for generative image understanding. InNeurIPS, 2023

work page 2023
[43]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. GitHub, 2023

work page 2023
[44]

Reasonflux: Hierarchical llm reasoning via scaling thought templates

Ling Yang, Zhaochen Yu, Bin Cui, and Mengdi Wang. Reasonflux: Hierarchical llm reasoning via scaling thought templates. arXiv preprint arXiv:2502.06772, 2025

work page arXiv 2025
[45]

Limo: Less is more for reasoning, 2025

Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning, 2025

work page 2025
[46]

s1: Simple test-time scaling, 2025

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025

work page 2025
[47]

Open Thoughts

OpenThoughts Team. Open Thoughts. https://open-thoughts.ai, January 2025

work page 2025
[48]

Acemath: Advancing frontier math reasoning with post-training and reward modeling.arXiv preprint, 2024

Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acemath: Advancing frontier math reasoning with post-training and reward modeling.arXiv preprint, 2024

work page 2024
[49]

Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl, 2025

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl, 2025

work page 2025
[50]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[51]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763, 2021

work page 2021
[52]

Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36, 2024

JiazhengXu, XiaoLiu, YuchenWu, YuxuanTong, QinkaiLi, MingDing, JieTang, andYuxiaoDong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[53]

Geneval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023

work page 2023
[54]

Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025

Yuwei Niu, Munan Ning, Mengren Zheng, Bin Lin, Peng Jin, Jiaqi Liao, Kunpeng Ning, Bin Zhu, and Li Yuan. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025

work page arXiv 2025
[55]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022

work page 2022
[57]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Vargpt: Unified understanding and generation in a visual autoregressive multimodal large language model.arXiv preprint arXiv:2501.12327, 2025

Xianwei Zhuang, Yuxin Xie, Yufan Deng, Liming Liang, Jinghan Ru, Yuguo Yin, and Yuexian Zou. Vargpt: Unified understanding and generation in a visual autoregressive multimodal large language model.arXiv preprint arXiv:2501.12327, 2025

work page arXiv 2025
[60]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Openai o1 system card.preprint, 2024

OpenAI. Openai o1 system card.preprint, 2024. 20

work page 2024
[62]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Multimodal foundation models: From specialists to general-purpose assistants.Foundations and Trends® in Computer Graphics and Vision, 16(1-2):1–214, 2024

Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao, et al. Multimodal foundation models: From specialists to general-purpose assistants.Foundations and Trends® in Computer Graphics and Vision, 16(1-2):1–214, 2024

work page 2024
[64]

A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023

work page arXiv 2023
[65]

Hallucination of Multimodal Large Language Models: A Survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucina- tion of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Visual instruction tuning.NeurIPS, 36, 2024

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36, 2024

work page 2024
[67]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.CoRR, abs/2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Mm1: Methods, analysis & insights from multimodal llm pre-training

Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, et al. Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611, 2024

work page arXiv 2024
[70]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[72]

Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Zhongdao Wang, James T. Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InICLR. OpenReview.net, 2024

work page 2024
[73]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[74]

Raphael: Text-to- image generation via large mixture of diffusion paths.NeurIPS, 36, 2024

Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, and Ping Luo. Raphael: Text-to- image generation via large mixture of diffusion paths.NeurIPS, 36, 2024

work page 2024
[75]

Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms

Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, and Bin Cui. Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. InForty-firstInternational Conference on Machine Learning, 2024

work page 2024
[76]

Itercomp: Iterative composition-aware feedback learning from model gallery for text-to-image generation

Xinchen Zhang, Ling Yang, Guohao Li, Yaqi Cai, Jiake Xie, Yong Tang, Yujiu Yang, Mengdi Wang, and Bin Cui. Itercomp: Iterative composition-aware feedback learning from model gallery for text-to-image generation. In ICLR, 2025

work page 2025
[77]

An overview of diffusion models: Applications, guided generation, statistical rates and optimization.arXiv preprint arXiv:2404.07771, 2024

Minshuo Chen, Song Mei, Jianqing Fan, and Mengdi Wang. An overview of diffusion models: Applications, guided generation, statistical rates and optimization.arXiv preprint arXiv:2404.07771, 2024

work page arXiv 2024
[78]

Diffusion-sharpening: Fine-tuning diffusion models with denoising trajectory sharpening.arXiv preprint arXiv:2502.12146, 2025

Ye Tian, Ling Yang, Xinchen Zhang, Yunhai Tong, Mengdi Wang, and Bin Cui. Diffusion-sharpening: Fine-tuning diffusion models with denoising trajectory sharpening.arXiv preprint arXiv:2502.12146, 2025

work page arXiv 2025
[79]

Reward-directed conditional diffusion: Provable distribution estimation and reward improvement.Advancesin Neural Information Processing Systems, 36:60599–60635, 2023

Hui Yuan, Kaixuan Huang, Chengzhuo Ni, Minshuo Chen, and Mengdi Wang. Reward-directed conditional diffusion: Provable distribution estimation and reward improvement.Advancesin Neural Information Processing Systems, 36:60599–60635, 2023

work page 2023
[80]

Gradient guidance for diffusion models: An optimization perspective.arXiv preprint arXiv:2404.14743, 2024

Yingqing Guo, Hui Yuan, Yukang Yang, Minshuo Chen, and Mengdi Wang. Gradient guidance for diffusion models: An optimization perspective.arXiv preprint arXiv:2404.14743, 2024. 21

work page arXiv 2024
[81]

Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data

Minshuo Chen, Kaixuan Huang, Tuo Zhao, and Mengdi Wang. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. InInternational Conference on Machine Learning, pages 4672–4712. PMLR, 2023

work page 2023
[82]

Rectified diffusion: Straightness is not your need in rectified flow.arXiv preprint arXiv:2410.07303,

Fu-Yun Wang, Ling Yang, Zhaoyang Huang, Mengdi Wang, and Hongsheng Li. Rectified diffusion: Straightness is not your need in rectified flow.arXiv preprint arXiv:2410.07303, 2024

work page arXiv 2024
[83]

Improving diffusion-based image synthesis with context prediction.Advances in Neural Information Processing Systems, 36:37636–37656, 2023

Ling Yang, Jingwei Liu, Shenda Hong, Zhilong Zhang, Zhilin Huang, Zheming Cai, Wentao Zhang, and Bin Cui. Improving diffusion-based image synthesis with context prediction.Advances in Neural Information Processing Systems, 36:37636–37656, 2023

work page 2023
[84]

Structure-guided adversarial training of diffusion models

Ling Yang, Haotian Qian, Zhilong Zhang, Jingwei Liu, and Bin Cui. Structure-guided adversarial training of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7256–7266, 2024

work page 2024

Showing first 80 references.