{"paper":{"title":"MMaDA: Multimodal Large Diffusion Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A single diffusion architecture unifies text reasoning, multimodal understanding, and image generation without modality-specific parts.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Bowen Li, Ke Shen, Ling Yang, Mengdi Wang, Xinchen Zhang, Ye Tian, Yunhai Tong","submitted_at":"2025-05-21T17:59:05Z","abstract_excerpt":"We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT)"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The shared probabilistic formulation and modality-agnostic design in the unified diffusion architecture is sufficient to seamlessly integrate and process different data types without modality-specific components.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-image tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A single diffusion architecture unifies text reasoning, multimodal understanding, and image generation without modality-specific parts.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"a88e133fea8857fa0f30815db5e92bd269a2c871019fca0e4649f449dc7b88df"},"source":{"id":"2505.15809","kind":"arxiv","version":2},"verdict":{"id":"9895232e-4f05-421d-b7c2-9de72efa4bbb","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T14:47:07.216120Z","strongest_claim":"MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation.","one_line_summary":"MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-image tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The shared probabilistic formulation and modality-agnostic design in the unified diffusion architecture is sufficient to seamlessly integrate and process different data types without modality-specific components.","pith_extraction_headline":"A single diffusion architecture unifies text reasoning, multimodal understanding, and image generation without modality-specific parts."},"references":{"count":126,"sample":[{"doi":"","year":2018,"title":"Improving language understanding by generative pre-training","work_id":"72bdd905-4f91-46af-883a-4c2849c99ffd","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Language models are few-shot learners","work_id":"97345cea-ff46-4103-a836-60d34717536c","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","ref_index":3,"cited_arxiv_id":"2412.16720","is_internal_anchor":true},{"doi":"","year":2023,"title":"Vl-gpt: A generative pre-trained transformer for vision and language understanding and generation","work_id":"2891128c-bc6e-4651-be44-a7b8f6f3ad91","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Emu: Generative Pretraining in Multimodality","work_id":"d20ff802-68ef-4783-943f-7bf17acb6a68","ref_index":6,"cited_arxiv_id":"2307.05222","is_internal_anchor":true}],"resolved_work":126,"snapshot_sha256":"5250db472d273842a04ea9d80139cfb7ea5056430e1c598a99862e7d69363eed","internal_anchors":32},"formal_canon":{"evidence_count":2,"snapshot_sha256":"fb2a8088bfdccd40909723b92584d7d46db4a9731f0e15ca3a443baeb5f5803a"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}