{"work":{"id":"9d626cf3-094e-4960-9e71-a00a47158639","openalex_id":null,"doi":null,"arxiv_id":"2505.15809","raw_key":null,"title":"MMaDA: Multimodal Large Diffusion Language Models","authors":null,"authors_text":"Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong","year":2025,"venue":"cs.CV","abstract":"We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA","external_url":"https://arxiv.org/abs/2505.15809","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T07:35:29.595059+00:00","pith_arxiv_id":"2505.15809","created_at":"2026-05-09T06:15:39.676805+00:00","updated_at":"2026-05-25T07:35:29.595059+00:00","title_quality_ok":true,"display_title":"MMaDA: Multimodal Large Diffusion Language Models","render_title":"MMaDA: Multimodal Large Diffusion Language Models"},"hub":{"state":{"work_id":"9d626cf3-094e-4960-9e71-a00a47158639","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":26,"external_cited_by_count":null,"distinct_field_count":6,"first_pith_cited_at":"2025-06-18T15:39:15+00:00","last_pith_cited_at":"2026-05-22T02:31:32+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-31T11:32:21.842260+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":7},{"context_role":"baseline","n":2},{"context_role":"method","n":2}],"polarity_counts":[{"context_polarity":"background","n":7},{"context_polarity":"baseline","n":2},{"context_polarity":"use_method","n":2}],"runs":{},"summary":{},"graph":{},"authors":[]}}