Recognition: 2 theorem links
· Lean TheoremMMaDA: Multimodal Large Diffusion Language Models
Pith reviewed 2026-05-15 14:47 UTC · model grok-4.3
The pith
A single diffusion architecture unifies text reasoning, multimodal understanding, and image generation without modality-specific parts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MMaDA uses a unified diffusion architecture with a shared probabilistic formulation and modality-agnostic design to process textual reasoning, multimodal understanding, and text-to-image generation in one model. A mixed long chain-of-thought fine-tuning strategy creates a common reasoning format across modalities to support later reinforcement learning. UniGRPO, a policy-gradient RL algorithm with diversified rewards, then unifies post-training for both reasoning and generation tasks, producing consistent gains. The 8B version outperforms listed baselines on each task type.
What carries the argument
Unified diffusion architecture with shared probabilistic formulation and modality-agnostic design, plus mixed long CoT fine-tuning and the UniGRPO policy-gradient RL algorithm
If this is right
- Text and image tasks can share the same core components and training pipeline.
- Unified CoT formats enable effective cold-start reinforcement learning across modalities.
- One RL method can drive improvements in both reasoning and generation at the same time.
- The 8B model already exceeds several specialized systems on standard benchmarks.
- The framework supplies a complete pretraining-to-post-training pipeline for future diffusion models.
Where Pith is reading between the lines
- The design could lower the cost of running multiple separate models by consolidating them into one.
- Similar shared formulations might apply to additional modalities such as audio without major redesign.
- Scaling the same architecture to larger sizes could test whether the unification benefits hold at higher capacity.
- The RL stage might be adapted to other generative objectives beyond the three evaluated here.
Load-bearing premise
A single probabilistic formulation and modality-agnostic design can integrate and process different data types without any modality-specific components.
What would settle it
An experiment that adds modality-specific components to an otherwise identical model and measures clear gains on any of the three tasks would falsify the claim that the shared design is sufficient.
read the original abstract
We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MMaDA, a class of multimodal diffusion foundation models that employs a unified diffusion architecture with shared probabilistic formulation and modality-agnostic design, a mixed long chain-of-thought (CoT) fine-tuning strategy to align reasoning across modalities, and the UniGRPO unified policy-gradient RL algorithm for post-training. The central claim is that the resulting 8B model exhibits strong generalization, outperforming LLaMA-3-7B and Qwen2-7B on textual reasoning, Show-o and SEED-X on multimodal understanding, and SDXL and Janus on text-to-image generation.
Significance. If the experimental results and ablations hold, the work would constitute a meaningful advance in unified multimodal modeling by demonstrating that a single diffusion-based architecture can jointly handle reasoning and generation tasks without modality-specific components, while also providing a concrete RL post-training method (UniGRPO) tailored to diffusion models.
major comments (3)
- [Abstract] Abstract and experimental sections: performance claims (outperformance over LLaMA-3-7B, Show-o, SDXL, etc.) are stated without reference to specific benchmarks, metrics, evaluation protocols, or error bars; this prevents verification of the generalization assertions.
- [Method] Method and experiments: the claim that the modality-agnostic unified diffusion architecture is sufficient for seamless multimodal integration is not isolated from the mixed long CoT fine-tuning and UniGRPO contributions; no ablation studies are described that evaluate the base architecture alone against the baselines.
- [Experiments] Experiments: the paper does not report whether the reported gains persist when the CoT and UniGRPO stages are removed or replaced with standard fine-tuning, leaving open whether the architecture itself is load-bearing for the cross-modal results.
minor comments (2)
- The GitHub link for code and models is provided, which supports reproducibility; ensure the released artifacts include the exact training configurations and evaluation scripts used for the reported numbers.
- [Method] Notation for the shared probabilistic formulation could be clarified with an explicit equation early in the method section to make the modality-agnostic property easier to follow.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important areas for improving the clarity and rigor of our presentation. We address each major comment below and have revised the manuscript accordingly to provide greater specificity in the abstract and experimental sections, as well as to include new ablations that better isolate the contributions of the unified architecture.
read point-by-point responses
-
Referee: [Abstract] Abstract and experimental sections: performance claims (outperformance over LLaMA-3-7B, Show-o, SDXL, etc.) are stated without reference to specific benchmarks, metrics, evaluation protocols, or error bars; this prevents verification of the generalization assertions.
Authors: We agree that the abstract and experimental reporting would benefit from greater specificity. In the revised manuscript, we have updated the abstract to explicitly name the benchmarks (GSM8K and MATH for textual reasoning, MMMU and MMBench for multimodal understanding, and COCO FID for text-to-image generation), the evaluation metrics, protocols, and standard deviations. The experimental section has been expanded with tables that include these details and error bars for all comparisons. revision: yes
-
Referee: [Method] Method and experiments: the claim that the modality-agnostic unified diffusion architecture is sufficient for seamless multimodal integration is not isolated from the mixed long CoT fine-tuning and UniGRPO contributions; no ablation studies are described that evaluate the base architecture alone against the baselines.
Authors: We acknowledge that the original manuscript did not sufficiently isolate the base architecture. We have added a dedicated ablation subsection (Section 5.4) in the revised version that evaluates the pre-trained unified diffusion model without mixed CoT fine-tuning or UniGRPO. These results are compared directly against the baselines to demonstrate that the modality-agnostic design provides a competitive foundation for cross-modal integration. revision: yes
-
Referee: [Experiments] Experiments: the paper does not report whether the reported gains persist when the CoT and UniGRPO stages are removed or replaced with standard fine-tuning, leaving open whether the architecture itself is load-bearing for the cross-modal results.
Authors: We have addressed this by adding experiments in the revision that remove the mixed CoT stage (replacing it with standard supervised fine-tuning) and omit UniGRPO. The results indicate that while our proposed stages provide additional gains, the unified diffusion architecture remains the primary enabler of the cross-modal capabilities, as performance drops notably without it relative to specialized baselines. revision: yes
Circularity Check
No derivation chain reduces to inputs by construction; empirical claims rest on joint innovations
full rationale
The manuscript introduces MMaDA via three explicit design choices (unified diffusion architecture, mixed long CoT fine-tuning, UniGRPO) and reports empirical outperformance. No equations, uniqueness theorems, or fitted-parameter predictions appear in the abstract or described sections that would collapse a claimed result back onto its own inputs. The modality-agnostic property is asserted as an architectural feature rather than derived from prior results. A low score is assigned only because the three contributions are presented jointly, leaving the isolated sufficiency of the base architecture untested; this is a limitation of experimental isolation, not a circular reduction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design... Lunify(θ) =−E t,x0,xt [1/t ∑ I[xi_t=[MASK]] log pθ(xi_0|xt)]
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Mixed Long-CoT fine-tuning... UniGRPO... unified policy-gradient-based RL
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
-
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
-
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...
-
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...
-
Relative Score Policy Optimization for Diffusion Language Models
RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
-
Discrete Langevin-Inspired Posterior Sampling
ΔLPS is a gradient-guided discrete posterior sampler for inverse problems that works with masked or uniform discrete diffusion priors and outperforms prior discrete methods on image restoration tasks.
-
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
-
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
Language generation is recast as optimal control and solved approximately with flow matching in rectified latent control space to enable high-fidelity parallel text generation.
-
Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning
b1 trains dLLMs to dynamically select reasoning block sizes via monotonic entropy descent with RL, improving coherence over fixed-size baselines on reasoning benchmarks.
-
NoiseRater: Meta-Learned Noise Valuation for Diffusion Model Training
NoiseRater meta-learns instance-level importance scores for noise in diffusion training via bilevel optimization, then uses a two-stage pipeline to improve efficiency and generation quality on FFHQ and ImageNet.
-
Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation
Auditing five frontier VLMs reveals severe grounding failures (max 0.23 IoU, 19.1% Acc@0.5) and format collapse (up to 99% parse failure) in medical VQA; fine-tuning yields 85.5% SLAKE recall but perception remains th...
-
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
-
Stability-Weighted Decoding for Diffusion Language Models
Stability-Weighted Decoding improves diffusion LLM accuracy by modulating token scores with temporal stability from KL divergence between prediction steps.
-
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
-
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
-
UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection
UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.
-
DMax: Aggressive Parallel Decoding for dLLMs
DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...
-
Motus: A Unified Latent Action World Model
Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.
-
TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training
TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
Reference graph
Works this paper leans on
-
[1]
Improving language understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018
work page 2018
-
[2]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...
work page 2020
-
[3]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, and Ying Shan. VL-GPT: A generative pre-trained transformer for vision and language understanding and generation.CoRR, abs/2312.09251, 2023
-
[6]
Generative pretraining in multimodality.CoRR, abs/2307.05222, 2023
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality.CoRR, abs/2307.05222, 2023
-
[7]
Generative multimodal models are in-context learners.arXiv preprint arXiv:2312.13286, 2023a
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners.CoRR, abs/2312.13286, 2023
-
[8]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
World model on million-length video and language with ringattention.arXiv preprint, 2024
Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention.arXiv preprint, 2024
work page 2024
-
[10]
Vila-u: a unified foundation model integrating visual understanding and generation
Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429, 2024
-
[11]
Emu: Generative pretraining in multimodality
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality. InICLR, 2023
work page 2023
-
[13]
Janus: Decoupling visual encoding for unified multimodal understanding and generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848, 2024
-
[14]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [15]
-
[16]
DreamLLM: Synergistic multimodal comprehension and creation
Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. DreamLLM: Synergistic multimodal comprehension and creation. InICLR, 2024
work page 2024
-
[17]
Generating images with multimodal language models
Jing Yu Koh, Daniel Fried, and Russ R Salakhutdinov. Generating images with multimodal language models. Advancesin Neural Information Processing Systems, 36:21487–21506, 2023
work page 2023
-
[18]
Llava-plus: Learning to use tools for creating multimodal agents
Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. InEuropean Conference on Computer Vision, pages 126–142. Springer, 2024
work page 2024
-
[19]
Seed-x: Multimodal models with unified multi-granularity comprehension and generation
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024
-
[20]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, pages 6840–6851, 2020. 18
work page 2020
-
[24]
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[27]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
d1: Scaling reasoning in diffusion large language models via reinforcement learning, 2025
Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning, 2025
work page 2025
-
[30]
Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning, 2022
work page 2022
-
[31]
Lawrence Zitnick, and Ross Girshick
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning, 2016
work page 2016
-
[32]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, pages 26296–26306, 2024
work page 2024
-
[33]
Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023
work page 2023
-
[34]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.CoRR, abs/2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration
Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. InCVPR, pages 13040–13051, 2024
work page 2024
-
[36]
Llava-phi: Efficient multi-modal assistant with small language model
Yichen Zhu, Minjie Zhu, Ning Liu, Zhiyuan Xu, and Yaxin Peng. Llava-phi: Efficient multi-modal assistant with small language model. InProceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited, pages 18–22, 2024
work page 2024
-
[37]
The refinedweb dataset for falcon LLM: outperforming curated corpora with web data only
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli, Alessandro Cappelli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon LLM: outperforming curated corpora with web data only. InNeurIPS, 2023
work page 2023
-
[38]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255, 2009
work page 2009
-
[39]
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InCVPR, pages 3558–3568, 2021
work page 2021
-
[40]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InICCV, pages 4015–4026, 2023
work page 2023
-
[41]
David McClure. laion-aesthetics-12m-umap. https://huggingface.co/datasets/dclure/ laion-aesthetics-12m-umap, 2024. 19
work page 2024
-
[42]
Journeydb: A benchmark for generative image understanding
Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, Limin Wang, and Hongsheng Li. Journeydb: A benchmark for generative image understanding. InNeurIPS, 2023
work page 2023
- [43]
-
[44]
Reasonflux: Hierarchical llm reasoning via scaling thought templates
Ling Yang, Zhaochen Yu, Bin Cui, and Mengdi Wang. Reasonflux: Hierarchical llm reasoning via scaling thought templates. arXiv preprint arXiv:2502.06772, 2025
-
[45]
Limo: Less is more for reasoning, 2025
Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning, 2025
work page 2025
-
[46]
s1: Simple test-time scaling, 2025
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025
work page 2025
-
[47]
OpenThoughts Team. Open Thoughts. https://open-thoughts.ai, January 2025
work page 2025
-
[48]
Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acemath: Advancing frontier math reasoning with post-training and reward modeling.arXiv preprint, 2024
work page 2024
-
[49]
Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl, 2025
Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl, 2025
work page 2025
-
[50]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[51]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763, 2021
work page 2021
-
[52]
JiazhengXu, XiaoLiu, YuchenWu, YuxuanTong, QinkaiLi, MingDing, JieTang, andYuxiaoDong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[53]
Geneval: An object-focused framework for evaluating text-to-image alignment
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023
work page 2023
-
[54]
Yuwei Niu, Munan Ning, Mengren Zheng, Bin Lin, Peng Jin, Jiaqi Liao, Kunpeng Ning, Bin Zhu, and Li Yuan. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025
-
[55]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022
work page 2022
-
[57]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
Xianwei Zhuang, Yuxin Xie, Yufan Deng, Liming Liang, Jinghan Ru, Yuguo Yin, and Yuexian Zou. Vargpt: Unified understanding and generation in a visual autoregressive multimodal large language model.arXiv preprint arXiv:2501.12327, 2025
-
[60]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
Openai o1 system card.preprint, 2024
OpenAI. Openai o1 system card.preprint, 2024. 20
work page 2024
-
[62]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao, et al. Multimodal foundation models: From specialists to general-purpose assistants.Foundations and Trends® in Computer Graphics and Vision, 16(1-2):1–214, 2024
work page 2024
-
[64]
A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023
-
[65]
Hallucination of Multimodal Large Language Models: A Survey
Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucina- tion of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
Visual instruction tuning.NeurIPS, 36, 2024
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36, 2024
work page 2024
-
[67]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.CoRR, abs/2304.10592, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
Mm1: Methods, analysis & insights from multimodal llm pre-training
Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, et al. Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611, 2024
-
[70]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[71]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[72]
Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Zhongdao Wang, James T. Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InICLR. OpenReview.net, 2024
work page 2024
-
[73]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[74]
Raphael: Text-to- image generation via large mixture of diffusion paths.NeurIPS, 36, 2024
Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, and Ping Luo. Raphael: Text-to- image generation via large mixture of diffusion paths.NeurIPS, 36, 2024
work page 2024
-
[75]
Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms
Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, and Bin Cui. Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. InForty-firstInternational Conference on Machine Learning, 2024
work page 2024
-
[76]
Xinchen Zhang, Ling Yang, Guohao Li, Yaqi Cai, Jiake Xie, Yong Tang, Yujiu Yang, Mengdi Wang, and Bin Cui. Itercomp: Iterative composition-aware feedback learning from model gallery for text-to-image generation. In ICLR, 2025
work page 2025
-
[77]
Minshuo Chen, Song Mei, Jianqing Fan, and Mengdi Wang. An overview of diffusion models: Applications, guided generation, statistical rates and optimization.arXiv preprint arXiv:2404.07771, 2024
-
[78]
Ye Tian, Ling Yang, Xinchen Zhang, Yunhai Tong, Mengdi Wang, and Bin Cui. Diffusion-sharpening: Fine-tuning diffusion models with denoising trajectory sharpening.arXiv preprint arXiv:2502.12146, 2025
-
[79]
Hui Yuan, Kaixuan Huang, Chengzhuo Ni, Minshuo Chen, and Mengdi Wang. Reward-directed conditional diffusion: Provable distribution estimation and reward improvement.Advancesin Neural Information Processing Systems, 36:60599–60635, 2023
work page 2023
-
[80]
Yingqing Guo, Hui Yuan, Yukang Yang, Minshuo Chen, and Mengdi Wang. Gradient guidance for diffusion models: An optimization perspective.arXiv preprint arXiv:2404.14743, 2024. 21
-
[81]
Minshuo Chen, Kaixuan Huang, Tuo Zhao, and Mengdi Wang. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. InInternational Conference on Machine Learning, pages 4672–4712. PMLR, 2023
work page 2023
-
[82]
Fu-Yun Wang, Ling Yang, Zhaoyang Huang, Mengdi Wang, and Hongsheng Li. Rectified diffusion: Straightness is not your need in rectified flow.arXiv preprint arXiv:2410.07303, 2024
-
[83]
Ling Yang, Jingwei Liu, Shenda Hong, Zhilong Zhang, Zhilin Huang, Zheming Cai, Wentao Zhang, and Bin Cui. Improving diffusion-based image synthesis with context prediction.Advances in Neural Information Processing Systems, 36:37636–37656, 2023
work page 2023
-
[84]
Structure-guided adversarial training of diffusion models
Ling Yang, Haotian Qian, Zhilong Zhang, Jingwei Liu, and Bin Cui. Structure-guided adversarial training of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7256–7266, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.