arxiv: 2505.16933 · v2 · submitted 2025-05-22 · 💻 cs.LG · cs.CL· cs.CV

Recognition: 2 theorem links

· Lean Theorem

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Zebin You , Shen Nie , Xiaolu Zhang , Jun Hu , Jun Zhou , Zhiwu Lu , Ji-Rong Wen , Chongxuan Li

Authors on Pith no claims yet

Pith reviewed 2026-05-17 03:40 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CV

keywords multimodal large language modelsdiffusion modelsvisual instruction tuningmasked diffusionmultimodal understandingvision-language models

0 comments

The pith

A purely diffusion-based multimodal model matches autoregressive leaders on visual instruction tasks by adding a vision encoder to a language diffusion backbone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LLaDA-V as a departure from autoregressive multimodal large language models, instead using a masked diffusion process on a language diffusion model with added visual components. It shows that this setup delivers competitive multimodal performance even when the base language model lags on pure text tasks, and reaches state-of-the-art results among diffusion-based and hybrid MLLMs. The findings indicate that diffusion models can handle multimodal alignment and instruction following without sequential token generation. If correct, this suggests diffusion approaches could scale effectively for vision-language work and reduce reliance on autoregressive decoding constraints.

Core claim

LLaDA-V integrates a vision encoder and MLP connector into the LLaDA language diffusion model to project visual features into the embedding space, enabling masked diffusion training on visual instruction data. When trained on the same data as LLaMA3-V, it proves competitive across multimodal tasks and narrows the gap to Qwen2-VL while outperforming other diffusion-based and hybrid MLLMs, showing that large language diffusion models remain effective for multimodal understanding despite weaker standalone text performance.

What carries the argument

Masked diffusion process applied to language tokens combined with a vision encoder and MLP connector that maps visual features into the shared embedding space for joint denoising.

If this is right

The architecture supports better data scalability on multimodal tasks than some autoregressive baselines under identical training conditions.
Performance advantages appear concentrated in multimodal understanding rather than pure language modeling.
Results encourage replacing or complementing autoregressive decoding with diffusion steps in future multimodal systems.
The model narrows the gap to strong autoregressive systems such as Qwen2-VL on shared instruction data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Parallel denoising in diffusion could enable faster or more flexible multimodal output generation than left-to-right token prediction.
The approach may generalize to additional modalities if the same encoder-connector pattern is applied to audio or other inputs.
Longer context or higher-resolution visual inputs could be tested to measure whether diffusion maintains coherence better than autoregressive models under increased sequence length.

Load-bearing premise

That empirical gains on the selected instruction-tuning datasets and benchmarks will hold beyond this specific mixture and that the diffusion process can preserve coherent multimodal alignment without autoregressive sequential constraints.

What would settle it

A clear drop in multimodal benchmark scores when the same model is tested on instruction data drawn from distributions outside the training mixture or when the vision encoder and connector are ablated while keeping the diffusion backbone fixed.

read the original abstract

In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: https://ml-gsai.github.io/LLaDA-V-demo/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLaDA-V shows diffusion MLLMs can compete but baseline controls on data and encoders are missing.

read the letter

Here's what matters about this paper. LLaDA-V takes a diffusion language model and adds visual instruction tuning to make a multimodal version that competes with autoregressive models when trained on the same data. They start with LLaDA, plug in a vision encoder and MLP connector for images, and run the instruction tuning. The results show it matches LLaMA3-V on multimodal benchmarks with the same data and scales better. It also narrows the gap to stronger models like Qwen2-VL. On top of that, it leads the pack among other diffusion-based or hybrid MLLMs. The new part is applying this to a diffusion backbone and doing the first head-to-head on standard multimodal tasks. It does well by pointing out that even though the underlying language model lags on text-only work, the multimodal performance is still strong. This hints that diffusion can manage the alignment without autoregressive constraints. Where it gets soft is the baseline comparisons. The SOTA claim for diffusion MLLMs and the competitiveness to AR ones rest on the assumption that data amounts and vision encoders are comparable. The abstract does not give those details, so we cannot rule out that the gains come from more training or a better vision setup rather than the diffusion design. That matches the stress-test concern, and it holds up here. This work is aimed at people thinking about non-autoregressive options for MLLMs. It shows clear thinking on the experiments and cites the relevant prior work on diffusion models and instruction tuning. No major contradictions in the claims. I would bring it to the reading group to talk about whether diffusion is worth pursuing in this area. I would not cite it yet in my papers. But yes, it should go to peer review so the authors can add the missing controls and strengthen the case.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LLaDA-V, a purely diffusion-based Multimodal Large Language Model extending the LLaDA language diffusion model via a vision encoder and MLP connector for visual instruction tuning. It reports competitive multimodal performance to LLaMA3-V when using identical instruction data, better data scalability, narrowing of the gap to Qwen2-VL, and state-of-the-art results among existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs.

Significance. If the performance claims are shown to hold under controlled comparisons of data volume and vision encoders, the work would establish that masked diffusion models can achieve effective multimodal alignment and instruction following without autoregressive decoding. The reported competitiveness despite a weaker base language model on text-only tasks, together with scalability observations, would provide evidence that diffusion paradigms merit further study as alternatives to dominant autoregressive MLLMs.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The SOTA claim among hybrid AR-diffusion and pure diffusion MLLMs is load-bearing for the central contribution, yet the manuscript provides no explicit quantification of total training tokens, image-text pairs, or vision encoder parameters for the diffusion baselines. Without this, the reported gains cannot be isolated from potential differences in training mixture size or backbone strength (e.g., CLIP-style encoders).
[§4.2] §4.2 (Main Results): The competitiveness to LLaMA3-V on identical instruction data and the narrowing gap to Qwen2-VL are presented without accompanying statistical significance tests, variance across runs, or details on evaluation data splits. This weakens assessment of whether the diffusion architecture itself drives the observed multimodal gains.

minor comments (2)

[Conclusion] The manuscript could add a dedicated limitations paragraph discussing potential coherence issues in long multimodal sequences under the diffusion process.
[§3] Notation for the masked diffusion objective and the MLP connector projection could be made more explicit with an equation reference in §3.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The SOTA claim among hybrid AR-diffusion and pure diffusion MLLMs is load-bearing for the central contribution, yet the manuscript provides no explicit quantification of total training tokens, image-text pairs, or vision encoder parameters for the diffusion baselines. Without this, the reported gains cannot be isolated from potential differences in training mixture size or backbone strength (e.g., CLIP-style encoders).

Authors: We agree that an explicit comparison of training resources would better support the SOTA claim and help isolate architectural effects. In the revised manuscript we will add a table in §4 summarizing the number of training tokens, image-text pairs, and vision-encoder parameters for LLaDA-V and the compared hybrid and pure-diffusion MLLMs, citing the original publications for the baselines. This addition will clarify the scale of the comparisons. revision: yes
Referee: [§4.2] §4.2 (Main Results): The competitiveness to LLaMA3-V on identical instruction data and the narrowing gap to Qwen2-VL are presented without accompanying statistical significance tests, variance across runs, or details on evaluation data splits. This weakens assessment of whether the diffusion architecture itself drives the observed multimodal gains.

Authors: We acknowledge the value of statistical reporting. The LLaMA3-V comparison uses exactly the same instruction data, which already controls for data volume. All evaluations follow the standard benchmark protocols and splits published with each dataset. Because of the substantial compute required for large-model training, we report single-run results. In revision we will (i) explicitly state the evaluation splits used, (ii) note the consistency of gains across tasks, and (iii) add a limitations paragraph acknowledging the lack of multiple runs and statistical tests. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical performance claims

full rationale

The paper introduces LLaDA-V by extending a prior diffusion language model with a vision encoder and reports benchmark results after instruction tuning. All central claims consist of observed performance numbers on public multimodal benchmarks rather than any derivation, prediction, or first-principles result that reduces to fitted parameters or self-citations by construction. No equations appear that would turn training objectives back into outputs, and comparisons to LLaMA3-V or Qwen2-VL are presented as empirical observations, not forced equivalences. Self-reference to the base LLaDA model is architectural background and does not load-bear any circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The architecture relies on standard vision-encoder pretraining and MLP projection assumptions common in MLLMs; no new physical entities or ad-hoc conserved quantities are introduced. Training hyperparameters and data mixtures are the main free parameters, but these are typical for the field.

free parameters (1)

MLP connector dimensions and training schedule
Chosen to align visual features with the diffusion language embedding space; specific values not stated in abstract.

axioms (1)

domain assumption Masked diffusion language modeling can be extended to multimodal inputs via a vision encoder and linear projection without loss of coherence.
Invoked when the authors state that the vision encoder and MLP enable effective multimodal alignment.

pith-pipeline@v0.9.0 · 5552 in / 1257 out tokens · 24721 ms · 2026-05-17T03:40:53.041337+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
Relative Score Policy Optimization for Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 7.0

RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
cs.CV 2026-05 unverdicted novelty 7.0

GPO-V is a visual jailbreak framework that bypasses safety guardrails in diffusion VLMs by globally manipulating generative probabilities during denoising.
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
cs.CV 2026-05 unverdicted novelty 7.0

GPO-V jailbreaks dVLMs by globally optimizing probabilities in the denoising process to bypass refusal patterns, achieving stealthy and transferable attacks.
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
cs.CV 2026-04 unverdicted novelty 7.0

BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
cs.LG 2026-04 unverdicted novelty 7.0

ECHO is a one-step block diffusion VLM for chest X-ray reports that improves RaTE and SemScore by over 60% while delivering 8x faster inference than autoregressive baselines.
dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models
cs.CV 2025-12 conditional novelty 7.0

dMLLM-TTS delivers up to 6x more efficient test-time scaling for diffusion MLLMs via O(N+T) hierarchical search and self-verified feedback, improving generation quality on GenEval across three models.
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
cs.CL 2025-05 conditional novelty 7.0

Fast-dLLM adds reusable KV cache blocks and selective parallel decoding to diffusion LLMs, closing most of the speed gap with autoregressive models without retraining.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 6.0

MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
ELF: Embedded Language Flows
cs.CL 2026-05 unverdicted novelty 6.0

ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.
Continuous Latent Diffusion Language Model
cs.CL 2026-05 unverdicted novelty 6.0

Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
cs.LG 2026-04 unverdicted novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
Stability-Weighted Decoding for Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 6.0

Stability-Weighted Decoding improves diffusion LLM accuracy by modulating token scores with temporal stability from KL divergence between prediction steps.
Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.
LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation
cs.SD 2026-04 unverdicted novelty 6.0

LaDA-Band applies discrete masked diffusion with dual-track conditioning and progressive training to generate vocal-to-accompaniment tracks that improve acoustic authenticity, global coherence, and dynamic orchestrati...
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
cs.AI 2026-04 unverdicted novelty 6.0

Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed
cs.CL 2025-12 unverdicted novelty 6.0

Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.
DMax: Aggressive Parallel Decoding for dLLMs
cs.LG 2026-04 unverdicted novelty 5.0

DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...

Reference graph

Works this paper leans on

125 extracted references · 125 canonical work pages · cited by 17 Pith papers · 38 internal anchors

[1]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023

work page 2023
[2]

Improved baselines with visual instruction tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 296–26 306

work page 2024
[3]

LLaVA-OneVision: Easy Visual Task Transfer

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liuet al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 24 185–24 198

work page 2024
[5]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge et al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,” arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

C. Team, “Chameleon: Mixed-modal early-fusion foundation models,” arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Kimi-Audio Technical Report

D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tanget al., “Kimi-audio technical report,”arXiv preprint arXiv:2504.18425, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,” arXiv preprint arXiv:2311.07919, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,

S. Ghosh, S. Kumar, A. Seth, C. K. R. Evuru, U. Tyagi, S. Sakshi, O. Nieto, R. Duraiswami, and D. Manocha, “Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,”arXiv preprint arXiv:2406.11768, 2024

work page arXiv 2024
[10]

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

Y . Wang, X. Li, Z. Yan, Y . He, J. Yu, X. Zeng, C. Wang, C. Ma, H. Huang, J. Gao et al., “Internvideo2. 5: Empowering video mllms with long and rich context modeling,” arXiv preprint arXiv:2501.12386, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Sharegpt4video: Improving video understanding and generation with better captions,

L. Chen, X. Wei, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, Z. Tang, L. Yuan et al., “Sharegpt4video: Improving video understanding and generation with better captions,” Advances in Neural Information Processing Systems, vol. 37, pp. 19 472–19 495, 2024

work page 2024
[12]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Video instruction tuning with synthetic data,”arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Improving language understand- ing by generative pre-training,

A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understand- ing by generative pre-training,” 2018

work page 2018
[14]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskeveret al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019

work page 2019
[15]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

work page 1901
[16]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Qwen2.5 Technical Report

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2. 5 technical report,”arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Textbooks Are All You Need II: phi-1.5 technical report

Y . Li, S. Bubeck, R. Eldan, A. Del Giorno, S. Gunasekar, and Y . T. Lee, “Textbooks are all you need ii: phi-1.5 technical report,” arXiv preprint arXiv:2309.05463, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu et al., “Deepseek llm: Scaling open-source language models with longtermism,” arXiv preprint arXiv:2401.02954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Deep unsupervised learning using nonequilibrium thermodynamics,

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning. pmlr, 2015, pp. 2256–2265

work page 2015
[23]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

work page 2020
[24]

Score-Based Generative Modeling through Stochastic Differential Equations

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based gen- erative modeling through stochastic differential equations,”arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[25]

Argmax flows and multinomial diffusion: Learning categorical distributions,

E. Hoogeboom, D. Nielsen, P. Jaini, P. Forré, and M. Welling, “Argmax flows and multinomial diffusion: Learning categorical distributions,”NeurIPS, vol. 34, pp. 12 454–12 465, 2021

work page 2021
[26]

Structured denoising diffusion models in discrete state-spaces,

J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. van den Berg, “Structured denoising diffusion models in discrete state-spaces,” in Advances in Neural Information Processing Systems, 2021

work page 2021
[27]

One transformer fits all distributions in multi-modal diffusion at scale,

F. Bao, S. Nie, K. Xue, C. Li, S. Pu, Y . Wang, G. Yue, Y . Cao, H. Su, and J. Zhu, “One transformer fits all distributions in multi-modal diffusion at scale,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 1692–1717

work page 2023
[28]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y . Gu, Z. Chen, Z. Yang, and M. Z. Shou, “Show-o: One single transformer to unify multimodal understanding and generation,” arXiv preprint arXiv:2408.12528, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

C. Zhou, L. Yu, A. Babu, K. Tirumala, M. Yasunaga, L. Shamis, J. Kahn, X. Ma, L. Zettlemoyer, and O. Levy, “Transfusion: Predict the next token and diffuse images with one multi-modal model,”arXiv preprint arXiv:2408.11039, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Janus- flow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation,

Y . Ma, X. Liu, X. Chen, W. Liu, C. Wu, Z. Wu, Z. Pan, Z. Xie, H. Zhang, L. Zhaoet al., “Janus- flow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation,”arXiv preprint arXiv:2411.07975, 2024

work page arXiv 2024
[31]

Metamorph: Multimodal understanding and generation via instruction tuning,

S. Tong, D. Fan, J. Zhu, Y . Xiong, X. Chen, K. Sinha, M. Rabbat, Y . LeCun, S. Xie, and Z. Liu, “Metamorph: Multimodal understanding and generation via instruction tuning,” arXiv preprint arXiv:2412.14164, 2024

work page arXiv 2024
[32]

Orthus: Autoregressive interleaved image-text generation with modality-specific heads,

S. Kou, J. Jin, Z. Liu, C. Liu, Y . Ma, J. Jia, Q. Chen, P. Jiang, and Z. Deng, “Orthus: Autoregressive interleaved image-text generation with modality-specific heads,”arXiv preprint arXiv:2412.00127, 2024

work page arXiv 2024
[33]

Unified multimodal discrete diffusion,

A. Swerdlow, M. Prabhudesai, S. Gandhi, D. Pathak, and K. Fragkiadaki, “Unified multimodal discrete diffusion,”arXiv preprint arXiv:2503.20853, 2025

work page arXiv 2025
[34]

Dual diffusion for unified image generation and understanding,

Z. Li, H. Li, Y . Shi, A. B. Farimani, Y . Kluger, L. Yang, and P. Wang, “Dual diffusion for unified image generation and understanding,”arXiv preprint arXiv:2501.00289, 2024

work page arXiv 2024
[35]

A continuous time framework for discrete denoising models,

A. Campbell, J. Benton, V . D. Bortoli, T. Rainforth, G. Deligiannidis, and A. Doucet, “A continuous time framework for discrete denoising models,” inAdvances in Neural Information Processing Systems, 2022. 11

work page 2022
[36]

Diffusionbert: Improving generative masked language models with diffusion models,

Z. He, T. Sun, K. Wang, X. Huang, and X. Qiu, “Diffusionbert: Improving generative masked language models with diffusion models,”arXiv preprint arXiv:2211.15029, 2022

work page arXiv 2022
[37]

Score-based continuous-time discrete diffusion models,

H. Sun, L. Yu, B. Dai, D. Schuurmans, and H. Dai, “Score-based continuous-time discrete diffusion models,” in The Eleventh International Conference on Learning Representations, 2023

work page 2023
[38]

Discrete diffusion modeling by estimating the ratios of the data distribution,

A. Lou, C. Meng, and S. Ermon, “Discrete diffusion modeling by estimating the ratios of the data distribution,” inForty-first International Conference on Machine Learning, 2024

work page 2024
[39]

Simplified and generalized masked diffusion for discrete data,

J. Shi, K. Han, Z. Wang, A. Doucet, and M. K. Titsias, “Simplified and generalized masked diffusion for discrete data,”arXiv preprint arXiv:2406.04329, 2024

work page arXiv 2024
[40]

Simple and effective masked diffusion language models,

S. S. Sahoo, M. Arriola, Y . Schiff, A. Gokaslan, E. Marroquin, J. T. Chiu, A. Rush, and V . Kuleshov, “Simple and effective masked diffusion language models,” arXiv preprint arXiv:2406.07524, 2024

work page arXiv 2024
[41]

Your absorbing discrete diffusion secretly models the conditional distributions of clean data,

J. Ou, S. Nie, K. Xue, F. Zhu, J. Sun, Z. Li, and C. Li, “Your absorbing discrete diffusion secretly models the conditional distributions of clean data,”arXiv preprint arXiv:2406.03736, 2024

work page arXiv 2024
[42]

Large Language Diffusion Models

S. Nie, F. Zhu, Z. You, X. Zhang, J. Ou, J. Hu, J. Zhou, Y . Lin, J.-R. Wen, and C. Li, “Large language diffusion models,”arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Effective and efficient masked image generation models,

Z. You, J. Ou, X. Zhang, J. Hu, J. Zhou, and C. Li, “Effective and efficient masked image generation models,”arXiv preprint arXiv:2503.07197, 2025

work page arXiv 2025
[44]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafaet al., “Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features,” arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Are We on the Right Way for Evaluating Large Vision-Language Models?

L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Lin et al., “Are we on the right way for evaluating large vision-language models?”arXiv preprint arXiv:2403.20330, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Scaling up masked diffusion models on text,

S. Nie, F. Zhu, C. Du, T. Pang, Q. Liu, G. Zeng, M. Lin, and C. Li, “Scaling up masked diffusion models on text,”arXiv preprint arXiv:2410.18514, 2024

work page arXiv 2024
[47]

Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design,

A. Campbell, J. Yim, R. Barzilay, T. Rainforth, and T. Jaakkola, “Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design,” 2024

work page 2024
[48]

[mask] is all you need,

V . T. Hu and B. Ommer, “[mask] is all you need,” 2024. [Online]. Available: https://arxiv.org/abs/2412.06787

work page arXiv 2024
[49]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervi- sion,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

work page 2021
[50]

Sigmoid loss for language image pre- training,

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre- training,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 975–11 986

work page 2023
[51]

Wan: Open and Advanced Large-Scale Video Generative Models

A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng et al. , “Wan: Open and advanced large-scale video generative models,” arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng et al., “Cogvideox: Text-to-video diffusion models with an expert transformer,”arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhanget al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Llava-next: Improved reasoning, ocr, and world knowledge,

H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024. [Online]. Available: https://llava-vl.github.io/blog/2024-01-30-llava-next/

work page 2024
[55]

Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale.arXiv preprint arXiv:2412.05237,

J. Guo, T. Zheng, Y . Bai, B. Li, Y . Wang, K. Zhu, Y . Li, G. Neubig, W. Chen, and X. Yue, “Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale,” arXiv preprint arXiv:2412.05237, 2024

work page arXiv 2024
[56]

Visualwebinstruct: Scaling up multimodal instruction data through web search,

Y . Jia, J. Li, X. Yue, B. Li, P. Nie, K. Zou, and W. Chen, “Visualwebinstruct: Scaling up multimodal instruction data through web search,”arXiv preprint arXiv:2503.10582, 2025

work page arXiv 2025
[57]

Qwen3: Think deeper, act faster,

Q. Team, “Qwen3: Think deeper, act faster,” 2025, https://qwenlm.github.io/blog/qwen3/. [Online]. Available: https://qwenlm.github.io/blog/qwen3/

work page 2025
[58]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[59]

Direct pref- erence optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct pref- erence optimization: Your language model is secretly a reward model,”Advances in Neural Information Processing Systems, vol. 36, pp. 53 728–53 741, 2023

work page 2023
[60]

Simpo: Simple preference optimization with a reference-free reward,

Y . Meng, M. Xia, and D. Chen, “Simpo: Simple preference optimization with a reference-free reward,”Advances in Neural Information Processing Systems, vol. 37, pp. 124 198–124 235, 2024

work page 2024
[61]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,

X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sunet al., “Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9556–9567

work page 2024
[63]

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

X. Yue, T. Zheng, Y . Ni, Y . Wang, K. Zhang, S. Tong, Y . Sun, B. Yu, G. Zhang, H. Sunet al., “Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark,”arXiv preprint arXiv:2409.02813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y . Wu, and R. Ji, “Mme: A comprehensive evaluation benchmark for multimodal large language models,”arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

B. Li, R. Wang, G. Wang, Y . Ge, Y . Ge, and Y . Shan, “Seed-bench: Benchmarking multimodal llms with generative comprehension,”arXiv preprint arXiv:2307.16125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Mmbench: Is your multi-modal model an all-around player?

Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liuet al., “Mmbench: Is your multi-modal model an all-around player?” in European conference on computer vision. Springer, 2024, pp. 216–233

work page 2024
[67]

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?

R. Zhang, D. Jiang, Y . Zhang, H. Lin, Z. Guo, P. Qiu, A. Zhou, P. Lu, K.-W. Chang, Y . Qiao et al., “Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?” in European Conference on Computer Vision. Springer, 2024, pp. 169–186

work page 2024
[68]

Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models,

P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao, “Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models,”CoRR, 2023

work page 2023
[69]

A diagram is worth a dozen images,

A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi, “A diagram is worth a dozen images,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer, 2016, pp. 235–251

work page 2016
[70]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque, “Chartqa: A benchmark for question answering about charts with visual and logical reasoning,”arXiv preprint arXiv:2203.10244, 2022. 13

work page internal anchor Pith review Pith/arXiv arXiv 2022
[71]

Docvqa: A dataset for vqa on document images,

M. Mathew, D. Karatzas, and C. Jawahar, “Docvqa: A dataset for vqa on document images,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 2200–2209

work page 2021
[72]

Infographicvqa,

M. Mathew, V . Bagal, R. Tito, D. Karatzas, E. Valveny, and C. Jawahar, “Infographicvqa,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 1697–1706

work page 2022
[73]

Grok-1.5 vision preview,

x.ai, “Grok-1.5 vision preview,” 2024, https://x.ai/news/grok-1.5v/. [Online]. Available: https://x.ai/news/grok-1.5v/

work page 2024
[74]

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

F. Wang, X. Fu, J. Y . Huang, Z. Li, Q. Liu, X. Liu, M. D. Ma, N. Xu, W. Zhou, K. Zhang et al., “Muirbench: A comprehensive benchmark for robust multi-image understanding,”arXiv preprint arXiv:2406.09411, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

MLVU: Benchmarking Multi-task Long Video Understanding

J. Zhou, Y . Shu, B. Zhao, B. Wu, S. Xiao, X. Yang, Y . Xiong, B. Zhang, T. Huang, and Z. Liu, “Mlvu: A comprehensive benchmark for multi-task long video understanding,”arXiv preprint arXiv:2406.04264, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhanget al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,”arXiv preprint arXiv:2405.21075, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[77]

Sharegpt4v: Improving large multi-modal models with better captions,

L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin, “Sharegpt4v: Improving large multi-modal models with better captions,” in European Conference on Computer Vision. Springer, 2024, pp. 370–387

work page 2024
[78]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms,

P. Tong, E. Brown, P. Wu, S. Woo, A. J. V . IYER, S. C. Akula, S. Yang, J. Yang, M. Middepogu, Z. Wang et al., “Cambrian-1: A fully open, vision-centric exploration of multimodal llms,” Advances in Neural Information Processing Systems, vol. 37, pp. 87 310–87 356, 2024

work page 2024
[79]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, H. Yang et al., “Deepseek-vl: towards real-world vision-language understanding,” arXiv preprint arXiv:2403.05525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[80]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y . Ma, C. Wu, B. Wang et al., “Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal under- standing,”arXiv preprint arXiv:2412.10302, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

Showing first 80 references.