Recognition: 2 theorem links
· Lean TheoremLLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
Pith reviewed 2026-05-17 03:40 UTC · model grok-4.3
The pith
A purely diffusion-based multimodal model matches autoregressive leaders on visual instruction tasks by adding a vision encoder to a language diffusion backbone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLaDA-V integrates a vision encoder and MLP connector into the LLaDA language diffusion model to project visual features into the embedding space, enabling masked diffusion training on visual instruction data. When trained on the same data as LLaMA3-V, it proves competitive across multimodal tasks and narrows the gap to Qwen2-VL while outperforming other diffusion-based and hybrid MLLMs, showing that large language diffusion models remain effective for multimodal understanding despite weaker standalone text performance.
What carries the argument
Masked diffusion process applied to language tokens combined with a vision encoder and MLP connector that maps visual features into the shared embedding space for joint denoising.
If this is right
- The architecture supports better data scalability on multimodal tasks than some autoregressive baselines under identical training conditions.
- Performance advantages appear concentrated in multimodal understanding rather than pure language modeling.
- Results encourage replacing or complementing autoregressive decoding with diffusion steps in future multimodal systems.
- The model narrows the gap to strong autoregressive systems such as Qwen2-VL on shared instruction data.
Where Pith is reading between the lines
- Parallel denoising in diffusion could enable faster or more flexible multimodal output generation than left-to-right token prediction.
- The approach may generalize to additional modalities if the same encoder-connector pattern is applied to audio or other inputs.
- Longer context or higher-resolution visual inputs could be tested to measure whether diffusion maintains coherence better than autoregressive models under increased sequence length.
Load-bearing premise
That empirical gains on the selected instruction-tuning datasets and benchmarks will hold beyond this specific mixture and that the diffusion process can preserve coherent multimodal alignment without autoregressive sequential constraints.
What would settle it
A clear drop in multimodal benchmark scores when the same model is tested on instruction data drawn from distributions outside the training mixture or when the vision encoder and connector are ablated while keeping the diffusion backbone fixed.
read the original abstract
In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: https://ml-gsai.github.io/LLaDA-V-demo/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LLaDA-V, a purely diffusion-based Multimodal Large Language Model extending the LLaDA language diffusion model via a vision encoder and MLP connector for visual instruction tuning. It reports competitive multimodal performance to LLaMA3-V when using identical instruction data, better data scalability, narrowing of the gap to Qwen2-VL, and state-of-the-art results among existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs.
Significance. If the performance claims are shown to hold under controlled comparisons of data volume and vision encoders, the work would establish that masked diffusion models can achieve effective multimodal alignment and instruction following without autoregressive decoding. The reported competitiveness despite a weaker base language model on text-only tasks, together with scalability observations, would provide evidence that diffusion paradigms merit further study as alternatives to dominant autoregressive MLLMs.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): The SOTA claim among hybrid AR-diffusion and pure diffusion MLLMs is load-bearing for the central contribution, yet the manuscript provides no explicit quantification of total training tokens, image-text pairs, or vision encoder parameters for the diffusion baselines. Without this, the reported gains cannot be isolated from potential differences in training mixture size or backbone strength (e.g., CLIP-style encoders).
- [§4.2] §4.2 (Main Results): The competitiveness to LLaMA3-V on identical instruction data and the narrowing gap to Qwen2-VL are presented without accompanying statistical significance tests, variance across runs, or details on evaluation data splits. This weakens assessment of whether the diffusion architecture itself drives the observed multimodal gains.
minor comments (2)
- [Conclusion] The manuscript could add a dedicated limitations paragraph discussing potential coherence issues in long multimodal sequences under the diffusion process.
- [§3] Notation for the masked diffusion objective and the MLP connector projection could be made more explicit with an equation reference in §3.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The SOTA claim among hybrid AR-diffusion and pure diffusion MLLMs is load-bearing for the central contribution, yet the manuscript provides no explicit quantification of total training tokens, image-text pairs, or vision encoder parameters for the diffusion baselines. Without this, the reported gains cannot be isolated from potential differences in training mixture size or backbone strength (e.g., CLIP-style encoders).
Authors: We agree that an explicit comparison of training resources would better support the SOTA claim and help isolate architectural effects. In the revised manuscript we will add a table in §4 summarizing the number of training tokens, image-text pairs, and vision-encoder parameters for LLaDA-V and the compared hybrid and pure-diffusion MLLMs, citing the original publications for the baselines. This addition will clarify the scale of the comparisons. revision: yes
-
Referee: [§4.2] §4.2 (Main Results): The competitiveness to LLaMA3-V on identical instruction data and the narrowing gap to Qwen2-VL are presented without accompanying statistical significance tests, variance across runs, or details on evaluation data splits. This weakens assessment of whether the diffusion architecture itself drives the observed multimodal gains.
Authors: We acknowledge the value of statistical reporting. The LLaMA3-V comparison uses exactly the same instruction data, which already controls for data volume. All evaluations follow the standard benchmark protocols and splits published with each dataset. Because of the substantial compute required for large-model training, we report single-run results. In revision we will (i) explicitly state the evaluation splits used, (ii) note the consistency of gains across tasks, and (iii) add a limitations paragraph acknowledging the lack of multiple runs and statistical tests. revision: partial
Circularity Check
No circularity in empirical performance claims
full rationale
The paper introduces LLaDA-V by extending a prior diffusion language model with a vision encoder and reports benchmark results after instruction tuning. All central claims consist of observed performance numbers on public multimodal benchmarks rather than any derivation, prediction, or first-principles result that reduces to fitted parameters or self-citations by construction. No equations appear that would turn training objectives back into outputs, and comparisons to LLaMA3-V or Qwen2-VL are presented as empirical observations, not forced equivalences. Self-reference to the base LLaDA model is architectural background and does not load-bear any circular reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- MLP connector dimensions and training schedule
axioms (1)
- domain assumption Masked diffusion language modeling can be extended to multimodal inputs via a vision encoder and linear projection without loss of coherence.
Forward citations
Cited by 19 Pith papers
-
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
-
Relative Score Policy Optimization for Diffusion Language Models
RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
-
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
GPO-V is a visual jailbreak framework that bypasses safety guardrails in diffusion VLMs by globally manipulating generative probabilities during denoising.
-
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
GPO-V jailbreaks dVLMs by globally optimizing probabilities in the denoising process to bypass refusal patterns, achieving stealthy and transferable attacks.
-
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
-
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
ECHO is a one-step block diffusion VLM for chest X-ray reports that improves RaTE and SemScore by over 60% while delivering 8x faster inference than autoregressive baselines.
-
dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models
dMLLM-TTS delivers up to 6x more efficient test-time scaling for diffusion MLLMs via O(N+T) hierarchical search and self-verified feedback, improving generation quality on GenEval across three models.
-
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
Fast-dLLM adds reusable KV cache blocks and selective parallel decoding to diffusion LLMs, closing most of the speed gap with autoregressive models without retraining.
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
-
ELF: Embedded Language Flows
ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.
-
Continuous Latent Diffusion Language Model
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
-
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
-
Stability-Weighted Decoding for Diffusion Language Models
Stability-Weighted Decoding improves diffusion LLM accuracy by modulating token scores with temporal stability from KL divergence between prediction steps.
-
Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models
Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.
-
LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation
LaDA-Band applies discrete masked diffusion with dual-track conditioning and progressive training to generate vocal-to-accompaniment tracks that improve acoustic authenticity, global coherence, and dynamic orchestrati...
-
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
-
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed
Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.
-
DMax: Aggressive Parallel Decoding for dLLMs
DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...
Reference graph
Works this paper leans on
-
[1]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023
work page 2023
-
[2]
Improved baselines with visual instruction tuning,
H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 296–26 306
work page 2024
-
[3]
LLaVA-OneVision: Easy Visual Task Transfer
B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liuet al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,
Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 24 185–24 198
work page 2024
-
[5]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge et al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,” arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
C. Team, “Chameleon: Mixed-modal early-fusion foundation models,” arXiv preprint arXiv:2405.09818, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tanget al., “Kimi-audio technical report,”arXiv preprint arXiv:2504.18425, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,” arXiv preprint arXiv:2311.07919, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
S. Ghosh, S. Kumar, A. Seth, C. K. R. Evuru, U. Tyagi, S. Sakshi, O. Nieto, R. Duraiswami, and D. Manocha, “Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,”arXiv preprint arXiv:2406.11768, 2024
-
[10]
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Y . Wang, X. Li, Z. Yan, Y . He, J. Yu, X. Zeng, C. Wang, C. Ma, H. Huang, J. Gao et al., “Internvideo2. 5: Empowering video mllms with long and rich context modeling,” arXiv preprint arXiv:2501.12386, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Sharegpt4video: Improving video understanding and generation with better captions,
L. Chen, X. Wei, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, Z. Tang, L. Yuan et al., “Sharegpt4video: Improving video understanding and generation with better captions,” Advances in Neural Information Processing Systems, vol. 37, pp. 19 472–19 495, 2024
work page 2024
-
[12]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Video instruction tuning with synthetic data,”arXiv preprint arXiv:2410.02713, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Improving language understand- ing by generative pre-training,
A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understand- ing by generative pre-training,” 2018
work page 2018
-
[14]
Language models are unsupervised multitask learners,
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskeveret al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019
work page 2019
-
[15]
Language models are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020
work page 1901
-
[16]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023. 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2. 5 technical report,”arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Textbooks Are All You Need II: phi-1.5 technical report
Y . Li, S. Bubeck, R. Eldan, A. Del Giorno, S. Gunasekar, and Y . T. Lee, “Textbooks are all you need ii: phi-1.5 technical report,” arXiv preprint arXiv:2309.05463, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu et al., “Deepseek llm: Scaling open-source language models with longtermism,” arXiv preprint arXiv:2401.02954, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Deep unsupervised learning using nonequilibrium thermodynamics,
J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning. pmlr, 2015, pp. 2256–2265
work page 2015
-
[23]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020
work page 2020
-
[24]
Score-Based Generative Modeling through Stochastic Differential Equations
Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based gen- erative modeling through stochastic differential equations,”arXiv preprint arXiv:2011.13456, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[25]
Argmax flows and multinomial diffusion: Learning categorical distributions,
E. Hoogeboom, D. Nielsen, P. Jaini, P. Forré, and M. Welling, “Argmax flows and multinomial diffusion: Learning categorical distributions,”NeurIPS, vol. 34, pp. 12 454–12 465, 2021
work page 2021
-
[26]
Structured denoising diffusion models in discrete state-spaces,
J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. van den Berg, “Structured denoising diffusion models in discrete state-spaces,” in Advances in Neural Information Processing Systems, 2021
work page 2021
-
[27]
One transformer fits all distributions in multi-modal diffusion at scale,
F. Bao, S. Nie, K. Xue, C. Li, S. Pu, Y . Wang, G. Yue, Y . Cao, H. Su, and J. Zhu, “One transformer fits all distributions in multi-modal diffusion at scale,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 1692–1717
work page 2023
-
[28]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y . Gu, Z. Chen, Z. Yang, and M. Z. Shou, “Show-o: One single transformer to unify multimodal understanding and generation,” arXiv preprint arXiv:2408.12528, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
C. Zhou, L. Yu, A. Babu, K. Tirumala, M. Yasunaga, L. Shamis, J. Kahn, X. Ma, L. Zettlemoyer, and O. Levy, “Transfusion: Predict the next token and diffuse images with one multi-modal model,”arXiv preprint arXiv:2408.11039, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Y . Ma, X. Liu, X. Chen, W. Liu, C. Wu, Z. Wu, Z. Pan, Z. Xie, H. Zhang, L. Zhaoet al., “Janus- flow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation,”arXiv preprint arXiv:2411.07975, 2024
-
[31]
Metamorph: Multimodal understanding and generation via instruction tuning,
S. Tong, D. Fan, J. Zhu, Y . Xiong, X. Chen, K. Sinha, M. Rabbat, Y . LeCun, S. Xie, and Z. Liu, “Metamorph: Multimodal understanding and generation via instruction tuning,” arXiv preprint arXiv:2412.14164, 2024
-
[32]
Orthus: Autoregressive interleaved image-text generation with modality-specific heads,
S. Kou, J. Jin, Z. Liu, C. Liu, Y . Ma, J. Jia, Q. Chen, P. Jiang, and Z. Deng, “Orthus: Autoregressive interleaved image-text generation with modality-specific heads,”arXiv preprint arXiv:2412.00127, 2024
-
[33]
Unified multimodal discrete diffusion,
A. Swerdlow, M. Prabhudesai, S. Gandhi, D. Pathak, and K. Fragkiadaki, “Unified multimodal discrete diffusion,”arXiv preprint arXiv:2503.20853, 2025
-
[34]
Dual diffusion for unified image generation and understanding,
Z. Li, H. Li, Y . Shi, A. B. Farimani, Y . Kluger, L. Yang, and P. Wang, “Dual diffusion for unified image generation and understanding,”arXiv preprint arXiv:2501.00289, 2024
-
[35]
A continuous time framework for discrete denoising models,
A. Campbell, J. Benton, V . D. Bortoli, T. Rainforth, G. Deligiannidis, and A. Doucet, “A continuous time framework for discrete denoising models,” inAdvances in Neural Information Processing Systems, 2022. 11
work page 2022
-
[36]
Diffusionbert: Improving generative masked language models with diffusion models,
Z. He, T. Sun, K. Wang, X. Huang, and X. Qiu, “Diffusionbert: Improving generative masked language models with diffusion models,”arXiv preprint arXiv:2211.15029, 2022
-
[37]
Score-based continuous-time discrete diffusion models,
H. Sun, L. Yu, B. Dai, D. Schuurmans, and H. Dai, “Score-based continuous-time discrete diffusion models,” in The Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[38]
Discrete diffusion modeling by estimating the ratios of the data distribution,
A. Lou, C. Meng, and S. Ermon, “Discrete diffusion modeling by estimating the ratios of the data distribution,” inForty-first International Conference on Machine Learning, 2024
work page 2024
-
[39]
Simplified and generalized masked diffusion for discrete data,
J. Shi, K. Han, Z. Wang, A. Doucet, and M. K. Titsias, “Simplified and generalized masked diffusion for discrete data,”arXiv preprint arXiv:2406.04329, 2024
-
[40]
Simple and effective masked diffusion language models,
S. S. Sahoo, M. Arriola, Y . Schiff, A. Gokaslan, E. Marroquin, J. T. Chiu, A. Rush, and V . Kuleshov, “Simple and effective masked diffusion language models,” arXiv preprint arXiv:2406.07524, 2024
-
[41]
Your absorbing discrete diffusion secretly models the conditional distributions of clean data,
J. Ou, S. Nie, K. Xue, F. Zhu, J. Sun, Z. Li, and C. Li, “Your absorbing discrete diffusion secretly models the conditional distributions of clean data,”arXiv preprint arXiv:2406.03736, 2024
-
[42]
Large Language Diffusion Models
S. Nie, F. Zhu, Z. You, X. Zhang, J. Ou, J. Hu, J. Zhou, Y . Lin, J.-R. Wen, and C. Li, “Large language diffusion models,”arXiv preprint arXiv:2502.09992, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Effective and efficient masked image generation models,
Z. You, J. Ou, X. Zhang, J. Hu, J. Zhou, and C. Li, “Effective and efficient masked image generation models,”arXiv preprint arXiv:2503.07197, 2025
-
[44]
M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafaet al., “Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features,” arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Are We on the Right Way for Evaluating Large Vision-Language Models?
L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Lin et al., “Are we on the right way for evaluating large vision-language models?”arXiv preprint arXiv:2403.20330, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Scaling up masked diffusion models on text,
S. Nie, F. Zhu, C. Du, T. Pang, Q. Liu, G. Zeng, M. Lin, and C. Li, “Scaling up masked diffusion models on text,”arXiv preprint arXiv:2410.18514, 2024
-
[47]
A. Campbell, J. Yim, R. Barzilay, T. Rainforth, and T. Jaakkola, “Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design,” 2024
work page 2024
-
[48]
V . T. Hu and B. Ommer, “[mask] is all you need,” 2024. [Online]. Available: https://arxiv.org/abs/2412.06787
-
[49]
Learning transferable visual models from natural language supervi- sion,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervi- sion,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763
work page 2021
-
[50]
Sigmoid loss for language image pre- training,
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre- training,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 975–11 986
work page 2023
-
[51]
Wan: Open and Advanced Large-Scale Video Generative Models
A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng et al. , “Wan: Open and advanced large-scale video generative models,” arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng et al., “Cogvideox: Text-to-video diffusion models with an expert transformer,”arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhanget al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024. 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Llava-next: Improved reasoning, ocr, and world knowledge,
H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024. [Online]. Available: https://llava-vl.github.io/blog/2024-01-30-llava-next/
work page 2024
-
[55]
J. Guo, T. Zheng, Y . Bai, B. Li, Y . Wang, K. Zhu, Y . Li, G. Neubig, W. Chen, and X. Yue, “Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale,” arXiv preprint arXiv:2412.05237, 2024
-
[56]
Visualwebinstruct: Scaling up multimodal instruction data through web search,
Y . Jia, J. Li, X. Yue, B. Li, P. Nie, K. Zou, and W. Chen, “Visualwebinstruct: Scaling up multimodal instruction data through web search,”arXiv preprint arXiv:2503.10582, 2025
-
[57]
Qwen3: Think deeper, act faster,
Q. Team, “Qwen3: Think deeper, act faster,” 2025, https://qwenlm.github.io/blog/qwen3/. [Online]. Available: https://qwenlm.github.io/blog/qwen3/
work page 2025
-
[58]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[59]
Direct pref- erence optimization: Your language model is secretly a reward model,
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct pref- erence optimization: Your language model is secretly a reward model,”Advances in Neural Information Processing Systems, vol. 36, pp. 53 728–53 741, 2023
work page 2023
-
[60]
Simpo: Simple preference optimization with a reference-free reward,
Y . Meng, M. Xia, and D. Chen, “Simpo: Simple preference optimization with a reference-free reward,”Advances in Neural Information Processing Systems, vol. 37, pp. 124 198–124 235, 2024
work page 2024
-
[61]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,
X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sunet al., “Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9556–9567
work page 2024
-
[63]
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
X. Yue, T. Zheng, Y . Ni, Y . Wang, K. Zhang, S. Tong, Y . Sun, B. Yu, G. Zhang, H. Sunet al., “Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark,”arXiv preprint arXiv:2409.02813, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y . Wu, and R. Ji, “Mme: A comprehensive evaluation benchmark for multimodal large language models,”arXiv preprint arXiv:2306.13394, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[65]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
B. Li, R. Wang, G. Wang, Y . Ge, Y . Ge, and Y . Shan, “Seed-bench: Benchmarking multimodal llms with generative comprehension,”arXiv preprint arXiv:2307.16125, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
Mmbench: Is your multi-modal model an all-around player?
Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liuet al., “Mmbench: Is your multi-modal model an all-around player?” in European conference on computer vision. Springer, 2024, pp. 216–233
work page 2024
-
[67]
Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?
R. Zhang, D. Jiang, Y . Zhang, H. Lin, Z. Guo, P. Qiu, A. Zhou, P. Lu, K.-W. Chang, Y . Qiao et al., “Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?” in European Conference on Computer Vision. Springer, 2024, pp. 169–186
work page 2024
-
[68]
P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao, “Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models,”CoRR, 2023
work page 2023
-
[69]
A diagram is worth a dozen images,
A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi, “A diagram is worth a dozen images,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer, 2016, pp. 235–251
work page 2016
-
[70]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque, “Chartqa: A benchmark for question answering about charts with visual and logical reasoning,”arXiv preprint arXiv:2203.10244, 2022. 13
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[71]
Docvqa: A dataset for vqa on document images,
M. Mathew, D. Karatzas, and C. Jawahar, “Docvqa: A dataset for vqa on document images,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 2200–2209
work page 2021
-
[72]
M. Mathew, V . Bagal, R. Tito, D. Karatzas, E. Valveny, and C. Jawahar, “Infographicvqa,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 1697–1706
work page 2022
-
[73]
x.ai, “Grok-1.5 vision preview,” 2024, https://x.ai/news/grok-1.5v/. [Online]. Available: https://x.ai/news/grok-1.5v/
work page 2024
-
[74]
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
F. Wang, X. Fu, J. Y . Huang, Z. Li, Q. Liu, X. Liu, M. D. Ma, N. Xu, W. Zhou, K. Zhang et al., “Muirbench: A comprehensive benchmark for robust multi-image understanding,”arXiv preprint arXiv:2406.09411, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[75]
MLVU: Benchmarking Multi-task Long Video Understanding
J. Zhou, Y . Shu, B. Zhao, B. Wu, S. Xiao, X. Yang, Y . Xiong, B. Zhang, T. Huang, and Z. Liu, “Mlvu: A comprehensive benchmark for multi-task long video understanding,”arXiv preprint arXiv:2406.04264, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhanget al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,”arXiv preprint arXiv:2405.21075, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[77]
Sharegpt4v: Improving large multi-modal models with better captions,
L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin, “Sharegpt4v: Improving large multi-modal models with better captions,” in European Conference on Computer Vision. Springer, 2024, pp. 370–387
work page 2024
-
[78]
Cambrian-1: A fully open, vision-centric exploration of multimodal llms,
P. Tong, E. Brown, P. Wu, S. Woo, A. J. V . IYER, S. C. Akula, S. Yang, J. Yang, M. Middepogu, Z. Wang et al., “Cambrian-1: A fully open, vision-centric exploration of multimodal llms,” Advances in Neural Information Processing Systems, vol. 37, pp. 87 310–87 356, 2024
work page 2024
-
[79]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, H. Yang et al., “Deepseek-vl: towards real-world vision-language understanding,” arXiv preprint arXiv:2403.05525, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[80]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y . Ma, C. Wu, B. Wang et al., “Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal under- standing,”arXiv preprint arXiv:2412.10302, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.