pith. sign in

arxiv: 2606.27376 · v1 · pith:CZHIZUUWnew · submitted 2026-06-25 · 💻 cs.CV

Ask, Solve, Generate: Self-Evolving Unified Multimodal Understanding and Generation via Self-Consistency Rewards

Pith reviewed 2026-06-26 04:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords self-evolving multimodal modelsunified LMM trainingself-consistency rewardsSolver Token Entropyimage generation evaluationvisual question answeringunlabeled data training
0
0 comments X

The pith

A unified multimodal model improves its visual understanding and image generation using only self-derived consistency signals from unlabeled images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a single large multimodal model can advance both its question-answering accuracy and its image-generation quality without any human labels, preference data, or external reward models. It does so by assigning the model three internal roles that operate on the same unlabeled images: one role proposes visual questions, a second solves and scores them, and a third generates new images whose quality is judged by the solver. Training signals come entirely from measuring consistency between these roles, stabilized by an added token-level uncertainty measure called Solver Token Entropy. The same loop and reward logic are applied unchanged to three different backbone architectures and produce gains on eight separate understanding benchmarks plus one generation benchmark.

Core claim

A Proposer-Solver-Generator loop that generates its own training signals from question-answer consistency and cycle-consistent image evaluation enables a unified LMM to raise performance on both understanding and generation tasks, delivering a 3.5-point absolute gain on MMMU and lifting GenEval from 82 percent to 85 percent for the BAGEL backbone while requiring only unlabeled images and each model’s native interface.

What carries the argument

The Proposer-Solver-Generator loop mediated by Solver Token Entropy, which supplies a continuous internal difficulty signal that remains usable when sample-level consistency scores become unreliable.

If this is right

  • Better visual understanding directly strengthens the reliability of the internal generation-assessment signal.
  • The identical role decomposition and reward logic transfers across diffusion, rectified-flow, and autoregressive backbones without architecture-specific changes.
  • No curated annotations, preference pairs, or separately trained judge models are required for the observed gains.
  • Multi-scale fidelity scoring plus cycle-consistent captioning creates a closed coupling between understanding and generation training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could reduce dependence on large human-annotated multimodal datasets for continued scaling.
  • Similar internal-consistency loops might be definable for other paired tasks such as text-to-video or audio-visual generation.
  • If the entropy signal continues to function at larger scales, the method offers a route to iterative self-improvement without external supervision.

Load-bearing premise

The internal consistency signals produced by the three-role loop stay reliable enough to drive genuine improvement rather than reinforcing the model’s existing errors.

What would settle it

Training runs that apply the self-consistency rewards produce no gain or a measurable drop on held-out understanding and generation benchmarks relative to the untouched base models.

Figures

Figures reproduced from arXiv: 2606.27376 by Abdelrahman Shaker, Fahad Khan, Hisham Cholakkal, Omkar Thawakar, Rao Muhammad Anwer, Ritesh Thawkar, Salman Khan, Shravan Venkatraman.

Figure 1
Figure 1. Figure 1: Overview of our self-evolving framework. Three LoRA adapters–Proposer, Solver, and Generator–are trained on a frozen backbone using only unlabeled images. The understanding loop uses prompt-perturbed self￾consistency and Solver Token Entropy (STE), while the generation loop uses the Solver as an internal evaluator through QA fidelity and cycle-consistent captioning. To bridge this gap, we propose a self-ev… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our Proposer–Solver–Generator self-evolving framework. Given only a frozen backbone and unlabeled images, we attach three lightweight LoRA adapters for the Proposer, Solver, and Generator roles. In understanding steps (left), the Proposer generates visual questions, and the Solver answers under multiple prompt perturbations; self-consistency agreement and Solver Token Entropy (STE) jointly prod… view at source ↗
Figure 3
Figure 3. Figure 3: Signal analysis on BLIP3o-8B revealing the complementary roles of our self-evolving training signals. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Solver-mediated loop dynamics on BLIP3o-8B comparing joint training against single-loop variants over [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training dynamics over a roughly 10k-step horizon. Understanding signals stabilize after STE-driven exploration, and generation rewards rise across diffusion, rectified-flow, and autoregressive backbones without reward plateaus. Qualitative Results [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of base vs. self-evolved outputs across tasks and backbones. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
read the original abstract

Most unified large multimodal models (LMMs) that support both visual understanding and image generation still rely on curated post-training supervision, such as human annotations, preference labels, or external reward models. We ask whether a unified LMM can improve both abilities autonomously using only unlabeled images. We propose a self-evolving training framework with three internal roles: a Proposer that generates visual questions, a Solver that answers and evaluates them, and a Generator that synthesizes images. Training uses only self-derived consistency signals, without human annotations, preference labels, or task-trained external reward/judge models. To stabilize learning, we introduce Solver Token Entropy (STE), a continuous difficulty signal based on token-level prediction uncertainty that remains useful even when sample-level consistency becomes unreliable. For image generation, we design a multi-scale internal evaluation scheme that combines question-answer fidelity scoring with cycle-consistent captioning. This creates a solver-mediated coupling, where better visual understanding enables more reliable generation assessment and stronger internal training signals. The framework preserves the same role decomposition, reward logic, and training schedule across diffusion-based BLIP3o, rectified-flow BAGEL, and autoregressive VARGPT-v1.1 architectures, requiring only each backbone's native prompting and generation interface. Across eight understanding metrics, our method consistently improves over the corresponding base models. On BAGEL, it achieves a $+3.5\%$ absolute gain on MMMU and improves GenEval image generation performance from $82\%$ to $85\%$. Code and models are publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a self-evolving training framework for unified LMMs supporting both understanding and generation. It decomposes training into Proposer (generates questions from unlabeled images), Solver (answers and evaluates), and Generator (synthesizes images) roles, using only internal consistency signals between these components as the training objective. Solver Token Entropy (STE) is introduced as a continuous difficulty signal based on token uncertainty to stabilize learning when sample-level consistency fails. A multi-scale internal evaluation combines QA fidelity with cycle-consistent captioning for generation. The same framework is applied to BLIP3o, BAGEL, and VARGPT-v1.1, yielding consistent gains on eight understanding metrics and specific improvements on BAGEL (+3.5% MMMU, GenEval 82% to 85%). Code and models are released.

Significance. If the self-consistency signals can be shown to supply non-circular, task-aligned gradients rather than regularization effects, the result would be significant for demonstrating autonomous post-training of unified multimodal models without human annotations or external judges. The cross-architecture applicability and public release of code/models are explicit strengths that aid verification and extension.

major comments (2)
  1. [§3.3] §3.3 (Solver Token Entropy): The claim that STE supplies a reliable non-circular difficulty signal when sample-level consistency is unreliable is load-bearing for the central training loop, yet the manuscript provides no ablation that replaces STE with a constant or random signal while keeping other loop components fixed; without this isolation, attribution of the +3.5% MMMU gain to the semantic content of consistency versus implicit regularization remains unverified.
  2. [§4.2] §4.2 (BAGEL experiments): The reported gains on MMMU and GenEval are attributed to the Proposer-Solver-Generator consistency rewards, but the experimental design lacks a control that disables the consistency metric (e.g., uniform random rewards or extra forward passes without the reward logic); this leaves open whether observed improvements arise from the proposed self-derived signals or from auxiliary effects of the training schedule.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'across eight understanding metrics' is used without enumeration; listing the specific metrics (e.g., VQAv2, GQA, etc.) would improve immediate clarity.
  2. [§3.1] §3.1: The multi-scale internal evaluation scheme for generation is described at a high level; providing the exact weighting or combination formula between QA fidelity and cycle-consistent captioning would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the need for stronger isolation of the self-consistency signals. We address each major point below. Where the manuscript lacks the requested controls, we agree that additional experiments are warranted and will incorporate them.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (Solver Token Entropy): The claim that STE supplies a reliable non-circular difficulty signal when sample-level consistency is unreliable is load-bearing for the central training loop, yet the manuscript provides no ablation that replaces STE with a constant or random signal while keeping other loop components fixed; without this isolation, attribution of the +3.5% MMMU gain to the semantic content of consistency versus implicit regularization remains unverified.

    Authors: We agree that an explicit ablation replacing Solver Token Entropy with a constant or random signal (while holding the remainder of the Proposer-Solver-Generator loop fixed) would strengthen attribution of gains to the semantic content of the consistency signals. The manuscript shows that STE prevents instability when sample-level consistency fails, but does not contain the requested control. We will add this ablation in the revised version. revision: yes

  2. Referee: [§4.2] §4.2 (BAGEL experiments): The reported gains on MMMU and GenEval are attributed to the Proposer-Solver-Generator consistency rewards, but the experimental design lacks a control that disables the consistency metric (e.g., uniform random rewards or extra forward passes without the reward logic); this leaves open whether observed improvements arise from the proposed self-derived signals or from auxiliary effects of the training schedule.

    Authors: We acknowledge that the current BAGEL results lack a control that replaces the consistency-derived rewards with random or null signals while preserving the training schedule. Such a control would more conclusively rule out auxiliary schedule effects. We will add the requested control experiments in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity: external benchmarks validate internal self-consistency training

full rationale

The paper trains via internal Proposer-Solver-Generator consistency signals (plus STE) but evaluates gains on independent external benchmarks (MMMU +3.5%, GenEval 82%→85%). No equation or claim equates the reported performance improvement to the consistency metric by construction. The reward is a training objective; the output metrics are separate held-out tasks. No self-citations, fitted-input predictions, or self-definitional reductions appear in the abstract or described framework. The derivation chain remains non-circular because success is measured outside the self-derived signals.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger reflects the method's stated dependencies; the core unverified premise is that internal consistency alone suffices for stable improvement.

axioms (1)
  • domain assumption Self-consistency between internally generated questions, answers, and images provides a reliable training signal without external grounding.
    Invoked as the sole source of reward across all reported gains.

pith-pipeline@v0.9.1-grok · 5844 in / 1185 out tokens · 36251 ms · 2026-06-26T04:56:22.193868+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 4 canonical work pages

  1. [1]

    Bengio, J

    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 41–48. ACM, June 2009. doi: 10.1145/1553374.1553380. URLhttp://dx.doi.org/10.1145/1553374.1553380

  2. [2]

    BLIP3-o: A family of fully open unified multimodal models-architecture, training and dataset, 2025

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, and Ran Xu. BLIP3-o: A family of fully open unified multimodal models-architecture, training and dataset, 2025. URLhttps://arxiv.org/abs/2505.09568

  3. [3]

    Janus-Pro: Unified multimodal understanding and generation with data and model scaling, 2025

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-Pro: Unified multimodal understanding and generation with data and model scaling, 2025. URL https://arxiv.org/abs/2501.17811

  4. [4]

    Emerging properties in unified multimodal pretraining, 2025

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining, 2025. URLhttps://arxiv.org/abs/2505.14683

  5. [5]

    MME: A com- prehensive evaluation benchmark for multimodal large language models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. MME: A com- prehensive evaluation benchmark for multimodal large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Trac...

  6. [6]

    SEED-X: Multimodal models with unified multi-granularity comprehension and generation, 2025

    Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. SEED-X: Multimodal models with unified multi-granularity comprehension and generation, 2025. URL https://arxiv.org/abs/2404.14396

  7. [7]

    X-Omni: Reinforcement learning makes discrete autoregressive image generative models great again, 2025

    Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, and Jie Jiang. X-Omni: Reinforcement learning makes discrete autoregressive image generative models great again, 2025. URLhttps://arxiv.org/abs/2507.22058. 11

  8. [8]

    GenEval: An object-focused framework for evaluating text-to-image alignment

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. GenEval: An object-focused framework for evaluating text-to-image alignment. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 52132–52152. Curran Associates, Inc., 2023. URL https://proceedings.neurips.c...

  9. [9]

    Bellemare, Jacob Menick, Rémi Munos, and Koray Kavukcuoglu

    Alex Graves, Marc G. Bellemare, Jacob Menick, Rémi Munos, and Koray Kavukcuoglu. Automated curriculum learning for neural networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1311–1320. PMLR, 06–11 Aug 2017. URLhttps://proceedin...

  10. [10]

    UniCorn: Towards self-improving unified multimodal models through self-generated supervision, 2026

    Ruiyan Han, Zhen Fang, XinYu Sun, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, Yi Cao, and Feng Zhao. UniCorn: Towards self-improving unified multimodal models through self-generated supervision, 2026. URLhttps://arxiv.org/abs/2601.03193

  11. [11]

    Turning internal gap into self-improvement: Promoting the generation-understanding unification in MLLMs

    Yujin Han, Hao Chen, Andi Han, Zhiheng Wang, Xinyu Liu, Yingya Zhang, Shiwei Zhang, and Difan Zou. Turning internal gap into self-improvement: Promoting the generation-understanding unification in MLLMs. InThe Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=tVnml9Q4XW

  12. [12]

    SUDER: Self-improving unified large multimodal models for understanding and generation with dual self-rewards, 2025

    Jixiang Hong, Yiran Zhang, Guanzhong Wang, Yi Liu, Ji-Rong Wen, and Rui Yan. SUDER: Self-improving unified large multimodal models for understanding and generation with dual self-rewards, 2025. URL https://arxiv.org/abs/2506.07963

  13. [13]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9

  14. [14]

    Hudson and Christopher D

    Drew A. Hudson and Christopher D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6700–6709, June 2019

  15. [15]

    Co-reinforcement learning for unified multimodal understanding and generation

    Jingjing Jiang, Chongjie Si, Jun Luo, Hanwang Zhang, and Chao Ma. Co-reinforcement learning for unified multimodal understanding and generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=aDa0xEFDu1

  16. [16]

    SRUM: Fine- grained self-rewarding for unified multimodal models, 2025

    Weiyang Jin, Yuwei Niu, Jiaqi Liao, Chengqi Duan, Aoxue Li, Shenghua Gao, and Xihui Liu. SRUM: Fine- grained self-rewarding for unified multimodal models, 2025. URL https://arxiv.org/abs/2510.12784

  17. [17]

    Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, October 2023

  18. [18]

    SEED-Bench: Benchmarking multimodal large language models

    Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. SEED-Bench: Benchmarking multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13299–13308, June 2024

  19. [19]

    Lawrence Zitnick, and Piotr Dollár

    Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context. InECCV, 2014

  20. [20]

    MMBench: Is your multi-modal model an all- around player? InComputer Vision – ECCV 2024, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. MMBench: Is your multi-modal model an all- around player? InComputer Vision – ECCV 2024, pages 216–233. Springer Nature Switzerland, October

  21. [21]

    doi: 10.1007/978-3-031-72658-3_13

    ISBN 9783031726583. doi: 10.1007/978-3-031-72658-3_13. URL http://dx.doi.org/10.1007/ 978-3-031-72658-3_13

  22. [22]

    JanusFlow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, Liang Zhao, Yisong Wang, Jiaying Liu, and Chong Ruan. JanusFlow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Patte...

  23. [23]

    UniRL: Self-improving unified multimodal models via supervised and reinforcement learning, 2025

    Weijia Mao, Zhenheng Yang, and Mike Zheng Shou. UniRL: Self-improving unified multimodal models via supervised and reinforcement learning, 2025. URLhttps://arxiv.org/abs/2505.23380

  24. [24]

    SILMM: Self-improving large multimodal models for compositional text-to-image generation

    Leigang Qu, Haochuan Li, Wenjie Wang, Xiang Liu, Juncheng Li, Liqiang Nie, and Tat-Seng Chua. SILMM: Self-improving large multimodal models for compositional text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18497–18508, June 2025

  25. [25]

    Du, Zehuan Yuan, and Xinglong Wu

    Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, and Xinglong Wu. TokenFlow: Unified image tokenizer for multimodal understanding and generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2545–2555, June 2025

  26. [26]

    LAION-COCO: 600M synthetic captions from LAION-2B-en

    Christoph Schuhmann, Andreas Köpf, Theo Coombes, Richard Vencu, Benjamin Trom, and Romain Beaumont. LAION-COCO: 600M synthetic captions from LAION-2B-en. https://laion.ai/blog/laion-coco/,

  27. [27]

    Published September 15, 2022

  28. [28]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

  29. [29]

    Towards VQA models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8317–8326, June 2019

  30. [30]

    Unigame: Turning a unified multimodal model into its own adversary

    Zhaolong Su, Wang Lu, Hao Chen, Sharon Li, and Jindong Wang. Unigame: Turning a unified multimodal model into its own adversary. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 37632–37641, 2026

  31. [31]

    Endogenous reprompting: Self-evolving cognitive alignment for unified multimodal models, 2026

    Zhenchen Tang, Songlin Yang, Zichuan Wang, Bo Peng, Yang Li, Beibei Dong, and Jing Dong. Endogenous reprompting: Self-evolving cognitive alignment for unified multimodal models, 2026. URL https://arxiv. org/abs/2601.20305

  32. [32]

    Chameleon: Mixed-modal early-fusion foundation models, 2025

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models, 2025. URL https://arxiv. org/abs/2405.09818

  33. [33]

    LongCat-Next: Lexicalizing modalities as discrete tokens, 2026

    Meituan LongCat Team et al. LongCat-Next: Lexicalizing modalities as discrete tokens, 2026. URL https://arxiv.org/abs/2603.27538

  34. [34]

    EvoLMM: Self-evolving large multimodal models with continuous rewards, 2026

    Omkar Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, and Fahad Khan. EvoLMM: Self-evolving large multimodal models with continuous rewards, 2026. URLhttps://arxiv.org/abs/2511.16672

  35. [35]

    MetaMorph: Multimodal understanding and generation via instruction tuning

    Shengbang Tong, David Fan, Jiachen Li, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. MetaMorph: Multimodal understanding and generation via instruction tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17001–17012, October 2025

  36. [36]

    ILLUME: Illuminating your LLMs to see, draw, and self-enhance

    Chunwei Wang, Guansong Lu, Junwei Yang, Runhui Huang, Jianhua Han, Lu Hou, Wei Zhang, and Hang Xu. ILLUME: Illuminating your LLMs to see, draw, and self-enhance. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 21612–21622, October 2025

  37. [37]

    Multimodal learning with next-token prediction for large multimodal models.Nature, 650(8101):327–333, 2026

    Xinlong Wang, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Zhen Li, Yuqi Wang, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Chunlei Men, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Zhongyuan Wang, and Tiejun Huang. Multimodal learning with next-tok...

  38. [38]

    Janus: Decoupling visual encoding for unified multimodal understanding and generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12966–12977, June 2025. 13

  39. [39]

    Liquid: Language models are scalable and unified multi-modal generators.International Journal of Computer Vision, 134(1), January 2026

    Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Hengshuang Zhao, Zehuan Yuan, Song Bai, and Xiang Bai. Liquid: Language models are scalable and unified multi-modal generators.International Journal of Computer Vision, 134(1), January 2026. ISSN 1573-1405. doi: 10.1007/s11263-025-02639-5. URL http://dx.doi.org/10.1007/s11263-025-02639-5

  40. [40]

    VILA-U: a unified foundation model integrating visual understanding and generation, 2025

    Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, and Yao Lu. VILA-U: a unified foundation model integrating visual understanding and generation, 2025. URLhttps://arxiv.org/abs/2409.04429

  41. [41]

    Grok-1.5 Vision Preview

    xAI. Grok-1.5 Vision Preview. https://x.ai/news/grok-1.5v, 2024. Introduces the RealWorldQA benchmark

  42. [42]

    Reconstruction alignment improves unified multimodal models

    Ji Xie, Trevor Darrell, Luke Zettlemoyer, and XuDong Wang. Reconstruction alignment improves unified multimodal models. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=ppQWp8yrm7

  43. [43]

    Show-o: One single transformer to unify multimodal understanding and generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=o6Ynz6OIQ6

  44. [44]

    Show-o2: Improved native unified multimodal models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https: //openreview.net/forum?id=7VMg7Jb7AL

  45. [45]

    Unified multimodal models as auto-encoders

    Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Haochen Wang, Zhendong Wang, Bin Lin, Hao Li, Xinyan Xiao, Jingdong Wang, Haifeng Wang, and Li Yuan. Unified multimodal models as auto-encoders. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 41903–41912, June 2026

  46. [46]

    MM-Vet: Evaluating large multimodal models for integrated capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. MM-Vet: Evaluating large multimodal models for integrated capabilities. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=KOTutrSR2y

  47. [47]

    MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

  48. [48]

    V ARGPT: Unified understanding and generation in a visual autoregressive multimodal large language model, 2025

    Xianwei Zhuang, Yuxin Xie, Yufan Deng, Liming Liang, Jinghan Ru, Yuguo Yin, and Yuexian Zou. V ARGPT: Unified understanding and generation in a visual autoregressive multimodal large language model, 2025. URL https://arxiv.org/abs/2501.12327

  49. [49]

    Vargpt-v1.1: Improve visual autoregressive large unified model via iterative instruction tuning and reinforcement learning, 2025

    Xianwei Zhuang, Yuxin Xie, Yufan Deng, Dongchao Yang, Liming Liang, Jinghan Ru, Yuguo Yin, and Yuexian Zou. Vargpt-v1.1: Improve visual autoregressive large unified model via iterative instruction tuning and reinforcement learning, 2025. URLhttps://arxiv.org/abs/2504.02949

  50. [50]

    OmniMamba: Efficient and unified multimodal understanding and generation via state space models, 2025

    Jialv Zou, Bencheng Liao, Qian Zhang, Wenyu Liu, and Xinggang Wang. OmniMamba: Efficient and unified multimodal understanding and generation via state space models, 2025. URL https://arxiv.org/abs/ 2503.08686. 14 Appendix We provide additional details needed to reproduce and interpret the experiments. The appendix includes the training algorithm, implemen...