pith. sign in

arxiv: 2511.22663 · v5 · submitted 2025-11-27 · 💻 cs.CV

AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model

Pith reviewed 2026-05-17 04:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords unified multimodal modelsattention interaction alignmentcross-modal attentiontask conflictsimage generationmultimodal understandingAIA loss
0
0 comments X

The pith

Unified multimodal models can fix understanding-generation conflicts by aligning cross-modal attention patterns rather than decoupling their architecture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines why splitting model architectures improves performance in unified systems that handle both image understanding and generation. The key observation is that decoupling pushes the model to use attention behaviors seen in specialized, task-specific models. To capture this benefit without the drawbacks of decoupling, the authors introduce an Attention Interaction Alignment loss that trains the model to match those desired interaction patterns. Experiments on models like Emu3 and Janus-Pro show gains in both task types while preserving the ability to produce interleaved text and image outputs. This approach matters because it offers a way to strengthen unified models without fragmenting them into separate components.

Core claim

Architecture decoupling does not solve task conflicts but essentially drives models toward cross-modal interaction patterns of task-specific models. The Attention Interaction Alignment (AIA) loss explicitly learns these patterns during training, refining cross-modal attention and boosting performance in generation and understanding tasks.

What carries the argument

Attention Interaction Alignment (AIA) loss, which explicitly aligns the cross-modal attention patterns of the unified model to those of task-specific models during training.

If this is right

  • Models trained with AIA show refined cross-modal attention patterns similar to specialized systems.
  • Both image generation and understanding performance improve without additional architecture changes.
  • The approach maintains the interleave generation ability that decoupling tends to undermine.
  • AIA can be applied during supervised fine-tuning or post-training stages on different base models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future unified models might achieve high performance with simpler, fully shared architectures if attention alignment is used.
  • Combining attention patterns from multiple specialized models could lead to even stronger unified systems.
  • This method may generalize to other conflicting task pairs in multimodal learning beyond vision and language.

Load-bearing premise

That the performance boost from decoupling comes primarily from shifting to task-specific cross-modal attention patterns, which can be replicated through a loss function without losing unified model benefits.

What would settle it

Apply AIA to a unified model and check if its attention maps become more similar to those in Qwen3-VL or HunyuanImage-3.0 while simultaneously measuring improvements in generation and understanding metrics alongside retention of interleaved output capability.

Figures

Figures reproduced from arXiv: 2511.22663 by Dian Zheng, Hongbo Liu, Hongsheng Li, Hongyu Li, Kaituo Feng, Kai Zou, Manyuan Zhang, Yexin Liu, Ying Luo, Ziyu Guo.

Figure 1
Figure 1. Figure 1: Various architectures of UMMs and its corresponding cross-modal interaction patterns. We arrange the models in order of [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The pipeline of cross-modal interaction intensity cal [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training loss curve of Emu3 and Janus-Pro under various AIA coefficient. NTP and AIA means next-token-prediction and [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-Modal Attention Patterns Visualization of Different Single-Task Models. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of cross-modal attention patterns modification after AIA training. Task-specific models are Qwen3-VL-8B for [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of architecture decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling boosts performance by studying the cross-modal attention behavior of models. We observe that architecture decoupling does not solve task conflicts, but essentially drives models toward cross-modal interaction patterns of task-specific models, as seen in Qwen3-VL and HunyuanImage-3.0, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns task-specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that architecture decoupling in unified multimodal models (e.g., Emu3, Janus-Pro) primarily improves performance by driving cross-modal attention toward task-specific patterns seen in decoupled models like Qwen3-VL and HunyuanImage-3.0; it proposes an Attention Interaction Alignment (AIA) loss to explicitly learn these patterns during SFT or post-training, thereby mitigating task conflicts while preserving interleaved generation ability and boosting both generation and understanding performance.

Significance. If the central mechanism is validated with quantitative attention comparisons and the reported gains hold under rigorous controls, the work could provide a lighter-weight alternative to heavy architectural decoupling for unified multimodal training, with potential to simplify model design while retaining strong cross-task capabilities.

major comments (3)
  1. [Abstract / Motivation] Abstract and motivation section: the core observation that 'the more thorough the decoupling, the more consistent the behavior becomes' and that decoupling 'drives models toward cross-modal interaction patterns of task-specific models' is load-bearing for the AIA proposal, yet no quantitative metrics (e.g., attention-map cosine similarity, KL divergence, or layer-wise statistics) are provided to compare unified-model attention before/after AIA against the cited reference models.
  2. [Experiments] Experiments section: the claim that AIA 'refines cross-modal attention patterns' and 'boosts both generation and understanding performance' lacks reported baselines, ablation studies, error bars, or direct evidence that post-AIA attention maps are measurably closer to task-specific references than pre-AIA maps; without these, it remains unclear whether gains stem from the claimed alignment mechanism or from generic regularization effects.
  3. [Method] Method section: the AIA loss formulation is described at a high level as 'explicitly learns task-specific multimodal interaction patterns' but no equation, reference-pattern construction details, or hyper-parameter settings are given, preventing assessment of whether the loss is parameter-free or how it avoids circularity with the observed patterns.
minor comments (1)
  1. [Abstract] Abstract: consider adding one sentence on the concrete datasets or benchmarks used for the Emu3 SFT and Janus-Pro post-training experiments to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments identify key areas where additional evidence and detail would strengthen the manuscript. We address each major comment below and have revised the paper to incorporate quantitative metrics, ablations, and methodological details as requested.

read point-by-point responses
  1. Referee: [Abstract / Motivation] Abstract and motivation section: the core observation that 'the more thorough the decoupling, the more consistent the behavior becomes' and that decoupling 'drives models toward cross-modal interaction patterns of task-specific models' is load-bearing for the AIA proposal, yet no quantitative metrics (e.g., attention-map cosine similarity, KL divergence, or layer-wise statistics) are provided to compare unified-model attention before/after AIA against the cited reference models.

    Authors: We agree that quantitative support for the core observation is essential. The original manuscript relied on qualitative attention visualizations in Section 3. In the revision we have added cosine similarity and KL-divergence measurements between the cross-modal attention maps of the unified models (pre- and post-AIA) and the reference task-specific models (Qwen3-VL and HunyuanImage-3.0). Layer-wise statistics are also reported. These metrics show a clear increase in similarity after AIA, directly supporting the claim that decoupling drives models toward task-specific patterns and that AIA reproduces this effect without architectural changes. revision: yes

  2. Referee: [Experiments] Experiments section: the claim that AIA 'refines cross-modal attention patterns' and 'boosts both generation and understanding performance' lacks reported baselines, ablation studies, error bars, or direct evidence that post-AIA attention maps are measurably closer to task-specific references than pre-AIA maps; without these, it remains unclear whether gains stem from the claimed alignment mechanism or from generic regularization effects.

    Authors: We acknowledge the need for stronger controls. The revised manuscript now includes: (1) additional baselines that isolate AIA from standard cross-entropy and regularization losses, (2) ablation studies removing the alignment term, (3) results with standard error bars computed over three independent runs, and (4) direct before/after quantitative attention-map comparisons (cosine similarity and KL) to the reference models. These additions demonstrate that performance improvements are larger and more consistent when the alignment term is present, supporting the mechanism over generic regularization. revision: yes

  3. Referee: [Method] Method section: the AIA loss formulation is described at a high level as 'explicitly learns task-specific multimodal interaction patterns' but no equation, reference-pattern construction details, or hyper-parameter settings are given, preventing assessment of whether the loss is parameter-free or how it avoids circularity with the observed patterns.

    Authors: We accept that the original description was insufficiently precise. The revised Method section now contains the full loss equation (new Equation 3), the procedure for constructing reference patterns (averaging attention maps extracted from frozen Qwen3-VL on understanding data and HunyuanImage-3.0 on generation data), and the hyper-parameter values used (λ = 0.5, temperature τ = 1.0). The loss is not parameter-free; it relies on fixed external reference models. Circularity is avoided because the references are obtained once from independent task-specific models and remain frozen during AIA training, rather than being derived from the model being optimized. revision: yes

Circularity Check

0 steps flagged

Derivation of AIA loss is self-contained with no circular reductions

full rationale

The paper's central proposal is the introduction of the Attention Interaction Alignment (AIA) loss, motivated by an analysis of cross-modal attention in decoupled models such as Qwen3-VL and HunyuanImage-3.0. This observation is external to the method itself. The AIA loss is then applied during training of Emu3 and Janus-Pro, with claims of improved performance and refined attention patterns. No equations are presented that equate a derived quantity to its own inputs by construction, nor are there load-bearing self-citations or fitted inputs renamed as predictions. The approach relies on empirical validation rather than tautological definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that decoupling shifts attention patterns and on the assumption that aligning to those patterns improves unified performance; no free parameters or invented entities are detailed in the abstract.

axioms (1)
  • domain assumption Architecture decoupling boosts performance by driving models toward cross-modal interaction patterns of task-specific models rather than resolving task conflicts.
    This observation from cross-modal attention study is the direct motivation for proposing AIA loss.

pith-pipeline@v0.9.0 · 5586 in / 1244 out tokens · 48685 ms · 2026-05-17T04:14:10.365341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

    cs.CV 2026-05 unverdicted novelty 7.0

    Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.

  2. Gen-Searcher: Reinforcing Agentic Search for Image Generation

    cs.CV 2026-03 unverdicted novelty 7.0

    Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.

  3. Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

    cs.CV 2026-05 unverdicted novelty 6.0

    Uni-Edit frames intelligent image editing as a general task for unified multimodal models and uses an automated pipeline to synthesize complex reasoning-intensive instructions from VQA data, yielding performance gains...

  4. Gen-Searcher: Reinforcing Agentic Search for Image Generation

    cs.CV 2026-03 unverdicted novelty 6.0

    Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.

  5. AdaTooler-V: Adaptive Tool-Use for Images and Videos

    cs.CV 2025-12 conditional novelty 6.0

    AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 3 Pith papers · 27 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  2. [2]

    HunyuanImage 3.0 Technical Report

    Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025. 2, 3, 7

  3. [3]

    Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation

    Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal mod- els with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025. 6

  4. [4]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 3, 5, 6

  5. [5]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

  6. [6]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR,

  7. [7]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 1, 3, 5

  8. [8]

    The llama 3 herd of models.arXiv e-prints, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, 2024. 3

  9. [9]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR,

  10. [10]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InICML, 2024. 3, 5

  11. [11]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 6

  12. [12]

    Geneval: An object-focused framework for evaluating text- to-image alignment

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment. InNeurIPS, 2023. 6

  13. [13]

    Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale

    Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xi- ang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale.arXiv preprint arXiv:2412.05237,

  14. [14]

    Infinity: Scaling bit- wise autoregressive modeling for high-resolution image syn- thesis

    Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bit- wise autoregressive modeling for high-resolution image syn- thesis. InCVPR, 2025. 5

  15. [15]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020. 3

  16. [16]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 6

  17. [17]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 3

  18. [18]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 3, 5, 7

  19. [19]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 3, 6

  20. [20]

    Onecat: Decoder-only auto-regressive model for unified understanding and generation.arXiv preprint arXiv:2509.03498, 2025

    Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, and Hongkai Xiong. Onecat: Decoder-only auto-regressive model for unified understanding and generation.arXiv preprint arXiv:2509.03498, 2025. 1, 3, 5

  21. [21]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 6

  22. [22]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic en- coders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025. 3, 5

  23. [23]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 3, 7 9

  24. [24]

    Shotbench: Expert-level cinematic understand- ing in vision-language models

    Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, et al. Shotbench: Expert-level cinematic understand- ing in vision-language models. InNeurIPS, 2025. 3

  25. [25]

    Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024. 6

  26. [26]

    Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation, 2024

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai yu, Liang Zhao, Yisong Wang, Jiaying Liu, and Chong Ruan. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation, 2024. 3

  27. [27]

    Transfer between Modalities with MetaQueries

    Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Ji- uhai Chen, Kunpeng Li, Felix Juefei-Xu, Ji Hou, and Saining Xie. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025. 3, 5

  28. [28]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 3, 5

  29. [29]

    Tokenflow: Unified image tokenizer for multi- modal understanding and generation

    Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xin- glong Wu. Tokenflow: Unified image tokenizer for multi- modal understanding and generation. InCVPR, 2025. 3

  30. [30]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 3

  31. [31]

    Denois- ing diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InICLR, 2020. 3

  32. [32]

    Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278, 2025

    Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pan- deng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and edit- ing.arXiv preprint arXiv:2507.23278, 2025. 3

  33. [33]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1, 3, 5

  34. [34]

    Qwen2 Technical Report

    Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 3

  35. [35]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, 2024. 6

  36. [36]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017. 3

  37. [37]

    Sparsemm: Head sparsity emerges from visual concept re- sponses in mllms.arXiv preprint arXiv:2506.05344, 2025

    Jiahui Wang, Zuyan Liu, Yongming Rao, and Jiwen Lu. Sparsemm: Head sparsity emerges from visual concept re- sponses in mllms.arXiv preprint arXiv:2506.05344, 2025. 4

  38. [38]

    Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl

    Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pre- training, sft, and rl.arXiv preprint arXiv:2504.11455, 2025. 3, 7

  39. [39]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 1, 3, 4, 5

  40. [40]

    Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation.arXiv preprint arXiv:2410.13848, 2024. 3

  41. [41]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 3, 5, 7

  42. [42]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 1, 3, 5

  43. [43]

    Liq- uid: Language models are scalable and unified multi-modal generators.IJCV, 2024

    Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Heng- shuang Zhao, Zehuan Yuan, Song Bai, and Xiang Bai. Liq- uid: Language models are scalable and unified multi-modal generators.IJCV, 2024. 1, 3

  44. [44]

    VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

    Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model inte- grating visual understanding and generation.arXiv preprint arXiv:2409.04429, 2024. 5

  45. [45]

    "Your output must be a single JSON object.\n\n

    Ji Xie, Trevor Darrell, Luke Zettlemoyer, and XuDong Wang. Reconstruction alignment improves unified multi- modal models.arXiv preprint arXiv:2509.07295, 2025. 4

  46. [46]

    Show-o: One single transformer to unify multimodal understanding and generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InICLR, 2025. 3, 5

  47. [47]

    Show- o2: Improved native unified multimodal models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models. InNeurIPS,

  48. [48]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 3

  49. [49]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023. 6

  50. [50]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InCVPR, 2024. 6

  51. [51]

    SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

    Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024. 4

  52. [52]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang 10 You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 6

  53. [53]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 5, 7

  54. [54]

    Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

    Kai Zou, Ziqi Huang, Yuhao Dong, Shulin Tian, Dian Zheng, Hongbo Liu, Jingwen He, Bin Liu, Yu Qiao, and Ziwei Liu. Uni-mmmu: A massive multi-discipline multimodal unified benchmark.arXiv preprint arXiv:2510.13759, 2025. 1 11 Architecture Decoupling Is Not All You Need For Unified Multimodal Model Supplementary Material In this supplementary file, we provi...