AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model

Dian Zheng; Hongbo Liu; Hongsheng Li; Hongyu Li; Kaituo Feng; Kai Zou; Manyuan Zhang; Yexin Liu; Ying Luo; Ziyu Guo

arxiv: 2511.22663 · v5 · submitted 2025-11-27 · 💻 cs.CV

AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model

Dian Zheng , Manyuan Zhang , Hongyu Li , Kai Zou , Hongbo Liu , Ziyu Guo , Kaituo Feng , Yexin Liu

show 2 more authors

Ying Luo Hongsheng Li

This is my paper

Pith reviewed 2026-05-17 04:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords unified multimodal modelsattention interaction alignmentcross-modal attentiontask conflictsimage generationmultimodal understandingAIA loss

0 comments

The pith

Unified multimodal models can fix understanding-generation conflicts by aligning cross-modal attention patterns rather than decoupling their architecture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines why splitting model architectures improves performance in unified systems that handle both image understanding and generation. The key observation is that decoupling pushes the model to use attention behaviors seen in specialized, task-specific models. To capture this benefit without the drawbacks of decoupling, the authors introduce an Attention Interaction Alignment loss that trains the model to match those desired interaction patterns. Experiments on models like Emu3 and Janus-Pro show gains in both task types while preserving the ability to produce interleaved text and image outputs. This approach matters because it offers a way to strengthen unified models without fragmenting them into separate components.

Core claim

Architecture decoupling does not solve task conflicts but essentially drives models toward cross-modal interaction patterns of task-specific models. The Attention Interaction Alignment (AIA) loss explicitly learns these patterns during training, refining cross-modal attention and boosting performance in generation and understanding tasks.

What carries the argument

Attention Interaction Alignment (AIA) loss, which explicitly aligns the cross-modal attention patterns of the unified model to those of task-specific models during training.

If this is right

Models trained with AIA show refined cross-modal attention patterns similar to specialized systems.
Both image generation and understanding performance improve without additional architecture changes.
The approach maintains the interleave generation ability that decoupling tends to undermine.
AIA can be applied during supervised fine-tuning or post-training stages on different base models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future unified models might achieve high performance with simpler, fully shared architectures if attention alignment is used.
Combining attention patterns from multiple specialized models could lead to even stronger unified systems.
This method may generalize to other conflicting task pairs in multimodal learning beyond vision and language.

Load-bearing premise

That the performance boost from decoupling comes primarily from shifting to task-specific cross-modal attention patterns, which can be replicated through a loss function without losing unified model benefits.

What would settle it

Apply AIA to a unified model and check if its attention maps become more similar to those in Qwen3-VL or HunyuanImage-3.0 while simultaneously measuring improvements in generation and understanding metrics alongside retention of interleaved output capability.

Figures

Figures reproduced from arXiv: 2511.22663 by Dian Zheng, Hongbo Liu, Hongsheng Li, Hongyu Li, Kaituo Feng, Kai Zou, Manyuan Zhang, Yexin Liu, Ying Luo, Ziyu Guo.

**Figure 1.** Figure 1: Various architectures of UMMs and its corresponding cross-modal interaction patterns. We arrange the models in order of [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The pipeline of cross-modal interaction intensity cal [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Training loss curve of Emu3 and Janus-Pro under various AIA coefficient. NTP and AIA means next-token-prediction and [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Cross-Modal Attention Patterns Visualization of Different Single-Task Models. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of cross-modal attention patterns modification after AIA training. Task-specific models are Qwen3-VL-8B for [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of architecture decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling boosts performance by studying the cross-modal attention behavior of models. We observe that architecture decoupling does not solve task conflicts, but essentially drives models toward cross-modal interaction patterns of task-specific models, as seen in Qwen3-VL and HunyuanImage-3.0, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns task-specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims decoupling helps unified multimodal models mainly by shifting cross-modal attention toward task-specific patterns, and offers AIA loss to achieve similar alignment without splitting the architecture.

read the letter

The core point is that architecture decoupling does not directly resolve task conflicts in unified models; it mainly pushes cross-modal attention to match the patterns seen in separate task-specific models like Qwen3-VL and HunyuanImage-3.0. The authors propose Attention Interaction Alignment loss to train for those patterns explicitly while keeping a single model that can still do interleaved generation and understanding.

Referee Report

3 major / 1 minor

Summary. The paper claims that architecture decoupling in unified multimodal models (e.g., Emu3, Janus-Pro) primarily improves performance by driving cross-modal attention toward task-specific patterns seen in decoupled models like Qwen3-VL and HunyuanImage-3.0; it proposes an Attention Interaction Alignment (AIA) loss to explicitly learn these patterns during SFT or post-training, thereby mitigating task conflicts while preserving interleaved generation ability and boosting both generation and understanding performance.

Significance. If the central mechanism is validated with quantitative attention comparisons and the reported gains hold under rigorous controls, the work could provide a lighter-weight alternative to heavy architectural decoupling for unified multimodal training, with potential to simplify model design while retaining strong cross-task capabilities.

major comments (3)

[Abstract / Motivation] Abstract and motivation section: the core observation that 'the more thorough the decoupling, the more consistent the behavior becomes' and that decoupling 'drives models toward cross-modal interaction patterns of task-specific models' is load-bearing for the AIA proposal, yet no quantitative metrics (e.g., attention-map cosine similarity, KL divergence, or layer-wise statistics) are provided to compare unified-model attention before/after AIA against the cited reference models.
[Experiments] Experiments section: the claim that AIA 'refines cross-modal attention patterns' and 'boosts both generation and understanding performance' lacks reported baselines, ablation studies, error bars, or direct evidence that post-AIA attention maps are measurably closer to task-specific references than pre-AIA maps; without these, it remains unclear whether gains stem from the claimed alignment mechanism or from generic regularization effects.
[Method] Method section: the AIA loss formulation is described at a high level as 'explicitly learns task-specific multimodal interaction patterns' but no equation, reference-pattern construction details, or hyper-parameter settings are given, preventing assessment of whether the loss is parameter-free or how it avoids circularity with the observed patterns.

minor comments (1)

[Abstract] Abstract: consider adding one sentence on the concrete datasets or benchmarks used for the Emu3 SFT and Janus-Pro post-training experiments to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments identify key areas where additional evidence and detail would strengthen the manuscript. We address each major comment below and have revised the paper to incorporate quantitative metrics, ablations, and methodological details as requested.

read point-by-point responses

Referee: [Abstract / Motivation] Abstract and motivation section: the core observation that 'the more thorough the decoupling, the more consistent the behavior becomes' and that decoupling 'drives models toward cross-modal interaction patterns of task-specific models' is load-bearing for the AIA proposal, yet no quantitative metrics (e.g., attention-map cosine similarity, KL divergence, or layer-wise statistics) are provided to compare unified-model attention before/after AIA against the cited reference models.

Authors: We agree that quantitative support for the core observation is essential. The original manuscript relied on qualitative attention visualizations in Section 3. In the revision we have added cosine similarity and KL-divergence measurements between the cross-modal attention maps of the unified models (pre- and post-AIA) and the reference task-specific models (Qwen3-VL and HunyuanImage-3.0). Layer-wise statistics are also reported. These metrics show a clear increase in similarity after AIA, directly supporting the claim that decoupling drives models toward task-specific patterns and that AIA reproduces this effect without architectural changes. revision: yes
Referee: [Experiments] Experiments section: the claim that AIA 'refines cross-modal attention patterns' and 'boosts both generation and understanding performance' lacks reported baselines, ablation studies, error bars, or direct evidence that post-AIA attention maps are measurably closer to task-specific references than pre-AIA maps; without these, it remains unclear whether gains stem from the claimed alignment mechanism or from generic regularization effects.

Authors: We acknowledge the need for stronger controls. The revised manuscript now includes: (1) additional baselines that isolate AIA from standard cross-entropy and regularization losses, (2) ablation studies removing the alignment term, (3) results with standard error bars computed over three independent runs, and (4) direct before/after quantitative attention-map comparisons (cosine similarity and KL) to the reference models. These additions demonstrate that performance improvements are larger and more consistent when the alignment term is present, supporting the mechanism over generic regularization. revision: yes
Referee: [Method] Method section: the AIA loss formulation is described at a high level as 'explicitly learns task-specific multimodal interaction patterns' but no equation, reference-pattern construction details, or hyper-parameter settings are given, preventing assessment of whether the loss is parameter-free or how it avoids circularity with the observed patterns.

Authors: We accept that the original description was insufficiently precise. The revised Method section now contains the full loss equation (new Equation 3), the procedure for constructing reference patterns (averaging attention maps extracted from frozen Qwen3-VL on understanding data and HunyuanImage-3.0 on generation data), and the hyper-parameter values used (λ = 0.5, temperature τ = 1.0). The loss is not parameter-free; it relies on fixed external reference models. Circularity is avoided because the references are obtained once from independent task-specific models and remain frozen during AIA training, rather than being derived from the model being optimized. revision: yes

Circularity Check

0 steps flagged

Derivation of AIA loss is self-contained with no circular reductions

full rationale

The paper's central proposal is the introduction of the Attention Interaction Alignment (AIA) loss, motivated by an analysis of cross-modal attention in decoupled models such as Qwen3-VL and HunyuanImage-3.0. This observation is external to the method itself. The AIA loss is then applied during training of Emu3 and Janus-Pro, with claims of improved performance and refined attention patterns. No equations are presented that equate a derived quantity to its own inputs by construction, nor are there load-bearing self-citations or fitted inputs renamed as predictions. The approach relies on empirical validation rather than tautological definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that decoupling shifts attention patterns and on the assumption that aligning to those patterns improves unified performance; no free parameters or invented entities are detailed in the abstract.

axioms (1)

domain assumption Architecture decoupling boosts performance by driving models toward cross-modal interaction patterns of task-specific models rather than resolving task conflicts.
This observation from cross-modal attention study is the direct motivation for proposing AIA loss.

pith-pipeline@v0.9.0 · 5586 in / 1244 out tokens · 48685 ms · 2026-05-17T04:14:10.365341+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose Attention Interaction Alignment (AIA) loss, which explicitly learns task-specific multimodal interaction patterns during training... LAIA = 1/L Σ Huber(Il − Tl, δl)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

as decoupling increases, the interaction patterns increasingly resemble those of single-task models

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
cs.CV 2026-05 unverdicted novelty 7.0

Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
Gen-Searcher: Reinforcing Agentic Search for Image Generation
cs.CV 2026-03 unverdicted novelty 7.0

Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
cs.CV 2026-05 unverdicted novelty 6.0

Uni-Edit frames intelligent image editing as a general task for unified multimodal models and uses an automated pipeline to synthesize complex reasoning-intensive instructions from VQA data, yielding performance gains...
Gen-Searcher: Reinforcing Agentic Search for Image Generation
cs.CV 2026-03 unverdicted novelty 6.0

Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.
AdaTooler-V: Adaptive Tool-Use for Images and Videos
cs.CV 2025-12 conditional novelty 6.0

AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 3 Pith papers · 27 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

HunyuanImage 3.0 Technical Report

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025. 2, 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation

Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal mod- els with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025. 6

work page arXiv 2025
[4]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 3, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR,

work page
[7]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 1, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

The llama 3 herd of models.arXiv e-prints, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, 2024. 3

work page 2024
[9]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR,

work page
[10]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InICML, 2024. 3, 5

work page 2024
[11]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Geneval: An object-focused framework for evaluating text- to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment. InNeurIPS, 2023. 6

work page 2023
[13]

Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale

Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xi- ang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale.arXiv preprint arXiv:2412.05237,

work page arXiv
[14]

Infinity: Scaling bit- wise autoregressive modeling for high-resolution image syn- thesis

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bit- wise autoregressive modeling for high-resolution image syn- thesis. InCVPR, 2025. 5

work page 2025
[15]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020. 3

work page 2020
[16]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 3

work page internal anchor Pith review Pith/arXiv arXiv 2013
[18]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 3, 5, 7

work page 2024
[19]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Onecat: Decoder-only auto-regressive model for unified understanding and generation.arXiv preprint arXiv:2509.03498, 2025

Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, and Hongkai Xiong. Onecat: Decoder-only auto-regressive model for unified understanding and generation.arXiv preprint arXiv:2509.03498, 2025. 1, 3, 5

work page arXiv 2025
[21]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic en- coders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 3, 7 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Shotbench: Expert-level cinematic understand- ing in vision-language models

Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, et al. Shotbench: Expert-level cinematic understand- ing in vision-language models. InNeurIPS, 2025. 3

work page 2025
[25]

Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024. 6

work page 2024
[26]

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation, 2024

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai yu, Liang Zhao, Yisong Wang, Jiaying Liu, and Chong Ruan. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation, 2024. 3

work page 2024
[27]

Transfer between Modalities with MetaQueries

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Ji- uhai Chen, Kunpeng Li, Felix Juefei-Xu, Ji Hou, and Saining Xie. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Tokenflow: Unified image tokenizer for multi- modal understanding and generation

Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xin- glong Wu. Tokenflow: Unified image tokenizer for multi- modal understanding and generation. InCVPR, 2025. 3

work page 2025
[30]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 3

work page 2021
[31]

Denois- ing diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InICLR, 2020. 3

work page 2020
[32]

Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278, 2025

Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pan- deng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and edit- ing.arXiv preprint arXiv:2507.23278, 2025. 3

work page arXiv 2025
[33]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Qwen2 Technical Report

Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, 2024. 6

work page 2024
[36]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017. 3

work page 2017
[37]

Sparsemm: Head sparsity emerges from visual concept re- sponses in mllms.arXiv preprint arXiv:2506.05344, 2025

Jiahui Wang, Zuyan Liu, Yongming Rao, and Jiwen Lu. Sparsemm: Head sparsity emerges from visual concept re- sponses in mllms.arXiv preprint arXiv:2506.05344, 2025. 4

work page arXiv 2025
[38]

Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl

Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pre- training, sft, and rl.arXiv preprint arXiv:2504.11455, 2025. 3, 7

work page arXiv 2025
[39]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 1, 3, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation.arXiv preprint arXiv:2410.13848, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 3, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 1, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Liq- uid: Language models are scalable and unified multi-modal generators.IJCV, 2024

Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Heng- shuang Zhao, Zehuan Yuan, Song Bai, and Xiang Bai. Liq- uid: Language models are scalable and unified multi-modal generators.IJCV, 2024. 1, 3

work page 2024
[44]

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model inte- grating visual understanding and generation.arXiv preprint arXiv:2409.04429, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

"Your output must be a single JSON object.\n\n

Ji Xie, Trevor Darrell, Luke Zettlemoyer, and XuDong Wang. Reconstruction alignment improves unified multi- modal models.arXiv preprint arXiv:2509.07295, 2025. 4

work page arXiv 2025
[46]

Show-o: One single transformer to unify multimodal understanding and generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InICLR, 2025. 3, 5

work page 2025
[47]

Show- o2: Improved native unified multimodal models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models. InNeurIPS,

work page
[48]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InCVPR, 2024. 6

work page 2024
[51]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang 10 You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

Kai Zou, Ziqi Huang, Yuhao Dong, Shulin Tian, Dian Zheng, Hongbo Liu, Jingwen He, Bin Liu, Yu Qiao, and Ziwei Liu. Uni-mmmu: A massive multi-discipline multimodal unified benchmark.arXiv preprint arXiv:2510.13759, 2025. 1 11 Architecture Decoupling Is Not All You Need For Unified Multimodal Model Supplementary Material In this supplementary file, we provi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

HunyuanImage 3.0 Technical Report

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025. 2, 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation

Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal mod- els with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025. 6

work page arXiv 2025

[4] [4]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 3, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR,

work page

[7] [7]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 1, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

The llama 3 herd of models.arXiv e-prints, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, 2024. 3

work page 2024

[9] [9]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR,

work page

[10] [10]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InICML, 2024. 3, 5

work page 2024

[11] [11]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Geneval: An object-focused framework for evaluating text- to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment. InNeurIPS, 2023. 6

work page 2023

[13] [13]

Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale

Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xi- ang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale.arXiv preprint arXiv:2412.05237,

work page arXiv

[14] [14]

Infinity: Scaling bit- wise autoregressive modeling for high-resolution image syn- thesis

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bit- wise autoregressive modeling for high-resolution image syn- thesis. InCVPR, 2025. 5

work page 2025

[15] [15]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020. 3

work page 2020

[16] [16]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 3

work page internal anchor Pith review Pith/arXiv arXiv 2013

[18] [18]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 3, 5, 7

work page 2024

[19] [19]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Onecat: Decoder-only auto-regressive model for unified understanding and generation.arXiv preprint arXiv:2509.03498, 2025

Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, and Hongkai Xiong. Onecat: Decoder-only auto-regressive model for unified understanding and generation.arXiv preprint arXiv:2509.03498, 2025. 1, 3, 5

work page arXiv 2025

[21] [21]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic en- coders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 3, 7 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Shotbench: Expert-level cinematic understand- ing in vision-language models

Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, et al. Shotbench: Expert-level cinematic understand- ing in vision-language models. InNeurIPS, 2025. 3

work page 2025

[25] [25]

Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024. 6

work page 2024

[26] [26]

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation, 2024

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai yu, Liang Zhao, Yisong Wang, Jiaying Liu, and Chong Ruan. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation, 2024. 3

work page 2024

[27] [27]

Transfer between Modalities with MetaQueries

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Ji- uhai Chen, Kunpeng Li, Felix Juefei-Xu, Ji Hou, and Saining Xie. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Tokenflow: Unified image tokenizer for multi- modal understanding and generation

Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xin- glong Wu. Tokenflow: Unified image tokenizer for multi- modal understanding and generation. InCVPR, 2025. 3

work page 2025

[30] [30]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 3

work page 2021

[31] [31]

Denois- ing diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InICLR, 2020. 3

work page 2020

[32] [32]

Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278, 2025

Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pan- deng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and edit- ing.arXiv preprint arXiv:2507.23278, 2025. 3

work page arXiv 2025

[33] [33]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Qwen2 Technical Report

Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, 2024. 6

work page 2024

[36] [36]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017. 3

work page 2017

[37] [37]

Sparsemm: Head sparsity emerges from visual concept re- sponses in mllms.arXiv preprint arXiv:2506.05344, 2025

Jiahui Wang, Zuyan Liu, Yongming Rao, and Jiwen Lu. Sparsemm: Head sparsity emerges from visual concept re- sponses in mllms.arXiv preprint arXiv:2506.05344, 2025. 4

work page arXiv 2025

[38] [38]

Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl

Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pre- training, sft, and rl.arXiv preprint arXiv:2504.11455, 2025. 3, 7

work page arXiv 2025

[39] [39]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 1, 3, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation.arXiv preprint arXiv:2410.13848, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 3, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 1, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Liq- uid: Language models are scalable and unified multi-modal generators.IJCV, 2024

Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Heng- shuang Zhao, Zehuan Yuan, Song Bai, and Xiang Bai. Liq- uid: Language models are scalable and unified multi-modal generators.IJCV, 2024. 1, 3

work page 2024

[44] [44]

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model inte- grating visual understanding and generation.arXiv preprint arXiv:2409.04429, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

"Your output must be a single JSON object.\n\n

Ji Xie, Trevor Darrell, Luke Zettlemoyer, and XuDong Wang. Reconstruction alignment improves unified multi- modal models.arXiv preprint arXiv:2509.07295, 2025. 4

work page arXiv 2025

[46] [46]

Show-o: One single transformer to unify multimodal understanding and generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InICLR, 2025. 3, 5

work page 2025

[47] [47]

Show- o2: Improved native unified multimodal models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models. InNeurIPS,

work page

[48] [48]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InCVPR, 2024. 6

work page 2024

[51] [51]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang 10 You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

Kai Zou, Ziqi Huang, Yuhao Dong, Shulin Tian, Dian Zheng, Hongbo Liu, Jingwen He, Bin Liu, Yu Qiao, and Ziwei Liu. Uni-mmmu: A massive multi-discipline multimodal unified benchmark.arXiv preprint arXiv:2510.13759, 2025. 1 11 Architecture Decoupling Is Not All You Need For Unified Multimodal Model Supplementary Material In this supplementary file, we provi...

work page internal anchor Pith review Pith/arXiv arXiv 2025