AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model
Pith reviewed 2026-05-17 04:14 UTC · model grok-4.3
The pith
Unified multimodal models can fix understanding-generation conflicts by aligning cross-modal attention patterns rather than decoupling their architecture.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Architecture decoupling does not solve task conflicts but essentially drives models toward cross-modal interaction patterns of task-specific models. The Attention Interaction Alignment (AIA) loss explicitly learns these patterns during training, refining cross-modal attention and boosting performance in generation and understanding tasks.
What carries the argument
Attention Interaction Alignment (AIA) loss, which explicitly aligns the cross-modal attention patterns of the unified model to those of task-specific models during training.
If this is right
- Models trained with AIA show refined cross-modal attention patterns similar to specialized systems.
- Both image generation and understanding performance improve without additional architecture changes.
- The approach maintains the interleave generation ability that decoupling tends to undermine.
- AIA can be applied during supervised fine-tuning or post-training stages on different base models.
Where Pith is reading between the lines
- Future unified models might achieve high performance with simpler, fully shared architectures if attention alignment is used.
- Combining attention patterns from multiple specialized models could lead to even stronger unified systems.
- This method may generalize to other conflicting task pairs in multimodal learning beyond vision and language.
Load-bearing premise
That the performance boost from decoupling comes primarily from shifting to task-specific cross-modal attention patterns, which can be replicated through a loss function without losing unified model benefits.
What would settle it
Apply AIA to a unified model and check if its attention maps become more similar to those in Qwen3-VL or HunyuanImage-3.0 while simultaneously measuring improvements in generation and understanding metrics alongside retention of interleaved output capability.
Figures
read the original abstract
Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of architecture decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling boosts performance by studying the cross-modal attention behavior of models. We observe that architecture decoupling does not solve task conflicts, but essentially drives models toward cross-modal interaction patterns of task-specific models, as seen in Qwen3-VL and HunyuanImage-3.0, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns task-specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that architecture decoupling in unified multimodal models (e.g., Emu3, Janus-Pro) primarily improves performance by driving cross-modal attention toward task-specific patterns seen in decoupled models like Qwen3-VL and HunyuanImage-3.0; it proposes an Attention Interaction Alignment (AIA) loss to explicitly learn these patterns during SFT or post-training, thereby mitigating task conflicts while preserving interleaved generation ability and boosting both generation and understanding performance.
Significance. If the central mechanism is validated with quantitative attention comparisons and the reported gains hold under rigorous controls, the work could provide a lighter-weight alternative to heavy architectural decoupling for unified multimodal training, with potential to simplify model design while retaining strong cross-task capabilities.
major comments (3)
- [Abstract / Motivation] Abstract and motivation section: the core observation that 'the more thorough the decoupling, the more consistent the behavior becomes' and that decoupling 'drives models toward cross-modal interaction patterns of task-specific models' is load-bearing for the AIA proposal, yet no quantitative metrics (e.g., attention-map cosine similarity, KL divergence, or layer-wise statistics) are provided to compare unified-model attention before/after AIA against the cited reference models.
- [Experiments] Experiments section: the claim that AIA 'refines cross-modal attention patterns' and 'boosts both generation and understanding performance' lacks reported baselines, ablation studies, error bars, or direct evidence that post-AIA attention maps are measurably closer to task-specific references than pre-AIA maps; without these, it remains unclear whether gains stem from the claimed alignment mechanism or from generic regularization effects.
- [Method] Method section: the AIA loss formulation is described at a high level as 'explicitly learns task-specific multimodal interaction patterns' but no equation, reference-pattern construction details, or hyper-parameter settings are given, preventing assessment of whether the loss is parameter-free or how it avoids circularity with the observed patterns.
minor comments (1)
- [Abstract] Abstract: consider adding one sentence on the concrete datasets or benchmarks used for the Emu3 SFT and Janus-Pro post-training experiments to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. The comments identify key areas where additional evidence and detail would strengthen the manuscript. We address each major comment below and have revised the paper to incorporate quantitative metrics, ablations, and methodological details as requested.
read point-by-point responses
-
Referee: [Abstract / Motivation] Abstract and motivation section: the core observation that 'the more thorough the decoupling, the more consistent the behavior becomes' and that decoupling 'drives models toward cross-modal interaction patterns of task-specific models' is load-bearing for the AIA proposal, yet no quantitative metrics (e.g., attention-map cosine similarity, KL divergence, or layer-wise statistics) are provided to compare unified-model attention before/after AIA against the cited reference models.
Authors: We agree that quantitative support for the core observation is essential. The original manuscript relied on qualitative attention visualizations in Section 3. In the revision we have added cosine similarity and KL-divergence measurements between the cross-modal attention maps of the unified models (pre- and post-AIA) and the reference task-specific models (Qwen3-VL and HunyuanImage-3.0). Layer-wise statistics are also reported. These metrics show a clear increase in similarity after AIA, directly supporting the claim that decoupling drives models toward task-specific patterns and that AIA reproduces this effect without architectural changes. revision: yes
-
Referee: [Experiments] Experiments section: the claim that AIA 'refines cross-modal attention patterns' and 'boosts both generation and understanding performance' lacks reported baselines, ablation studies, error bars, or direct evidence that post-AIA attention maps are measurably closer to task-specific references than pre-AIA maps; without these, it remains unclear whether gains stem from the claimed alignment mechanism or from generic regularization effects.
Authors: We acknowledge the need for stronger controls. The revised manuscript now includes: (1) additional baselines that isolate AIA from standard cross-entropy and regularization losses, (2) ablation studies removing the alignment term, (3) results with standard error bars computed over three independent runs, and (4) direct before/after quantitative attention-map comparisons (cosine similarity and KL) to the reference models. These additions demonstrate that performance improvements are larger and more consistent when the alignment term is present, supporting the mechanism over generic regularization. revision: yes
-
Referee: [Method] Method section: the AIA loss formulation is described at a high level as 'explicitly learns task-specific multimodal interaction patterns' but no equation, reference-pattern construction details, or hyper-parameter settings are given, preventing assessment of whether the loss is parameter-free or how it avoids circularity with the observed patterns.
Authors: We accept that the original description was insufficiently precise. The revised Method section now contains the full loss equation (new Equation 3), the procedure for constructing reference patterns (averaging attention maps extracted from frozen Qwen3-VL on understanding data and HunyuanImage-3.0 on generation data), and the hyper-parameter values used (λ = 0.5, temperature τ = 1.0). The loss is not parameter-free; it relies on fixed external reference models. Circularity is avoided because the references are obtained once from independent task-specific models and remain frozen during AIA training, rather than being derived from the model being optimized. revision: yes
Circularity Check
Derivation of AIA loss is self-contained with no circular reductions
full rationale
The paper's central proposal is the introduction of the Attention Interaction Alignment (AIA) loss, motivated by an analysis of cross-modal attention in decoupled models such as Qwen3-VL and HunyuanImage-3.0. This observation is external to the method itself. The AIA loss is then applied during training of Emu3 and Janus-Pro, with claims of improved performance and refined attention patterns. No equations are presented that equate a derived quantity to its own inputs by construction, nor are there load-bearing self-citations or fitted inputs renamed as predictions. The approach relies on empirical validation rather than tautological definitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Architecture decoupling boosts performance by driving models toward cross-modal interaction patterns of task-specific models rather than resolving task conflicts.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we propose Attention Interaction Alignment (AIA) loss, which explicitly learns task-specific multimodal interaction patterns during training... LAIA = 1/L Σ Huber(Il − Tl, δl)
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
as decoupling increases, the interaction patterns increasingly resemble those of single-task models
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 5 Pith papers
-
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
-
Gen-Searcher: Reinforcing Agentic Search for Image Generation
Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.
-
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
Uni-Edit frames intelligent image editing as a general task for unified multimodal models and uses an automated pipeline to synthesize complex reasoning-intensive instructions from VQA data, yielding performance gains...
-
Gen-Searcher: Reinforcing Agentic Search for Image Generation
Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.
-
AdaTooler-V: Adaptive Tool-Use for Images and Videos
AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
HunyuanImage 3.0 Technical Report
Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025. 2, 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation
Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal mod- els with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025. 6
-
[4]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 3, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR,
-
[7]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 1, 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
The llama 3 herd of models.arXiv e-prints, 2024
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, 2024. 3
work page 2024
-
[9]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR,
-
[10]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InICML, 2024. 3, 5
work page 2024
-
[11]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Geneval: An object-focused framework for evaluating text- to-image alignment
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment. InNeurIPS, 2023. 6
work page 2023
-
[13]
Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale
Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xi- ang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale.arXiv preprint arXiv:2412.05237,
-
[14]
Infinity: Scaling bit- wise autoregressive modeling for high-resolution image syn- thesis
Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bit- wise autoregressive modeling for high-resolution image syn- thesis. InCVPR, 2025. 5
work page 2025
-
[15]
Denoising diffu- sion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020. 3
work page 2020
-
[16]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 3
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[18]
Flux.https://github.com/ black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 3, 5, 7
work page 2024
-
[19]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, and Hongkai Xiong. Onecat: Decoder-only auto-regressive model for unified understanding and generation.arXiv preprint arXiv:2509.03498, 2025. 1, 3, 5
-
[21]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic en- coders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025. 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 3, 7 9
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Shotbench: Expert-level cinematic understand- ing in vision-language models
Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, et al. Shotbench: Expert-level cinematic understand- ing in vision-language models. InNeurIPS, 2025. 3
work page 2025
-
[25]
Mmbench: Is your multi-modal model an all-around player? InECCV, 2024
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024. 6
work page 2024
-
[26]
Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai yu, Liang Zhao, Yisong Wang, Jiaying Liu, and Chong Ruan. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation, 2024. 3
work page 2024
-
[27]
Transfer between Modalities with MetaQueries
Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Ji- uhai Chen, Kunpeng Li, Felix Juefei-Xu, Ji Hou, and Saining Xie. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025. 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Tokenflow: Unified image tokenizer for multi- modal understanding and generation
Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xin- glong Wu. Tokenflow: Unified image tokenizer for multi- modal understanding and generation. InCVPR, 2025. 3
work page 2025
-
[30]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 3
work page 2021
-
[31]
Denois- ing diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InICLR, 2020. 3
work page 2020
-
[32]
Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pan- deng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and edit- ing.arXiv preprint arXiv:2507.23278, 2025. 3
-
[33]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1, 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, 2024. 6
work page 2024
-
[36]
Neural discrete representation learning
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017. 3
work page 2017
-
[37]
Jiahui Wang, Zuyan Liu, Yongming Rao, and Jiwen Lu. Sparsemm: Head sparsity emerges from visual concept re- sponses in mllms.arXiv preprint arXiv:2506.05344, 2025. 4
-
[38]
Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl
Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pre- training, sft, and rl.arXiv preprint arXiv:2504.11455, 2025. 3, 7
-
[39]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 1, 3, 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation.arXiv preprint arXiv:2410.13848, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 3, 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 1, 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Liq- uid: Language models are scalable and unified multi-modal generators.IJCV, 2024
Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Heng- shuang Zhao, Zehuan Yuan, Song Bai, and Xiang Bai. Liq- uid: Language models are scalable and unified multi-modal generators.IJCV, 2024. 1, 3
work page 2024
-
[44]
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model inte- grating visual understanding and generation.arXiv preprint arXiv:2409.04429, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
"Your output must be a single JSON object.\n\n
Ji Xie, Trevor Darrell, Luke Zettlemoyer, and XuDong Wang. Reconstruction alignment improves unified multi- modal models.arXiv preprint arXiv:2509.07295, 2025. 4
-
[46]
Show-o: One single transformer to unify multimodal understanding and generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InICLR, 2025. 3, 5
work page 2025
-
[47]
Show- o2: Improved native unified multimodal models
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models. InNeurIPS,
-
[48]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InCVPR, 2024. 6
work page 2024
-
[51]
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024. 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Open-Sora: Democratizing Efficient Video Production for All
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang 10 You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark
Kai Zou, Ziqi Huang, Yuhao Dong, Shulin Tian, Dian Zheng, Hongbo Liu, Jingwen He, Bin Liu, Yu Qiao, and Ziwei Liu. Uni-mmmu: A massive multi-discipline multimodal unified benchmark.arXiv preprint arXiv:2510.13759, 2025. 1 11 Architecture Decoupling Is Not All You Need For Unified Multimodal Model Supplementary Material In this supplementary file, we provi...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.