TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

Hao Chen; Hayes Bai; Hongyu Zhu; Jindong Wang; Marios Savvides; Pan He; Sharon Li; Wenwen Wang; Yinyi Luo

arxiv: 2604.10784 · v2 · pith:4XV7OROVnew · submitted 2026-04-12 · 💻 cs.AI

TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

Yinyi Luo , Wenwen Wang , Hayes Bai , Hongyu Zhu , Hao Chen , Pan He , Marios Savvides , Sharon Li

show 1 more author

Jindong Wang

This is my paper

Pith reviewed 2026-05-21 08:55 UTC · model grok-4.3

classification 💻 cs.AI

keywords unified multimodal modelsevaluation frameworkmultimodal understandingmultimodal generationmultimodal editingpost-trainingreproducible benchmarking

0 comments

The pith

TorchUMM supplies the first unified codebase for evaluating, analyzing, and post-training diverse unified multimodal models across tasks and datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TorchUMM to tackle the problem of comparing unified multimodal models that vary widely in architecture and training approach. It builds a single framework that handles evaluation on understanding, generation, and editing tasks while incorporating both standard and new datasets for measuring perception, reasoning, compositionality, and instruction following. A unified interface together with fixed protocols is meant to remove inconsistencies that arise when each research group writes its own test code. If this approach works, different model designs could be ranked on equal terms, making it clearer which choices improve performance on concrete abilities. The effort also includes tools for post-training so that insights from evaluation can be turned directly into model improvements.

Core claim

TorchUMM is presented as the first unified codebase for comprehensive evaluation, analysis, and post-training across diverse UMM backbones, tasks, and datasets. It supports a broad spectrum of models covering a wide range of scales and design paradigms. The benchmark covers three core task dimensions—multimodal understanding, generation, and editing—and integrates both established and novel datasets to evaluate perception, reasoning, compositionality, and instruction-following abilities through a unified interface and standardized protocols.

What carries the argument

The unified interface and standardized evaluation protocols that let heterogeneous models be tested under the same conditions.

If this is right

Researchers can run reproducible comparisons across models that differ in scale and design paradigm.
Strengths and limitations in perception, reasoning, and instruction-following become visible under consistent conditions.
Post-training routines can be applied uniformly to improve models after evaluation.
New datasets can be added to the same benchmark structure without rewriting test harnesses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use of one codebase could cut the time teams spend re-implementing evaluation pipelines for each new model release.
The structure might make it easier to test whether gains on one task transfer to the other two task dimensions.
Community contributions could expand the set of post-training methods available inside the same standardized setting.

Load-bearing premise

A single unified interface and standardized protocols can fairly compare models with fundamentally different architectures and training paradigms without introducing framework-specific biases or implementation artifacts.

What would settle it

Evaluating the same set of models once inside TorchUMM and once with each model's original author-provided evaluation scripts, then finding large differences in reported scores or reversed model rankings, would show that the unified protocols do not remove bias.

Figures

Figures reproduced from arXiv: 2604.10784 by Hao Chen, Hayes Bai, Hongyu Zhu, Jindong Wang, Marios Savvides, Pan He, Sharon Li, Wenwen Wang, Yinyi Luo.

**Figure 1.** Figure 1: Overview of TorchUMM. et al., 2023]. A model that achieves notable gains on certain benchmarks may simultaneously experience performance degradation on others, or across different capability dimensions, including understanding, generation, and image editing. This inconsistency suggests that many reported improvements are localized rather than indicative of a holistic enhancement in model capability, raisin… view at source ↗

**Figure 2.** Figure 2: Representative UEval cases across models with different paradigms of unification. The first row [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Query-variation analysis under two backbone–model pairings. In each row, the left panel shows [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

read the original abstract

Recent advances in unified multimodal models (UMMs) have led to a proliferation of architectures capable of understanding, generating, and editing across visual and textual modalities. However, developing a unified framework for UMMs remains challenging due to the diversity of model architectures and the heterogeneity of training paradigms and implementation details. In this paper, we present TorchUMM, the first unified codebase for comprehensive evaluation, analysis, and post-training across diverse UMM backbones, tasks, and datasets. TorchUMM supports a broad spectrum of models covering a wide range of scales and design paradigms. Our benchmark encompasses three core task dimensions: multimodal understanding, generation, and editing, and integrates both established and novel datasets to evaluate perception, reasoning, compositionality, and instruction-following abilities. By providing a unified interface and standardized evaluation protocols, TorchUMM enables fair and reproducible comparisons across heterogeneous models and fosters deeper insights into their strengths and limitations, facilitating the development of more capable unified multimodal systems. Code is available at: https://github.com/AIFrontierLab/TorchUMM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TorchUMM is a codebase release that organizes evaluation for unified multimodal models, but the abstract gives no evidence that its unified interface avoids introducing performance artifacts across architectures.

read the letter

TorchUMM is mainly an engineering release that bundles evaluation, analysis, and post-training for a range of unified multimodal models under one interface and set of protocols. The paper positions it as the first such unified codebase spanning different backbones, scales, and design choices, with benchmarks covering understanding, generation, and editing plus a mix of established and new datasets for perception, reasoning, and instruction following. Code is linked on GitHub, which is the practical part that stands out here. Pulling models and tasks into a single place can cut down on duplicated effort when people want to run the same tests on different systems. That organizational step is the clearest contribution on offer. The central assumption is that standardized protocols will produce fair comparisons without framework-specific biases. The abstract states this goal but shows no supporting checks, such as side-by-side runs that match original model scores when executed inside TorchUMM versus their native implementations. If the abstraction layer requires changes to tokenizers, encoders, or fusion steps, those changes could shift measured performance in ways the paper does not quantify. The stress-test note flags exactly this risk, and nothing in the provided description rules it out. This work is for researchers who need a ready evaluation harness for multimodal systems rather than new algorithms or theory. A reader already running their own comparisons might find the shared protocols convenient if the code proves extensible and well-documented. It deserves peer review because toolkits can raise baseline standards when the implementation details are solid, and referees can examine the actual code and any validation experiments that the abstract omits.

Referee Report

1 major / 2 minor

Summary. The manuscript presents TorchUMM as the first unified codebase for comprehensive evaluation, analysis, and post-training of diverse unified multimodal models (UMMs). It supports a broad range of model backbones spanning scales and design paradigms, covers three core task dimensions (multimodal understanding, generation, and editing), integrates established and novel datasets for assessing perception, reasoning, compositionality, and instruction-following, and supplies a unified interface with standardized protocols claimed to enable fair and reproducible comparisons across heterogeneous models.

Significance. If the abstraction layer successfully unifies heterogeneous tokenizers, vision encoders, fusion mechanisms, and objectives without introducing measurable implementation artifacts, TorchUMM could become a valuable community resource that reduces redundant engineering effort and promotes standardized benchmarking in multimodal AI.

major comments (1)

[Abstract] Abstract: the central claim that the unified interface and standardized protocols 'enable fair and reproducible comparisons across heterogeneous models' is load-bearing yet unsupported by any empirical evidence in the manuscript, such as side-by-side re-evaluations of models in TorchUMM versus their original codebases or ablations quantifying performance shifts attributable to the abstraction layer.

minor comments (2)

The manuscript would benefit from an explicit table or section enumerating all supported UMM backbones together with the precise integration points (e.g., tokenizer wrappers, vision-encoder adapters) used to achieve unification.
Clarify whether post-training routines are implemented uniformly or require model-specific overrides, and document any such overrides to allow readers to assess potential bias.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript describing TorchUMM. We address the single major comment below and outline the planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the unified interface and standardized protocols 'enable fair and reproducible comparisons across heterogeneous models' is load-bearing yet unsupported by any empirical evidence in the manuscript, such as side-by-side re-evaluations of models in TorchUMM versus their original codebases or ablations quantifying performance shifts attributable to the abstraction layer.

Authors: We acknowledge that the manuscript currently lacks direct empirical validation of this claim, such as side-by-side performance comparisons between TorchUMM and original model implementations or ablations isolating effects of the abstraction layer. The paper emphasizes the design of the unified interface to support heterogeneous tokenizers, encoders, fusion mechanisms, and objectives while aiming to avoid implementation artifacts, but does not quantify this through new experiments. To address the concern, we will add a dedicated subsection in the revised manuscript (likely in Section 4 or an appendix) presenting side-by-side evaluations on a representative subset of models and tasks. These will compare results obtained via TorchUMM against those from the original codebases or reported numbers, along with ablations measuring any performance shifts due to the standardization layer. This will provide the requested evidence for fair and reproducible comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering codebase presentation with no derivation chain

full rationale

The manuscript introduces TorchUMM as a new unified codebase and standardized evaluation protocols for multimodal models. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The central claim is the existence and utility of the software artifact itself, which is externally verifiable via the released code and independent benchmarks rather than reducing to any self-referential input, self-citation chain, or ansatz. All enumerated circularity patterns are absent; the work is self-contained as an engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software-tool paper whose central claim rests on the existence and correctness of the released codebase rather than on mathematical axioms or fitted parameters. No free parameters, domain axioms, or invented entities are invoked in the abstract.

pith-pipeline@v0.9.0 · 5745 in / 987 out tokens · 38039 ms · 2026-05-21T08:55:01.344111+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TorchUMM supports a broad spectrum of models... standardized interface that abstracts away model-specific details
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

unified interface and standardized evaluation protocols

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
cs.MM 2026-05 unverdicted novelty 7.0

UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...
LatentUMM: Dual Latent Alignment for Unified Multimodal Models
cs.CV 2026-05 unverdicted novelty 6.0

LatentUMM proposes dual latent alignment at modality and capacity levels plus latent dynamics stabilization to reduce semantic drift and improve consistency in unified multimodal models.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 2 Pith papers · 23 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025a. Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi ...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025b. OpenCompass Contributors. Opencompass: a universal evaluation platform for foundation models (2023). URL https://github. com/open...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

From multimodal llm to human-level ai: Modality, instruction, reasoning, efficiency and beyond

Hao Fei, Yuan Yao, Zhuosheng Zhang, Fuxiao Liu, Ao Zhang, and Tat-Seng Chua. From multimodal llm to human-level ai: Modality, instruction, reasoning, efficiency and beyond. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): Tutorial Summaries, pages 1–8,

work page 2024
[8]

Style outweighs substance: Failure modes of llm judges in alignment benchmarking.arXiv preprint arXiv:2409.15268,

Benjamin Feuer, Micah Goldblum, Teresa Datta, Sanjana Nambiar, Raz Besaleli, Samuel Dooley, Max Cembalest, and John P Dickerson. Style outweighs substance: Failure modes of llm judges in alignment benchmarking.arXiv preprint arXiv:2409.15268,

work page arXiv
[9]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

15 Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing.arXiv preprint arXiv:2307.10373,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Unicorn: Towards self-improving unified multimodal models through self- generated supervision.arXiv preprint arXiv:2601.03193,

Ruiyan Han, Zhen Fang, XinYu Sun, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, et al. Unicorn: Towards self-improving unified multimodal models through self- generated supervision.arXiv preprint arXiv:2601.03193,

work page arXiv
[12]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Interleaving reasoning for better text-to-image generation

Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, et al. Interleaving reasoning for better text-to-image generation. arXiv preprint arXiv:2509.06945,

work page arXiv
[14]

Ueval: A benchmark for unified multimodal generation.arXiv preprint arXiv:2601.22155,

Bo Li, Yida Yin, Wenhao Chai, Xingyu Fu, and Zhuang Liu. Ueval: A benchmark for unified multimodal generation.arXiv preprint arXiv:2601.22155,

work page arXiv
[15]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Do instruction-tuned models always perform better than base models? evidence from math and domain-shifted benchmarks.arXiv preprint arXiv:2601.13244,

16 Prateek Munjal, Clement Christophe, Ronnie Rajan, and Praveenkumar Kanithi. Do instruction-tuned models always perform better than base models? evidence from math and domain-shifted benchmarks.arXiv preprint arXiv:2601.13244,

work page arXiv
[18]

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Uni-cot: Towards unified chain-of-thought reasoning across text and vision.arXiv preprint arXiv:2508.05606,

Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, and Hao Li. Uni-cot: Towards unified chain-of-thought reasoning across text and vision.arXiv preprint arXiv:2508.05606,

work page arXiv
[20]

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026a. URL https://qwen.ai/ blog?id=qwen3.5. Qwen Team. Qwen3-vl-embedding-8b, 2026b. URL https://huggingface.co/Qwen/ Qwen3-VL-Embedding-8B. Hugging Face model card. Rui Shao, Wei Li, Lingsen Zhang, Renshan Zhang, Zhiyang Liu, Ran Chen, and Liqiang Nie. Large vlm- based vision-language-action...

work page internal anchor Pith review Pith/arXiv arXiv
[21]

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

DataFlow Team et al. Openworldlib: A unified codebase and definition of advanced world models.arXiv preprint arXiv:2604.04707,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Quantifying the gap between understanding and generation within unified multimodal models.arXiv preprint arXiv:2602.02140, 2026a

Chenlong Wang, Yuhang Chen, Zhihan Hu, Dongping Chen, Wenhu Chen, Sarah Wiegreffe, and Tianyi Zhou. Quantifying the gap between understanding and generation within unified multimodal models.arXiv preprint arXiv:2602.02140, 2026a. Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, et al. ...

work page arXiv
[24]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation.arXiv preprint arXiv:2410.13848,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025a. Mingrui Wu, Hang Liu, Jiayi Ji, Xiaoshuai Sun, and Rongrong Ji. Micon-bench: Benchmarking and enhancing multi-image context image gen...

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Openuni: A simple baseline for unified multimodal understanding and generation.arXiv preprint arXiv:2505.23661, 2025b

Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, Qinyue Li, Wei Li, and Chen Change Loy. Openuni: A simple baseline for unified multimodal understanding and generation.arXiv preprint arXiv:2505.23661, 2025b. Ji Xie, Trevor Darrell, Luke Zettlemoyer, and XuDong Wang. Reconstruction alignment improves unified multimodal models. InICLR,

work page arXiv
[27]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. arXiv preprint arXiv:2506.15564,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

MMaDA: Multimodal Large Diffusion Language Models

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Lmms-eval: Reality check on the evaluation of large multimodal models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 881–916,

work page 2025
[33]

Unified multimodal understanding and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567,

Shanshan Zhao, Xinjie Zhang, Jintao Guo, Jiakui Hu, Lunhao Duan, Minghao Fu, Yong Xien Chng, Guo- Hua Wang, Qing-Guo Chen, Zhao Xu, et al. Unified multimodal understanding and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567,

work page arXiv
[34]

Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

Kai Zou, Ziqi Huang, Yuhao Dong, Shulin Tian, Dian Zheng, Hongbo Liu, Jingwen He, Bin Liu, Yu Qiao, and Ziwei Liu. Uni-mmmu: A massive multi-discipline multimodal unified benchmark.arXiv preprint arXiv:2510.13759,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Table 7: Geneval sub-score. model single_object two_object counting colors position color_attr overall bagel(w/o think) 99.38 94.19 78.75 87.77 51 61.75 78.81 blip3o 98.12 93.18 73.44 86.17 72.75 64.5 81.36 show_o2(7B) 97.81 71.46 48.75 78.46 20 42.75 59.87 show_o2(1.5B) 96.88 64.39 46.88 76.06 16.75 32 55.49 Janus_pro 97.81 86.62 57.5 89.36 76 66.25 78.9...

work page 1905

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025a. Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi ...

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025b. OpenCompass Contributors. Opencompass: a universal evaluation platform for foundation models (2023). URL https://github. com/open...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

From multimodal llm to human-level ai: Modality, instruction, reasoning, efficiency and beyond

Hao Fei, Yuan Yao, Zhuosheng Zhang, Fuxiao Liu, Ao Zhang, and Tat-Seng Chua. From multimodal llm to human-level ai: Modality, instruction, reasoning, efficiency and beyond. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): Tutorial Summaries, pages 1–8,

work page 2024

[8] [8]

Style outweighs substance: Failure modes of llm judges in alignment benchmarking.arXiv preprint arXiv:2409.15268,

Benjamin Feuer, Micah Goldblum, Teresa Datta, Sanjana Nambiar, Raz Besaleli, Samuel Dooley, Max Cembalest, and John P Dickerson. Style outweighs substance: Failure modes of llm judges in alignment benchmarking.arXiv preprint arXiv:2409.15268,

work page arXiv

[9] [9]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

15 Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing.arXiv preprint arXiv:2307.10373,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Unicorn: Towards self-improving unified multimodal models through self- generated supervision.arXiv preprint arXiv:2601.03193,

Ruiyan Han, Zhen Fang, XinYu Sun, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, et al. Unicorn: Towards self-improving unified multimodal models through self- generated supervision.arXiv preprint arXiv:2601.03193,

work page arXiv

[12] [12]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Interleaving reasoning for better text-to-image generation

Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, et al. Interleaving reasoning for better text-to-image generation. arXiv preprint arXiv:2509.06945,

work page arXiv

[14] [14]

Ueval: A benchmark for unified multimodal generation.arXiv preprint arXiv:2601.22155,

Bo Li, Yida Yin, Wenhao Chai, Xingyu Fu, and Zhuang Liu. Ueval: A benchmark for unified multimodal generation.arXiv preprint arXiv:2601.22155,

work page arXiv

[15] [15]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Do instruction-tuned models always perform better than base models? evidence from math and domain-shifted benchmarks.arXiv preprint arXiv:2601.13244,

16 Prateek Munjal, Clement Christophe, Ronnie Rajan, and Praveenkumar Kanithi. Do instruction-tuned models always perform better than base models? evidence from math and domain-shifted benchmarks.arXiv preprint arXiv:2601.13244,

work page arXiv

[18] [18]

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Uni-cot: Towards unified chain-of-thought reasoning across text and vision.arXiv preprint arXiv:2508.05606,

Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, and Hao Li. Uni-cot: Towards unified chain-of-thought reasoning across text and vision.arXiv preprint arXiv:2508.05606,

work page arXiv

[20] [20]

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026a. URL https://qwen.ai/ blog?id=qwen3.5. Qwen Team. Qwen3-vl-embedding-8b, 2026b. URL https://huggingface.co/Qwen/ Qwen3-VL-Embedding-8B. Hugging Face model card. Rui Shao, Wei Li, Lingsen Zhang, Renshan Zhang, Zhiyang Liu, Ran Chen, and Liqiang Nie. Large vlm- based vision-language-action...

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

DataFlow Team et al. Openworldlib: A unified codebase and definition of advanced world models.arXiv preprint arXiv:2604.04707,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Quantifying the gap between understanding and generation within unified multimodal models.arXiv preprint arXiv:2602.02140, 2026a

Chenlong Wang, Yuhang Chen, Zhihan Hu, Dongping Chen, Wenhu Chen, Sarah Wiegreffe, and Tianyi Zhou. Quantifying the gap between understanding and generation within unified multimodal models.arXiv preprint arXiv:2602.02140, 2026a. Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, et al. ...

work page arXiv

[24] [24]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation.arXiv preprint arXiv:2410.13848,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025a. Mingrui Wu, Hang Liu, Jiayi Ji, Xiaoshuai Sun, and Rongrong Ji. Micon-bench: Benchmarking and enhancing multi-image context image gen...

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Openuni: A simple baseline for unified multimodal understanding and generation.arXiv preprint arXiv:2505.23661, 2025b

Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, Qinyue Li, Wei Li, and Chen Change Loy. Openuni: A simple baseline for unified multimodal understanding and generation.arXiv preprint arXiv:2505.23661, 2025b. Ji Xie, Trevor Darrell, Luke Zettlemoyer, and XuDong Wang. Reconstruction alignment improves unified multimodal models. InICLR,

work page arXiv

[27] [27]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. arXiv preprint arXiv:2506.15564,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

MMaDA: Multimodal Large Diffusion Language Models

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Lmms-eval: Reality check on the evaluation of large multimodal models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 881–916,

work page 2025

[33] [33]

Unified multimodal understanding and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567,

Shanshan Zhao, Xinjie Zhang, Jintao Guo, Jiakui Hu, Lunhao Duan, Minghao Fu, Yong Xien Chng, Guo- Hua Wang, Qing-Guo Chen, Zhao Xu, et al. Unified multimodal understanding and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567,

work page arXiv

[34] [34]

Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

Kai Zou, Ziqi Huang, Yuhao Dong, Shulin Tian, Dian Zheng, Hongbo Liu, Jingwen He, Bin Liu, Yu Qiao, and Ziwei Liu. Uni-mmmu: A massive multi-discipline multimodal unified benchmark.arXiv preprint arXiv:2510.13759,

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

Table 7: Geneval sub-score. model single_object two_object counting colors position color_attr overall bagel(w/o think) 99.38 94.19 78.75 87.77 51 61.75 78.81 blip3o 98.12 93.18 73.44 86.17 72.75 64.5 81.36 show_o2(7B) 97.81 71.46 48.75 78.46 20 42.75 59.87 show_o2(1.5B) 96.88 64.39 46.88 76.06 16.75 32 55.49 Janus_pro 97.81 86.62 57.5 89.36 76 66.25 78.9...

work page 1905