arxiv: 2604.25072 · v1 · submitted 2026-04-27 · 💻 cs.CV

Recognition: unknown

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

Weixing Wang , Liudvikas Zekas , Anton Hackl , Constantin Alexander Auga , Parisa Shahabinejad , Jona Otholt , Antonio Rueda-Toicen , Gerard de Melo

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords unified multimodal modelscross-task consistencyscene graphvisual understandingvisual generationsemantic alignmentlearning objectivesevaluation benchmark

0 comments

The pith

Unified multimodal models can excel at understanding and generation separately yet fail to stay consistent on the same visual facts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current evaluations of unified multimodal models treat visual understanding and generation as independent capabilities, missing whether the model maintains coherent representations across them. This work introduces XTC-Bench, which grounds both tasks in the same scene graph to compare outputs at the level of individual facts about objects, attributes, and relations. The key finding is that strong performance on either task alone does not ensure alignment between them, and that consistency depends on how closely the training objectives link the modalities rather than on sharing an architecture. A new metric, Continuous Cross-Task Agreement, quantifies this agreement finely. This matters because true unification requires not just capability in both areas but coherent internal representations.

Core claim

Unified Multimodal Models do not automatically learn representations that are consistent across understanding and generation tasks for the same visual concept. Experiments across nine models show that high accuracy in one or both tasks frequently coexists with low semantic agreement on matched facts. Consistency is instead determined by the degree to which learning objectives are coupled across modalities, independent of architectural unification.

What carries the argument

XTC-Bench, a scene-graph-grounded evaluation framework that derives both generation prompts and understanding queries from the same structured scene graph to enable fact-level alignment analysis via the Continuous Cross-Task Agreement (CCTA) metric.

Load-bearing premise

That deriving both generation prompts and understanding queries from the same scene graph produces fact-level alignment that accurately isolates internal model consistency from standalone task performance.

What would settle it

A model that achieves high scores on both generation and understanding tasks while also showing high agreement on the specific facts extracted from the shared scene graph, particularly if its objectives are loosely coupled.

Figures

Figures reproduced from arXiv: 2604.25072 by Anton Hackl, Antonio Rueda-Toicen, Constantin Alexander Auga, Gerard de Melo, Jona Otholt, Liudvikas Zekas, Parisa Shahabinejad, Weixing Wang.

**Figure 1.** Figure 1: OmniGen can generate attribute details while failing to recognize them (top), whereas BAGEL understands relational viewpoints that it fails to render (bottom). visual question answering and text-to-image generation benchmarks, suggesting substantial progress toward unified multimodal intelligence. However, sharing representation does not automatically imply semantic coherence across tasks. If a model trul… view at source ↗

**Figure 2.** Figure 2: Overview of our evaluation framework. Cross-Task Consistency is evaluated by aggregating Understanding and Generation performance anchored on the same scene. To address this limitation, we introduce XTC-Bench, a scene-graph-grounded benchmark for evaluating cross-task visual semantic consistency. Scene graphs provide an explicit, structured representation of objects, attributes, and relations within a sce… view at source ↗

**Figure 3.** Figure 3: Scene-graph extraction pipeline, producing the final reference scene graph used for evaluation. relations are then verified by Qwen3-VL-235B [2] through a visual question answering step: the VLM receives the image annotated with bounding boxes and is asked to confirm or reject each candidate predicate with a structured Yes/No answer. For exclusive predicates such as eating and driving, we enforce a unique… view at source ↗

**Figure 4.** Figure 4: Task performance (G, U) versus semantic consistency (AW-CCTA). Tab. 4 reports CCTA and AWCCTA results. First, cross-task inconsistency is observed universally: the strongest CCTA reaches only 0.706, and under AW-CCTA the best model achieves 0.623, demonstrating that semantic misalignment is a structural property not resolved by scale or architectural sophistication. Second, the CCTA–AW-CCTA gap exposes… view at source ↗

**Figure 5.** Figure 5: The web-based platforms used for the Scene Graph Quality Evaluation. Annotators assess the correctness of extracted objects, attributes, and relations using segmentation mask overlays. Study 3: LLM-as-Judge Reliability The final study evaluated our automated scorer. Annotators used the interface ( view at source ↗

**Figure 6.** Figure 6: The interface for prompt fidelity evaluation. Color-coded highlighting links atomic facts within natural language prompts to their image locations. 6.4 Additional Analysis and Qualitative Results Performance Across Attribute and Relation Categories To complement the aggregate benchmark results, we provide a detailed analysis of model performance across fine-grained attribute and relation dimensions. For e… view at source ↗

**Figure 7.** Figure 7: The interface for LLM-Judge Reliability. Annotators score semantic equivalence of Model Answers against Ground Truth on a 0–5 scale. model, we visualize generation and understanding performance separately to highlight potential asymmetries between these two capabilities. In the visualizations, the left side of each plot represents generation performance, while the right side represents understanding perfo… view at source ↗

**Figure 8.** Figure 8: Per-dimension tornado plots for BAGEL and BLIP3o. Each plot shows attribute and relation category scores for generation and understanding, along with the corresponding imbalance view at source ↗

**Figure 9.** Figure 9: Per-dimension tornado plots for Gemini and GPT view at source ↗

**Figure 10.** Figure 10: Per-dimension tornado plots for JanusPro and MMaDA view at source ↗

**Figure 11.** Figure 11: Per-dimension tornado plots for OmniGen2 and Tar view at source ↗

**Figure 12.** Figure 12: Per-dimension tornado plots for Show-o and Show-o2 view at source ↗

read the original abstract

Unified Multimodal Models (uMMs) aim to support both visual understanding and visual generation within a shared representation. However, existing evaluation protocols assess these two capabilities independently and do not examine whether they are semantically aligned. As a result, it remains unclear whether current uMMs learn coherent unified representations that remain consistent across tasks given a visual concept. We introduce XTC-Bench, a scene-graph-grounded evaluation framework that measures cross-task visual semantic consistency. By deriving both generation prompts and understanding queries from a structured scene graph, our framework enables fact-level alignment analysis across objects, attributes, and relations. We propose Continuous Cross-Task Agreement (CCTA), a fine-grained metric that quantifies semantic agreement between generation and understanding over matched atomic facts, isolating internal consistency from standalone task accuracy. Extensive experiments on eight open-source and one commercial unified models reveal that high generation or understanding performance does not imply strong cross-task alignment, and architectural analysis shows consistency is governed by how tightly learning objectives are coupled across modalities, not by architectural unification alone. XTC-Bench provides a reproducible and model-agnostic framework for diagnosing representation-level misalignment, offering a concrete direction for advancing unified multimodal modeling beyond isolated task performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows unified multimodal models often lack consistency between generation and understanding on the same facts, and introduces XTC-Bench plus CCTA to measure it via scene graphs.

read the letter

This paper points out that unified multimodal models can score well on generation or understanding separately but still produce inconsistent results when the same concepts are tested across both. They back this with a new scene-graph-based benchmark called XTC-Bench and a metric CCTA that checks agreement on individual facts like objects and relations. The new part is the cross-task setup itself. Most benchmarks treat understanding and generation as independent, so this framework forces a direct comparison by pulling both from the same structured scene graph. Running it across eight open-source models plus one commercial one gives a broad view, and the analysis linking consistency to how tightly the training objectives are coupled across modalities adds some insight beyond raw scores. The experiments appear to show the gap clearly enough to make the point. The idea of using atomic facts for fine-grained measurement is practical and avoids relying on overall accuracy alone. That said, there is a potential issue with how the prompts and queries are created and how facts are extracted from generated images. Differences in prompt style or noise in automatic extraction could affect the CCTA scores without reflecting the model's internal representations. The paper would be stronger with ablations that test those factors directly or compare against human judgments on the same facts. Overall, this is for researchers focused on building more coherent multimodal systems rather than just pushing task-specific metrics. It gives a diagnostic that could guide training improvements. The work shows clear thinking on the evaluation problem and engages with the literature on unified models, so it merits peer review. I would send it to referees with notes to verify the robustness of the metric against extraction artifacts.

Referee Report

3 major / 2 minor

Summary. The paper introduces XTC-Bench, a scene-graph-grounded evaluation framework for unified multimodal models (uMMs), along with the Continuous Cross-Task Agreement (CCTA) metric. It derives both generation prompts and understanding queries from the same structured scene graph to enable fact-level semantic alignment analysis across objects, attributes, and relations. Experiments on nine models (eight open-source, one commercial) show that high standalone generation or understanding performance does not imply strong cross-task consistency, and that consistency depends on tight coupling of learning objectives across modalities rather than architectural unification alone.

Significance. If the CCTA metric validly isolates representation-level consistency, the work provides a reproducible, model-agnostic benchmark that shifts evaluation of uMMs from isolated task accuracies toward diagnosing unified representations. The finding that objective coupling governs consistency offers a concrete, actionable direction for model development beyond architectural unification. The framework's grounding in scene graphs and focus on atomic facts is a methodological strength.

major comments (3)

[Methods (CCTA definition)] Methods section on CCTA definition: The metric quantifies semantic agreement over matched atomic facts extracted from the shared scene graph and is claimed to isolate internal consistency from standalone task accuracy. However, generation uses synthesized prompts from the graph while understanding uses queries on the resulting image, so mismatches can arise from stylistic differences in prompt/query formulation or noise in automatic fact extraction. No ablations or controls (e.g., human-verified facts, varied prompt styles, or extraction sensitivity tests) are described to rule out these artifacts; this is load-bearing for the central claim that low CCTA reflects model misalignment rather than evaluation confounds.
[Experiments] Experiments section (results on nine models): The abstract and results report CCTA scores supporting that high task performance does not imply alignment, but without visible error bars, statistical significance tests, or full details on data splits, scene-graph construction, and extraction pipelines, it is impossible to confirm robustness or rule out post-hoc choices affecting the scores. Specific per-model tables correlating CCTA with accuracy should be provided with variance estimates.
[Architectural analysis] Architectural analysis section: The conclusion that consistency is governed by objective coupling (not unification alone) requires concrete model comparisons. For example, identify pairs of models with similar architectures but differing objective couplings and report their quantitative CCTA differences; without this, the architectural claim rests on correlational observations rather than controlled evidence.

minor comments (2)

[Abstract] Abstract and introduction: Explicitly list the nine evaluated models (with citations) rather than describing them only as 'eight open-source and one commercial' to improve reproducibility.
[Figures/Tables] Figures/tables: Add error bars or confidence intervals to any plots or tables reporting CCTA scores, and ensure legends clearly distinguish CCTA from standalone accuracy metrics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped clarify several aspects of our work. We address each major comment point by point below, indicating the revisions made to strengthen the manuscript.

read point-by-point responses

Referee: Methods section on CCTA definition: The metric quantifies semantic agreement over matched atomic facts extracted from the shared scene graph and is claimed to isolate internal consistency from standalone task accuracy. However, generation uses synthesized prompts from the graph while understanding uses queries on the resulting image, so mismatches can arise from stylistic differences in prompt/query formulation or noise in automatic fact extraction. No ablations or controls (e.g., human-verified facts, varied prompt styles, or extraction sensitivity tests) are described to rule out these artifacts; this is load-bearing for the central claim that low CCTA reflects model misalignment rather than evaluation confounds.

Authors: We agree that potential artifacts from automatic extraction and prompt formulation must be ruled out to support the central claim. The original manuscript relied on the shared scene-graph structure to ensure semantic equivalence but did not report explicit controls. In the revision, we have added a dedicated subsection in Methods with three controls: human verification of 200 randomly sampled facts (94% agreement with automatic extraction), sensitivity analysis across three prompt/query style variants (CCTA standard deviation below 0.04), and robustness checks using an alternative fact extractor. These results are now reported in the main text and Appendix, confirming that CCTA primarily captures model-level misalignment rather than evaluation noise. We have also revised the CCTA definition paragraph to explicitly discuss how the metric isolates consistency from task accuracy. revision: yes
Referee: Experiments section (results on nine models): The abstract and results report CCTA scores supporting that high task performance does not imply alignment, but without visible error bars, statistical significance tests, or full details on data splits, scene-graph construction, and extraction pipelines, it is impossible to confirm robustness or rule out post-hoc choices affecting the scores. Specific per-model tables correlating CCTA with accuracy should be provided with variance estimates.

Authors: We acknowledge the need for greater experimental transparency and reproducibility. The revised Experiments section now includes error bars (standard deviation over three random seeds for open-source models), paired statistical significance tests (Wilcoxon signed-rank) on CCTA differences, and expanded details on scene-graph construction (sourced from Visual Genome), the evaluation split, and the full extraction pipeline (including LLM prompts and filtering steps) in the Appendix. We have also added a new table correlating per-model CCTA with generation and understanding accuracies, complete with variance estimates. These changes directly address concerns about post-hoc choices and confirm the robustness of the reported trends. revision: yes
Referee: Architectural analysis section: The conclusion that consistency is governed by objective coupling (not unification alone) requires concrete model comparisons. For example, identify pairs of models with similar architectures but differing objective couplings and report their quantitative CCTA differences; without this, the architectural claim rests on correlational observations rather than controlled evidence.

Authors: We accept that the original architectural analysis was largely observational. To provide controlled evidence, the revised section now explicitly identifies and compares model pairs with comparable architectures but differing objective couplings (e.g., models trained with joint cross-modal objectives versus those with more decoupled per-task objectives). We report the corresponding CCTA differences for these pairs and discuss how the degree of objective coupling explains the observed consistency gaps beyond architectural unification alone. This addition moves the claim from correlation toward controlled comparison while remaining grounded in the evaluated models. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper introduces XTC-Bench and the CCTA metric as an explicit new framework: generation prompts and understanding queries are both derived from the same scene graph to enable fact-level comparison, with CCTA defined directly as semantic agreement over matched atomic facts. This construction isolates the consistency measure by design rather than deriving it from model outputs or prior results. The central claims (high task performance does not imply alignment; consistency depends on objective coupling) are presented as empirical findings from experiments across nine models, without any equations, fitted parameters, or self-citations that reduce the result to its inputs by construction. No self-definitional loops, renamed known results, or load-bearing self-citations appear in the abstract or described methodology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that scene graphs provide a reliable, atomic decomposition of visual semantics suitable for cross-task comparison; the new metric CCTA is defined in terms of this decomposition.

axioms (1)

domain assumption Scene graphs can be used to derive both generation prompts and understanding queries that enable fact-level semantic alignment analysis across tasks.
Invoked when the framework is introduced to ground prompts and queries in the same structured representation.

invented entities (1)

Continuous Cross-Task Agreement (CCTA) no independent evidence
purpose: Fine-grained metric quantifying semantic agreement between generation and understanding over matched atomic facts.
Newly defined quantity introduced to isolate internal consistency from standalone accuracy.

pith-pipeline@v0.9.0 · 5544 in / 1290 out tokens · 55332 ms · 2026-05-08T04:02:26.596619+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 39 canonical work pages · 16 internal anchors

[1]

Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C.L., Batra, D., Parikh, D.: Vqa: Visual question answering (2016),https://arxiv.org/abs/1505.00468

work page Pith review arXiv 2016
[2]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review arXiv 2025
[3]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(1), 1–26 (2021)

Chang, X., Ren, P., Xu, P., Li, Z., Chen, X., Hauptmann, A.: A comprehensive survey of scene graphs: Generation and application. IEEE Transactions on Pattern Analysis and Machine Intelligence45(1), 1–26 (2021)

2021
[4]

Chen, J., Xu, Z., Pan, X., Hu, Y., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., Xue, L., Xiong, C., Xu, R.: Blip3-o: A family of fully open unified multimodalmodels-architecture,traininganddataset(2025),https://arxiv.org/ abs/2505.09568

work page Pith review arXiv 2025
[5]

Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling (2025),https://arxiv.org/abs/2501.17811

work page internal anchor Pith review arXiv 2025
[6]

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation (2022),https://arxiv.org/ abs/2112.01527

work page arXiv 2022
[7]

Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation (2021),https://arxiv.org/abs/2107.06278

work page arXiv 2021
[8]

Chong, M.J., Forsyth, D.: Effectively unbiased fid and inception score and where to find them (2020),https://arxiv.org/abs/1911.07023

work page arXiv 2020
[9]

In: Proceedings of the IEEE conference on computer vision and Pattern recognition

Dai, B., Zhang, Y., Lin, D.: Detecting visual relationships with deep relational networks. In: Proceedings of the IEEE conference on computer vision and Pattern recognition. pp. 3076–3086 (2017)

2017
[10]

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., Fan, H.: Emerging properties in unified multimodal pretraining (2025),https://arxiv.org/abs/2505.14683 20 Wang et al

work page internal anchor Pith review arXiv 2025
[11]

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y., Ji, R., Shan, C., He, R.: Mme: A comprehensive evaluation benchmark for multimodal large language models (2025),https://arxiv.org/ abs/2306.13394

work page internal anchor Pith review arXiv 2025
[12]

Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment (2023),https://arxiv.org/abs/2310.11513

work page arXiv 2023
[13]

Ging, S., Bravo, M.A., Brox, T.: Open-ended vqa benchmarking of vision-language models by exploiting classification datasets and their semantic hierarchy (2024), https://arxiv.org/abs/2402.07270

work page arXiv 2024
[14]

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, S., Zhang, K., Wang, Y., Gao, W., Ni, L., Guo, J.: A survey on llm-as-a-judge (2025),https://arxiv.org/abs/2411.15594

work page internal anchor Pith review arXiv 2025
[15]

Han, J., Chen, H., Zhao, Y., Wang, H., Zhao, Q., Yang, Z., He, H., Yue, X., Jiang, L.: Vision as a dialect: Unifying visual understanding and generation via text-aligned representations (2025),https://arxiv.org/abs/2506.18898

work page arXiv 2025
[16]

In: Proceedings of the 2021 conference on empirical methods in natural language processing

Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: Clipscore: A reference- free evaluation metric for image captioning. In: Proceedings of the 2021 conference on empirical methods in natural language processing. pp. 7514–7528 (2021)

2021
[17]

Huang, K., Duan, C., Sun, K., Xie, E., Li, Z., Liu, X.: T2i-compbench++: An en- hanced and comprehensive benchmark for compositional text-to-image generation (2025),https://arxiv.org/abs/2307.06350

work page arXiv 2025
[18]

Jain, J., Li, J., Chiu, M., Hassani, A., Orlov, N., Shi, H.: Oneformer: One trans- former to rule universal image segmentation (2022),https://arxiv.org/abs/ 2211.06220

work page arXiv 2022
[19]

Jiang, Y., Yang, D., Han, M., Han, J., Chen, Z., Liu, Y., Li, M., Zhai, P., Zhang, L.: Fysicsworld: A unified full-modality benchmark for any-to-any understanding, generation, and reasoning (2025),https://arxiv.org/abs/2512.12756

work page arXiv 2025
[20]

Krishna,R.,Zhu,Y.,Groth,O.,Johnson,J.,Hata,K.,Kravitz,J.,Chen,S.,Kalan- tidis, Y., Li, L.J., Shamma, D.A., Bernstein, M.S., Li, F.F.: Visual genome: Con- necting language and vision using crowdsourced dense image annotations (2016), https://arxiv.org/abs/1602.07332

work page Pith review arXiv 2016
[21]

Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: Seed-bench: Benchmarking multimodal llms with generative comprehension (2023),https://arxiv.org/abs/ 2307.16125

work page internal anchor Pith review arXiv 2023
[22]

Li, Y., Wang, H., Zhang, Q., Xiao, B., Hu, C., Wang, H., Li, X.: Unieval: Unified holistic evaluation for unified multimodal understanding and generation (2025), https://arxiv.org/abs/2505.10483

work page arXiv 2025
[23]

Li, Z., Wang, W., Xie, E., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P., Lu, T.: Panoptic segformer: Delving deeper into panoptic segmentation with transformers (2022),https://arxiv.org/abs/2109.03814

work page arXiv 2022
[24]

Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., Dollár, P.: Microsoft coco: Common objects in context (2015),https://arxiv.org/abs/1405.0312

work page internal anchor Pith review arXiv 2015
[25]

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., Chen, K., Lin, D.: Mmbench: Is your multi-modal model an all-around player? (2024),https://arxiv.org/abs/2307.06281

work page internal anchor Pith review arXiv 2024
[26]

09216 Title Suppressed Due to Excessive Length 21

Lorenz, J., Pest, A., Kienzle, D., Ludwig, K., Lienhart, R.: A fair ranking and new model for panoptic scene graph generation (2024),https://arxiv.org/abs/2407. 09216 Title Suppressed Due to Excessive Length 21

2024
[27]

arXiv preprint arXiv:2502.20321 (2025) 9

Ma, C., Jiang, Y., Wu, J., Yang, J., Yu, X., Yuan, Z., Peng, B., Qi, X.: Unitok: A unified tokenizer for visual generation and understanding (2025),https://arxiv. org/abs/2502.20321

work page arXiv 2025
[28]

Mañas, O., Krojer, B., Agrawal, A.: Improving automatic vqa evaluation using large language models (2024),https://arxiv.org/abs/2310.02567

work page arXiv 2024
[29]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (11 2019),https: //arxiv.org/abs/1908.10084

work page internal anchor Pith review arXiv 2019
[30]

Shi, Y., Dong, Y., Ding, Y., Wang, Y., Zhu, X., Zhou, S., Liu, W., Tian, H., Wang, R., Wang, H., Liu, Z., Zeng, B., Chen, R., Wang, Q., Zhang, Z., Chen, X., Tong, C., Li, B., Fu, C., Liu, Q., Wang, H., Yang, W., Zhang, Y., Wan, P., Zhang, Y.F., Liu, Z.: Realunify: Do unified models truly benefit from unification? a comprehensive benchmark (2025),https://a...

work page arXiv 2025
[31]

Team,C.:Chameleon:Mixed-modalearly-fusionfoundationmodels.arXivpreprint arXiv:2405.09818 (2024)

work page internal anchor Pith review arXiv 2024
[32]

Wang, C., Chen, Y., Hu, Z., Chen, D., Chen, W., Wiegreffe, S., Zhou, T.: Quan- tifying the gap between understanding and generation within unified multimodal models (2026),https://arxiv.org/abs/2602.02140

work page arXiv 2026
[33]

Wang, H., Zhu, Y., Adam, H., Yuille, A., Chen, L.C.: Max-deeplab: End-to-end panoptic segmentation with mask transformers (2021),https://arxiv.org/abs/ 2012.00759

work page arXiv 2021
[34]

Wang, X., Liu, J., Huang, C., Yu, X., Wang, Z., Sun, X., Wu, J., Yuille, A., Barsoum, E., Liu, Z.: Xmodbench: Benchmarking cross-modal capabilities and consistencyinomni-languagemodels(2025),https://arxiv.org/abs/2510.15148

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Emu3: Next-Token Prediction is All You Need

Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024)

work page internal anchor Pith review arXiv 2024
[36]

Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., Liu, Z., Xia, Z., Li, C., Deng, H., Wang, J., Luo, K., Zhang, B., Lian, D., Wang, X., Wang, Z., Huang, T., Liu, Z.: Omnigen2: Exploration to advanced multimodal generation (2025),https://arxiv.org/abs/2506.18871

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation (2025),https://arxiv.org/abs/2408.12528

work page internal anchor Pith review arXiv 2025
[38]

Xie, J., Yang, Z., Shou, M.Z.: Show-o2: Improved native unified multimodal models (2025),https://arxiv.org/abs/2506.15564

work page internal anchor Pith review arXiv 2025
[39]

Xie, W., Zhang, Y.F., Fu, C., Shi, Y., Nie, B., Chen, H., Zhang, Z., Wang, L., Tan, T.: Mme-unify: A comprehensive benchmark for unified multimodal understanding and generation models (2025),https://arxiv.org/abs/2504.03641

work page arXiv 2025
[40]

In: Proceedings of the IEEE conference on computer vision and pat- tern recognition

Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative mes- sage passing. In: Proceedings of the IEEE conference on computer vision and pat- tern recognition. pp. 5410–5419 (2017)

2017
[41]

Yang, L., Tian, Y., Li, B., Zhang, X., Shen, K., Tong, Y., Wang, M.: Mmada: Mul- timodal large diffusion language models (2025),https://arxiv.org/abs/2505. 15809

2025
[42]

Yu, Q., Wang, H., Kim, D., Qiao, S., Collins, M., Zhu, Y., Adam, H., Yuille, A., Chen, L.C.: Cmt-deeplab: Clustering mask transformers for panoptic segmentation (2022),https://arxiv.org/abs/2206.08948 22 Wang et al

work page arXiv 2022
[43]

Yu, Q., Wang, H., Qiao, S., Collins, M., Zhu, Y., Adam, H., Yuille, A., Chen, L.C.: kmax-deeplab: k-means mask transformer (2023),https://arxiv.org/abs/2207. 04044

2023
[44]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: Scene graph parsing with global context. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5831–5840 (2018)

2018
[45]

Zhang, W., Pang, J., Chen, K., Loy, C.C.: K-net: Towards unified image segmen- tation (2021),https://arxiv.org/abs/2106.14855

work page arXiv 2021
[46]

Zhao, S., Zhang, X., Guo, J., Hu, J., Duan, L., Fu, M., Chng, Y.X., Wang, G.H., Chen, Q.G., Xu, Z., Luo, W., Zhang, K.: Unified multimodal understanding and generation models: Advances, challenges, and opportunities (2026),https: //arxiv.org/abs/2505.02567

work page arXiv 2026
[47]

Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

Zou, K., Huang, Z., Dong, Y., Tian, S., Zheng, D., Liu, H., He, J., Liu, B., Qiao, Y., Liu, Z.: Uni-mmmu: A massive multi-discipline multimodal unified benchmark (2026),https://arxiv.org/abs/2510.13759 Appendix 6.1 Detailed Dataset Implementation Selection of Panoptic Segmentation ModelAs described in Section 3.2.1 of the main paper, our approach builds u...

work page internal anchor Pith review Pith/arXiv arXiv 2026