pith. sign in

arxiv: 2606.30054 · v1 · pith:RXSMKM7Mnew · submitted 2026-06-29 · 💻 cs.CV

Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation

Pith reviewed 2026-06-30 06:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords unified multimodal modelinterleaved text-image generationfree-form generationprogressive trainingself-adaptive objectivesstyle transferimage decompositionstorytelling
0
0 comments X

The pith

ILLUME-X enables high-quality free-form interleaved text-image generation by improving data efficiency and stabilizing training for variable-length sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ILLUME-X as a unified multimodal model that generates sequences freely interleaving text and images. It relies on an expanded training data pipeline for interleaved content, a progressive training strategy using self-adaptive objectives suited to free-length token sequences, and a new evaluation method called ILScore. These elements are said to raise multimodal data efficiency and stabilize training, producing better results than earlier unified models on tasks such as style transfer, image decomposition, and storytelling. A reader would care because the work targets more autonomous generation of mixed text and image content without fixed format constraints.

Core claim

ILLUME-X comprises an expanded training data pipeline optimized for interleaved text-image generation, a progressive training strategy with self-adaptive objectives for free-length multimodal token sequences, and the ILScore evaluation method; together these components enable high-quality free-form interleaved text-image generation and deliver outperformance over previous unified models on multiple tasks including style transfer, image decomposition, and storytelling.

What carries the argument

The progressive training strategy with self-adaptive objectives for free-length multimodal token sequences, which stabilizes the multimodal training process and improves data efficiency.

Load-bearing premise

The expanded training data pipeline and progressive training strategy with self-adaptive objectives are sufficient to improve multimodal data efficiency and stabilize training for free-length sequences.

What would settle it

A controlled comparison in which a prior unified model trained on the same expanded interleaved data but without the progressive self-adaptive strategy matches or exceeds ILLUME-X performance on ILScore across style transfer, decomposition, and storytelling tasks.

Figures

Figures reproduced from arXiv: 2606.30054 by Chonghuinan Wang, Chunwei Wang, Fan Li, Jiaqi Xu, Junwei Yang, Renjing Pei, Wangmeng Zuo, Wei Zhang, Xiaohe Wu, Yecong Wan, Zhikai Chen, Zhixin Wang.

Figure 1
Figure 1. Figure 1: Illustrative examples of ILLUME-X. The model handles interleaved text￾image inputs and outputs, enabling cohesive multimodal understanding and generation. Abstract. The advancement of generative AI models capable of pro￾ducing text and image marks a critical step forward in the realm of * Equal Contribution, B Corresponding Author, † Project Leader arXiv:2606.30054v1 [cs.CV] 29 Jun 2026 [PITH_FULL_IMAGE:f… view at source ↗
Figure 2
Figure 2. Figure 2: The overall architecture of our ILLUME-X. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The overall architecture of our data pipeline. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison with other methods. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: 6 Limitations ILLUME-X is primarily trained and evaluated for interleaved text-image gener￾ation at a resolution of 512. Due to limitations imposed by the underlying model architecture and the finite context length, scaling both training and inference to resolutions of 1024 and above remains challenging. Consequently, the quality of high-resolution interleaved generation still leaves room for further impro… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of different CoT settings. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
read the original abstract

The advancement of generative AI models capable of producing text and image marks a critical step forward in the realm of multimodal intelligence, particularly for tasks involving the interleaving of both modalities. To advance this intelligence to the next stage, it is crucial for models to autonomously generate free-form interleaved text-image sequences. In this paper, we introduce ILLUME-X, an advanced unified multimodal paradigm that enables high-quality, free-form interleaved text-image generation by improving multimodal data efficiency and stabilizing the multimodal training process. ILLUME-X comprises three key components: (i) an expanded training data pipeline optimized for interleaved text-image generation, (ii) a progressive training strategy with self-adaptive objectives for free-length multimodal token sequences, and (iii) an objective and comprehensive evaluation method ILScore for interleaved text-image sequences. Notably, our ILLUME-X outperforms previous unified models across multiple interleaved text-image generation tasks like style transfer, image decomposition and storytelling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces ILLUME-X, a unified multimodal model for free-form interleaved text-image generation. It proposes three components—an expanded training data pipeline optimized for interleaved data, a progressive training strategy with self-adaptive objectives for variable-length sequences, and the ILScore evaluation metric—and claims these enable high-quality generation while outperforming prior unified models on tasks including style transfer, image decomposition, and storytelling.

Significance. If substantiated by rigorous experiments, the work could advance multimodal generation by addressing data efficiency and training stability for interleaved outputs. The ILScore metric represents a potential contribution if shown to be objective and well-correlated with human judgments, but the absence of any quantitative results, baselines, or ablation details in the provided text prevents assessment of whether the claimed improvements are real or meaningful.

major comments (1)
  1. [Abstract] Abstract: the central claims of outperformance across multiple tasks and the sufficiency of the three components for improving data efficiency and stabilizing training are stated without any quantitative results, error bars, dataset sizes, baseline comparisons, or experimental details, rendering the claims unevaluable from the manuscript text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and the opportunity to clarify the manuscript. The primary concern raised is the lack of quantitative support in the abstract, which we address directly below by committing to revisions that incorporate key experimental details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claims of outperformance across multiple tasks and the sufficiency of the three components for improving data efficiency and stabilizing training are stated without any quantitative results, error bars, dataset sizes, baseline comparisons, or experimental details, rendering the claims unevaluable from the manuscript text.

    Authors: We agree that the abstract as written states performance claims without supporting numbers, which limits immediate evaluability. The full manuscript contains dedicated experimental sections reporting quantitative results (including baseline comparisons on style transfer, image decomposition, and storytelling), dataset sizes from the expanded interleaved data pipeline, ablation studies on the progressive training strategy, and details on ILScore correlation with human judgments. To address this, we will revise the abstract to include specific metrics (e.g., relative improvements over prior unified models and key dataset statistics) while preserving conciseness. This change will be incorporated in the next manuscript version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical model with no derivation chain

full rationale

The paper presents ILLUME-X as an empirical multimodal model whose central claims rest on an expanded data pipeline, progressive training with self-adaptive objectives, and a new ILScore metric. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or described components. The outperformance statements are framed as experimental results rather than logical reductions to inputs by construction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are described in sufficient detail to enumerate.

pith-pipeline@v0.9.1-grok · 5725 in / 1049 out tokens · 29478 ms · 2026-06-30T06:02:15.471720+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 34 canonical work pages · 21 internal anchors

  1. [1]

    In: EMNLP (2023) 16 C

    Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., Sanghai, S.: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. In: EMNLP (2023) 16 C. Wang et al

  2. [2]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  3. [3]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv 2502.13923 (2025)

  4. [4]

    HunyuanImage 3.0 Technical Report

    Cao, S., Chen, H., Chen, P., Cheng, Y., Cui, Y., Deng, X., Dong, Y., Gong, K., Gu, T., Gu, X., et al.: Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951 (2025)

  5. [5]

    arXiv preprint arXiv:2411.17188 (2024)

    Chen, D., Chen, R., Pu, S., Liu, Z., Wu, Y., Chen, C., Liu, B., Huang, Y., Wan, Y., Zhou, P., et al.: Interleaved scene graphs for interleaved text-and-image generation assessment. arXiv preprint arXiv:2411.17188 (2024)

  6. [6]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Chen, J., Xu, Z., Pan, X., Hu, Y., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al.: Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025)

  7. [7]

    In: CVPR (2025)

    Chen, W., Li, L., Yang, Y., Wen, B., Yang, F., Gao, T., Wu, Y., Chen, L.: Comm: A coherent interleaved image-text dataset for multimodal understanding and gen- eration. In: CVPR (2025)

  8. [8]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. arXiv 2501.17811 (2025)

  9. [9]

    arXiv preprint arXiv:2407.06135 (2024)

    Chern, E., Su, J., Ma, Y., Liu, P.: Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation. arXiv preprint arXiv:2407.06135 (2024)

  10. [10]

    arXiv preprint arXiv:2511.11434 (2025)

    Chow,W.,Pan,J.,Liang,Y.,Zhou,M.,Song,X.,Jia,L.,Zhang,S.,Tang,S.,Li,J., Zhang, F., et al.: Weave: Unleashing and benchmarking the in-context interleaved comprehension and generation. arXiv preprint arXiv:2511.11434 (2025)

  11. [11]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  12. [12]

    Emu3.5: Native Multimodal Models are World Learners

    Cui, Y., Chen, H., Deng, H., Huang, X., Li, X., Liu, J., Liu, Y., Luo, Z., Wang, J., Wang, W., Wang, Y., Wang, C., Zhang, F., Zhao, Y., Pan, T., Li, X., Hao, Z., Ma, W., Chen, Z., Ao, Y., Huang, T., Wang, Z., Wang, X.: Emu3.5: Native Multimodal Models are World Learners. arXiv 2510.26583 (2025)

  13. [13]

    In: ICML (2023) ILLUME-X 17

    Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A.P., Caron, M., Geirhos, R., Alabdulmohsin, I., Jenatton, R., Beyer, L., Tschannen, M., Arnab, A., Wang, X., Riquelme Ruiz, C., Minderer, M., Puigcerver, J., Evci, U., Kumar, M., Steenkiste, S.V., Elsayed, G.F., Mahendran, A., Yu, F., Oliver, A., Huot, F., Bastings, J....

  14. [14]

    Emerging Properties in Unified Multimodal Pretraining

    Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., Fan, H.: Emerging Properties in Unified Multimodal Pretraining. arXiv 2505.14683 (2025)

  15. [15]

    In: ICLR (2024)

    Dong, R., Han, C., Peng, Y., Qi, Z., Ge, Z., Yang, J., Zhao, L., Sun, J., Zhou, H., Wei, H., Kong, X., Zhang, X., Ma, K., Yi, L.: DreamLLM: Synergistic Multimodal Comprehension and Creation. In: ICLR (2024)

  16. [16]

    In: Forty-first international conference on machine learning (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

  17. [17]

    In: ICML (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., Rombach, R.: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. In: ICML (2024)

  18. [18]

    arXiv preprint arXiv:2310.01218 (2023)

    Ge, Y., Zhao, S., Zeng, Z., Ge, Y., Li, C., Wang, X., Shan, Y.: Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218 (2023)

  19. [19]

    In: NeurIPS (2023)

    Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. In: NeurIPS (2023)

  20. [20]

    arXiv 2512.04810 (2025)

    He, X., Wei, L., Ouyang, J., Liao, M., Xie, L., Tian, Q.: EMMA: Efficient Multi- modal Understanding, Generation, and Editing with a Unified Architecture. arXiv 2512.04810 (2025)

  21. [21]

    Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P., Yu, G.: Ella: Equip diffusion models with llm for enhanced semantic alignment (2024)

  22. [22]

    arXiv 2504.01934 (2025)

    Huang, R., Wang, C., Yang, J., Lu, G., Yuan, Y., Han, J., Hou, L., Zhang, W., Hong, L., Zhao, H., Xu, H.: ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement. arXiv 2504.01934 (2025)

  23. [23]

    In: CVPR (2025)

    Jiao, Y., Qiu, H., Jie, Z., Chen, S., Chen, J., Ma, L., Jiang, Y.G.: Unitoken: Harmo- nizing multimodal understanding and generation through unified visual encoding. In: CVPR (2025)

  24. [24]

    In: ICML (2025)

    Kou, S., Jin, J., Liu, Z., Liu, C., Ma, Y., Jia, J., Chen, Q., Jiang, P., Deng, Z.: Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads. In: ICML (2025)

  25. [25]

    Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025)

  26. [26]

    arXiv preprint arXiv:2507.16746 (2025)

    Li, A., Wang, C., Fu, D., Yue, K., Cai, Z., Zhu, W.B., Liu, O., Guo, P., Neiswanger, W., Huang, F., et al.: Zebra-cot: A dataset for interleaved vision language reason- ing. arXiv preprint arXiv:2507.16746 (2025)

  27. [27]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)

    Li, B., Wang, Z., Li, F., Xu, J., Guo, J., Pei, R., Li, X., Chen, Z.: Colorflux: A structure-color decoupling framework for old photo colorization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)

  28. [28]

    Lavida-o: Elastic masked diffusion models for unified multimodal understanding and generation.arXiv preprint arXiv:2509.19244, 2025

    Li, S., Gu, J., Liu, K., Lin, Z., Wei, Z., Grover, A., Kuen, J.: Lavida-o: Elastic large masked diffusion models for unified multimodal understanding and generation. arXiv preprint arXiv:2509.19244 (2025)

  29. [29]

    In: CVPR (2025) 18 C

    Li,Z.,Li,H.,Shi,Y.,Farimani,A.B.,Kluger,Y.,Yang,L.,Wang,P.:Dualdiffusion for unified image generation and understanding. In: CVPR (2025) 18 C. Wang et al

  30. [30]

    Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

    Liao, C., Liu, L., Wang, X., Luo, Z., Zhang, X., Zhao, W., Wu, J., Li, L., Tian, Z., Huang, W.: Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation. arXiv 2505.05472 (2025)

  31. [31]

    In: ICLR (2024)

    Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., Cobbe, K.: Let’s verify step by step. In: ICLR (2024)

  32. [32]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025)

  33. [33]

    In: ICLR (2023)

    Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow Matching for Generative Modeling. In: ICLR (2023)

  34. [34]

    In: European Conference on Computer Vision

    Liu, X., Wei, Y., Liu, M., Lin, X., Ren, P., Xie, X., Zuo, W.: Smartcontrol: En- hancing controlnet for handling rough visual conditions. In: European Conference on Computer Vision. pp. 1–17. Springer (2024)

  35. [35]

    arXiv 2411.07975 (2025)

    Ma, Y., Liu, X., Chen, X., Liu, W., Wu, C., Wu, Z., Pan, Z., Xie, Z., Zhang, H., yu, X., Zhao, L., Wang, Y., Liu, J., Ruan, C.: JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation. arXiv 2411.07975 (2025)

  36. [36]

    Transactions on Machine Learning Research (2024)

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning Robust Visual Feat...

  37. [37]

    In: ICLR (2024)

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: SDXL: Improving latent diffusion models for high-resolution im- age synthesis. In: ICLR (2024)

  38. [38]

    In: ICLR (2026)

    Qu, L., Cheng, F., Yang, Z., Zhao, Q., Lin, S., Shi, Y., Li, Y., Wang, W., Chua, T.S., Jiang, L.: VINCIE: Unlocking in-context image editing from video. In: ICLR (2026)

  39. [39]

    GLU Variants Improve Transformer

    Shazeer, N.: GLU Variants Improve Transformer. arXiv 2002.05202 (2020)

  40. [40]

    Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

    Shi, Q., Bai, J., Zhao, Z., Chai, W., Yu, K., Wu, J., Song, S., Tong, Y., Li, X., Li, X., et al.: Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model. arXiv preprint arXiv:2505.23606 (2025)

  41. [41]

    OpenAI GPT-5 System Card

    Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025)

  42. [42]

    Neurocomputing (2024)

    Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: RoFormer: Enhanced trans- former with Rotary Position Embedding. Neurocomputing (2024)

  43. [43]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Team, C.: Chameleon: Mixed-Modal Early-Fusion Foundation Models. arXiv 2405.09818 (2025)

  44. [44]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: A family of highly capable multimodal models. arxiv 2023. arXiv preprint arXiv:2312.11805 (2024)

  45. [45]

    In: ECCV (2020)

    Teed, Z., Deng, J.: RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. In: ECCV (2020)

  46. [46]

    arXiv (2024)

    Tian, C., Zhu, X., Xiong, Y., Wang, W., Chen, Z., Wang, W., Chen, Y., Lu, L., Lu, T., Zhou, J., Li, H., Qiao, Y., Dai, J.: MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer. arXiv (2024)

  47. [47]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)

    Wang, C., Chen, Z., Wei, Y., Jiang, T., Wu, X., Li, F., Zuo, W., Yao, H.: Creval: An automated interpretable evaluation for creative image manipulation under complex ILLUME-X 19 instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)

  48. [48]

    arXiv preprint arXiv:2511.20520 (2025)

    Wang, X., Zhang, Z., Zhang, H., Lin, Z., Zhou, Y., Liu, Q., Zhang, S., Li, Y., Liu, S., Zheng, H., et al.: Hbridge: H-shape bridging of heterogeneous experts for unified multimodal understanding and generation. arXiv preprint arXiv:2511.20520 (2025)

  49. [49]

    Emu3: Next-Token Prediction is All You Need

    Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024)

  50. [50]

    Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

    Wu, C., Chen, X., Wu, Z., Ma, Y., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C., Luo, P.: Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation. arXiv 2410.13848 (2024)

  51. [51]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)

    Wu, C., Lei, L., Li, F., Guo, C., Kong, D., Qin, X., Wang, Z., Cheng, M., Li, C.: Yose: You only select essential tokens for efficient dit-based video object removal. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)

  52. [52]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., Liu, Z., Xia, Z., Li, C., Deng, H., Wang, J., Luo, K., Zhang, B., Lian, D., Wang, X., Wang, Z., Huang, T., Liu, Z.: OmniGen2: Exploration to Advanced Multimodal Generation. arXiv 2506.18871 (2025)

  53. [53]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. arXiv preprint arXiv:2408.12528 (2024)

  54. [54]

    Show-o2: Improved Native Unified Multimodal Models

    Xie,J.,Yang,Z.,Shou,M.Z.:Show-o2:Improvednativeunifiedmultimodalmodels. arXiv preprint arXiv:2506.15564 (2025)

  55. [55]

    MMaDA: Multimodal Large Diffusion Language Models

    Yang, L., Tian, Y., Li, B., Zhang, X., Shen, K., Tong, Y., Wang, M.: Mmada: Mul- timodal large diffusion language models. arXiv preprint arXiv:2505.15809 (2025)

  56. [56]

    In: ICCV Workshops

    Yang, S., Ge, Y., Li, Y., Chen, Y., Ge, Y., Shan, Y., Chen, Y.C.: Seed-story: Multimodal long story generation with large language model. In: ICCV Workshops. pp. 1850–1860 (October 2025)

  57. [57]

    In: CVPR (2022)

    Yang,S.,Wu,T.,Shi,S.,Lao,S.,Gong,Y.,Cao,M.,Wang,J.,Yang,Y.:MANIQA: Multi-dimension Attention Network for No-Reference Image Quality Assessment. In: CVPR (2022)

  58. [58]

    In: NeurIPS (2019)

    Zhang, B., Sennrich, R.: Root Mean Square Layer Normalization. In: NeurIPS (2019)

  59. [59]

    arXiv preprint arXiv:2511.16917 (2025)

    Zhang, C., Wang, J., Wang, Y., Liang, Y., Yang, X., Li, Z., Huang, H., Li, X.: Unimodel: A visual-only framework for unified multimodal understanding and gen- eration. arXiv preprint arXiv:2511.16917 (2025)

  60. [60]

    arXiv 2601.02204 (2026)

    Zhang, H., Qu, L., Liu, Y., Chen, H., Song, Y., Dong, Y., Sun, S., Li, X., Wang, X., Jiang, Y., Ye, H., Chen, B., Gao, Y., Liu, P., Liu, A., Yang, Z., Deng, Q., Xing, L., Liu, J., Wang, Z., Zhou, Y., Liu, M., Zhang, Y., He, Q., Hu, X., Qi, Z., Shao, J., Fu, Z., Wang, S., Chen, F., Chai, X., Wu, Z., Wang, Y., Yuan, Z., Du, D.K., Wu, X.: NextFlow: Unified S...

  61. [61]

    arXiv preprint arXiv:2310.02239 (2023)

    Zheng, K., He, X., Wang, X.E.: Minigpt-5: Interleaved vision-and-language gener- ation via generative vokens. arXiv preprint arXiv:2310.02239 (2023)

  62. [62]

    In: ICLR (2025) ILLUME-X 1 Supplementary Material In this supplementary material, we provide additional explanation and experi- mental results to further support the main paper

    Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer, L., Levy, O.: Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model. In: ICLR (2025) ILLUME-X 1 Supplementary Material In this supplementary material, we provide additional explanation and experi- mental results to further support...

  63. [67]

    coherence

    Creativity and Originality: The model's ability to generate novel and imaginative content across both text and images. Output Requirement: Please output in JSON format, including scores for each dimension (on a scale of 1-10) and a final overall score (on a scale of 1-10). Also provide brief explanations for each score. The JSON should follow this structu...

  64. [68]

    Coherence: How well the text and images work together to convey a unified message or story

  65. [69]

    Content Accuracy: The factual correctness of both textual information and visual elements

  66. [71]

    Logicality: The steps or processes included in the content, whether the arrangement of text and images follows a logical sequence, and whether it guides the reader's understanding

  67. [72]

    coherence

    Creativity and Originality: The model's ability to generate novel and imaginative content across both text and images. Output Requirement: Please output in JSON format, including scores for each dimension (on a scale of 1-10) and a final overall score (on a scale of 1-10). Also provide brief explanations for each score. The JSON should follow this structu...

  68. [73]

    Cross-Image Content Consistency (Entity Stability) Evaluate whether the same or related entities maintain stability across images: - Identity Consistency: Whether the same character or object is recognizable as the same entity across images - Attribute Continuity: Whether key attributes (color, quantity, state) are reasonably maintained or changed across ...

  69. [74]

    Each dimension uses a 1-10 scale

    Style Consistency Evaluate visual style coordination across images: - Artistic Style Uniformity: Whether artistic styles and rendering methods are consistent (e.g., all realistic or all cartoonish) - Color Harmony: Whether color tones, saturation, contrast are coordinated ## Scoring Criteria … Score based on severity and frequency of inconsistencies. Each...

  70. [75]

    Coherence: Grammatical correctness, Expression naturalness, Lexical accuracy, Sentence variety

  71. [76]

    Content Accuracy: Information completeness, Factual accuracy

  72. [77]

    Relevance and Responsiveness: How well the generated content addresses the given query

  73. [78]

    Logicality: The steps or processes follows a logical sequence, and it guides the reader's understanding

  74. [79]

    coherence

    Creativity and Originality: The model's ability to generate novel and imaginative content. Output Requirement: Please output in JSON format, including scores for each dimension (on a scale of 1-10) and a final overall score (on a scale of 1-10). Also provide brief explanations for each score. The JSON should follow this structure: {{ "coherence": {{ "scor...

  75. [80]

    <think>A white ceramic plate filled with small, vibrant orange baby carrots with a few small green leafy tops still attached

    A white ceramic plate is filled with a large quantity of small, bright orange baby carrots. <think>A white ceramic plate filled with small, vibrant orange baby carrots with a few small green leafy tops still attached. </think>

  76. [81]

    A white ceramic plate holds a small pile of vibrant green peas

  77. [82]

    Children are playing at the beach

    A small bunch of deep purple beets is shown. The beets have green leafy tops and some root hairs. A plate of chopped carrots and green peas with some beet greens on the side and a small white bowl on the counter. Current Step: Identify and select the carrots from the plate. Current Step: Pick the green peas from the plate. Current Step: Extract the beet g...

  78. [83]

    A white banner with Japanese text and colorful foot and handprints

  79. [84]

    A woman with short brown hair, wearing a gray hoodie, a blue shirt, black pants, and a yellow sash

  80. [85]

    A colorful kite flying in the background

Showing first 80 references.