Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation

Chonghuinan Wang; Chunwei Wang; Fan Li; Jiaqi Xu; Junwei Yang; Renjing Pei; Wangmeng Zuo; Wei Zhang; Xiaohe Wu; Yecong Wan

arxiv: 2606.30054 · v1 · pith:RXSMKM7Mnew · submitted 2026-06-29 · 💻 cs.CV

Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation

Chonghuinan Wang , Zhikai Chen , Chunwei Wang , Yecong Wan , Junwei Yang , Zhixin Wang , Wei Zhang , Jiaqi Xu

show 4 more authors

Renjing Pei Xiaohe Wu Fan Li Wangmeng Zuo

This is my paper

Pith reviewed 2026-06-30 06:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords unified multimodal modelinterleaved text-image generationfree-form generationprogressive trainingself-adaptive objectivesstyle transferimage decompositionstorytelling

0 comments

The pith

ILLUME-X enables high-quality free-form interleaved text-image generation by improving data efficiency and stabilizing training for variable-length sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ILLUME-X as a unified multimodal model that generates sequences freely interleaving text and images. It relies on an expanded training data pipeline for interleaved content, a progressive training strategy using self-adaptive objectives suited to free-length token sequences, and a new evaluation method called ILScore. These elements are said to raise multimodal data efficiency and stabilize training, producing better results than earlier unified models on tasks such as style transfer, image decomposition, and storytelling. A reader would care because the work targets more autonomous generation of mixed text and image content without fixed format constraints.

Core claim

ILLUME-X comprises an expanded training data pipeline optimized for interleaved text-image generation, a progressive training strategy with self-adaptive objectives for free-length multimodal token sequences, and the ILScore evaluation method; together these components enable high-quality free-form interleaved text-image generation and deliver outperformance over previous unified models on multiple tasks including style transfer, image decomposition, and storytelling.

What carries the argument

The progressive training strategy with self-adaptive objectives for free-length multimodal token sequences, which stabilizes the multimodal training process and improves data efficiency.

Load-bearing premise

The expanded training data pipeline and progressive training strategy with self-adaptive objectives are sufficient to improve multimodal data efficiency and stabilize training for free-length sequences.

What would settle it

A controlled comparison in which a prior unified model trained on the same expanded interleaved data but without the progressive self-adaptive strategy matches or exceeds ILLUME-X performance on ILScore across style transfer, decomposition, and storytelling tasks.

Figures

Figures reproduced from arXiv: 2606.30054 by Chonghuinan Wang, Chunwei Wang, Fan Li, Jiaqi Xu, Junwei Yang, Renjing Pei, Wangmeng Zuo, Wei Zhang, Xiaohe Wu, Yecong Wan, Zhikai Chen, Zhixin Wang.

**Figure 1.** Figure 1: Illustrative examples of ILLUME-X. The model handles interleaved textimage inputs and outputs, enabling cohesive multimodal understanding and generation. Abstract. The advancement of generative AI models capable of producing text and image marks a critical step forward in the realm of * Equal Contribution, B Corresponding Author, † Project Leader arXiv:2606.30054v1 [cs.CV] 29 Jun 2026 [PITH_FULL_IMAGE:f… view at source ↗

**Figure 2.** Figure 2: The overall architecture of our ILLUME-X. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The overall architecture of our data pipeline. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison with other methods. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: 6 Limitations ILLUME-X is primarily trained and evaluated for interleaved text-image generation at a resolution of 512. Due to limitations imposed by the underlying model architecture and the finite context length, scaling both training and inference to resolutions of 1024 and above remains challenging. Consequently, the quality of high-resolution interleaved generation still leaves room for further impro… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of different CoT settings. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

The advancement of generative AI models capable of producing text and image marks a critical step forward in the realm of multimodal intelligence, particularly for tasks involving the interleaving of both modalities. To advance this intelligence to the next stage, it is crucial for models to autonomously generate free-form interleaved text-image sequences. In this paper, we introduce ILLUME-X, an advanced unified multimodal paradigm that enables high-quality, free-form interleaved text-image generation by improving multimodal data efficiency and stabilizing the multimodal training process. ILLUME-X comprises three key components: (i) an expanded training data pipeline optimized for interleaved text-image generation, (ii) a progressive training strategy with self-adaptive objectives for free-length multimodal token sequences, and (iii) an objective and comprehensive evaluation method ILScore for interleaved text-image sequences. Notably, our ILLUME-X outperforms previous unified models across multiple interleaved text-image generation tasks like style transfer, image decomposition and storytelling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ILLUME-X bundles a data pipeline, progressive training, and ILScore for interleaved text-image output, but the abstract supplies no numbers or methods so the outperformance claim stays untestable.

read the letter

The main thing to know is that this paper names a model ILLUME-X and lists three pieces—an expanded interleaved data pipeline, progressive training with self-adaptive objectives for variable-length sequences, and a new evaluation called ILScore—then asserts it beats prior unified models on style transfer, decomposition, and storytelling. That is the entire contribution visible so far.

What is actually new is the combination of those three elements aimed at free-form interleaving rather than fixed turn-taking. The progressive training idea and the self-adaptive objectives address a practical pain point: keeping training stable when sequence length varies. ILScore is presented as an objective way to score mixed text-image outputs, which could be useful if it turns out to correlate with human judgment.

The soft spot is obvious and not minor: the abstract states outperformance and component benefits but gives zero quantitative results, no baselines, no error bars, and no description of the experiments. Without those, the central claim cannot be checked. The weakest assumption in the abstract—that the data pipeline and training strategy are sufficient to deliver the gains—remains exactly that, an assumption. If the full paper contains controlled ablations and reproducible numbers, that changes the picture; right now it does not.

This is the kind of work that belongs in a reading group only if someone has already run the numbers and can walk through them. I would not cite it yet. A serious editor could send it to review because the problem is real and the framing is coherent, but the referees would need to see the missing evidence before any stronger judgment. If the experiments hold up, it is worth the time; if they do not, it is not.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces ILLUME-X, a unified multimodal model for free-form interleaved text-image generation. It proposes three components—an expanded training data pipeline optimized for interleaved data, a progressive training strategy with self-adaptive objectives for variable-length sequences, and the ILScore evaluation metric—and claims these enable high-quality generation while outperforming prior unified models on tasks including style transfer, image decomposition, and storytelling.

Significance. If substantiated by rigorous experiments, the work could advance multimodal generation by addressing data efficiency and training stability for interleaved outputs. The ILScore metric represents a potential contribution if shown to be objective and well-correlated with human judgments, but the absence of any quantitative results, baselines, or ablation details in the provided text prevents assessment of whether the claimed improvements are real or meaningful.

major comments (1)

[Abstract] Abstract: the central claims of outperformance across multiple tasks and the sufficiency of the three components for improving data efficiency and stabilizing training are stated without any quantitative results, error bars, dataset sizes, baseline comparisons, or experimental details, rendering the claims unevaluable from the manuscript text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and the opportunity to clarify the manuscript. The primary concern raised is the lack of quantitative support in the abstract, which we address directly below by committing to revisions that incorporate key experimental details.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of outperformance across multiple tasks and the sufficiency of the three components for improving data efficiency and stabilizing training are stated without any quantitative results, error bars, dataset sizes, baseline comparisons, or experimental details, rendering the claims unevaluable from the manuscript text.

Authors: We agree that the abstract as written states performance claims without supporting numbers, which limits immediate evaluability. The full manuscript contains dedicated experimental sections reporting quantitative results (including baseline comparisons on style transfer, image decomposition, and storytelling), dataset sizes from the expanded interleaved data pipeline, ablation studies on the progressive training strategy, and details on ILScore correlation with human judgments. To address this, we will revise the abstract to include specific metrics (e.g., relative improvements over prior unified models and key dataset statistics) while preserving conciseness. This change will be incorporated in the next manuscript version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical model with no derivation chain

full rationale

The paper presents ILLUME-X as an empirical multimodal model whose central claims rest on an expanded data pipeline, progressive training with self-adaptive objectives, and a new ILScore metric. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or described components. The outperformance statements are framed as experimental results rather than logical reductions to inputs by construction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are described in sufficient detail to enumerate.

pith-pipeline@v0.9.1-grok · 5725 in / 1049 out tokens · 29478 ms · 2026-06-30T06:02:15.471720+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 34 canonical work pages · 21 internal anchors

[1]

In: EMNLP (2023) 16 C

Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., Sanghai, S.: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. In: EMNLP (2023) 16 C. Wang et al

2023
[2]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv 2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

HunyuanImage 3.0 Technical Report

Cao, S., Chen, H., Chen, P., Cheng, Y., Cui, Y., Deng, X., Dong, Y., Gong, K., Gu, T., Gu, X., et al.: Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

arXiv preprint arXiv:2411.17188 (2024)

Chen, D., Chen, R., Pu, S., Liu, Z., Wu, Y., Chen, C., Liu, B., Huang, Y., Wan, Y., Zhou, P., et al.: Interleaved scene graphs for interleaved text-and-image generation assessment. arXiv preprint arXiv:2411.17188 (2024)

work page arXiv 2024
[6]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Chen, J., Xu, Z., Pan, X., Hu, Y., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al.: Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

In: CVPR (2025)

Chen, W., Li, L., Yang, Y., Wen, B., Yang, F., Gao, T., Wu, Y., Chen, L.: Comm: A coherent interleaved image-text dataset for multimodal understanding and gen- eration. In: CVPR (2025)

2025
[8]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. arXiv 2501.17811 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

arXiv preprint arXiv:2407.06135 (2024)

Chern, E., Su, J., Ma, Y., Liu, P.: Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation. arXiv preprint arXiv:2407.06135 (2024)

work page arXiv 2024
[10]

arXiv preprint arXiv:2511.11434 (2025)

Chow,W.,Pan,J.,Liang,Y.,Zhou,M.,Song,X.,Jia,L.,Zhang,S.,Tang,S.,Li,J., Zhang, F., et al.: Weave: Unleashing and benchmarking the in-context interleaved comprehension and generation. arXiv preprint arXiv:2511.11434 (2025)

work page arXiv 2025
[11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Emu3.5: Native Multimodal Models are World Learners

Cui, Y., Chen, H., Deng, H., Huang, X., Li, X., Liu, J., Liu, Y., Luo, Z., Wang, J., Wang, W., Wang, Y., Wang, C., Zhang, F., Zhao, Y., Pan, T., Li, X., Hao, Z., Ma, W., Chen, Z., Ao, Y., Huang, T., Wang, Z., Wang, X.: Emu3.5: Native Multimodal Models are World Learners. arXiv 2510.26583 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

In: ICML (2023) ILLUME-X 17

Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A.P., Caron, M., Geirhos, R., Alabdulmohsin, I., Jenatton, R., Beyer, L., Tschannen, M., Arnab, A., Wang, X., Riquelme Ruiz, C., Minderer, M., Puigcerver, J., Evci, U., Kumar, M., Steenkiste, S.V., Elsayed, G.F., Mahendran, A., Yu, F., Oliver, A., Huot, F., Bastings, J....

2023
[14]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., Fan, H.: Emerging Properties in Unified Multimodal Pretraining. arXiv 2505.14683 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

In: ICLR (2024)

Dong, R., Han, C., Peng, Y., Qi, Z., Ge, Z., Yang, J., Zhao, L., Sun, J., Zhou, H., Wei, H., Kong, X., Zhang, X., Ma, K., Yi, L.: DreamLLM: Synergistic Multimodal Comprehension and Creation. In: ICLR (2024)

2024
[16]

In: Forty-first international conference on machine learning (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

2024
[17]

In: ICML (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., Rombach, R.: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. In: ICML (2024)

2024
[18]

arXiv preprint arXiv:2310.01218 (2023)

Ge, Y., Zhao, S., Zeng, Z., Ge, Y., Li, C., Wang, X., Shan, Y.: Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218 (2023)

work page arXiv 2023
[19]

In: NeurIPS (2023)

Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. In: NeurIPS (2023)

2023
[20]

arXiv 2512.04810 (2025)

He, X., Wei, L., Ouyang, J., Liao, M., Xie, L., Tian, Q.: EMMA: Efficient Multi- modal Understanding, Generation, and Editing with a Unified Architecture. arXiv 2512.04810 (2025)

work page arXiv 2025
[21]

Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P., Yu, G.: Ella: Equip diffusion models with llm for enhanced semantic alignment (2024)

2024
[22]

arXiv 2504.01934 (2025)

Huang, R., Wang, C., Yang, J., Lu, G., Yuan, Y., Han, J., Hou, L., Zhang, W., Hong, L., Zhao, H., Xu, H.: ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement. arXiv 2504.01934 (2025)

work page arXiv 2025
[23]

In: CVPR (2025)

Jiao, Y., Qiu, H., Jie, Z., Chen, S., Chen, J., Ma, L., Jiang, Y.G.: Unitoken: Harmo- nizing multimodal understanding and generation through unified visual encoding. In: CVPR (2025)

2025
[24]

In: ICML (2025)

Kou, S., Jin, J., Liu, Z., Liu, C., Ma, Y., Jia, J., Chen, Q., Jiang, P., Deng, Z.: Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads. In: ICML (2025)

2025
[25]

Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025)

2025
[26]

arXiv preprint arXiv:2507.16746 (2025)

Li, A., Wang, C., Fu, D., Yue, K., Cai, Z., Zhu, W.B., Liu, O., Guo, P., Neiswanger, W., Huang, F., et al.: Zebra-cot: A dataset for interleaved vision language reason- ing. arXiv preprint arXiv:2507.16746 (2025)

work page arXiv 2025
[27]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)

Li, B., Wang, Z., Li, F., Xu, J., Guo, J., Pei, R., Li, X., Chen, Z.: Colorflux: A structure-color decoupling framework for old photo colorization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)

2026
[28]

Lavida-o: Elastic masked diffusion models for unified multimodal understanding and generation.arXiv preprint arXiv:2509.19244, 2025

Li, S., Gu, J., Liu, K., Lin, Z., Wei, Z., Grover, A., Kuen, J.: Lavida-o: Elastic large masked diffusion models for unified multimodal understanding and generation. arXiv preprint arXiv:2509.19244 (2025)

work page arXiv 2025
[29]

In: CVPR (2025) 18 C

Li,Z.,Li,H.,Shi,Y.,Farimani,A.B.,Kluger,Y.,Yang,L.,Wang,P.:Dualdiffusion for unified image generation and understanding. In: CVPR (2025) 18 C. Wang et al

2025
[30]

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

Liao, C., Liu, L., Wang, X., Luo, Z., Zhang, X., Zhao, W., Wu, J., Li, L., Tian, Z., Huang, W.: Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation. arXiv 2505.05472 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

In: ICLR (2024)

Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., Cobbe, K.: Let’s verify step by step. In: ICLR (2024)

2024
[32]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

In: ICLR (2023)

Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow Matching for Generative Modeling. In: ICLR (2023)

2023
[34]

In: European Conference on Computer Vision

Liu, X., Wei, Y., Liu, M., Lin, X., Ren, P., Xie, X., Zuo, W.: Smartcontrol: En- hancing controlnet for handling rough visual conditions. In: European Conference on Computer Vision. pp. 1–17. Springer (2024)

2024
[35]

arXiv 2411.07975 (2025)

Ma, Y., Liu, X., Chen, X., Liu, W., Wu, C., Wu, Z., Pan, Z., Xie, Z., Zhang, H., yu, X., Zhao, L., Wang, Y., Liu, J., Ruan, C.: JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation. arXiv 2411.07975 (2025)

work page arXiv 2025
[36]

Transactions on Machine Learning Research (2024)

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning Robust Visual Feat...

2024
[37]

In: ICLR (2024)

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: SDXL: Improving latent diffusion models for high-resolution im- age synthesis. In: ICLR (2024)

2024
[38]

In: ICLR (2026)

Qu, L., Cheng, F., Yang, Z., Zhao, Q., Lin, S., Shi, Y., Li, Y., Wang, W., Chua, T.S., Jiang, L.: VINCIE: Unlocking in-context image editing from video. In: ICLR (2026)

2026
[39]

GLU Variants Improve Transformer

Shazeer, N.: GLU Variants Improve Transformer. arXiv 2002.05202 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2002
[40]

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Shi, Q., Bai, J., Zhao, Z., Chai, W., Yu, K., Wu, J., Song, S., Tong, Y., Li, X., Li, X., et al.: Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model. arXiv preprint arXiv:2505.23606 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

OpenAI GPT-5 System Card

Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Neurocomputing (2024)

Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: RoFormer: Enhanced trans- former with Rotary Position Embedding. Neurocomputing (2024)

2024
[43]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Team, C.: Chameleon: Mixed-Modal Early-Fusion Foundation Models. arXiv 2405.09818 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: A family of highly capable multimodal models. arxiv 2023. arXiv preprint arXiv:2312.11805 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

In: ECCV (2020)

Teed, Z., Deng, J.: RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. In: ECCV (2020)

2020
[46]

arXiv (2024)

Tian, C., Zhu, X., Xiong, Y., Wang, W., Chen, Z., Wang, W., Chen, Y., Lu, L., Lu, T., Zhou, J., Li, H., Qiao, Y., Dai, J.: MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer. arXiv (2024)

2024
[47]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)

Wang, C., Chen, Z., Wei, Y., Jiang, T., Wu, X., Li, F., Zuo, W., Yao, H.: Creval: An automated interpretable evaluation for creative image manipulation under complex ILLUME-X 19 instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)

2026
[48]

arXiv preprint arXiv:2511.20520 (2025)

Wang, X., Zhang, Z., Zhang, H., Lin, Z., Zhou, Y., Liu, Q., Zhang, S., Li, Y., Liu, S., Zheng, H., et al.: Hbridge: H-shape bridging of heterogeneous experts for unified multimodal understanding and generation. arXiv preprint arXiv:2511.20520 (2025)

work page arXiv 2025
[49]

Emu3: Next-Token Prediction is All You Need

Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Wu, C., Chen, X., Wu, Z., Ma, Y., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C., Luo, P.: Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation. arXiv 2410.13848 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)

Wu, C., Lei, L., Li, F., Guo, C., Kong, D., Qin, X., Wang, Z., Cheng, M., Li, C.: Yose: You only select essential tokens for efficient dit-based video object removal. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)

2026
[52]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., Liu, Z., Xia, Z., Li, C., Deng, H., Wang, J., Luo, K., Zhang, B., Lian, D., Wang, X., Wang, Z., Huang, T., Liu, Z.: OmniGen2: Exploration to Advanced Multimodal Generation. arXiv 2506.18871 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. arXiv preprint arXiv:2408.12528 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Show-o2: Improved Native Unified Multimodal Models

Xie,J.,Yang,Z.,Shou,M.Z.:Show-o2:Improvednativeunifiedmultimodalmodels. arXiv preprint arXiv:2506.15564 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

MMaDA: Multimodal Large Diffusion Language Models

Yang, L., Tian, Y., Li, B., Zhang, X., Shen, K., Tong, Y., Wang, M.: Mmada: Mul- timodal large diffusion language models. arXiv preprint arXiv:2505.15809 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

In: ICCV Workshops

Yang, S., Ge, Y., Li, Y., Chen, Y., Ge, Y., Shan, Y., Chen, Y.C.: Seed-story: Multimodal long story generation with large language model. In: ICCV Workshops. pp. 1850–1860 (October 2025)

2025
[57]

In: CVPR (2022)

Yang,S.,Wu,T.,Shi,S.,Lao,S.,Gong,Y.,Cao,M.,Wang,J.,Yang,Y.:MANIQA: Multi-dimension Attention Network for No-Reference Image Quality Assessment. In: CVPR (2022)

2022
[58]

In: NeurIPS (2019)

Zhang, B., Sennrich, R.: Root Mean Square Layer Normalization. In: NeurIPS (2019)

2019
[59]

arXiv preprint arXiv:2511.16917 (2025)

Zhang, C., Wang, J., Wang, Y., Liang, Y., Yang, X., Li, Z., Huang, H., Li, X.: Unimodel: A visual-only framework for unified multimodal understanding and gen- eration. arXiv preprint arXiv:2511.16917 (2025)

work page arXiv 2025
[60]

arXiv 2601.02204 (2026)

Zhang, H., Qu, L., Liu, Y., Chen, H., Song, Y., Dong, Y., Sun, S., Li, X., Wang, X., Jiang, Y., Ye, H., Chen, B., Gao, Y., Liu, P., Liu, A., Yang, Z., Deng, Q., Xing, L., Liu, J., Wang, Z., Zhou, Y., Liu, M., Zhang, Y., He, Q., Hu, X., Qi, Z., Shao, J., Fu, Z., Wang, S., Chen, F., Chai, X., Wu, Z., Wang, Y., Yuan, Z., Du, D.K., Wu, X.: NextFlow: Unified S...

work page arXiv 2026
[61]

arXiv preprint arXiv:2310.02239 (2023)

Zheng, K., He, X., Wang, X.E.: Minigpt-5: Interleaved vision-and-language gener- ation via generative vokens. arXiv preprint arXiv:2310.02239 (2023)

work page arXiv 2023
[62]

In: ICLR (2025) ILLUME-X 1 Supplementary Material In this supplementary material, we provide additional explanation and experi- mental results to further support the main paper

Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer, L., Levy, O.: Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model. In: ICLR (2025) ILLUME-X 1 Supplementary Material In this supplementary material, we provide additional explanation and experi- mental results to further support...

2025
[67]

coherence

Creativity and Originality: The model's ability to generate novel and imaginative content across both text and images. Output Requirement: Please output in JSON format, including scores for each dimension (on a scale of 1-10) and a final overall score (on a scale of 1-10). Also provide brief explanations for each score. The JSON should follow this structu...
[68]

Coherence: How well the text and images work together to convey a unified message or story
[69]

Content Accuracy: The factual correctness of both textual information and visual elements
[71]

Logicality: The steps or processes included in the content, whether the arrangement of text and images follows a logical sequence, and whether it guides the reader's understanding
[72]

coherence

Creativity and Originality: The model's ability to generate novel and imaginative content across both text and images. Output Requirement: Please output in JSON format, including scores for each dimension (on a scale of 1-10) and a final overall score (on a scale of 1-10). Also provide brief explanations for each score. The JSON should follow this structu...
[73]

Cross-Image Content Consistency (Entity Stability) Evaluate whether the same or related entities maintain stability across images: - Identity Consistency: Whether the same character or object is recognizable as the same entity across images - Attribute Continuity: Whether key attributes (color, quantity, state) are reasonably maintained or changed across ...
[74]

Each dimension uses a 1-10 scale

Style Consistency Evaluate visual style coordination across images: - Artistic Style Uniformity: Whether artistic styles and rendering methods are consistent (e.g., all realistic or all cartoonish) - Color Harmony: Whether color tones, saturation, contrast are coordinated ## Scoring Criteria … Score based on severity and frequency of inconsistencies. Each...
[75]

Coherence: Grammatical correctness, Expression naturalness, Lexical accuracy, Sentence variety
[76]

Content Accuracy: Information completeness, Factual accuracy
[77]

Relevance and Responsiveness: How well the generated content addresses the given query
[78]

Logicality: The steps or processes follows a logical sequence, and it guides the reader's understanding
[79]

coherence

Creativity and Originality: The model's ability to generate novel and imaginative content. Output Requirement: Please output in JSON format, including scores for each dimension (on a scale of 1-10) and a final overall score (on a scale of 1-10). Also provide brief explanations for each score. The JSON should follow this structure: {{ "coherence": {{ "scor...
[80]

<think>A white ceramic plate filled with small, vibrant orange baby carrots with a few small green leafy tops still attached

A white ceramic plate is filled with a large quantity of small, bright orange baby carrots. <think>A white ceramic plate filled with small, vibrant orange baby carrots with a few small green leafy tops still attached. </think>
[81]

A white ceramic plate holds a small pile of vibrant green peas
[82]

Children are playing at the beach

A small bunch of deep purple beets is shown. The beets have green leafy tops and some root hairs. A plate of chopped carrots and green peas with some beet greens on the side and a small white bowl on the counter. Current Step: Identify and select the carrots from the plate. Current Step: Pick the green peas from the plate. Current Step: Extract the beet g...
[83]

A white banner with Japanese text and colorful foot and handprints
[84]

A woman with short brown hair, wearing a gray hoodie, a blue shirt, black pants, and a yellow sash
[85]

A colorful kite flying in the background

Showing first 80 references.

[1] [1]

In: EMNLP (2023) 16 C

Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., Sanghai, S.: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. In: EMNLP (2023) 16 C. Wang et al

2023

[2] [2]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv 2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

HunyuanImage 3.0 Technical Report

Cao, S., Chen, H., Chen, P., Cheng, Y., Cui, Y., Deng, X., Dong, Y., Gong, K., Gu, T., Gu, X., et al.: Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

arXiv preprint arXiv:2411.17188 (2024)

Chen, D., Chen, R., Pu, S., Liu, Z., Wu, Y., Chen, C., Liu, B., Huang, Y., Wan, Y., Zhou, P., et al.: Interleaved scene graphs for interleaved text-and-image generation assessment. arXiv preprint arXiv:2411.17188 (2024)

work page arXiv 2024

[6] [6]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Chen, J., Xu, Z., Pan, X., Hu, Y., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al.: Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

In: CVPR (2025)

Chen, W., Li, L., Yang, Y., Wen, B., Yang, F., Gao, T., Wu, Y., Chen, L.: Comm: A coherent interleaved image-text dataset for multimodal understanding and gen- eration. In: CVPR (2025)

2025

[8] [8]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. arXiv 2501.17811 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

arXiv preprint arXiv:2407.06135 (2024)

Chern, E., Su, J., Ma, Y., Liu, P.: Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation. arXiv preprint arXiv:2407.06135 (2024)

work page arXiv 2024

[10] [10]

arXiv preprint arXiv:2511.11434 (2025)

Chow,W.,Pan,J.,Liang,Y.,Zhou,M.,Song,X.,Jia,L.,Zhang,S.,Tang,S.,Li,J., Zhang, F., et al.: Weave: Unleashing and benchmarking the in-context interleaved comprehension and generation. arXiv preprint arXiv:2511.11434 (2025)

work page arXiv 2025

[11] [11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Emu3.5: Native Multimodal Models are World Learners

Cui, Y., Chen, H., Deng, H., Huang, X., Li, X., Liu, J., Liu, Y., Luo, Z., Wang, J., Wang, W., Wang, Y., Wang, C., Zhang, F., Zhao, Y., Pan, T., Li, X., Hao, Z., Ma, W., Chen, Z., Ao, Y., Huang, T., Wang, Z., Wang, X.: Emu3.5: Native Multimodal Models are World Learners. arXiv 2510.26583 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

In: ICML (2023) ILLUME-X 17

Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A.P., Caron, M., Geirhos, R., Alabdulmohsin, I., Jenatton, R., Beyer, L., Tschannen, M., Arnab, A., Wang, X., Riquelme Ruiz, C., Minderer, M., Puigcerver, J., Evci, U., Kumar, M., Steenkiste, S.V., Elsayed, G.F., Mahendran, A., Yu, F., Oliver, A., Huot, F., Bastings, J....

2023

[14] [14]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., Fan, H.: Emerging Properties in Unified Multimodal Pretraining. arXiv 2505.14683 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

In: ICLR (2024)

Dong, R., Han, C., Peng, Y., Qi, Z., Ge, Z., Yang, J., Zhao, L., Sun, J., Zhou, H., Wei, H., Kong, X., Zhang, X., Ma, K., Yi, L.: DreamLLM: Synergistic Multimodal Comprehension and Creation. In: ICLR (2024)

2024

[16] [16]

In: Forty-first international conference on machine learning (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

2024

[17] [17]

In: ICML (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., Rombach, R.: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. In: ICML (2024)

2024

[18] [18]

arXiv preprint arXiv:2310.01218 (2023)

Ge, Y., Zhao, S., Zeng, Z., Ge, Y., Li, C., Wang, X., Shan, Y.: Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218 (2023)

work page arXiv 2023

[19] [19]

In: NeurIPS (2023)

Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. In: NeurIPS (2023)

2023

[20] [20]

arXiv 2512.04810 (2025)

He, X., Wei, L., Ouyang, J., Liao, M., Xie, L., Tian, Q.: EMMA: Efficient Multi- modal Understanding, Generation, and Editing with a Unified Architecture. arXiv 2512.04810 (2025)

work page arXiv 2025

[21] [21]

Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P., Yu, G.: Ella: Equip diffusion models with llm for enhanced semantic alignment (2024)

2024

[22] [22]

arXiv 2504.01934 (2025)

Huang, R., Wang, C., Yang, J., Lu, G., Yuan, Y., Han, J., Hou, L., Zhang, W., Hong, L., Zhao, H., Xu, H.: ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement. arXiv 2504.01934 (2025)

work page arXiv 2025

[23] [23]

In: CVPR (2025)

Jiao, Y., Qiu, H., Jie, Z., Chen, S., Chen, J., Ma, L., Jiang, Y.G.: Unitoken: Harmo- nizing multimodal understanding and generation through unified visual encoding. In: CVPR (2025)

2025

[24] [24]

In: ICML (2025)

Kou, S., Jin, J., Liu, Z., Liu, C., Ma, Y., Jia, J., Chen, Q., Jiang, P., Deng, Z.: Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads. In: ICML (2025)

2025

[25] [25]

Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025)

2025

[26] [26]

arXiv preprint arXiv:2507.16746 (2025)

Li, A., Wang, C., Fu, D., Yue, K., Cai, Z., Zhu, W.B., Liu, O., Guo, P., Neiswanger, W., Huang, F., et al.: Zebra-cot: A dataset for interleaved vision language reason- ing. arXiv preprint arXiv:2507.16746 (2025)

work page arXiv 2025

[27] [27]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)

Li, B., Wang, Z., Li, F., Xu, J., Guo, J., Pei, R., Li, X., Chen, Z.: Colorflux: A structure-color decoupling framework for old photo colorization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)

2026

[28] [28]

Lavida-o: Elastic masked diffusion models for unified multimodal understanding and generation.arXiv preprint arXiv:2509.19244, 2025

Li, S., Gu, J., Liu, K., Lin, Z., Wei, Z., Grover, A., Kuen, J.: Lavida-o: Elastic large masked diffusion models for unified multimodal understanding and generation. arXiv preprint arXiv:2509.19244 (2025)

work page arXiv 2025

[29] [29]

In: CVPR (2025) 18 C

Li,Z.,Li,H.,Shi,Y.,Farimani,A.B.,Kluger,Y.,Yang,L.,Wang,P.:Dualdiffusion for unified image generation and understanding. In: CVPR (2025) 18 C. Wang et al

2025

[30] [30]

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

Liao, C., Liu, L., Wang, X., Luo, Z., Zhang, X., Zhao, W., Wu, J., Li, L., Tian, Z., Huang, W.: Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation. arXiv 2505.05472 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

In: ICLR (2024)

Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., Cobbe, K.: Let’s verify step by step. In: ICLR (2024)

2024

[32] [32]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

In: ICLR (2023)

Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow Matching for Generative Modeling. In: ICLR (2023)

2023

[34] [34]

In: European Conference on Computer Vision

Liu, X., Wei, Y., Liu, M., Lin, X., Ren, P., Xie, X., Zuo, W.: Smartcontrol: En- hancing controlnet for handling rough visual conditions. In: European Conference on Computer Vision. pp. 1–17. Springer (2024)

2024

[35] [35]

arXiv 2411.07975 (2025)

Ma, Y., Liu, X., Chen, X., Liu, W., Wu, C., Wu, Z., Pan, Z., Xie, Z., Zhang, H., yu, X., Zhao, L., Wang, Y., Liu, J., Ruan, C.: JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation. arXiv 2411.07975 (2025)

work page arXiv 2025

[36] [36]

Transactions on Machine Learning Research (2024)

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning Robust Visual Feat...

2024

[37] [37]

In: ICLR (2024)

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: SDXL: Improving latent diffusion models for high-resolution im- age synthesis. In: ICLR (2024)

2024

[38] [38]

In: ICLR (2026)

Qu, L., Cheng, F., Yang, Z., Zhao, Q., Lin, S., Shi, Y., Li, Y., Wang, W., Chua, T.S., Jiang, L.: VINCIE: Unlocking in-context image editing from video. In: ICLR (2026)

2026

[39] [39]

GLU Variants Improve Transformer

Shazeer, N.: GLU Variants Improve Transformer. arXiv 2002.05202 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2002

[40] [40]

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Shi, Q., Bai, J., Zhao, Z., Chai, W., Yu, K., Wu, J., Song, S., Tong, Y., Li, X., Li, X., et al.: Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model. arXiv preprint arXiv:2505.23606 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

OpenAI GPT-5 System Card

Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Neurocomputing (2024)

Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: RoFormer: Enhanced trans- former with Rotary Position Embedding. Neurocomputing (2024)

2024

[43] [43]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Team, C.: Chameleon: Mixed-Modal Early-Fusion Foundation Models. arXiv 2405.09818 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: A family of highly capable multimodal models. arxiv 2023. arXiv preprint arXiv:2312.11805 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

In: ECCV (2020)

Teed, Z., Deng, J.: RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. In: ECCV (2020)

2020

[46] [46]

arXiv (2024)

Tian, C., Zhu, X., Xiong, Y., Wang, W., Chen, Z., Wang, W., Chen, Y., Lu, L., Lu, T., Zhou, J., Li, H., Qiao, Y., Dai, J.: MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer. arXiv (2024)

2024

[47] [47]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)

Wang, C., Chen, Z., Wei, Y., Jiang, T., Wu, X., Li, F., Zuo, W., Yao, H.: Creval: An automated interpretable evaluation for creative image manipulation under complex ILLUME-X 19 instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)

2026

[48] [48]

arXiv preprint arXiv:2511.20520 (2025)

Wang, X., Zhang, Z., Zhang, H., Lin, Z., Zhou, Y., Liu, Q., Zhang, S., Li, Y., Liu, S., Zheng, H., et al.: Hbridge: H-shape bridging of heterogeneous experts for unified multimodal understanding and generation. arXiv preprint arXiv:2511.20520 (2025)

work page arXiv 2025

[49] [49]

Emu3: Next-Token Prediction is All You Need

Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Wu, C., Chen, X., Wu, Z., Ma, Y., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C., Luo, P.: Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation. arXiv 2410.13848 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)

Wu, C., Lei, L., Li, F., Guo, C., Kong, D., Qin, X., Wang, Z., Cheng, M., Li, C.: Yose: You only select essential tokens for efficient dit-based video object removal. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)

2026

[52] [52]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., Liu, Z., Xia, Z., Li, C., Deng, H., Wang, J., Luo, K., Zhang, B., Lian, D., Wang, X., Wang, Z., Huang, T., Liu, Z.: OmniGen2: Exploration to Advanced Multimodal Generation. arXiv 2506.18871 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. arXiv preprint arXiv:2408.12528 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

Show-o2: Improved Native Unified Multimodal Models

Xie,J.,Yang,Z.,Shou,M.Z.:Show-o2:Improvednativeunifiedmultimodalmodels. arXiv preprint arXiv:2506.15564 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

MMaDA: Multimodal Large Diffusion Language Models

Yang, L., Tian, Y., Li, B., Zhang, X., Shen, K., Tong, Y., Wang, M.: Mmada: Mul- timodal large diffusion language models. arXiv preprint arXiv:2505.15809 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

In: ICCV Workshops

Yang, S., Ge, Y., Li, Y., Chen, Y., Ge, Y., Shan, Y., Chen, Y.C.: Seed-story: Multimodal long story generation with large language model. In: ICCV Workshops. pp. 1850–1860 (October 2025)

2025

[57] [57]

In: CVPR (2022)

Yang,S.,Wu,T.,Shi,S.,Lao,S.,Gong,Y.,Cao,M.,Wang,J.,Yang,Y.:MANIQA: Multi-dimension Attention Network for No-Reference Image Quality Assessment. In: CVPR (2022)

2022

[58] [58]

In: NeurIPS (2019)

Zhang, B., Sennrich, R.: Root Mean Square Layer Normalization. In: NeurIPS (2019)

2019

[59] [59]

arXiv preprint arXiv:2511.16917 (2025)

Zhang, C., Wang, J., Wang, Y., Liang, Y., Yang, X., Li, Z., Huang, H., Li, X.: Unimodel: A visual-only framework for unified multimodal understanding and gen- eration. arXiv preprint arXiv:2511.16917 (2025)

work page arXiv 2025

[60] [60]

arXiv 2601.02204 (2026)

Zhang, H., Qu, L., Liu, Y., Chen, H., Song, Y., Dong, Y., Sun, S., Li, X., Wang, X., Jiang, Y., Ye, H., Chen, B., Gao, Y., Liu, P., Liu, A., Yang, Z., Deng, Q., Xing, L., Liu, J., Wang, Z., Zhou, Y., Liu, M., Zhang, Y., He, Q., Hu, X., Qi, Z., Shao, J., Fu, Z., Wang, S., Chen, F., Chai, X., Wu, Z., Wang, Y., Yuan, Z., Du, D.K., Wu, X.: NextFlow: Unified S...

work page arXiv 2026

[61] [61]

arXiv preprint arXiv:2310.02239 (2023)

Zheng, K., He, X., Wang, X.E.: Minigpt-5: Interleaved vision-and-language gener- ation via generative vokens. arXiv preprint arXiv:2310.02239 (2023)

work page arXiv 2023

[62] [62]

In: ICLR (2025) ILLUME-X 1 Supplementary Material In this supplementary material, we provide additional explanation and experi- mental results to further support the main paper

Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer, L., Levy, O.: Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model. In: ICLR (2025) ILLUME-X 1 Supplementary Material In this supplementary material, we provide additional explanation and experi- mental results to further support...

2025

[63] [67]

coherence

Creativity and Originality: The model's ability to generate novel and imaginative content across both text and images. Output Requirement: Please output in JSON format, including scores for each dimension (on a scale of 1-10) and a final overall score (on a scale of 1-10). Also provide brief explanations for each score. The JSON should follow this structu...

[64] [68]

Coherence: How well the text and images work together to convey a unified message or story

[65] [69]

Content Accuracy: The factual correctness of both textual information and visual elements

[66] [71]

Logicality: The steps or processes included in the content, whether the arrangement of text and images follows a logical sequence, and whether it guides the reader's understanding

[67] [72]

coherence

Creativity and Originality: The model's ability to generate novel and imaginative content across both text and images. Output Requirement: Please output in JSON format, including scores for each dimension (on a scale of 1-10) and a final overall score (on a scale of 1-10). Also provide brief explanations for each score. The JSON should follow this structu...

[68] [73]

Cross-Image Content Consistency (Entity Stability) Evaluate whether the same or related entities maintain stability across images: - Identity Consistency: Whether the same character or object is recognizable as the same entity across images - Attribute Continuity: Whether key attributes (color, quantity, state) are reasonably maintained or changed across ...

[69] [74]

Each dimension uses a 1-10 scale

Style Consistency Evaluate visual style coordination across images: - Artistic Style Uniformity: Whether artistic styles and rendering methods are consistent (e.g., all realistic or all cartoonish) - Color Harmony: Whether color tones, saturation, contrast are coordinated ## Scoring Criteria … Score based on severity and frequency of inconsistencies. Each...

[70] [75]

Coherence: Grammatical correctness, Expression naturalness, Lexical accuracy, Sentence variety

[71] [76]

Content Accuracy: Information completeness, Factual accuracy

[72] [77]

Relevance and Responsiveness: How well the generated content addresses the given query

[73] [78]

Logicality: The steps or processes follows a logical sequence, and it guides the reader's understanding

[74] [79]

coherence

Creativity and Originality: The model's ability to generate novel and imaginative content. Output Requirement: Please output in JSON format, including scores for each dimension (on a scale of 1-10) and a final overall score (on a scale of 1-10). Also provide brief explanations for each score. The JSON should follow this structure: {{ "coherence": {{ "scor...

[75] [80]

<think>A white ceramic plate filled with small, vibrant orange baby carrots with a few small green leafy tops still attached

A white ceramic plate is filled with a large quantity of small, bright orange baby carrots. <think>A white ceramic plate filled with small, vibrant orange baby carrots with a few small green leafy tops still attached. </think>

[76] [81]

A white ceramic plate holds a small pile of vibrant green peas

[77] [82]

Children are playing at the beach

A small bunch of deep purple beets is shown. The beets have green leafy tops and some root hairs. A plate of chopped carrots and green peas with some beet greens on the side and a small white bowl on the counter. Current Step: Identify and select the carrots from the plate. Current Step: Pick the green peas from the plate. Current Step: Extract the beet g...

[78] [83]

A white banner with Japanese text and colorful foot and handprints

[79] [84]

A woman with short brown hair, wearing a gray hoodie, a blue shirt, black pants, and a yellow sash

[80] [85]

A colorful kite flying in the background