Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation
Pith reviewed 2026-06-30 06:02 UTC · model grok-4.3
The pith
ILLUME-X enables high-quality free-form interleaved text-image generation by improving data efficiency and stabilizing training for variable-length sequences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ILLUME-X comprises an expanded training data pipeline optimized for interleaved text-image generation, a progressive training strategy with self-adaptive objectives for free-length multimodal token sequences, and the ILScore evaluation method; together these components enable high-quality free-form interleaved text-image generation and deliver outperformance over previous unified models on multiple tasks including style transfer, image decomposition, and storytelling.
What carries the argument
The progressive training strategy with self-adaptive objectives for free-length multimodal token sequences, which stabilizes the multimodal training process and improves data efficiency.
Load-bearing premise
The expanded training data pipeline and progressive training strategy with self-adaptive objectives are sufficient to improve multimodal data efficiency and stabilize training for free-length sequences.
What would settle it
A controlled comparison in which a prior unified model trained on the same expanded interleaved data but without the progressive self-adaptive strategy matches or exceeds ILLUME-X performance on ILScore across style transfer, decomposition, and storytelling tasks.
Figures
read the original abstract
The advancement of generative AI models capable of producing text and image marks a critical step forward in the realm of multimodal intelligence, particularly for tasks involving the interleaving of both modalities. To advance this intelligence to the next stage, it is crucial for models to autonomously generate free-form interleaved text-image sequences. In this paper, we introduce ILLUME-X, an advanced unified multimodal paradigm that enables high-quality, free-form interleaved text-image generation by improving multimodal data efficiency and stabilizing the multimodal training process. ILLUME-X comprises three key components: (i) an expanded training data pipeline optimized for interleaved text-image generation, (ii) a progressive training strategy with self-adaptive objectives for free-length multimodal token sequences, and (iii) an objective and comprehensive evaluation method ILScore for interleaved text-image sequences. Notably, our ILLUME-X outperforms previous unified models across multiple interleaved text-image generation tasks like style transfer, image decomposition and storytelling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ILLUME-X, a unified multimodal model for free-form interleaved text-image generation. It proposes three components—an expanded training data pipeline optimized for interleaved data, a progressive training strategy with self-adaptive objectives for variable-length sequences, and the ILScore evaluation metric—and claims these enable high-quality generation while outperforming prior unified models on tasks including style transfer, image decomposition, and storytelling.
Significance. If substantiated by rigorous experiments, the work could advance multimodal generation by addressing data efficiency and training stability for interleaved outputs. The ILScore metric represents a potential contribution if shown to be objective and well-correlated with human judgments, but the absence of any quantitative results, baselines, or ablation details in the provided text prevents assessment of whether the claimed improvements are real or meaningful.
major comments (1)
- [Abstract] Abstract: the central claims of outperformance across multiple tasks and the sufficiency of the three components for improving data efficiency and stabilizing training are stated without any quantitative results, error bars, dataset sizes, baseline comparisons, or experimental details, rendering the claims unevaluable from the manuscript text.
Simulated Author's Rebuttal
We thank the referee for their review and the opportunity to clarify the manuscript. The primary concern raised is the lack of quantitative support in the abstract, which we address directly below by committing to revisions that incorporate key experimental details.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claims of outperformance across multiple tasks and the sufficiency of the three components for improving data efficiency and stabilizing training are stated without any quantitative results, error bars, dataset sizes, baseline comparisons, or experimental details, rendering the claims unevaluable from the manuscript text.
Authors: We agree that the abstract as written states performance claims without supporting numbers, which limits immediate evaluability. The full manuscript contains dedicated experimental sections reporting quantitative results (including baseline comparisons on style transfer, image decomposition, and storytelling), dataset sizes from the expanded interleaved data pipeline, ablation studies on the progressive training strategy, and details on ILScore correlation with human judgments. To address this, we will revise the abstract to include specific metrics (e.g., relative improvements over prior unified models and key dataset statistics) while preserving conciseness. This change will be incorporated in the next manuscript version. revision: yes
Circularity Check
No significant circularity; empirical model with no derivation chain
full rationale
The paper presents ILLUME-X as an empirical multimodal model whose central claims rest on an expanded data pipeline, progressive training with self-adaptive objectives, and a new ILScore metric. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or described components. The outperformance statements are framed as experimental results rather than logical reductions to inputs by construction, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: EMNLP (2023) 16 C
Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., Sanghai, S.: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. In: EMNLP (2023) 16 C. Wang et al
2023
-
[2]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv 2502.13923 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
HunyuanImage 3.0 Technical Report
Cao, S., Chen, H., Chen, P., Cheng, Y., Cui, Y., Deng, X., Dong, Y., Gong, K., Gu, T., Gu, X., et al.: Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
arXiv preprint arXiv:2411.17188 (2024)
Chen, D., Chen, R., Pu, S., Liu, Z., Wu, Y., Chen, C., Liu, B., Huang, Y., Wan, Y., Zhou, P., et al.: Interleaved scene graphs for interleaved text-and-image generation assessment. arXiv preprint arXiv:2411.17188 (2024)
-
[6]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Chen, J., Xu, Z., Pan, X., Hu, Y., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al.: Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
In: CVPR (2025)
Chen, W., Li, L., Yang, Y., Wen, B., Yang, F., Gao, T., Wu, Y., Chen, L.: Comm: A coherent interleaved image-text dataset for multimodal understanding and gen- eration. In: CVPR (2025)
2025
-
[8]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. arXiv 2501.17811 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
arXiv preprint arXiv:2407.06135 (2024)
Chern, E., Su, J., Ma, Y., Liu, P.: Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation. arXiv preprint arXiv:2407.06135 (2024)
-
[10]
arXiv preprint arXiv:2511.11434 (2025)
Chow,W.,Pan,J.,Liang,Y.,Zhou,M.,Song,X.,Jia,L.,Zhang,S.,Tang,S.,Li,J., Zhang, F., et al.: Weave: Unleashing and benchmarking the in-context interleaved comprehension and generation. arXiv preprint arXiv:2511.11434 (2025)
-
[11]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Emu3.5: Native Multimodal Models are World Learners
Cui, Y., Chen, H., Deng, H., Huang, X., Li, X., Liu, J., Liu, Y., Luo, Z., Wang, J., Wang, W., Wang, Y., Wang, C., Zhang, F., Zhao, Y., Pan, T., Li, X., Hao, Z., Ma, W., Chen, Z., Ao, Y., Huang, T., Wang, Z., Wang, X.: Emu3.5: Native Multimodal Models are World Learners. arXiv 2510.26583 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
In: ICML (2023) ILLUME-X 17
Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A.P., Caron, M., Geirhos, R., Alabdulmohsin, I., Jenatton, R., Beyer, L., Tschannen, M., Arnab, A., Wang, X., Riquelme Ruiz, C., Minderer, M., Puigcerver, J., Evci, U., Kumar, M., Steenkiste, S.V., Elsayed, G.F., Mahendran, A., Yu, F., Oliver, A., Huot, F., Bastings, J....
2023
-
[14]
Emerging Properties in Unified Multimodal Pretraining
Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., Fan, H.: Emerging Properties in Unified Multimodal Pretraining. arXiv 2505.14683 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
In: ICLR (2024)
Dong, R., Han, C., Peng, Y., Qi, Z., Ge, Z., Yang, J., Zhao, L., Sun, J., Zhou, H., Wei, H., Kong, X., Zhang, X., Ma, K., Yi, L.: DreamLLM: Synergistic Multimodal Comprehension and Creation. In: ICLR (2024)
2024
-
[16]
In: Forty-first international conference on machine learning (2024)
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)
2024
-
[17]
In: ICML (2024)
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., Rombach, R.: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. In: ICML (2024)
2024
-
[18]
arXiv preprint arXiv:2310.01218 (2023)
Ge, Y., Zhao, S., Zeng, Z., Ge, Y., Li, C., Wang, X., Shan, Y.: Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218 (2023)
-
[19]
In: NeurIPS (2023)
Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. In: NeurIPS (2023)
2023
-
[20]
He, X., Wei, L., Ouyang, J., Liao, M., Xie, L., Tian, Q.: EMMA: Efficient Multi- modal Understanding, Generation, and Editing with a Unified Architecture. arXiv 2512.04810 (2025)
-
[21]
Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P., Yu, G.: Ella: Equip diffusion models with llm for enhanced semantic alignment (2024)
2024
-
[22]
Huang, R., Wang, C., Yang, J., Lu, G., Yuan, Y., Han, J., Hou, L., Zhang, W., Hong, L., Zhao, H., Xu, H.: ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement. arXiv 2504.01934 (2025)
-
[23]
In: CVPR (2025)
Jiao, Y., Qiu, H., Jie, Z., Chen, S., Chen, J., Ma, L., Jiang, Y.G.: Unitoken: Harmo- nizing multimodal understanding and generation through unified visual encoding. In: CVPR (2025)
2025
-
[24]
In: ICML (2025)
Kou, S., Jin, J., Liu, Z., Liu, C., Ma, Y., Jia, J., Chen, Q., Jiang, P., Deng, Z.: Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads. In: ICML (2025)
2025
-
[25]
Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025)
2025
-
[26]
arXiv preprint arXiv:2507.16746 (2025)
Li, A., Wang, C., Fu, D., Yue, K., Cai, Z., Zhu, W.B., Liu, O., Guo, P., Neiswanger, W., Huang, F., et al.: Zebra-cot: A dataset for interleaved vision language reason- ing. arXiv preprint arXiv:2507.16746 (2025)
-
[27]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)
Li, B., Wang, Z., Li, F., Xu, J., Guo, J., Pei, R., Li, X., Chen, Z.: Colorflux: A structure-color decoupling framework for old photo colorization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)
2026
-
[28]
Li, S., Gu, J., Liu, K., Lin, Z., Wei, Z., Grover, A., Kuen, J.: Lavida-o: Elastic large masked diffusion models for unified multimodal understanding and generation. arXiv preprint arXiv:2509.19244 (2025)
-
[29]
In: CVPR (2025) 18 C
Li,Z.,Li,H.,Shi,Y.,Farimani,A.B.,Kluger,Y.,Yang,L.,Wang,P.:Dualdiffusion for unified image generation and understanding. In: CVPR (2025) 18 C. Wang et al
2025
-
[30]
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
Liao, C., Liu, L., Wang, X., Luo, Z., Zhang, X., Zhao, W., Wu, J., Li, L., Tian, Z., Huang, W.: Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation. arXiv 2505.05472 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
In: ICLR (2024)
Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., Cobbe, K.: Let’s verify step by step. In: ICLR (2024)
2024
-
[32]
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
In: ICLR (2023)
Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow Matching for Generative Modeling. In: ICLR (2023)
2023
-
[34]
In: European Conference on Computer Vision
Liu, X., Wei, Y., Liu, M., Lin, X., Ren, P., Xie, X., Zuo, W.: Smartcontrol: En- hancing controlnet for handling rough visual conditions. In: European Conference on Computer Vision. pp. 1–17. Springer (2024)
2024
-
[35]
arXiv preprint arXiv:2411.07975 (2024)
Ma, Y., Liu, X., Chen, X., Liu, W., Wu, C., Wu, Z., Pan, Z., Xie, Z., Zhang, H., yu, X., Zhao, L., Wang, Y., Liu, J., Ruan, C.: JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation. arXiv 2411.07975 (2025)
-
[36]
Transactions on Machine Learning Research (2024)
Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning Robust Visual Feat...
2024
-
[37]
In: ICLR (2024)
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: SDXL: Improving latent diffusion models for high-resolution im- age synthesis. In: ICLR (2024)
2024
-
[38]
In: ICLR (2026)
Qu, L., Cheng, F., Yang, Z., Zhao, Q., Lin, S., Shi, Y., Li, Y., Wang, W., Chua, T.S., Jiang, L.: VINCIE: Unlocking in-context image editing from video. In: ICLR (2026)
2026
-
[39]
GLU Variants Improve Transformer
Shazeer, N.: GLU Variants Improve Transformer. arXiv 2002.05202 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[40]
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Shi, Q., Bai, J., Zhao, Z., Chai, W., Yu, K., Wu, J., Song, S., Tong, Y., Li, X., Li, X., et al.: Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model. arXiv preprint arXiv:2505.23606 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Neurocomputing (2024)
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: RoFormer: Enhanced trans- former with Rotary Position Embedding. Neurocomputing (2024)
2024
-
[43]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Team, C.: Chameleon: Mixed-Modal Early-Fusion Foundation Models. arXiv 2405.09818 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Gemini: A Family of Highly Capable Multimodal Models
Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: A family of highly capable multimodal models. arxiv 2023. arXiv preprint arXiv:2312.11805 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
In: ECCV (2020)
Teed, Z., Deng, J.: RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. In: ECCV (2020)
2020
-
[46]
arXiv (2024)
Tian, C., Zhu, X., Xiong, Y., Wang, W., Chen, Z., Wang, W., Chen, Y., Lu, L., Lu, T., Zhou, J., Li, H., Qiao, Y., Dai, J.: MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer. arXiv (2024)
2024
-
[47]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)
Wang, C., Chen, Z., Wei, Y., Jiang, T., Wu, X., Li, F., Zuo, W., Yao, H.: Creval: An automated interpretable evaluation for creative image manipulation under complex ILLUME-X 19 instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)
2026
-
[48]
Wang, X., Zhang, Z., Zhang, H., Lin, Z., Zhou, Y., Liu, Q., Zhang, S., Li, Y., Liu, S., Zheng, H., et al.: Hbridge: H-shape bridging of heterogeneous experts for unified multimodal understanding and generation. arXiv preprint arXiv:2511.20520 (2025)
-
[49]
Emu3: Next-Token Prediction is All You Need
Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Wu, C., Chen, X., Wu, Z., Ma, Y., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C., Luo, P.: Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation. arXiv 2410.13848 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)
Wu, C., Lei, L., Li, F., Guo, C., Kong, D., Qin, X., Wang, Z., Cheng, M., Li, C.: Yose: You only select essential tokens for efficient dit-based video object removal. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)
2026
-
[52]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., Liu, Z., Xia, Z., Li, C., Deng, H., Wang, J., Luo, K., Zhang, B., Lian, D., Wang, X., Wang, Z., Huang, T., Liu, Z.: OmniGen2: Exploration to Advanced Multimodal Generation. arXiv 2506.18871 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. arXiv preprint arXiv:2408.12528 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Show-o2: Improved Native Unified Multimodal Models
Xie,J.,Yang,Z.,Shou,M.Z.:Show-o2:Improvednativeunifiedmultimodalmodels. arXiv preprint arXiv:2506.15564 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
MMaDA: Multimodal Large Diffusion Language Models
Yang, L., Tian, Y., Li, B., Zhang, X., Shen, K., Tong, Y., Wang, M.: Mmada: Mul- timodal large diffusion language models. arXiv preprint arXiv:2505.15809 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
In: ICCV Workshops
Yang, S., Ge, Y., Li, Y., Chen, Y., Ge, Y., Shan, Y., Chen, Y.C.: Seed-story: Multimodal long story generation with large language model. In: ICCV Workshops. pp. 1850–1860 (October 2025)
2025
-
[57]
In: CVPR (2022)
Yang,S.,Wu,T.,Shi,S.,Lao,S.,Gong,Y.,Cao,M.,Wang,J.,Yang,Y.:MANIQA: Multi-dimension Attention Network for No-Reference Image Quality Assessment. In: CVPR (2022)
2022
-
[58]
In: NeurIPS (2019)
Zhang, B., Sennrich, R.: Root Mean Square Layer Normalization. In: NeurIPS (2019)
2019
-
[59]
arXiv preprint arXiv:2511.16917 (2025)
Zhang, C., Wang, J., Wang, Y., Liang, Y., Yang, X., Li, Z., Huang, H., Li, X.: Unimodel: A visual-only framework for unified multimodal understanding and gen- eration. arXiv preprint arXiv:2511.16917 (2025)
-
[60]
Zhang, H., Qu, L., Liu, Y., Chen, H., Song, Y., Dong, Y., Sun, S., Li, X., Wang, X., Jiang, Y., Ye, H., Chen, B., Gao, Y., Liu, P., Liu, A., Yang, Z., Deng, Q., Xing, L., Liu, J., Wang, Z., Zhou, Y., Liu, M., Zhang, Y., He, Q., Hu, X., Qi, Z., Shao, J., Fu, Z., Wang, S., Chen, F., Chai, X., Wu, Z., Wang, Y., Yuan, Z., Du, D.K., Wu, X.: NextFlow: Unified S...
-
[61]
arXiv preprint arXiv:2310.02239 (2023)
Zheng, K., He, X., Wang, X.E.: Minigpt-5: Interleaved vision-and-language gener- ation via generative vokens. arXiv preprint arXiv:2310.02239 (2023)
-
[62]
In: ICLR (2025) ILLUME-X 1 Supplementary Material In this supplementary material, we provide additional explanation and experi- mental results to further support the main paper
Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer, L., Levy, O.: Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model. In: ICLR (2025) ILLUME-X 1 Supplementary Material In this supplementary material, we provide additional explanation and experi- mental results to further support...
2025
-
[67]
coherence
Creativity and Originality: The model's ability to generate novel and imaginative content across both text and images. Output Requirement: Please output in JSON format, including scores for each dimension (on a scale of 1-10) and a final overall score (on a scale of 1-10). Also provide brief explanations for each score. The JSON should follow this structu...
-
[68]
Coherence: How well the text and images work together to convey a unified message or story
-
[69]
Content Accuracy: The factual correctness of both textual information and visual elements
-
[71]
Logicality: The steps or processes included in the content, whether the arrangement of text and images follows a logical sequence, and whether it guides the reader's understanding
-
[72]
coherence
Creativity and Originality: The model's ability to generate novel and imaginative content across both text and images. Output Requirement: Please output in JSON format, including scores for each dimension (on a scale of 1-10) and a final overall score (on a scale of 1-10). Also provide brief explanations for each score. The JSON should follow this structu...
-
[73]
Cross-Image Content Consistency (Entity Stability) Evaluate whether the same or related entities maintain stability across images: - Identity Consistency: Whether the same character or object is recognizable as the same entity across images - Attribute Continuity: Whether key attributes (color, quantity, state) are reasonably maintained or changed across ...
-
[74]
Each dimension uses a 1-10 scale
Style Consistency Evaluate visual style coordination across images: - Artistic Style Uniformity: Whether artistic styles and rendering methods are consistent (e.g., all realistic or all cartoonish) - Color Harmony: Whether color tones, saturation, contrast are coordinated ## Scoring Criteria … Score based on severity and frequency of inconsistencies. Each...
-
[75]
Coherence: Grammatical correctness, Expression naturalness, Lexical accuracy, Sentence variety
-
[76]
Content Accuracy: Information completeness, Factual accuracy
-
[77]
Relevance and Responsiveness: How well the generated content addresses the given query
-
[78]
Logicality: The steps or processes follows a logical sequence, and it guides the reader's understanding
-
[79]
coherence
Creativity and Originality: The model's ability to generate novel and imaginative content. Output Requirement: Please output in JSON format, including scores for each dimension (on a scale of 1-10) and a final overall score (on a scale of 1-10). Also provide brief explanations for each score. The JSON should follow this structure: {{ "coherence": {{ "scor...
-
[80]
<think>A white ceramic plate filled with small, vibrant orange baby carrots with a few small green leafy tops still attached
A white ceramic plate is filled with a large quantity of small, bright orange baby carrots. <think>A white ceramic plate filled with small, vibrant orange baby carrots with a few small green leafy tops still attached. </think>
-
[81]
A white ceramic plate holds a small pile of vibrant green peas
-
[82]
Children are playing at the beach
A small bunch of deep purple beets is shown. The beets have green leafy tops and some root hairs. A plate of chopped carrots and green peas with some beet greens on the side and a small white bowl on the counter. Current Step: Identify and select the carrots from the plate. Current Step: Pick the green peas from the plate. Current Step: Extract the beet g...
-
[83]
A white banner with Japanese text and colorful foot and handprints
-
[84]
A woman with short brown hair, wearing a gray hoodie, a blue shirt, black pants, and a yellow sash
-
[85]
A colorful kite flying in the background
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.