pith. machine review for the scientific record. sign in

arxiv: 2505.09568 · v1 · submitted 2025-05-14 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Authors on Pith no claims yet

Pith reviewed 2026-05-11 23:28 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords unified multimodal modelsimage understanding and generationdiffusion transformerCLIP image featuressequential pretraininginstruction tuningopen source models
0
0 comments X

The pith

BLIP3-o uses a diffusion transformer to produce CLIP image features and sequential pretraining to create unified models strong at both image understanding and generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to best combine image understanding and generation in a single multimodal model. It tests autoregressive and diffusion approaches and settles on generating rich CLIP features with a diffusion transformer rather than using VAE representations. This choice improves both training speed and output quality. The work also shows that training first for understanding and then for generation keeps the first skill intact while building the second. A new curated dataset supports the generation side, resulting in models that lead on standard benchmarks for both capabilities.

Core claim

By employing a diffusion transformer to generate semantically rich CLIP image features instead of conventional VAE-based representations, and by applying a sequential pretraining strategy that first trains on image understanding and then on image generation, the authors develop BLIP3-o models that achieve superior performance across popular benchmarks for both tasks while preserving understanding capabilities.

What carries the argument

A diffusion transformer that generates semantically rich CLIP image features, which replaces VAE representations to increase training efficiency and generative quality in unified multimodal models.

If this is right

  • Unified multimodal models can achieve high performance in both understanding and generation without sacrificing one for the other.
  • Diffusion-based generation of CLIP features offers practical advantages over VAE approaches in terms of efficiency and quality.
  • Sequential pretraining starting with understanding tasks allows strong generation abilities to be added later.
  • The BLIP3o-60k dataset provides a high-quality resource for instruction-tuning image generation models.
  • Full open-sourcing of models, code, and datasets enables broader community progress on unified multimodal systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This design could be adapted to incorporate additional modalities like text or video generation with similar sequential strategies.
  • Using CLIP features as the target for diffusion might generalize to other semantic embedding spaces for more controllable generation.
  • Open models like this could accelerate development of applications that require both analyzing and synthesizing images in one system.

Load-bearing premise

That training first on image understanding and then on image generation will preserve the understanding performance while building strong generation ability.

What would settle it

Measurements on understanding benchmarks after the full sequential training process showing performance below that of models trained only on understanding or with simultaneous training.

read the original abstract

Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore, we demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability. Finally, we carefully curate a high-quality instruction-tuning dataset BLIP3o-60k for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, human gestures, and more. Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models. BLIP3-o achieves superior performance across most of the popular benchmarks spanning both image understanding and generation tasks. To facilitate future research, we fully open-source our models, including code, model weights, training scripts, and pretraining and instruction tuning datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces BLIP3-o, a family of unified multimodal models for both image understanding and generation. It conducts a study on autoregressive and diffusion-based approaches, proposes a diffusion transformer to generate semantically rich CLIP image features (instead of VAE representations) for improved efficiency and quality, advocates a sequential pretraining strategy (image understanding followed by image generation), curates the BLIP3o-60k instruction-tuning dataset via GPT-4o prompting, and reports state-of-the-art results across popular benchmarks for both task types while fully open-sourcing models, code, weights, scripts, and datasets.

Significance. If the empirical results hold, the work is significant for providing a fully open, high-performing unified multimodal model that balances understanding and generation. The open release of all artifacts (including pretraining and instruction-tuning datasets) enables direct reproducibility and extension by the community. The exploration of CLIP-feature diffusion and sequential training offers practical insights into scalable unified architectures.

major comments (2)
  1. [§4] §4 (Training Strategy): The claim that sequential pretraining 'preserves image understanding capability while developing strong image generation ability' is central to the recipe, yet the manuscript provides limited quantitative evidence (e.g., understanding benchmark deltas before vs. after the generation stage). Without explicit ablation tables showing no degradation on tasks such as VQA or captioning, this assumption remains under-supported for the SOTA unified claim.
  2. [§3.1] §3.1 (Model Architecture): The superiority of the diffusion transformer on CLIP features over conventional VAE-based representations is asserted for both training efficiency and generative quality, but the paper lacks direct head-to-head metrics (e.g., FID scores, training FLOPs, or convergence curves) in the main results or ablations to isolate this design choice as load-bearing for the reported benchmark gains.
minor comments (3)
  1. [Results tables] Tables reporting benchmark results should include error bars or multiple-run statistics to allow assessment of statistical significance, especially when claiming superiority 'across most' benchmarks.
  2. [Dataset section] The curation process for the BLIP3o-60k dataset (prompting details, filtering criteria, diversity metrics) is described at a high level; expanding this in the appendix would strengthen reproducibility claims.
  3. [§3] Notation for the diffusion objective and CLIP feature projection should be clarified with an explicit equation or diagram to avoid ambiguity when comparing to prior unified models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment, the recommendation for minor revision, and the recognition of the work's significance and open-sourcing contributions. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [§4] §4 (Training Strategy): The claim that sequential pretraining 'preserves image understanding capability while developing strong image generation ability' is central to the recipe, yet the manuscript provides limited quantitative evidence (e.g., understanding benchmark deltas before vs. after the generation stage). Without explicit ablation tables showing no degradation on tasks such as VQA or captioning, this assumption remains under-supported for the SOTA unified claim.

    Authors: We agree that explicit before-and-after quantitative comparisons would provide stronger support for the sequential pretraining claim. Although the final BLIP3-o models achieve state-of-the-art performance on both understanding and generation benchmarks (indicating that understanding capabilities are retained), we will add ablation tables in the revised manuscript. These tables will report performance deltas on representative understanding tasks such as VQA and image captioning immediately prior to and following the image generation pretraining stage, thereby offering direct evidence of minimal degradation. revision: yes

  2. Referee: [§3.1] §3.1 (Model Architecture): The superiority of the diffusion transformer on CLIP features over conventional VAE-based representations is asserted for both training efficiency and generative quality, but the paper lacks direct head-to-head metrics (e.g., FID scores, training FLOPs, or convergence curves) in the main results or ablations to isolate this design choice as load-bearing for the reported benchmark gains.

    Authors: We acknowledge that isolating the contribution of the diffusion transformer on CLIP features versus VAE representations would benefit from more targeted comparative metrics. The manuscript reports overall efficiency gains and generative quality improvements within the unified framework, but we will incorporate direct head-to-head evaluations in the revised version. These will include FID scores, training FLOPs, and convergence curves comparing the two representation approaches under controlled settings, either in the main text or as supplementary material, to more clearly substantiate this design choice. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper is an empirical modeling contribution that reports architecture choices, a sequential pretraining recipe, and a curated dataset, then evaluates the resulting models on public benchmarks for image understanding and generation. No equations, first-principles derivations, or 'predictions' appear that reduce by construction to quantities defined inside the paper itself. The central performance claims are grounded in external test sets and open-sourced artifacts rather than self-referential fits or self-citation chains. The sequential pretraining strategy is presented as an empirical observation, not a tautological result. This is the normal, non-circular case for an open empirical systems paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard machine-learning components (CLIP encoders, diffusion transformers, autoregressive objectives) and the assumption that CLIP features are semantically rich enough to serve as generation targets. No new physical axioms or invented entities are introduced; free parameters are the usual training hyperparameters whose values are not detailed in the abstract.

free parameters (1)
  • training hyperparameters
    Standard learning rates, batch sizes, and schedule choices required for any large-scale model training; not enumerated in the abstract.
axioms (1)
  • domain assumption CLIP image features provide semantically rich targets suitable for diffusion-based generation
    Invoked when the authors replace conventional VAE representations with CLIP features generated by a diffusion transformer.

pith-pipeline@v0.9.0 · 5623 in / 1478 out tokens · 72003 ms · 2026-05-11T23:28:58.221911+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 36 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Representation Fr\'echet Loss for Visual Generation

    cs.CV 2026-04 unverdicted novelty 8.0

    Fréchet Distance optimized as FD-loss in representation space by decoupling population size from batch size improves generator quality, enables one-step generation from multi-step models, and motivates a multi-represe...

  2. Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

  3. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.

  4. UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

    cs.MM 2026-05 unverdicted novelty 7.0

    UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...

  5. Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 7.0

    XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...

  6. Exploring Spatial Intelligence from a Generative Perspective

    cs.CV 2026-04 unverdicted novelty 7.0

    Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.

  7. Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.

  8. WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    cs.CV 2025-03 unverdicted novelty 7.0

    Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.

  9. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.

  10. HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

    cs.CV 2026-05 unverdicted novelty 6.0

    A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...

  11. Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

    cs.AI 2026-05 unverdicted novelty 6.0

    Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...

  12. STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.

  13. What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

    cs.CV 2026-05 unverdicted novelty 6.0

    Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

  14. MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

    cs.CV 2026-05 unverdicted novelty 6.0

    MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.

  15. Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than e...

  16. Meta-CoT: Enhancing Granularity and Generalization in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.

  17. Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

    cs.CV 2026-04 unverdicted novelty 6.0

    By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.

  18. Self-Adversarial One Step Generation via Condition Shifting

    cs.CV 2026-04 unverdicted novelty 6.0

    APEX derives self-adversarial gradients from condition-shifted velocity fields in flow models to achieve high-fidelity one-step generation, outperforming much larger models and multi-step teachers.

  19. Nucleus-Image: Sparse MoE for Image Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.

  20. LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving

    cs.CV 2026-04 unverdicted novelty 6.0

    LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.

  21. Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

    cs.CV 2026-04 unverdicted novelty 6.0

    Uni-ViGU unifies video generation and understanding by extending a diffusion video generator with unified continuous-discrete flow matching, modality-driven MoE layers, and bidirectional training stages that repurpose...

  22. MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.

  23. SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

    cs.CV 2026-05 unverdicted novelty 5.0

    SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

  24. Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

    cs.CV 2026-05 unverdicted novelty 5.0

    Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.

  25. Let ViT Speak: Generative Language-Image Pre-training

    cs.CV 2026-05 unverdicted novelty 5.0

    GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.

  26. Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.

  27. UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

    cs.CV 2026-04 unverdicted novelty 5.0

    UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.

  28. Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    cs.CV 2025-11 unverdicted novelty 5.0

    Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...

  29. Qwen-Image Technical Report

    cs.CV 2025-08 unverdicted novelty 5.0

    Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...

  30. UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    cs.CV 2025-06 unverdicted novelty 5.0

    UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.

  31. Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

    cs.GR 2026-05 unverdicted novelty 4.0

    JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.

  32. Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

    cs.CV 2026-05 unverdicted novelty 4.0

    Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.

  33. MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

    cs.CV 2026-04 unverdicted novelty 4.0

    MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.

  34. TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

    cs.AI 2026-04 unverdicted novelty 4.0

    TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.

  35. Show-o2: Improved Native Unified Multimodal Models

    cs.CV 2025-06 unverdicted novelty 4.0

    Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

  36. Evolution of Video Generative Foundations

    cs.CV 2026-04 unverdicted novelty 2.0

    This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 35 Pith papers · 16 internal anchors

  1. [1]

    https://openai.com/index/ introducing-4o-image-generation/ , 2025

    Introducing 4o image generation. https://openai.com/index/ introducing-4o-image-generation/ , 2025

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

  3. [3]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021

  4. [4]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025

  5. [5]

    Dreamllm: Syn- ergistic multimodal comprehension and creation

    Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jian- jian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023

  6. [6]

    Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024

  7. [7]

    Seed-x: Multi- modal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024

    Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024

  8. [8]

    Geneval: An object-focused framework for evaluating text-to-image alignment

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  9. [9]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 6904–6913, 2017

  10. [10]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017

  11. [11]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135, 2024

  12. [12]

    Auto-encoding variational bayes, 2013

    Diederik P Kingma, Max Welling, et al. Auto-encoding variational bayes, 2013

  13. [13]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

  14. [14]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023

  15. [15]

    Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024

  16. [17]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022. 15

  17. [18]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  18. [19]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. CoRR, abs/2209.03003, 2022

  19. [20]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023

  20. [21]

    Token-shuffle: Towards high-resolution image generation with autoregressive models

    Xu Ma, Peize Sun, Haoyu Ma, Hao Tang, Chih-Yao Ma, Jialiang Wang, Kunpeng Li, Xiaoliang Dai, Yujun Shi, Xuan Ju, et al. Token-shuffle: Towards high-resolution image generation with autoregressive models. arXiv preprint arXiv:2504.17789, 2025

  21. [22]

    WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    Yuwei Niu, Munan Ning, Mengren Zheng, Bin Lin, Peng Jin, Jiaqi Liao, Kunpeng Ning, Bin Zhu, and Li Yuan. Wise: A world knowledge-informed semantic evaluation for text-to-image generation. arXiv preprint arXiv:2503.07265, 2025

  22. [23]

    Transfer between Modalities with MetaQueries

    Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries. arXiv preprint arXiv:2504.06256, 2025

  23. [24]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023

  24. [25]

    Du, Zehuan Yuan, and Xinglong Wu

    Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xinglong Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation. arXiv preprint arXiv:2412.03069, 2024

  25. [26]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

  26. [27]

    Stochastic backpropagation and approximate inference in deep generative models

    Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278–1286. PMLR, 2014

  27. [28]

    Llamafu- sion: Adapting pretrained language models for multimodal generation.arXiv preprint arXiv:2412.15188, 2024

    Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, and Lili Yu. Llamafusion: Adapting pretrained language models for multimodal generation. arXiv preprint arXiv:2412.15188, 2024

  28. [29]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

  29. [30]

    Journeydb: A benchmark for generative image understanding

    Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, Limin Wang, and Hongsheng Li. Journeydb: A benchmark for generative image understanding. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Informati...

  30. [31]

    Generative multimodal models are in-context learners

    Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14398–14409, 2024

  31. [32]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024. 16

  32. [33]

    Metamorph: Multimodal under- standing and generation via instruction tuning.arXiv preprint arXiv:2412.14164, 2024

    Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal under- standing and generation via instruction tuning. arXiv preprint arXiv:2412.14164, 2024

  33. [34]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024

  34. [35]

    Janus: Decoupling visual encoding for unified multimodal understanding and generation.arXiv preprint arXiv:2410.13848, 2024

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848, 2024

  35. [36]

    Vila-u: a unified foundation model inte- grating visual understanding and generation.arXiv preprint arXiv:2409.04429, 2024

    Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429, 2024

  36. [37]

    Grok 1.5v: The next generation of ai.https://x.ai/blog/grok-1.5v, 2023

    x.ai. Grok 1.5v: The next generation of ai.https://x.ai/blog/grok-1.5v, 2023. Accessed: 2024-07-26

  37. [38]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single trans- former to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024

  38. [39]

    xGen-MM (BLIP-3): A family of open large multi- modal models

    Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, et al. xgen-mm (blip-3): A family of open large multimodal models. arXiv preprint arXiv:2408.08872, 2024

  39. [40]

    Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation.arXiv preprint arXiv:2504.02782,

    Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, and Li Yuan. Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation. arXiv preprint arXiv:2504.02782, 2025

  40. [41]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023

  41. [42]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

  42. [43]

    Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024

  43. [44]

    Lumina-next: Making lumina-t2x stronger and faster with next-dit

    Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit. arXiv preprint arXiv:2406.18583, 2024. 17