pith. machine review for the scientific record. sign in

arxiv: 2505.05472 · v2 · submitted 2025-05-08 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-17 07:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords interleaved multi-modal generationomni foundation modeltext-to-image generationmulti-modal understandingzero-shot image editingautoregressive diffusion fusionunified multi-modal model
0
0 comments X

The pith

Mogao is a single model that generates arbitrary sequences mixing text and images by fusing autoregressive and diffusion components.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Mogao as a unified causal framework for processing and producing interleaved multi-modal sequences where text and images can appear in any order. It does so through specific architectural choices that blend the sequential strengths of autoregressive text models with the quality of diffusion-based image synthesis. These changes, paired with training on a large curated dataset, let the model handle standard tasks like text-to-image generation and understanding while also producing coherent mixed outputs. A sympathetic reader would care because the approach points toward foundation models that can sustain natural, back-and-forth creation across modalities rather than treating each type separately.

Core claim

Mogao integrates deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance to enable a causal model that processes arbitrary interleaved text-image sequences, achieving state-of-the-art results on multi-modal understanding and text-to-image generation while also delivering high-quality coherent interleaved outputs and emergent zero-shot image editing and compositional generation capabilities.

What carries the argument

The Mogao architecture that uses deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance to combine autoregressive text modeling with diffusion image synthesis for arbitrary interleaved sequences.

If this is right

  • The model supports zero-shot image editing through text instructions without task-specific training.
  • It enables compositional generation where text descriptions and image elements can be mixed freely in one output.
  • Performance reaches state-of-the-art levels on both understanding benchmarks and standard text-to-image tasks.
  • The same framework can be scaled as a practical omni-modal foundation model for future unified systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar fusion techniques could be tested on additional modalities such as video or audio to check if interleaved generation generalizes.
  • The emphasis on interleaved sequences may reduce inconsistencies that appear when separate models are chained for mixed outputs.
  • Curated joint text-image datasets appear central to unlocking the emergent editing and composition behaviors.

Load-bearing premise

The listed architectural changes and training strategy actually succeed in merging the strengths of autoregressive text models and diffusion image models so the system works on any order of text and images.

What would settle it

A direct comparison showing that Mogao produces less coherent or lower-quality outputs than separate specialized models when required to generate long alternating text-and-image sequences.

read the original abstract

Recent progress in unified models for image understanding and generation has been impressive, yet most approaches remain limited to single-modal generation conditioned on multiple modalities. In this paper, we present Mogao, a unified framework that advances this paradigm by enabling interleaved multi-modal generation through a causal approach. Mogao integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance, which allow it to harness the strengths of both autoregressive models for text generation and diffusion models for high-quality image synthesis. These practical improvements also make Mogao particularly effective to process interleaved sequences of text and images arbitrarily. To further unlock the potential of unified models, we introduce an efficient training strategy on a large-scale, in-house dataset specifically curated for joint text and image generation. Extensive experiments show that Mogao not only achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs. Its emergent capabilities in zero-shot image editing and compositional generation highlight Mogao as a practical omni-modal foundation model, paving the way for future development and scaling the unified multi-modal systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents Mogao, a unified causal framework for omni-modal interleaved generation of text and images. It incorporates a deep-fusion architecture, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance to combine autoregressive text modeling with diffusion-based image synthesis. Trained on a large-scale in-house dataset, the model claims state-of-the-art results in multi-modal understanding and text-to-image generation, plus emergent zero-shot capabilities in image editing and compositional generation for arbitrary interleaved sequences.

Significance. If the empirical claims hold, the work would represent a meaningful advance in unified multi-modal models by demonstrating practical handling of arbitrary text-image interleaving and emergent editing behaviors. The emphasis on architectural components that bridge autoregressive and diffusion paradigms could inform scalable omni-modal systems, provided the contributions are isolated and the coherence claims are quantitatively verified.

major comments (1)
  1. [§3.3] §3.3 (Interleaved Rotary Position Embeddings): The central claim that the listed improvements enable coherent arbitrary interleaved sequences rests on the interleaved rotary position embeddings correctly modeling relative positions across text tokens and entire images. Rotary embeddings operate on 1D sequences; when an image is inserted mid-sequence, the embedding must either flatten the image tokens or apply a joint scheme. The manuscript does not provide an equation or ablation that isolates how intra-image 2D relations or cross-modal distances are preserved, leaving open the risk of positional aliasing in long or deeply interleaved outputs. This directly affects the reported emergent editing and compositional results.
minor comments (1)
  1. [Abstract] Abstract: The assertion of SOTA performance and emergent capabilities would be strengthened by including at least one key quantitative metric (e.g., FID or accuracy delta versus baselines) to allow readers to gauge the scale of improvement without consulting the full experiments section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and have revised the paper to improve clarity on the technical details of our position embedding scheme.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (Interleaved Rotary Position Embeddings): The central claim that the listed improvements enable coherent arbitrary interleaved sequences rests on the interleaved rotary position embeddings correctly modeling relative positions across text tokens and entire images. Rotary embeddings operate on 1D sequences; when an image is inserted mid-sequence, the embedding must either flatten the image tokens or apply a joint scheme. The manuscript does not provide an equation or ablation that isolates how intra-image 2D relations or cross-modal distances are preserved, leaving open the risk of positional aliasing in long or deeply interleaved outputs. This directly affects the reported emergent editing and compositional results.

    Authors: We appreciate the referee highlighting the need for greater precision in describing our interleaved rotary position embeddings. In Mogao, image tokens from each vision encoder are flattened into the causal 1D sequence while intra-image 2D structure is preserved by adding 2D rotary offsets (row and column indices) to the base rotary angles within each image block; cross-modal relative distances are then captured by the sequential ordering and causal attention mask across the full interleaved sequence. We acknowledge that the initial manuscript omitted an explicit equation for this joint scheme. In the revised version we will insert the precise formulation in §3.3 and add a targeted ablation (varying the 2D offset component) to the supplementary material. These additions directly address the concern about positional aliasing and provide quantitative support for the observed coherence in zero-shot editing and compositional generation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture and training results

full rationale

The paper describes an empirical multi-modal model whose central claims rest on architectural choices (deep-fusion, dual encoders, interleaved rotary embeddings, multi-modal CFG) trained on a curated dataset and validated through experiments. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters or self-referential definitions. Self-citations, if present, are not load-bearing for any claimed derivation; performance and emergent capabilities are reported as outcomes of training and evaluation rather than tautological reductions. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard deep-learning assumptions plus several new architectural choices whose effectiveness is asserted rather than derived from first principles.

free parameters (1)
  • model scale and training hyperparameters
    Size of the model, learning rates, and guidance scales are chosen to achieve reported performance.
axioms (1)
  • domain assumption Causal autoregressive modeling extends naturally to mixed text-image token sequences
    The paper assumes the next-token prediction framework remains valid when tokens alternate between text and image modalities.

pith-pipeline@v0.9.0 · 5539 in / 1292 out tokens · 38869 ms · 2026-05-17T07:18:40.194337+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

  2. MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production

    cs.DC 2026-05 unverdicted novelty 6.0

    MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.

  3. STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.

  4. Meta-CoT: Enhancing Granularity and Generalization in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.

  5. How Far Are Video Models from True Multimodal Reasoning?

    cs.CV 2026-04 unverdicted novelty 6.0

    Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

  6. LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving

    cs.CV 2026-04 unverdicted novelty 6.0

    LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.

  7. LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

    cs.CV 2026-02 unverdicted novelty 6.0

    LLaMo scales pretrained LLMs for unified motion-language tasks by encoding motion into continuous causal latents and adding a flow-matching head for real-time autoregressive generation and captioning.

  8. PixelGen: Improving Pixel Diffusion with Perceptual Supervision

    cs.CV 2026-02 accept novelty 6.0

    PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.

  9. DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

    cs.CV 2025-11 conditional novelty 6.0

    DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while clos...

  10. HunyuanImage 3.0 Technical Report

    cs.CV 2025-09 accept novelty 6.0

    HunyuanImage 3.0 delivers an 80B-parameter MoE model unifying multimodal understanding and generation that matches prior state-of-the-art results while being fully open-sourced.

  11. SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

    cs.CV 2026-05 unverdicted novelty 5.0

    SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

  12. Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

    cs.CV 2026-05 unverdicted novelty 5.0

    Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.

  13. UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

    cs.CV 2026-04 unverdicted novelty 5.0

    UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.

  14. Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

    cs.CV 2026-05 unverdicted novelty 4.0

    Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.

  15. Show-o2: Improved Native Unified Multimodal Models

    cs.CV 2025-06 unverdicted novelty 4.0

    Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

  16. Seedance 2.0: Advancing Video Generation for World Complexity

    cs.CV 2026-04 unverdicted novelty 3.0

    Seedance 2.0 is an updated multi-modal model for generating 4-15 second audio-video content at 480p/720p with support for up to 3 video, 9 image, and 3 audio references.

  17. Evolution of Video Generative Foundations

    cs.CV 2026-04 unverdicted novelty 2.0

    This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · cited by 17 Pith papers · 36 internal anchors

  1. [1]

    Pixtral 12B

    Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, et al. Pixtral 12b.arXiv preprint arXiv:2410.07073, 2024

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advancesin neural information processing systems, 35:23716–23736, 2022

  3. [3]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  5. [5]

    Improving image generation with better captions.Computer Science

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

  6. [6]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023

  7. [7]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al. Language models are few-shot learners.Advancesin Neural Information Processing Systems, 33:1877–1901, 2020

  8. [8]

    Maskgit: Masked generative image transformer

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022

  9. [9]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023

  10. [10]

    PixArt-Sigma: Weak-to-strong training of diffusion transformer for 4K text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. PixArt-Sigma: Weak-to-strong training of diffusion transformer for 4K text-to-image generation. arXiv preprint arXiv:2403.04692, 2024

  11. [11]

    MMDetection: Open MMLab Detection Toolbox and Benchmark

    Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. Mmdetection: Open mmlab detection toolbox and b...

  12. [12]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025

  13. [13]

    Schwing, Alexander Kirillov, and Rohit Girdhar

    Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation, 2022. URLhttps://arxiv.org/abs/2112.01527

  14. [14]

    Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024

    Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024

  15. [15]

    Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems, 36:2252– 2274, 2023

    Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems, 36:2252– 2274, 2023

  16. [16]

    Autoregressive video generation without vector quantization.arXiv preprintarXiv:2412.14169, 2024

    Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization.arXiv preprintarXiv:2412.14169, 2024. 17

  17. [17]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

  18. [18]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstInternational Conference on Machine Learning, 2024

  19. [19]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

  20. [20]

    Seedream 3.0 Technical Report

    Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025

  21. [21]

    arXiv preprint arXiv:2307.08041 , year=

    Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying Shan. Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041, 2023

  22. [22]

    Making llama see and draw with seed tokenizer

    Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer.arXiv preprint arXiv:2310.01218, 2023

  23. [23]

    SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprintarXiv:2404.14396, 2024

  24. [24]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advancesin Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advancesin Neural Information Processing Systems, 36:52132–52152, 2023

  25. [25]

    Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

    Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Linjie Yang, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, and Weilin Huang. Seedream 2.0: A native chinese-en...

  26. [26]

    Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis.arXiv preprint arXiv:2412.04431, 2024

    Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis.arXiv preprint arXiv:2412.04431, 2024

  27. [27]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  28. [28]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

  29. [29]

    Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024

  30. [30]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

  31. [31]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

  32. [32]

    Flux.https://github.com/black-forest-labs/flux, 2023

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2023

  33. [33]

    Introducing idefics: An open reproduction of state-of-the-art visual language model, 2023

    Hugo Laurençon, Daniel van Strien, Stas Bekman, Leo Tronchon, Lucile Saulnier, Thomas Wang, Siddharth Karamcheti, Amanpreet Singh, Giada Pistilli, Yacine Jernite, and et al. Introducing idefics: An open reproduction of state-of-the-art visual language model, 2023. URLhttps://huggingface.co/blog/idefics

  34. [34]

    Pix2struct: Screenshot parsing as pretraining for visual language understanding

    Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding. InInternational Conference on Machine Learning, pages 18893–18912. PMLR, 2023

  35. [35]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  36. [36]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 18

  37. [37]

    mplug: Effective and efficient vision-language learning by cross-modal skip-connections

    Chenliang Li et al. mplug: Effective and efficient vision-language learning by cross-modal skip-connections. EMNLP, 2022

  38. [38]

    Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024

  39. [39]

    Autoregressive image generation without vector quantization.Advancesin Neural Information Processing Systems, 37:56424–56445, 2025

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advancesin Neural Information Processing Systems, 37:56424–56445, 2025

  40. [40]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023

  41. [42]

    Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024

  42. [43]

    Dual diffusion for unified image generation and understanding.arXiv preprint arXiv:2501.00289, 2024

    Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding.arXiv preprint arXiv:2501.00289, 2024

  43. [44]

    Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models

    Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models. arXiv preprint arXiv:2411.04996, 2024

  44. [45]

    Moma: Efficient early-fusion pre-training with mixture of modality-aware experts.arXiv preprint arXiv:2407.21770, 2024

    Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Ghosh, Luke Zettlemoyer, and Armen Aghajanyan. Moma: Efficient early-fusion pre-training with mixture of modality-aware experts.arXiv preprint arXiv:2407.21770, 2024

  45. [46]

    Evaluating text-to-visual generation with image-to-text generation.arXiv preprint arXiv:2404.01291, 2024

    Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation.arXiv preprint arXiv:2404.01291, 2024

  46. [47]

    Stiv: Scalable text and image conditioned video generation.arXiv preprint arXiv:2412.07730, 2024

    Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, et al. Stiv: Scalable text and image conditioned video generation.arXiv preprint arXiv:2412.07730, 2024

  47. [48]

    World model on million-length video and language with ringattention

    Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv e-prints, pages arXiv–2402, 2024

  48. [49]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  49. [50]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

  50. [51]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  51. [52]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023

  52. [53]

    Shayne et al. Longpre. The flan collection: Designing data and methods for effective instruction tuning. arXiv:2301.13688, 2023

  53. [54]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024

  54. [55]

    Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025

    Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025. 19

  55. [56]

    Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Liang Zhao, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

  56. [57]

    Midjourney v6.1

    Midjourney. Midjourney v6.1. https://www.midjourney.com/, 2024

  57. [58]

    Gpt-4o system card, 2024

    OpenAI, :, Aaron Hurst, and Adam Lerer et al. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410. 21276

  58. [59]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  59. [60]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  60. [61]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  61. [62]

    Tokenflow: Unified image tokenizer for multimodal understanding and generation.arXiv preprint arXiv:2412.03069, 2024

    Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xinglong Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation.arXiv preprint arXiv:2412.03069, 2024

  62. [63]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018

  63. [64]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

  64. [65]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

  65. [66]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  66. [67]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

  67. [68]

    Generative modelling with inverse heat dissipation.arXiv preprint arXiv:2206.13397, 2022

    Severi Rissanen, Markus Heinonen, and Arno Solin. Generative modelling with inverse heat dissipation.arXiv preprint arXiv:2206.13397, 2022

  68. [69]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  69. [70]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  70. [71]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

  71. [72]

    Generative multimodal models are in-context learners.arXiv e-prints, pages arXiv–2312, 2023

    Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, et al. Generative multimodal models are in-context learners.arXiv e-prints, pages arXiv–2312, 2023

  72. [73]

    Emu: Generative Pretraining in Multimodality

    Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality.arXiv preprint arXiv:2307.05222, 2023

  73. [74]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 20

  74. [75]

    Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2025

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2025

  75. [76]

    MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

    Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164, 2024

  76. [77]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  77. [78]

    Attention is all you need.Advancesin neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesin neural information processing systems, 30, 2017

  78. [79]

    Illume: Illuminating your llms to see, draw, and self-enhance.arXiv preprint arXiv:2412.06673, 2024

    Chunwei Wang, Guansong Lu, Junwei Yang, Runhui Huang, Jianhua Han, Lu Hou, Wei Zhang, and Hang Xu. Illume: Illuminating your llms to see, draw, and self-enhance.arXiv preprint arXiv:2412.06673, 2024

  79. [80]

    Simplear: Push- ing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXivpreprintarXiv:2504.11455, 2025

    Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Push- ing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXivpreprintarXiv:2504.11455, 2025

  80. [81]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

Showing first 80 references.