pith. machine review for the scientific record. sign in

arxiv: 2506.03147 · v4 · submitted 2025-06-03 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 2 theorem links

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-12 17:28 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords unified vision modelssemantic encodersimage manipulationtext-to-image generationmultimodal large language modelscontrastive encodersvision-language understanding
0
0 comments X

The pith

UniWorld-V1 builds a unified visual model around semantic encoders from multimodal language models to handle understanding, generation, manipulation, and perception together.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that semantic features drawn from existing multimodal large language models and contrastive encoders can serve as the backbone for a single framework that performs image understanding, text-to-image generation, manipulation, and perception. This approach matters because prior unified models have struggled with practical manipulation tasks even when they handled language and generation well. The authors report that their model reaches competitive results across these tasks after training on only 2.7 million examples, and they release the full code, weights, and data. A sympathetic reader sees a potential route to versatile vision systems that avoid the data and compute costs of separate specialist models.

Core claim

UniWorld-V1 is a unified generative framework that extracts high-resolution semantic features from multimodal large language models and contrastive semantic encoders rather than relying on VAEs, enabling strong performance on image understanding, generation, manipulation, and perception when trained on 2.7 million examples.

What carries the argument

Semantic features extracted from multimodal large language models and contrastive encoders, used as the shared representation for both understanding and pixel-level generation or editing operations.

If this is right

  • A single set of weights can support both high-level vision-language reasoning and low-level image editing without task-specific heads.
  • Training data requirements for unified models drop to the low millions while still covering generation and manipulation.
  • Open release of the full training pipeline allows direct replication and extension by other groups.
  • Perception and manipulation tasks no longer require separate VAE-based pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same semantic-feature approach could be tested on video or 3D data to see whether the data-efficiency advantage holds beyond still images.
  • If semantic encoders already encode fine-grained spatial details, future work could measure exactly which layers contribute most to editing accuracy.
  • Replacing VAEs might reduce artifacts in generated images that stem from reconstruction loss, though this remains untested here.

Load-bearing premise

The performance gains come from using semantic encoders instead of VAEs because that is what GPT-4o-Image is presumed to do.

What would settle it

A controlled experiment that trains an otherwise identical model using VAE latents instead of semantic encoder features and measures whether manipulation and perception accuracy drops below the reported levels.

read the original abstract

Although existing unified models achieve strong performance in vision-language understanding and text-to-image generation, they remain limited in addressing image perception and manipulation -- capabilities increasingly demanded in practical applications. Recently, OpenAI introduced the powerful GPT-4o-Image model, which showcases advanced capabilities in comprehensive image perception and manipulation, sparking widespread interest. Through carefully designed experiments, we observe that GPT-4o-Image likely relies on semantic encoders rather than VAEs for feature extraction, despite VAEs being commonly regarded as crucial for image manipulation tasks. Inspired by this insight, we propose UniWorld-V1, a unified generative framework built upon semantic features extracted from powerful multimodal large language models and contrastive semantic encoders. Using only 2.7M training data, UniWorld-V1 achieves impressive performance across diverse tasks, including image understanding, generation, manipulation, and perception. We fully open-source the UniWorld-V1 framework, including model weights, training and evaluation scripts, and datasets to promote reproducibility and further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces UniWorld-V1, a unified generative framework for visual understanding, generation, manipulation, and perception. Motivated by behavioral experiments on the closed-source GPT-4o-Image suggesting that semantic encoders (rather than VAEs) drive its capabilities, the model extracts features from open multimodal LLMs and contrastive encoders. Trained on only 2.7M samples, it claims strong performance across diverse tasks and is fully open-sourced with weights, scripts, and datasets.

Significance. If the performance claims are substantiated with rigorous metrics and the semantic-encoder insight proves robust, the work could advance efficient unified vision models by showing competitive results with limited data and reduced VAE reliance. The explicit open-sourcing of the complete framework (including training/evaluation code and datasets) is a clear strength that directly supports reproducibility and further research in the field.

major comments (3)
  1. [Introduction] Introduction/Motivation section: The load-bearing claim that GPT-4o-Image 'likely relies on semantic encoders rather than VAEs' is derived from indirect, non-replicable behavioral experiments on a proprietary model. The manuscript provides no details on experimental design, controls, quantitative probes, or alternative explanations, making it impossible to verify whether this observation justifies the architectural choice or confers the claimed advantages for high-resolution manipulation.
  2. [Experiments] Experiments/Results section: The central performance claim ('impressive performance across diverse tasks' with 2.7M data) lacks any reported quantitative metrics, baselines, ablations, or error analysis in the provided summary. Without these, the contribution of the semantic-feature approach versus prior unified models cannot be evaluated, undermining the empirical support for the framework.
  3. [Method] Method section: The integration of semantic features from MLLMs and contrastive encoders for high-resolution generation and manipulation is described at a high level but lacks specifics on feature fusion, resolution handling, or loss formulations that would allow assessment of why this outperforms VAE-based alternatives.
minor comments (2)
  1. [Abstract] Abstract: While the high-level claims are clear, inclusion of at least one or two key quantitative results (e.g., FID, accuracy on specific benchmarks) would strengthen the summary and align with standard practice for empirical vision papers.
  2. [Overall] Overall presentation: Ensure all tables and figures are explicitly referenced in the text and include error bars or statistical significance where performance comparisons are shown.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment point-by-point below, providing clarifications and indicating where the manuscript has been revised. Our responses focus on strengthening the empirical grounding and technical details without overstating what the current experiments can demonstrate.

read point-by-point responses
  1. Referee: [Introduction] Introduction/Motivation section: The load-bearing claim that GPT-4o-Image 'likely relies on semantic encoders rather than VAEs' is derived from indirect, non-replicable behavioral experiments on a proprietary model. The manuscript provides no details on experimental design, controls, quantitative probes, or alternative explanations, making it impossible to verify whether this observation justifies the architectural choice or confers the claimed advantages for high-resolution manipulation.

    Authors: We agree that the proprietary nature of GPT-4o-Image limits full replicability. In the revised manuscript we have substantially expanded the Introduction and a new supplementary section to describe the behavioral experiments in greater detail: the exact prompt templates used for perception and manipulation queries, the quantitative success-rate metrics collected across 200+ test cases, the controls for output consistency, and a brief discussion of alternative explanations (e.g., possible internal VAE components). We also explicitly state the limitations of behavioral probing on closed models and frame the GPT-4o observation as motivational rather than definitive proof. These additions allow readers to better assess the strength of the motivation while remaining honest about what can be verified. revision: partial

  2. Referee: [Experiments] Experiments/Results section: The central performance claim ('impressive performance across diverse tasks' with 2.7M data) lacks any reported quantitative metrics, baselines, ablations, or error analysis in the provided summary. Without these, the contribution of the semantic-feature approach versus prior unified models cannot be evaluated, undermining the empirical support for the framework.

    Authors: The full manuscript already contains quantitative results (FID, CLIP score, task accuracy, manipulation success rate) with comparisons to prior unified models and VAE-based baselines, plus ablations on the semantic encoder choice. These appear in Section 4 and the supplementary material. To address the concern that the summary obscured this, we have added a concise main-text results table and an explicit error-analysis paragraph in the revised version. We also include additional baseline numbers (e.g., against LLaVA-based and SDXL-based unified models) that were previously only in the supplement. revision: yes

  3. Referee: [Method] Method section: The integration of semantic features from MLLMs and contrastive encoders for high-resolution generation and manipulation is described at a high level but lacks specifics on feature fusion, resolution handling, or loss formulations that would allow assessment of why this outperforms VAE-based alternatives.

    Authors: We have revised the Method section (Section 3) to provide the requested technical details: (1) feature fusion is performed via a multi-scale cross-attention module that aligns MLLM token features with contrastive image embeddings; (2) resolution handling uses a cascaded decoder with progressive upsampling guided by semantic masks; (3) the composite loss combines pixel reconstruction, perceptual (LPIPS), and semantic alignment losses with explicit weighting coefficients. We also added a diagram and pseudocode for the fusion step. These specifics clarify how semantic features enable high-resolution manipulation without relying on a VAE bottleneck. revision: yes

standing simulated objections not resolved
  • Full replicability of the GPT-4o-Image behavioral experiments remains impossible because the model is closed-source; we can only describe our probing methodology and results in as much detail as access permits.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper presents an empirical framework motivated by behavioral observations on a closed-source model (GPT-4o-Image) and constructs UniWorld-V1 around semantic encoders extracted from open MLLMs. No equations, derivations, or parameter-fitting steps appear that reduce any claimed result to a self-defined input or fitted quantity by construction. The central insight is presented as an external experimental observation rather than a tautology, self-citation chain, or renamed known result. The work remains self-contained against external benchmarks with no load-bearing reduction to the authors' own prior definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all modeling details are deferred to the unavailable full text.

pith-pipeline@v0.9.0 · 5513 in / 1153 out tokens · 26451 ms · 2026-05-12T17:28:55.213641+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.

  2. Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

    cs.CV 2026-05 unverdicted novelty 7.0

    Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.

  3. Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

  4. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.

  5. UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

    cs.MM 2026-05 unverdicted novelty 7.0

    UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...

  6. Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    A co-trained adapter framework enables mask-free local editing in DiTs by factorizing edit semantics from spatial location and jointly learning a mask predictor.

  7. Exploring Spatial Intelligence from a Generative Perspective

    cs.CV 2026-04 unverdicted novelty 7.0

    Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.

  8. HP-Edit: A Human-Preference Post-Training Framework for Image Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.

  9. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.

  10. Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.

  11. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.

  12. STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.

  13. MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

    cs.CV 2026-05 unverdicted novelty 6.0

    MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.

  14. SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness

    cs.CV 2026-04 unverdicted novelty 6.0

    SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.

  15. DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.

  16. Meta-CoT: Enhancing Granularity and Generalization in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.

  17. LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

    cs.CV 2026-04 unverdicted novelty 6.0

    LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.

  18. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 6.0

    UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.

  19. LIVE: Leveraging Image Manipulation Priors for Instruction-based Video Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    LIVE achieves state-of-the-art instruction-based video editing by jointly training on image and video data with a frame-wise token noise strategy to bridge domain gaps and a new benchmark of over 60 tasks.

  20. InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation

    cs.CV 2026-04 unverdicted novelty 6.0

    InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.

  21. SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.

  22. SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

    cs.CV 2026-05 unverdicted novelty 5.0

    SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

  23. Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

    cs.CV 2026-05 unverdicted novelty 5.0

    Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.

  24. Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.

  25. Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding

    cs.CV 2026-04 unverdicted novelty 5.0

    UniRect-CoT is a training-free rectification chain-of-thought framework that treats diffusion denoising as visual reasoning and uses the model's inherent understanding to align and correct intermediate generation results.

  26. FineEdit: Fine-Grained Image Edit with Bounding Box Guidance

    cs.CV 2026-04 unverdicted novelty 5.0

    FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models ...

  27. Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    cs.CV 2025-11 unverdicted novelty 5.0

    Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...

  28. Qwen-Image Technical Report

    cs.CV 2025-08 unverdicted novelty 5.0

    Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...

  29. Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

    cs.GR 2026-05 unverdicted novelty 4.0

    JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.

  30. Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

    cs.CV 2026-05 unverdicted novelty 4.0

    Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.

  31. MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

    cs.CV 2026-04 unverdicted novelty 4.0

    MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.

  32. Show-o2: Improved Native Unified Multimodal Models

    cs.CV 2025-06 unverdicted novelty 4.0

    Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

  33. Evolution of Video Generative Foundations

    cs.CV 2026-04 unverdicted novelty 2.0

    This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 31 Pith papers · 19 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

  3. [3]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

  4. [4]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025

  5. [5]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023

  6. [6]

    Multimodal representation alignment for image generation: Text-image interleaved control is easier than you think

    Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon Vinci, Junyang Lin, and Baobao Chang. Multimodal representation alignment for image generation: Text-image interleaved control is easier than you think. arXiv preprint arXiv:2502.20172, 2025

  7. [7]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025

  8. [8]

    Yolo-world: Real-time open-vocabulary object detection

    Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16901–16911, 2024

  9. [9]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683, 2025. 11

  10. [10]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. In Forty-first international conference on machine learning, 2024

  11. [11]

    Mmbench-video: A long-form multi-shot benchmark for holistic video understanding

    Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. Advances in Neural Information Processing Systems, 37:89098–89124, 2024

  12. [12]

    Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396,

    Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024

  13. [13]

    Experiment with gemini 2.0 flash native image generation, 2025

    Google Gemini2. Experiment with gemini 2.0 flash native image generation, 2025

  14. [14]

    Geneval: An object-focused framework for evaluating text-to-image alignment

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  15. [15]

    Black Forest Labs. Flux. https://bfl.ai/announcements/24-11-21-tools , 2024

  16. [16]

    Black Forest Labs. Flux. https://github.com/black-forest-labs/flux, 2024

  17. [17]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024

  18. [18]

    Visualcloze: A universal image generation framework via visual in-context learning

    Zhong-Yu Li, Ruoyi Du, Juncheng Yan, Le Zhuo, Zhen Li, Peng Gao, Zhanyu Ma, and Ming- Ming Cheng. Visualcloze: A universal image generation framework via visual in-context learning. arXiv preprint arXiv:2504.07960, 2025

  19. [19]

    Langbridge: Interpreting image as a combination of language embeddings

    Jiaqi Liao, Yuwei Niu, Fanqing Meng, Hao Li, Changyao Tian, Yinuo Du, Yuwen Xiong, Dianqi Li, Xizhou Zhu, Li Yuan, et al. Langbridge: Interpreting image as a combination of language embeddings. arXiv preprint arXiv:2503.19404, 2025

  20. [20]

    Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131,

    Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131, 2024

  21. [21]

    Moe-llava: Mix- ture of experts for large vision-language models

    Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Yatian Pang, Munan Ning, et al. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024

  22. [22]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023

  23. [23]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014

  24. [24]

    Playground v3: Improving text-to-image alignment with deep-fusion large language models

    Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Chase Lambert, Joao Souza, Suhail Doshi, and Daiqing Li. Playground v3: Im- proving text-to-image alignment with deep-fusion large language models. arXiv preprint arXiv:2409.10695, 2024

  25. [25]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

  26. [27]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025

  27. [28]

    Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024

  28. [29]

    Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265,

    Yuwei Niu, Munan Ning, Mengren Zheng, Bin Lin, Peng Jin, Jiaqi Liao, Kunpeng Ning, Bin Zhu, and Li Yuan. Wise: A world knowledge-informed semantic evaluation for text-to-image generation. arXiv preprint arXiv:2503.07265, 2025

  29. [30]

    Introducing 4o image generation, 2025

    OpenAI. Introducing 4o image generation, 2025

  30. [31]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  31. [32]

    Transfer between Modalities with MetaQueries

    Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries. arXiv preprint arXiv:2504.06256, 2025

  32. [33]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  33. [34]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023

  34. [35]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

  35. [36]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024

  36. [37]

    Improving image captioning with better use of captions

    Zhan Shi, Xu Zhou, Xipeng Qiu, and Xiaodan Zhu. Improving image captioning with better use of captions. arXiv preprint arXiv:2006.11807, 2020

  37. [38]

    Omniconsistency: Learning style-agnostic consistency from paired stylization data

    Yiren Song, Cheng Liu, and Mike Zheng Shou. Omniconsistency: Learning style-agnostic consistency from paired stylization data. 2025

  38. [39]

    RADIOv2.5 Team. Flux. https://github.com/NVlabs/RADIO/blob/main/RADIOv2.5_ tech_report.md, 2024

  39. [40]

    Metamorph: Multimodal understanding and generation via instruction tuning.arXiv preprint arXiv:2412.14164, 2024

    Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal under- standing and generation via instruction tuning. arXiv preprint arXiv:2412.14164, 2024

  40. [41]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

  41. [42]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  42. [43]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024. 13

  43. [44]

    Qwen2.5 Technical Report.arXiv preprint arXiv:2410.13848, 2024

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848, 2024

  44. [45]

    Omnigen: Unified image generation

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. arXiv preprint arXiv:2409.11340, 2024

  45. [46]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single trans- former to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024

  46. [47]

    Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation.arXiv preprint arXiv:2504.02782,

    Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, and Li Yuan. Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation. arXiv preprint arXiv:2504.02782, 2025

  47. [48]

    ImgEdit: A Unified Image Editing Dataset and Benchmark

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark. arXiv preprint arXiv:2505.20275, 2025

  48. [49]

    Anyedit: Mastering unified high-quality image editing for any idea

    Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. arXiv preprint arXiv:2411.15738, 2024

  49. [50]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023

  50. [51]

    Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation

    Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Chongyang Ma, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation. arXiv preprint arXiv:2505.20292, 2025

  51. [52]

    Identity-preserving text-to-video generation by frequency decomposition

    Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity-preserving text-to-video generation by frequency decomposition. arXiv preprint arXiv:2411.17440, 2024

  52. [53]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

  53. [54]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

  54. [55]

    Are unified vision- language models necessary: Generalization across understanding and generation

    Jihai Zhang, Tianle Li, Linjie Li, Zhengyuan Yang, and Yu Cheng. Are unified vision- language models necessary: Generalization across understanding and generation. arXiv preprint arXiv:2505.23043, 2025

  55. [56]

    Magicbrush: A manually annotated dataset for instruction-guided image editing

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Processing Systems, 36:31428–31449, 2023

  56. [57]

    Llava-next: A strong zero-shot video understanding model

    Y Zhang, B Li, H Liu, Y Lee, L Gui, D Fu, J Feng, Z Liu, and C Li. Llava-next: A strong zero-shot video understanding model. 2024

  57. [58]

    In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025

    Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025

  58. [59]

    Ultraedit: Instruction-based fine-grained image editing at scale

    Haozhe Zhao, Xiaojian Shawn Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained image editing at scale. Advances in Neural Information Processing Systems, 37:3058–3093, 2024. 14

  59. [60]

    Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment

    Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852, 2023. 15