arxiv: 2506.03147 · v4 · submitted 2025-06-03 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 2 theorem links

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin , Zongjian Li , Xinhua Cheng , Yuwei Niu , Yang Ye , Xianyi He , Shenghai Yuan , Wangbo Yu

show 4 more authors

Shaodong Wang Yunyang Ge Yatian Pang Li Yuan

Authors on Pith no claims yet

Pith reviewed 2026-05-12 17:28 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords unified vision modelssemantic encodersimage manipulationtext-to-image generationmultimodal large language modelscontrastive encodersvision-language understanding

0 comments

The pith

UniWorld-V1 builds a unified visual model around semantic encoders from multimodal language models to handle understanding, generation, manipulation, and perception together.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that semantic features drawn from existing multimodal large language models and contrastive encoders can serve as the backbone for a single framework that performs image understanding, text-to-image generation, manipulation, and perception. This approach matters because prior unified models have struggled with practical manipulation tasks even when they handled language and generation well. The authors report that their model reaches competitive results across these tasks after training on only 2.7 million examples, and they release the full code, weights, and data. A sympathetic reader sees a potential route to versatile vision systems that avoid the data and compute costs of separate specialist models.

Core claim

UniWorld-V1 is a unified generative framework that extracts high-resolution semantic features from multimodal large language models and contrastive semantic encoders rather than relying on VAEs, enabling strong performance on image understanding, generation, manipulation, and perception when trained on 2.7 million examples.

What carries the argument

Semantic features extracted from multimodal large language models and contrastive encoders, used as the shared representation for both understanding and pixel-level generation or editing operations.

If this is right

A single set of weights can support both high-level vision-language reasoning and low-level image editing without task-specific heads.
Training data requirements for unified models drop to the low millions while still covering generation and manipulation.
Open release of the full training pipeline allows direct replication and extension by other groups.
Perception and manipulation tasks no longer require separate VAE-based pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same semantic-feature approach could be tested on video or 3D data to see whether the data-efficiency advantage holds beyond still images.
If semantic encoders already encode fine-grained spatial details, future work could measure exactly which layers contribute most to editing accuracy.
Replacing VAEs might reduce artifacts in generated images that stem from reconstruction loss, though this remains untested here.

Load-bearing premise

The performance gains come from using semantic encoders instead of VAEs because that is what GPT-4o-Image is presumed to do.

What would settle it

A controlled experiment that trains an otherwise identical model using VAE latents instead of semantic encoder features and measures whether manipulation and perception accuracy drops below the reported levels.

read the original abstract

Although existing unified models achieve strong performance in vision-language understanding and text-to-image generation, they remain limited in addressing image perception and manipulation -- capabilities increasingly demanded in practical applications. Recently, OpenAI introduced the powerful GPT-4o-Image model, which showcases advanced capabilities in comprehensive image perception and manipulation, sparking widespread interest. Through carefully designed experiments, we observe that GPT-4o-Image likely relies on semantic encoders rather than VAEs for feature extraction, despite VAEs being commonly regarded as crucial for image manipulation tasks. Inspired by this insight, we propose UniWorld-V1, a unified generative framework built upon semantic features extracted from powerful multimodal large language models and contrastive semantic encoders. Using only 2.7M training data, UniWorld-V1 achieves impressive performance across diverse tasks, including image understanding, generation, manipulation, and perception. We fully open-source the UniWorld-V1 framework, including model weights, training and evaluation scripts, and datasets to promote reproducibility and further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces UniWorld-V1, a unified generative framework for visual understanding, generation, manipulation, and perception. Motivated by behavioral experiments on the closed-source GPT-4o-Image suggesting that semantic encoders (rather than VAEs) drive its capabilities, the model extracts features from open multimodal LLMs and contrastive encoders. Trained on only 2.7M samples, it claims strong performance across diverse tasks and is fully open-sourced with weights, scripts, and datasets.

Significance. If the performance claims are substantiated with rigorous metrics and the semantic-encoder insight proves robust, the work could advance efficient unified vision models by showing competitive results with limited data and reduced VAE reliance. The explicit open-sourcing of the complete framework (including training/evaluation code and datasets) is a clear strength that directly supports reproducibility and further research in the field.

major comments (3)

[Introduction] Introduction/Motivation section: The load-bearing claim that GPT-4o-Image 'likely relies on semantic encoders rather than VAEs' is derived from indirect, non-replicable behavioral experiments on a proprietary model. The manuscript provides no details on experimental design, controls, quantitative probes, or alternative explanations, making it impossible to verify whether this observation justifies the architectural choice or confers the claimed advantages for high-resolution manipulation.
[Experiments] Experiments/Results section: The central performance claim ('impressive performance across diverse tasks' with 2.7M data) lacks any reported quantitative metrics, baselines, ablations, or error analysis in the provided summary. Without these, the contribution of the semantic-feature approach versus prior unified models cannot be evaluated, undermining the empirical support for the framework.
[Method] Method section: The integration of semantic features from MLLMs and contrastive encoders for high-resolution generation and manipulation is described at a high level but lacks specifics on feature fusion, resolution handling, or loss formulations that would allow assessment of why this outperforms VAE-based alternatives.

minor comments (2)

[Abstract] Abstract: While the high-level claims are clear, inclusion of at least one or two key quantitative results (e.g., FID, accuracy on specific benchmarks) would strengthen the summary and align with standard practice for empirical vision papers.
[Overall] Overall presentation: Ensure all tables and figures are explicitly referenced in the text and include error bars or statistical significance where performance comparisons are shown.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment point-by-point below, providing clarifications and indicating where the manuscript has been revised. Our responses focus on strengthening the empirical grounding and technical details without overstating what the current experiments can demonstrate.

read point-by-point responses

Referee: [Introduction] Introduction/Motivation section: The load-bearing claim that GPT-4o-Image 'likely relies on semantic encoders rather than VAEs' is derived from indirect, non-replicable behavioral experiments on a proprietary model. The manuscript provides no details on experimental design, controls, quantitative probes, or alternative explanations, making it impossible to verify whether this observation justifies the architectural choice or confers the claimed advantages for high-resolution manipulation.

Authors: We agree that the proprietary nature of GPT-4o-Image limits full replicability. In the revised manuscript we have substantially expanded the Introduction and a new supplementary section to describe the behavioral experiments in greater detail: the exact prompt templates used for perception and manipulation queries, the quantitative success-rate metrics collected across 200+ test cases, the controls for output consistency, and a brief discussion of alternative explanations (e.g., possible internal VAE components). We also explicitly state the limitations of behavioral probing on closed models and frame the GPT-4o observation as motivational rather than definitive proof. These additions allow readers to better assess the strength of the motivation while remaining honest about what can be verified. revision: partial
Referee: [Experiments] Experiments/Results section: The central performance claim ('impressive performance across diverse tasks' with 2.7M data) lacks any reported quantitative metrics, baselines, ablations, or error analysis in the provided summary. Without these, the contribution of the semantic-feature approach versus prior unified models cannot be evaluated, undermining the empirical support for the framework.

Authors: The full manuscript already contains quantitative results (FID, CLIP score, task accuracy, manipulation success rate) with comparisons to prior unified models and VAE-based baselines, plus ablations on the semantic encoder choice. These appear in Section 4 and the supplementary material. To address the concern that the summary obscured this, we have added a concise main-text results table and an explicit error-analysis paragraph in the revised version. We also include additional baseline numbers (e.g., against LLaVA-based and SDXL-based unified models) that were previously only in the supplement. revision: yes
Referee: [Method] Method section: The integration of semantic features from MLLMs and contrastive encoders for high-resolution generation and manipulation is described at a high level but lacks specifics on feature fusion, resolution handling, or loss formulations that would allow assessment of why this outperforms VAE-based alternatives.

Authors: We have revised the Method section (Section 3) to provide the requested technical details: (1) feature fusion is performed via a multi-scale cross-attention module that aligns MLLM token features with contrastive image embeddings; (2) resolution handling uses a cascaded decoder with progressive upsampling guided by semantic masks; (3) the composite loss combines pixel reconstruction, perceptual (LPIPS), and semantic alignment losses with explicit weighting coefficients. We also added a diagram and pseudocode for the fusion step. These specifics clarify how semantic features enable high-resolution manipulation without relying on a VAE bottleneck. revision: yes

standing simulated objections not resolved

Full replicability of the GPT-4o-Image behavioral experiments remains impossible because the model is closed-source; we can only describe our probing methodology and results in as much detail as access permits.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper presents an empirical framework motivated by behavioral observations on a closed-source model (GPT-4o-Image) and constructs UniWorld-V1 around semantic encoders extracted from open MLLMs. No equations, derivations, or parameter-fitting steps appear that reduce any claimed result to a self-defined input or fitted quantity by construction. The central insight is presented as an external experimental observation rather than a tautology, self-citation chain, or renamed known result. The work remains self-contained against external benchmarks with no load-bearing reduction to the authors' own prior definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all modeling details are deferred to the unavailable full text.

pith-pipeline@v0.9.0 · 5513 in / 1153 out tokens · 26451 ms · 2026-05-12T17:28:55.213641+00:00 · methodology

discussion (0)

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
cs.CV 2026-05 unverdicted novelty 7.0

A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
cs.CV 2026-05 unverdicted novelty 7.0

Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
cs.CV 2026-05 unverdicted novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
cs.CV 2026-05 unverdicted novelty 7.0

UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
cs.MM 2026-05 unverdicted novelty 7.0

UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...
Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing
cs.CV 2026-04 unverdicted novelty 7.0

A co-trained adapter framework enables mask-free local editing in DiTs by factorizing edit semantics from spatial location and jointly learning a mask predictor.
Exploring Spatial Intelligence from a Generative Perspective
cs.CV 2026-04 unverdicted novelty 7.0

Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
HP-Edit: A Human-Preference Post-Training Framework for Image Editing
cs.CV 2026-04 unverdicted novelty 7.0

HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 7.0

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 7.0

Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
cs.CV 2026-05 unverdicted novelty 6.0

STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
cs.CV 2026-05 unverdicted novelty 6.0

MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness
cs.CV 2026-04 unverdicted novelty 6.0

SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
cs.CV 2026-04 unverdicted novelty 6.0

LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 6.0

UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
LIVE: Leveraging Image Manipulation Priors for Instruction-based Video Editing
cs.CV 2026-04 unverdicted novelty 6.0

LIVE achieves state-of-the-art instruction-based video editing by jointly training on image and video data with a frame-wise token noise strategy to bridge domain gaps and a new benchmark of over 60 tasks.
InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
cs.CV 2026-04 unverdicted novelty 6.0

InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.
SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing
cs.CV 2026-04 unverdicted novelty 6.0

SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
cs.CV 2026-05 unverdicted novelty 5.0

Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
cs.CV 2026-04 unverdicted novelty 5.0

Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding
cs.CV 2026-04 unverdicted novelty 5.0

UniRect-CoT is a training-free rectification chain-of-thought framework that treats diffusion denoising as visual reasoning and uses the model's inherent understanding to align and correct intermediate generation results.
FineEdit: Fine-Grained Image Edit with Bounding Box Guidance
cs.CV 2026-04 unverdicted novelty 5.0

FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models ...
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
cs.CV 2025-11 unverdicted novelty 5.0

Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...
Qwen-Image Technical Report
cs.CV 2025-08 unverdicted novelty 5.0

Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
cs.GR 2026-05 unverdicted novelty 4.0

JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
cs.CV 2026-05 unverdicted novelty 4.0

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings
cs.CV 2026-04 unverdicted novelty 4.0

MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.
Show-o2: Improved Native Unified Multimodal Models
cs.CV 2025-06 unverdicted novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
Evolution of Video Generative Foundations
cs.CV 2026-04 unverdicted novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 31 Pith papers · 19 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

work page 2023
[4]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review arXiv 2023
[6]

Multimodal representation alignment for image generation: Text-image interleaved control is easier than you think

Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon Vinci, Junyang Lin, and Baobao Chang. Multimodal representation alignment for image generation: Text-image interleaved control is easier than you think. arXiv preprint arXiv:2502.20172, 2025

work page arXiv 2025
[7]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Yolo-world: Real-time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16901–16911, 2024

work page 2024
[9]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. In Forty-first international conference on machine learning, 2024

work page 2024
[11]

Mmbench-video: A long-form multi-shot benchmark for holistic video understanding

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. Advances in Neural Information Processing Systems, 37:89098–89124, 2024

work page 2024
[12]

Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396,

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024

work page arXiv 2024
[13]

Experiment with gemini 2.0 flash native image generation, 2025

Google Gemini2. Experiment with gemini 2.0 flash native image generation, 2025

work page 2025
[14]

Geneval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36:52132–52152, 2023

work page 2023
[15]

Black Forest Labs. Flux. https://bfl.ai/announcements/24-11-21-tools , 2024

work page 2024
[16]

Black Forest Labs. Flux. https://github.com/black-forest-labs/flux, 2024

work page 2024
[17]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Visualcloze: A universal image generation framework via visual in-context learning

Zhong-Yu Li, Ruoyi Du, Juncheng Yan, Le Zhuo, Zhen Li, Peng Gao, Zhanyu Ma, and Ming- Ming Cheng. Visualcloze: A universal image generation framework via visual in-context learning. arXiv preprint arXiv:2504.07960, 2025

work page arXiv 2025
[19]

Langbridge: Interpreting image as a combination of language embeddings

Jiaqi Liao, Yuwei Niu, Fanqing Meng, Hao Li, Changyao Tian, Yinuo Du, Yuwen Xiong, Dianqi Li, Xizhou Zhu, Li Yuan, et al. Langbridge: Interpreting image as a combination of language embeddings. arXiv preprint arXiv:2503.19404, 2025

work page arXiv 2025
[20]

Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131,

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131, 2024

work page arXiv 2024
[21]

Moe-llava: Mix- ture of experts for large vision-language models

Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Yatian Pang, Munan Ning, et al. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024

work page arXiv 2024
[22]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023

work page internal anchor Pith review arXiv 2023
[23]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014

work page 2014
[24]

Playground v3: Improving text-to-image alignment with deep-fusion large language models

Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Chase Lambert, Joao Souza, Suhail Doshi, and Daiqing Li. Playground v3: Im- proving text-to-image alignment with deep-fusion large language models. arXiv preprint arXiv:2409.10695, 2024

work page arXiv 2024
[25]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

work page 2024
[27]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024

work page 2024
[29]

Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265,

Yuwei Niu, Munan Ning, Mengren Zheng, Bin Lin, Peng Jin, Jiaqi Liao, Kunpeng Ning, Bin Zhu, and Li Yuan. Wise: A world knowledge-informed semantic evaluation for text-to-image generation. arXiv preprint arXiv:2503.07265, 2025

work page arXiv 2025
[30]

Introducing 4o image generation, 2025

OpenAI. Introducing 4o image generation, 2025

work page 2025
[31]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Transfer between Modalities with MetaQueries

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries. arXiv preprint arXiv:2504.06256, 2025

work page internal anchor Pith review arXiv 2025
[33]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[34]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[36]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Improving image captioning with better use of captions

Zhan Shi, Xu Zhou, Xipeng Qiu, and Xiaodan Zhu. Improving image captioning with better use of captions. arXiv preprint arXiv:2006.11807, 2020

work page arXiv 2006
[38]

Omniconsistency: Learning style-agnostic consistency from paired stylization data

Yiren Song, Cheng Liu, and Mike Zheng Shou. Omniconsistency: Learning style-agnostic consistency from paired stylization data. 2025

work page 2025
[39]

RADIOv2.5 Team. Flux. https://github.com/NVlabs/RADIO/blob/main/RADIOv2.5_ tech_report.md, 2024

work page 2024
[40]

Metamorph: Multimodal understanding and generation via instruction tuning.arXiv preprint arXiv:2412.14164, 2024

Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal under- standing and generation via instruction tuning. arXiv preprint arXiv:2412.14164, 2024

work page arXiv 2024
[41]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024. 13

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Qwen2.5 Technical Report.arXiv preprint arXiv:2410.13848, 2024

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848, 2024

work page arXiv 2024
[45]

Omnigen: Unified image generation

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. arXiv preprint arXiv:2409.11340, 2024

work page arXiv 2024
[46]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single trans- former to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation.arXiv preprint arXiv:2504.02782,

Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, and Li Yuan. Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation. arXiv preprint arXiv:2504.02782, 2025

work page arXiv 2025
[48]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark. arXiv preprint arXiv:2505.20275, 2025

work page internal anchor Pith review arXiv 2025
[49]

Anyedit: Mastering unified high-quality image editing for any idea

Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. arXiv preprint arXiv:2411.15738, 2024

work page arXiv 2024
[50]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation

Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Chongyang Ma, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation. arXiv preprint arXiv:2505.20292, 2025

work page arXiv 2025
[52]

Identity-preserving text-to-video generation by frequency decomposition

Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity-preserving text-to-video generation by frequency decomposition. arXiv preprint arXiv:2411.17440, 2024

work page arXiv 2024
[53]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

work page 2024
[54]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

work page 2023
[55]

Are unified vision- language models necessary: Generalization across understanding and generation

Jihai Zhang, Tianle Li, Linjie Li, Zhengyuan Yang, and Yu Cheng. Are unified vision- language models necessary: Generalization across understanding and generation. arXiv preprint arXiv:2505.23043, 2025

work page arXiv 2025
[56]

Magicbrush: A manually annotated dataset for instruction-guided image editing

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Processing Systems, 36:31428–31449, 2023

work page 2023
[57]

Llava-next: A strong zero-shot video understanding model

Y Zhang, B Li, H Liu, Y Lee, L Gui, D Fu, J Feng, Z Liu, and C Li. Llava-next: A strong zero-shot video understanding model. 2024

work page 2024
[58]

In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025

Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025

work page arXiv 2025
[59]

Ultraedit: Instruction-based fine-grained image editing at scale

Haozhe Zhao, Xiaojian Shawn Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained image editing at scale. Advances in Neural Information Processing Systems, 37:3058–3093, 2024. 14

work page 2024
[60]

Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment

Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852, 2023. 15

work page arXiv 2023