SPAR: Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Multimodal Models

Chenyang Zhu; Hongxiang Li; Hongxu Chen; Jiayin Cai; Long Chen; Xiaolong Jiang; Xiaoshuang Huang; Yao Hu

arxiv: 2606.23041 · v2 · pith:S2FMRM4Jnew · submitted 2026-06-22 · 💻 cs.CV

SPAR: Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Multimodal Models

Hongxiang Li , Hongxu Chen , Chenyang Zhu , Xiaoshuang Huang , Jiayin Cai , Xiaolong Jiang , Yao Hu , Long Chen This is my paper

Pith reviewed 2026-07-03 23:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords unified multimodal modelssemantic-pixel self-alignmentdual-stream tokenizeradaptive routingdiffusion modelsvisual generationMLLMself-aligned generation

0 comments

The pith

A dual-stream tokenizer with internal self-alignment unifies semantic understanding and pixel generation in one multimodal model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the mismatch between semantic perception and pixel reconstruction that limits multimodal large language models in generation tasks. It introduces an asymmetric dual-stream unified tokenizer where one lightweight stream anchors discriminative semantic features and a second Transformer-augmented stream recovers fine-grained pixel details, both feeding a single compact latent space. A self-aligned generation paradigm then uses this tokenizer directly as an internal teacher to train a diffusion model, eliminating external alignment models. Dynamic token routing further lets each token pull relevant features from multiple MLLM layers according to its own needs. If successful, the approach would let a single architecture deliver high-quality image generation and reconstruction while retaining strong visual understanding, reaching state-of-the-art results among unified systems.

Core claim

SPAR establishes a unified multimodal framework through semantic-pixel self-alignment and adaptive routing. The asymmetric dual-stream unified tokenizer reconciles semantic perception with pixel-level reconstruction by using a lightweight semantic stream and a Transformer-augmented pixel stream into a unified compact latent space. The self-aligned generation paradigm leverages the optimized tokenizer as an internal alignment teacher for the diffusion model without external dependencies. Dynamic token routing enables each token to adaptively aggregate multi-layer MLLM features based on its semantic demands. Extensive experiments show this yields state-of-the-art performance for unified archit

What carries the argument

The asymmetric dual-stream unified tokenizer that anchors discriminative semantic features in a lightweight stream while recovering fine-grained pixel details in a Transformer-augmented stream into one compact latent space.

If this is right

Unified multimodal models can perform both high-fidelity image generation and accurate visual understanding without separate specialized components.
The diffusion model aligns to semantic spaces using only internal tokenizer feedback, removing the need for external teacher models.
Dynamic token routing allows flexible aggregation of features from different MLLM layers for each token according to its distinct demands.
The framework achieves state-of-the-art results among unified architectures while preserving core visual understanding capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The internal self-alignment could lower training costs by avoiding separate teacher networks for alignment.
The single latent space might support downstream tasks such as language-guided image editing within the same model without additional adapters.
Applying the dual-stream design to video or audio modalities could test whether the same unification pattern holds beyond static images.
Direct measurement of token routing decisions on held-out image categories would show whether the adaptive aggregation improves robustness on diverse inputs.

Load-bearing premise

The asymmetric dual-stream unified tokenizer can anchor discriminative semantic features and recover fine-grained pixel details simultaneously in a single compact latent space without external supervision or loss of either capability.

What would settle it

A controlled comparison in which a model built with the SPAR tokenizer shows either substantially lower visual understanding accuracy than standard MLLMs or substantially lower generation and reconstruction quality than specialized diffusion models would disprove the central claim.

Figures

Figures reproduced from arXiv: 2606.23041 by Chenyang Zhu, Hongxiang Li, Hongxu Chen, Jiayin Cai, Long Chen, Xiaolong Jiang, Xiaoshuang Huang, Yao Hu.

**Figure 1.** Figure 1: (a) Image Reconstruction: When modeling directly within the semantic representation space, existing methods suffer from lossy compression and struggle to preserve high-frequency details. In contrast, our method effectively recovers these crucial pixel-level details. (b) Representation Alignment Paradigm: Unlike existing approaches that rely on external semantic encoders to guide the generative model, our … view at source ↗

**Figure 2.** Figure 2: Architecture of the semantic-pixel self-aligned unified tokenizer. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the unified multimodal model. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results of image generation. Compared with the second-best OmniGen2 (3.44), SPAR-3B yields an absolute improvement of +0.58. Across specific editing categories, SPAR-3B achieves the best performance on various subtypes, with particularly pronounced advantages on semantically demanding tasks, indicating that the rich semantic representations within the SPAR tokenizer effectively enhance the edi… view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in visual understanding but remain constrained in visual generation due to the fundamental feature discrepancy between semantic perception and pixel-level reconstruction. Bridging this gap requires overcoming two core challenges: endowing semantic encoders with high-fidelity reconstruction capabilities, and effectively aligning generative models with semantic spaces without relying on external teachers. To this end, we propose a novel unified multimodal framework featuring \textbf{S}emantic-\textbf{P}ixel self-alignment and \textbf{A}daptive \textbf{R}outing (\textbf{SPAR}). First, to reconcile semantic perception with pixel-level reconstruction, we introduce an asymmetric dual-stream unified tokenizer. A lightweight semantic stream anchors discriminative features, while a Transformer-augmented pixel stream recovers fine-grained visual details into a unified compact latent space. Second, to eliminate external dependencies, we propose a self-aligned generation paradigm that natively leverages this optimized tokenizer as an internal alignment teacher for the diffusion model. Furthermore, to facilitate flexible multimodal interaction within this unified space, we introduce Dynamic Token Routing, which enables each token to adaptively aggregate multi-layer MLLM features based on its distinct semantic demands. Extensive experiments demonstrate that SPAR establishes the state-of-the-art for unified architectures, achieving exceptional generation and reconstruction quality while preserving foundational visual understanding capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SPAR's asymmetric tokenizer plus internal self-alignment looks like a coherent way to reduce external dependencies in unified MLLMs, but the SOTA claims rest on experiments we can't see here.

read the letter

The paper's core move is an asymmetric dual-stream tokenizer that keeps a lightweight semantic path for discrimination while a heavier pixel path tries to recover details in the same latent space, then uses that tokenizer itself as the alignment signal for the diffusion generator instead of an external teacher. On top of that it adds dynamic token routing that pulls features from different MLLM layers per token. That combination is what they present as new.

The design is laid out cleanly and directly targets the stated mismatch between semantic encoders and pixel reconstruction. Avoiding external teachers is a practical advantage if it works. The routing idea also feels like a reasonable response to the varying needs of different tokens.

The soft spot is exactly the one the abstract flags: whether the single compact latent space can hold both strong semantic features and fine pixel fidelity without measurable loss on either side. The abstract asserts SOTA generation and reconstruction while keeping understanding intact, but supplies no numbers, ablations, or comparisons to check that. Without those, the central assumption stays untested.

This is for people already working on unified understanding-plus-generation models. Anyone trying to cut down on separate alignment stages or external teachers would find the high-level architecture worth looking at. The paper deserves a serious referee to examine the actual results and training details; the argument itself is internally consistent and the problem it attacks is real.

Referee Report

2 major / 0 minor

Summary. The paper introduces SPAR, a unified multimodal framework for MLLMs that addresses the discrepancy between semantic perception and pixel-level reconstruction. It proposes an asymmetric dual-stream unified tokenizer (lightweight semantic stream plus Transformer-augmented pixel stream) to create a compact latent space, a self-aligned generation paradigm that uses the tokenizer as an internal teacher for the diffusion model, and Dynamic Token Routing to aggregate multi-layer MLLM features adaptively. The abstract claims this yields SOTA performance on generation, reconstruction, and visual understanding without external supervision.

Significance. If the quantitative results and ablations hold, the work would be significant for enabling self-contained unified architectures that avoid external teachers while preserving both discriminative and generative capabilities, a key open challenge in multimodal modeling.

major comments (2)

[Abstract] Abstract: the central claim that the asymmetric dual-stream tokenizer simultaneously anchors discriminative semantic features and recovers fine-grained pixel details 'without external supervision or loss of either capability' is load-bearing for all subsequent claims (self-alignment, SOTA performance, preservation of understanding), yet no derivation, loss formulation, or ablation is provided to demonstrate this is achievable.
[Abstract] Abstract: the assertion of 'establishing the state-of-the-art for unified architectures' with 'exceptional generation and reconstruction quality' cannot be evaluated because the text supplies no tables, metrics, baselines, or experimental setup, leaving the performance claims unsupported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We address the two major comments on the abstract below, clarifying where the supporting material appears in the full manuscript while noting that abstracts are necessarily concise.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the asymmetric dual-stream tokenizer simultaneously anchors discriminative semantic features and recovers fine-grained pixel details 'without external supervision or loss of either capability' is load-bearing for all subsequent claims (self-alignment, SOTA performance, preservation of understanding), yet no derivation, loss formulation, or ablation is provided to demonstrate this is achievable.

Authors: The loss formulation for the asymmetric dual-stream tokenizer (lightweight semantic stream + Transformer-augmented pixel stream) and the self-aligned generation paradigm (using the tokenizer as internal teacher) is derived in Sections 3.1 and 3.2. The objective combines semantic anchoring losses with pixel reconstruction terms without external models. Ablations confirming preservation of both capabilities appear in Section 4.3 (Table 4), showing performance retention on understanding benchmarks when generation is enabled. We can add a parenthetical reference to these sections in a revised abstract if the editor prefers. revision: partial
Referee: [Abstract] Abstract: the assertion of 'establishing the state-of-the-art for unified architectures' with 'exceptional generation and reconstruction quality' cannot be evaluated because the text supplies no tables, metrics, baselines, or experimental setup, leaving the performance claims unsupported.

Authors: Quantitative results, metrics (FID, PSNR, CLIP score, VQA accuracy, etc.), baselines, and experimental setup are reported in Section 4, with direct comparisons in Tables 1 (generation), 2 (reconstruction), and 5 (understanding). The setup details (datasets, training protocol, evaluation protocols) are in Section 4.1. These tables support the SOTA claims for unified models. If the referee did not locate these sections, we can improve cross-referencing from the abstract. revision: no

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper proposes an asymmetric dual-stream tokenizer and self-aligned generation paradigm as novel architectural choices that reconcile semantic and pixel features without external supervision. No equations, fitted parameters, or self-citations are visible in the provided text that reduce any claimed prediction or result to an input by construction. The design elements (Dynamic Token Routing, internal alignment teacher) are presented as independent innovations rather than renamings or self-referential definitions. The derivation chain remains self-contained against external benchmarks, with the central assumption being an empirical design claim rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no equations, training details, or component specifications are given, so no free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5790 in / 1064 out tokens · 13196 ms · 2026-07-03T23:13:33.192682+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 55 canonical work pages · 36 internal anchors

[1]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., 16 H. Li et al. Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Black Forest Labs: FLUX.2: Analyzing and enhancing the latent space of FLUX – representation comparison (2025),https://bfl.ai/research/representation- comparison3

2025
[4]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023) 14

2023
[5]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: Pushing web- scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3558–3568 (2021) 11

2021
[6]

Bi-Anchor Interpolation Solver for Accelerating Generative Modeling

Chen, H., Li, H., Wang, Z., Chen, L.: Bi-anchor interpolation solver for accelerating generative modeling. arXiv preprint arXiv:2601.21542 (2026) 2

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Chen, J., Xu, Z., Pan, X., Hu, Y., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al.: Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025) 4, 11, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Chen, J., Cai, Z., Chen, P., Chen, S., Ji, K., Wang, X., Yang, Y., Wang, B.: Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image genera- tion (2025),https://arxiv.org/abs/2506.180954, 11

work page arXiv 2025
[9]

arXiv preprint arXiv:2410.10733 (2024) 10

Chen, J., Cai, H., Chen, J., Xie, E., Yang, S., Tang, H., Li, M., Lu, Y., Han, S.: Deep compression autoencoder for efficient high-resolution diffusion models. arXiv preprint arXiv:2410.10733 (2024) 10, 11

work page arXiv 2024
[10]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025) 11, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., Fan, H.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 4, 5, 11, 12, 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

In: Forty-first international conference on machine learning (2024) 2, 12

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024) 2, 12

2024
[13]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023) 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Ge,Y.,Zhao,S.,Zhu,J.,Ge,Y.,Yi,K.,Song,L.,Li,C.,Ding,X.,Shan,Y.:Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396 (2024) 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Advances in Neural Information Processing Systems36, 52132–52152 (2023) 12

Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023) 12

2023
[16]

arXiv preprint arXiv:2506.18898 (2025) 11, 12

Han, J., Chen, H., Zhao, Y., Wang, H., Zhao, Q., Yang, Z., He, H., Yue, X., Jiang, L.: Vision as a dialect: Unifying visual understanding and generation via text-aligned representations. arXiv preprint arXiv:2506.18898 (2025) 11, 12

work page arXiv 2025
[17]

Illume+: Il- luminating unified mllm with dual visual tokenization and diffusion refinement.arXiv preprint arXiv:2504.01934,

Huang, R., Wang, C., Yang, J., Lu, G., Yuan, Y., Han, J., Hou, L., Zhang, W., Hong, L., Zhao, H., et al.: Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement. arXiv preprint arXiv:2504.01934 (2025) 10

work page arXiv 2025
[18]

Auto-Encoding Variational Bayes

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 2 Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Models 17

work page internal anchor Pith review Pith/arXiv arXiv 2013
[19]

Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024) 12

2024
[20]

Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025) 2

2025
[21]

Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, B., Ge, Y., Ge, Y., Wang, G., Wang, R., Zhang, R., Shan, Y.: Seed- bench: Benchmarking multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13299– 13308 (2024) 12

2024
[24]

In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=4c1gAsVd9C4

Li, H., Li, Y., Lin, B., Niu, Y., Yang, Y., Huang, X., Cai, J., Jiang, X., Hu, Y., Chen, L.: GIR-bench: Versatile benchmark for generating images with reasoning. In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=4c1gAsVd9C4

2026
[25]

In: International Conference on Learning Representations

Li, H., Li, Y., Yang, Y., Cao, J., Zhu, Z., Cheng, X., Chen, L.: Dispose: Disen- tangling pose guidance for controllable human image animation. In: International Conference on Learning Representations. vol. 2025, pp. 72213–72231 (2025) 2

2025
[26]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025) 5, 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

arXiv preprint arXiv:2505.05422 (2025) 11

Lin, H., Wang, T., Ge, Y., Ge, Y., Lu, Z., Wei, Y., Zhang, Q., Sun, Z., Shan, Y.: Toklip: Marry visual tokens to clip for multimodal comprehension and generation. arXiv preprint arXiv:2505.05422 (2025) 11

work page arXiv 2025
[28]

Advances in neural information processing systems36, 34892–34916 (2023) 1

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 1

2023
[29]

Step1X-Edit: A Practical Framework for General Image Editing

Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025) 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024) 12

2024
[31]

Luo, Z., Shi, F., Ge, Y., Yang, Y., Wang, L., Shan, Y.: Open-magvit2: An open-source project toward democratizing auto-regressive visual generation (2025), https://arxiv.org/abs/2409.0441010

work page arXiv 2025
[32]

Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321,

Ma, C., Jiang, Y., Wu, J., Yang, J., Yu, X., Yuan, Z., Peng, B., Qi, X.: Uni- tok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321 (2025) 4

work page arXiv 2025
[33]

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

Niu, Y., Ning, M., Zheng, M., Jin, W., Lin, B., Jin, P., Liao, J., Ning, K., Feng, C., Zhu, B., Yuan, L.: Wise: A world knowledge-informed semantic evaluation for text-to-image generation. arXiv preprint arXiv:2503.07265 (2025) 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

OpenAI: Introducing 4o Image Generation (2025),https://openai.com/index/ introducing-4o-image-generation/14

2025
[35]

Li et al

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,Howes,R.,Huang,P.Y.,Xu,H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, 18 H. Li et al. G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual f...

2023
[36]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 2545–2555 (2025) 4, 10, 12

2025
[38]

Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 2, 10

2022
[40]

Shi, M., Wang, H., Zheng, W., Yuan, Z., Wu, X., Wang, X., Wan, P., Zhou, J., Lu, J.: Latent diffusion model without variational autoencoder (2025),https: //arxiv.org/abs/2510.153014

work page arXiv 2025
[41]

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

Song, W., Wang, Y., Song, Z., Li, Y., Sun, H., Chen, W., Zhou, Z., Xu, J., Wang, J., Yu, K.: Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. arXiv preprint arXiv:2503.14324 (2025) 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Advances in neural information processing systems36, 49659–49678 (2023) 11

Sun, K., Pan, J., Ge, Y., Li, H., Duan, H., Wu, X., Zhang, R., Zhou, A., Qin, Z., Wang, Y., et al.: Journeydb: A benchmark for generative image understanding. Advances in neural information processing systems36, 49659–49678 (2023) 11

2023
[43]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Sun, P., Jiang, Y., Chen, S., Zhang, S., Peng, B., Luo, P., Yuan, Z.: Autoregres- sive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525 (2024) 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Sun, Q., Cui, Y., Zhang, X., Zhang, F., Yu, Q., Wang, Y., Rao, Y., Liu, J., Huang, T., Wang, X.: Generative multimodal models are in-context learners. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14398–14409 (2024) 4, 10

2024
[45]

Emu: Generative Pretraining in Multimodality

Sun, Q., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, Y., Gao, H., Liu, J., Huang, T., Wang, X.: Emu: Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222 (2023) 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Unilip: Adapting clip for unified multimodal understanding, generation and editing

Tang, H., Xie, C., Bao, X., Weng, T., Li, P., Zheng, Y., Wang, L.: Unilip: Adapting clip for unified multimodal understanding, generation and editing. arXiv preprint arXiv:2507.23278 (2025) 4, 5, 10

work page arXiv 2025
[47]

Team,C.:Chameleon:Mixed-modalearly-fusionfoundationmodels.arXivpreprint arXiv:2405.09818 (2024) 4, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive modeling: Scalableimagegenerationvianext-scaleprediction.Advancesinneuralinformation processing systems37, 84839–84865 (2024) 2, 10

2024
[49]

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Tong, S., Fan, D., Zhu, J., Xiong, Y., Chen, X., Sinha, K., Rabbat, M., LeCun, Y., Xie, S., Liu, Z.: Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164 (2024) 11 Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Models 19

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9568–9578 (2024) 12

2024
[51]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Advances in neural information processing systems30(2017) 4

Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Advances in neural information processing systems30(2017) 4

2017
[53]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Emu3: Next-Token Prediction is All You Need

Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024) 11, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

LISA: Likelihood Score Alignment for Visual-condition Controllable Generation

Wang, Y., Chen, H., Liu, J., He, Z., Liu, R., Wang, Z., Chen, L.: Lisa: Likeli- hood score alignment for visual-condition controllable generation. arXiv preprint arXiv:2606.27192 (2026) 2

work page internal anchor Pith review Pith/arXiv arXiv 2026
[56]

arXiv preprint arXiv:2603.12057 (2026) 2

Wang, Y., Jiang, Z., Wang, Z., Chen, L.: Coarse-guided visual generation via weighted h-transform sampling. arXiv preprint arXiv:2603.12057 (2026) 2

work page arXiv 2026
[57]

arXiv preprint arXiv:2510.20212 (2025) 2

Wang, Y., Wang, Z., Chen, L.: Target-aware image editing via cycle-consistent constraints. arXiv preprint arXiv:2510.20212 (2025) 2

work page arXiv 2025
[58]

arXiv preprint arXiv:2507.21033 (2025) 11

Wang, Y., Yang, S., Zhao, B., Zhang, L., Liu, Q., Zhou, Y., Xie, C.: Gpt- image-edit-1.5 m: A million-scale, gpt-generated image dataset. arXiv preprint arXiv:2507.21033 (2025) 11

work page arXiv 2025
[59]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025) 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025) 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

arXiv preprint arXiv:2505.23661 (2025) 4, 12 Abbreviated paper title 19

Wu, S., Wu, Z., Gong, Z., Tao, Q., Jin, S., Li, Q., Li, W., Loy, C.C.: Openuni: A simple baseline for unified multimodal understanding and generation. arXiv preprint arXiv:2505.23661 (2025) 4, 12

work page arXiv 2025
[62]

Wu, S., Zhang, W., Xu, L., Jin, S., Wu, Z., Tao, Q., Liu, W., Li, W., Loy, C.C.: Harmonizing visual representations for unified multimodal understanding and gen- eration (2025),https://arxiv.org/abs/2503.2197911, 12

work page arXiv 2025
[63]

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Wu, Y., Zhang, Z., Chen, J., Tang, H., Li, D., Fang, Y., Zhu, L., Xie, E., Yin, H., Yi, L., et al.: Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429 (2024) 4, 10, 11, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13294–13304 (2025) 14

2025
[65]

Xie, E., Chen, J., Chen, J., Cai, H., Tang, H., Lin, Y., Zhang, Z., Li, M., Zhu, L., Lu, Y., Han, S.: Sana: Efficient high-resolution image synthesis with linear diffusion transformer (2024),https://arxiv.org/abs/2410.1062910, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. arXiv preprint arXiv:2408.12528 (2024) 4, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Show-o2: Improved Native Unified Multimodal Models

Xie,J.,Yang,Z.,Shou,M.Z.:Show-o2:Improvednativeunifiedmultimodalmodels. arXiv preprint arXiv:2506.15564 (2025) 12 20 H. Li et al

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

generation: Taming optimization dilemma in latent diffusion models

Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15703–15712 (2025) 2, 3, 10

2025
[70]

Yu, L., Shi, B., Pasunuru, R., Muller, B., Golovneva, O., Wang, T., Babu, A., Tang, B., Karrer, B., Sheynin, S., Ross, C., Polyak, A., Howes, R., Sharma, V., Xu, P., Tamoyan, H., Ashual, O., Singer, U., Li, S.W., Zhang, S., James, R., Ghosh, G., Taigman, Y., Fazel-Zarandi, M., Celikyilmaz, A., Zettlemoyer, L., Aghajanyan, A.: Scaling autoregressive multi-...

work page arXiv 2023
[71]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yu, Q., Chow, W., Yue, Z., Pan, K., Wu, Y., Wan, X., Li, J., Tang, S., Zhang, H., Zhuang, Y.: Anyedit: Mastering unified high-quality image editing for any idea. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26125–26135 (2025) 14

2025
[72]

Advances in Neural Information Processing Systems37, 128940–128966 (2024) 2

Yu, Q., Weber, M., Deng, X., Shen, X., Cremers, D., Chen, L.C.: An image is worth 32 tokens for reconstruction and generation. Advances in Neural Information Processing Systems37, 128940–128966 (2024) 2

2024
[73]

In: International Conference on Learning Representations (2025) 2, 3, 4

Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. In: International Conference on Learning Representations (2025) 2, 3, 4

2025
[74]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., Wang, L.: Mm- vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023) 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[75]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9556–9567 (2024) 12

2024
[76]

arXiv preprint arXiv:2510.10575 (2025) 4

Yue, Z., Zhang, H., Zeng, X., Chen, B., Wang, C., Zhuang, S., Dong, L., Du, K., Wang, Y., Wang, L., et al.: Uniflow: A unified pixel flow tokenizer for visual understanding and generation. arXiv preprint arXiv:2510.10575 (2025) 4

work page arXiv 2025
[77]

Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 14

Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 14

2023
[78]

arXiv preprint arXiv:2512.17909 (2025) 3, 4

Zhang, S., Zhang, H., Zhang, Z., Ge, C., Xue, S., Liu, S., Ren, M., Kim, S.Y., Zhou, Y., Liu, Q., et al.: Both semantics and reconstruction matter: Making repre- sentation encoders ready for text-to-image generation and editing. arXiv preprint arXiv:2512.17909 (2025) 3, 4

work page arXiv 2025
[79]

Zhang, Z., Xie, J., Lu, Y., Yang, Z., Yang, Y.: Enabling instructional image editing within-contextgenerationinlargescalediffusiontransformer.In:TheThirty-ninth Annual Conference on Neural Information Processing Systems (2025) 14

2025
[80]

Advances in Neural Information Processing Systems37, 3058–3093 (2024) 14 Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Models 21

Zhao, H., Ma, X.S., Chen, L., Si, S., Wu, R., An, K., Yu, P., Zhang, M., Li, Q., Chang, B.: Ultraedit: Instruction-based fine-grained image editing at scale. Advances in Neural Information Processing Systems37, 3058–3093 (2024) 14 Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Models 21

2024

Showing first 80 references.

[1] [1]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., 16 H. Li et al. Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Black Forest Labs: FLUX.2: Analyzing and enhancing the latent space of FLUX – representation comparison (2025),https://bfl.ai/research/representation- comparison3

2025

[4] [4]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023) 14

2023

[5] [5]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: Pushing web- scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3558–3568 (2021) 11

2021

[6] [6]

Bi-Anchor Interpolation Solver for Accelerating Generative Modeling

Chen, H., Li, H., Wang, Z., Chen, L.: Bi-anchor interpolation solver for accelerating generative modeling. arXiv preprint arXiv:2601.21542 (2026) 2

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Chen, J., Xu, Z., Pan, X., Hu, Y., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al.: Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025) 4, 11, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Chen, J., Cai, Z., Chen, P., Chen, S., Ji, K., Wang, X., Yang, Y., Wang, B.: Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image genera- tion (2025),https://arxiv.org/abs/2506.180954, 11

work page arXiv 2025

[9] [9]

arXiv preprint arXiv:2410.10733 (2024) 10

Chen, J., Cai, H., Chen, J., Xie, E., Yang, S., Tang, H., Li, M., Lu, Y., Han, S.: Deep compression autoencoder for efficient high-resolution diffusion models. arXiv preprint arXiv:2410.10733 (2024) 10, 11

work page arXiv 2024

[10] [10]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025) 11, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., Fan, H.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 4, 5, 11, 12, 14

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

In: Forty-first international conference on machine learning (2024) 2, 12

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024) 2, 12

2024

[13] [13]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023) 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Ge,Y.,Zhao,S.,Zhu,J.,Ge,Y.,Yi,K.,Song,L.,Li,C.,Ding,X.,Shan,Y.:Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396 (2024) 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Advances in Neural Information Processing Systems36, 52132–52152 (2023) 12

Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023) 12

2023

[16] [16]

arXiv preprint arXiv:2506.18898 (2025) 11, 12

Han, J., Chen, H., Zhao, Y., Wang, H., Zhao, Q., Yang, Z., He, H., Yue, X., Jiang, L.: Vision as a dialect: Unifying visual understanding and generation via text-aligned representations. arXiv preprint arXiv:2506.18898 (2025) 11, 12

work page arXiv 2025

[17] [17]

Illume+: Il- luminating unified mllm with dual visual tokenization and diffusion refinement.arXiv preprint arXiv:2504.01934,

Huang, R., Wang, C., Yang, J., Lu, G., Yuan, Y., Han, J., Hou, L., Zhang, W., Hong, L., Zhao, H., et al.: Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement. arXiv preprint arXiv:2504.01934 (2025) 10

work page arXiv 2025

[18] [18]

Auto-Encoding Variational Bayes

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 2 Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Models 17

work page internal anchor Pith review Pith/arXiv arXiv 2013

[19] [19]

Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024) 12

2024

[20] [20]

Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025) 2

2025

[21] [21]

Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, B., Ge, Y., Ge, Y., Wang, G., Wang, R., Zhang, R., Shan, Y.: Seed- bench: Benchmarking multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13299– 13308 (2024) 12

2024

[24] [24]

In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=4c1gAsVd9C4

Li, H., Li, Y., Lin, B., Niu, Y., Yang, Y., Huang, X., Cai, J., Jiang, X., Hu, Y., Chen, L.: GIR-bench: Versatile benchmark for generating images with reasoning. In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=4c1gAsVd9C4

2026

[25] [25]

In: International Conference on Learning Representations

Li, H., Li, Y., Yang, Y., Cao, J., Zhu, Z., Cheng, X., Chen, L.: Dispose: Disen- tangling pose guidance for controllable human image animation. In: International Conference on Learning Representations. vol. 2025, pp. 72213–72231 (2025) 2

2025

[26] [26]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025) 5, 14

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

arXiv preprint arXiv:2505.05422 (2025) 11

Lin, H., Wang, T., Ge, Y., Ge, Y., Lu, Z., Wei, Y., Zhang, Q., Sun, Z., Shan, Y.: Toklip: Marry visual tokens to clip for multimodal comprehension and generation. arXiv preprint arXiv:2505.05422 (2025) 11

work page arXiv 2025

[28] [28]

Advances in neural information processing systems36, 34892–34916 (2023) 1

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 1

2023

[29] [29]

Step1X-Edit: A Practical Framework for General Image Editing

Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025) 14

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024) 12

2024

[31] [31]

Luo, Z., Shi, F., Ge, Y., Yang, Y., Wang, L., Shan, Y.: Open-magvit2: An open-source project toward democratizing auto-regressive visual generation (2025), https://arxiv.org/abs/2409.0441010

work page arXiv 2025

[32] [32]

Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321,

Ma, C., Jiang, Y., Wu, J., Yang, J., Yu, X., Yuan, Z., Peng, B., Qi, X.: Uni- tok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321 (2025) 4

work page arXiv 2025

[33] [33]

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

Niu, Y., Ning, M., Zheng, M., Jin, W., Lin, B., Jin, P., Liao, J., Ning, K., Feng, C., Zhu, B., Yuan, L.: Wise: A world knowledge-informed semantic evaluation for text-to-image generation. arXiv preprint arXiv:2503.07265 (2025) 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

OpenAI: Introducing 4o Image Generation (2025),https://openai.com/index/ introducing-4o-image-generation/14

2025

[35] [35]

Li et al

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,Howes,R.,Huang,P.Y.,Xu,H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, 18 H. Li et al. G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual f...

2023

[36] [36]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 2545–2555 (2025) 4, 10, 12

2025

[38] [38]

Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 2, 10

2022

[40] [40]

Shi, M., Wang, H., Zheng, W., Yuan, Z., Wu, X., Wang, X., Wan, P., Zhou, J., Lu, J.: Latent diffusion model without variational autoencoder (2025),https: //arxiv.org/abs/2510.153014

work page arXiv 2025

[41] [41]

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

Song, W., Wang, Y., Song, Z., Li, Y., Sun, H., Chen, W., Zhou, Z., Xu, J., Wang, J., Yu, K.: Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. arXiv preprint arXiv:2503.14324 (2025) 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Advances in neural information processing systems36, 49659–49678 (2023) 11

Sun, K., Pan, J., Ge, Y., Li, H., Duan, H., Wu, X., Zhang, R., Zhou, A., Qin, Z., Wang, Y., et al.: Journeydb: A benchmark for generative image understanding. Advances in neural information processing systems36, 49659–49678 (2023) 11

2023

[43] [43]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Sun, P., Jiang, Y., Chen, S., Zhang, S., Peng, B., Luo, P., Yuan, Z.: Autoregres- sive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525 (2024) 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Sun, Q., Cui, Y., Zhang, X., Zhang, F., Yu, Q., Wang, Y., Rao, Y., Liu, J., Huang, T., Wang, X.: Generative multimodal models are in-context learners. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14398–14409 (2024) 4, 10

2024

[45] [45]

Emu: Generative Pretraining in Multimodality

Sun, Q., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, Y., Gao, H., Liu, J., Huang, T., Wang, X.: Emu: Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222 (2023) 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

Unilip: Adapting clip for unified multimodal understanding, generation and editing

Tang, H., Xie, C., Bao, X., Weng, T., Li, P., Zheng, Y., Wang, L.: Unilip: Adapting clip for unified multimodal understanding, generation and editing. arXiv preprint arXiv:2507.23278 (2025) 4, 5, 10

work page arXiv 2025

[47] [47]

Team,C.:Chameleon:Mixed-modalearly-fusionfoundationmodels.arXivpreprint arXiv:2405.09818 (2024) 4, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive modeling: Scalableimagegenerationvianext-scaleprediction.Advancesinneuralinformation processing systems37, 84839–84865 (2024) 2, 10

2024

[49] [49]

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Tong, S., Fan, D., Zhu, J., Xiong, Y., Chen, X., Sinha, K., Rabbat, M., LeCun, Y., Xie, S., Liu, Z.: Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164 (2024) 11 Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Models 19

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9568–9578 (2024) 12

2024

[51] [51]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

Advances in neural information processing systems30(2017) 4

Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Advances in neural information processing systems30(2017) 4

2017

[53] [53]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Emu3: Next-Token Prediction is All You Need

Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024) 11, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

LISA: Likelihood Score Alignment for Visual-condition Controllable Generation

Wang, Y., Chen, H., Liu, J., He, Z., Liu, R., Wang, Z., Chen, L.: Lisa: Likeli- hood score alignment for visual-condition controllable generation. arXiv preprint arXiv:2606.27192 (2026) 2

work page internal anchor Pith review Pith/arXiv arXiv 2026

[56] [56]

arXiv preprint arXiv:2603.12057 (2026) 2

Wang, Y., Jiang, Z., Wang, Z., Chen, L.: Coarse-guided visual generation via weighted h-transform sampling. arXiv preprint arXiv:2603.12057 (2026) 2

work page arXiv 2026

[57] [57]

arXiv preprint arXiv:2510.20212 (2025) 2

Wang, Y., Wang, Z., Chen, L.: Target-aware image editing via cycle-consistent constraints. arXiv preprint arXiv:2510.20212 (2025) 2

work page arXiv 2025

[58] [58]

arXiv preprint arXiv:2507.21033 (2025) 11

Wang, Y., Yang, S., Zhao, B., Zhang, L., Liu, Q., Zhou, Y., Xie, C.: Gpt- image-edit-1.5 m: A million-scale, gpt-generated image dataset. arXiv preprint arXiv:2507.21033 (2025) 11

work page arXiv 2025

[59] [59]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025) 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025) 14

work page internal anchor Pith review Pith/arXiv arXiv 2025

[61] [61]

arXiv preprint arXiv:2505.23661 (2025) 4, 12 Abbreviated paper title 19

Wu, S., Wu, Z., Gong, Z., Tao, Q., Jin, S., Li, Q., Li, W., Loy, C.C.: Openuni: A simple baseline for unified multimodal understanding and generation. arXiv preprint arXiv:2505.23661 (2025) 4, 12

work page arXiv 2025

[62] [62]

Wu, S., Zhang, W., Xu, L., Jin, S., Wu, Z., Tao, Q., Liu, W., Li, W., Loy, C.C.: Harmonizing visual representations for unified multimodal understanding and gen- eration (2025),https://arxiv.org/abs/2503.2197911, 12

work page arXiv 2025

[63] [63]

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Wu, Y., Zhang, Z., Chen, J., Tang, H., Li, D., Fang, Y., Zhu, L., Xie, E., Yin, H., Yi, L., et al.: Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429 (2024) 4, 10, 11, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [64]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13294–13304 (2025) 14

2025

[65] [65]

Xie, E., Chen, J., Chen, J., Cai, H., Tang, H., Lin, Y., Zhang, Z., Li, M., Zhu, L., Lu, Y., Han, S.: Sana: Efficient high-resolution image synthesis with linear diffusion transformer (2024),https://arxiv.org/abs/2410.1062910, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[66] [66]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. arXiv preprint arXiv:2408.12528 (2024) 4, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[67] [67]

Show-o2: Improved Native Unified Multimodal Models

Xie,J.,Yang,Z.,Shou,M.Z.:Show-o2:Improvednativeunifiedmultimodalmodels. arXiv preprint arXiv:2506.15564 (2025) 12 20 H. Li et al

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [68]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[69] [69]

generation: Taming optimization dilemma in latent diffusion models

Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15703–15712 (2025) 2, 3, 10

2025

[70] [70]

Yu, L., Shi, B., Pasunuru, R., Muller, B., Golovneva, O., Wang, T., Babu, A., Tang, B., Karrer, B., Sheynin, S., Ross, C., Polyak, A., Howes, R., Sharma, V., Xu, P., Tamoyan, H., Ashual, O., Singer, U., Li, S.W., Zhang, S., James, R., Ghosh, G., Taigman, Y., Fazel-Zarandi, M., Celikyilmaz, A., Zettlemoyer, L., Aghajanyan, A.: Scaling autoregressive multi-...

work page arXiv 2023

[71] [71]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yu, Q., Chow, W., Yue, Z., Pan, K., Wu, Y., Wan, X., Li, J., Tang, S., Zhang, H., Zhuang, Y.: Anyedit: Mastering unified high-quality image editing for any idea. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26125–26135 (2025) 14

2025

[72] [72]

Advances in Neural Information Processing Systems37, 128940–128966 (2024) 2

Yu, Q., Weber, M., Deng, X., Shen, X., Cremers, D., Chen, L.C.: An image is worth 32 tokens for reconstruction and generation. Advances in Neural Information Processing Systems37, 128940–128966 (2024) 2

2024

[73] [73]

In: International Conference on Learning Representations (2025) 2, 3, 4

Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. In: International Conference on Learning Representations (2025) 2, 3, 4

2025

[74] [74]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., Wang, L.: Mm- vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023) 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[75] [75]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9556–9567 (2024) 12

2024

[76] [76]

arXiv preprint arXiv:2510.10575 (2025) 4

Yue, Z., Zhang, H., Zeng, X., Chen, B., Wang, C., Zhuang, S., Dong, L., Du, K., Wang, Y., Wang, L., et al.: Uniflow: A unified pixel flow tokenizer for visual understanding and generation. arXiv preprint arXiv:2510.10575 (2025) 4

work page arXiv 2025

[77] [77]

Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 14

Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 14

2023

[78] [78]

arXiv preprint arXiv:2512.17909 (2025) 3, 4

Zhang, S., Zhang, H., Zhang, Z., Ge, C., Xue, S., Liu, S., Ren, M., Kim, S.Y., Zhou, Y., Liu, Q., et al.: Both semantics and reconstruction matter: Making repre- sentation encoders ready for text-to-image generation and editing. arXiv preprint arXiv:2512.17909 (2025) 3, 4

work page arXiv 2025

[79] [79]

Zhang, Z., Xie, J., Lu, Y., Yang, Z., Yang, Y.: Enabling instructional image editing within-contextgenerationinlargescalediffusiontransformer.In:TheThirty-ninth Annual Conference on Neural Information Processing Systems (2025) 14

2025

[80] [80]

Advances in Neural Information Processing Systems37, 3058–3093 (2024) 14 Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Models 21

Zhao, H., Ma, X.S., Chen, L., Si, S., Wu, R., An, K., Yu, P., Zhang, M., Li, Q., Chang, B.: Ultraedit: Instruction-based fine-grained image editing at scale. Advances in Neural Information Processing Systems37, 3058–3093 (2024) 14 Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Models 21

2024