SPAR: Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Multimodal Models

Chenyang Zhu; Hongxiang Li; Hongxu Chen; Jiayin Cai; Long Chen; Xiaolong Jiang; Xiaoshuang Huang; Yao Hu

arxiv: 2606.23041 · v1 · pith:S2FMRM4Jnew · submitted 2026-06-22 · 💻 cs.CV

SPAR: Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Multimodal Models

Hongxiang Li , Hongxu Chen , Chenyang Zhu , Xiaoshuang Huang , Jiayin Cai , Xiaolong Jiang , Yao Hu , Long Chen This is my paper

Pith reviewed 2026-06-26 09:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords unified multimodal modelssemantic-pixel alignmentadaptive routingdiffusion modelsunified tokenizervisual generationself-alignmentMLLM

0 comments

The pith

SPAR reconciles semantic perception and pixel reconstruction in one tokenizer to let multimodal models generate and understand without external teachers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models handle visual understanding well but lag in generation because semantic features clash with the needs of pixel-level reconstruction. The paper introduces SPAR to close this gap with an asymmetric dual-stream unified tokenizer that keeps a lightweight semantic stream for strong discrimination while a transformer-augmented pixel stream adds fine details into one compact latent space. This tokenizer then serves directly as an internal teacher to align a diffusion model for generation, removing any need for outside alignment systems. Dynamic token routing further lets each token pull the right multi-layer features based on its own semantic needs. The result is a single architecture that aims to match specialized systems on both understanding and generation tasks.

Core claim

The central claim is that an asymmetric dual-stream unified tokenizer can reconcile semantic perception with pixel-level reconstruction by anchoring discriminative features in a lightweight semantic stream and recovering fine-grained details in a Transformer-augmented pixel stream within a unified compact latent space, and that this tokenizer can natively serve as an internal alignment teacher for a diffusion model in a self-aligned generation paradigm, while Dynamic Token Routing enables flexible multimodal interaction, allowing the unified model to achieve state-of-the-art generation and reconstruction quality while preserving visual understanding.

What carries the argument

Asymmetric dual-stream unified tokenizer, which anchors discriminative features in a lightweight semantic stream and recovers fine-grained visual details in a Transformer-augmented pixel stream inside one compact latent space, then used as internal teacher for diffusion.

If this is right

Unified models reach state-of-the-art generation and reconstruction quality.
Visual understanding capabilities stay intact without separate training branches.
Dynamic token routing lets tokens adaptively aggregate multi-layer features according to their semantic demands.
Generation proceeds without external alignment teachers or added dependencies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the internal teacher scales reliably, multimodal training pipelines could drop separate pre-alignment stages and reduce compute overhead.
The routing mechanism might extend naturally to video or audio by swapping the pixel stream while keeping the semantic anchor.
Token-level adaptation could improve efficiency at inference time by skipping unneeded layers for simple tokens.

Load-bearing premise

The dual-stream tokenizer can deliver both high discriminative power for understanding and high-fidelity reconstruction for generation in the same space without one weakening the other.

What would settle it

A controlled ablation where the internal self-alignment produces measurably lower generation quality than an external teacher baseline on standard metrics such as FID while the semantic stream is kept fixed.

Figures

Figures reproduced from arXiv: 2606.23041 by Chenyang Zhu, Hongxiang Li, Hongxu Chen, Jiayin Cai, Long Chen, Xiaolong Jiang, Xiaoshuang Huang, Yao Hu.

**Figure 1.** Figure 1: (a) Image Reconstruction: When modeling directly within the semantic representation space, existing methods suffer from lossy compression and struggle to preserve high-frequency details. In contrast, our method effectively recovers these crucial pixel-level details. (b) Representation Alignment Paradigm: Unlike existing approaches that rely on external semantic encoders to guide the generative model, our … view at source ↗

**Figure 2.** Figure 2: Architecture of the semantic-pixel self-aligned unified tokenizer. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the unified multimodal model. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results of image generation. Compared with the second-best OmniGen2 (3.44), SPAR-3B yields an absolute improvement of +0.58. Across specific editing categories, SPAR-3B achieves the best performance on various subtypes, with particularly pronounced advantages on semantically demanding tasks, indicating that the rich semantic representations within the SPAR tokenizer effectively enhance the edi… view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in visual understanding but remain constrained in visual generation due to the fundamental feature discrepancy between semantic perception and pixel-level reconstruction. Bridging this gap requires overcoming two core challenges: endowing semantic encoders with high-fidelity reconstruction capabilities, and effectively aligning generative models with semantic spaces without relying on external teachers. To this end, we propose a novel unified multimodal framework featuring \textbf{S}emantic-\textbf{P}ixel self-alignment and \textbf{A}daptive \textbf{R}outing (\textbf{SPAR}). First, to reconcile semantic perception with pixel-level reconstruction, we introduce an asymmetric dual-stream unified tokenizer. A lightweight semantic stream anchors discriminative features, while a Transformer-augmented pixel stream recovers fine-grained visual details into a unified compact latent space. Second, to eliminate external dependencies, we propose a self-aligned generation paradigm that natively leverages this optimized tokenizer as an internal alignment teacher for the diffusion model. Furthermore, to facilitate flexible multimodal interaction within this unified space, we introduce Dynamic Token Routing, which enables each token to adaptively aggregate multi-layer MLLM features based on its distinct semantic demands. Extensive experiments demonstrate that SPAR establishes the state-of-the-art for unified architectures, achieving exceptional generation and reconstruction quality while preserving foundational visual understanding capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SPAR's asymmetric tokenizer and internal self-alignment look like a practical route to unified MLLMs, though the SOTA claim still needs the numbers to confirm it.

read the letter

The paper's real contribution is the asymmetric dual-stream tokenizer that keeps a lightweight semantic anchor while letting a pixel stream recover details, then turns that same tokenizer into the diffusion model's internal teacher. Dynamic Token Routing is the other piece, letting tokens pull from different MLLM layers as needed. These moves directly target the semantic-pixel mismatch without pulling in outside models for alignment.

The design is cleanly motivated. The authors spell out the two core problems and link each component to one of them, which makes the architecture easy to follow. Treating the tokenizer as its own teacher is a sensible way to cut external dependencies, and the routing idea adds flexibility without obvious extra cost. If the experiments back this up with solid ablations, the approach could give people a workable template for building unified models.

The main soft spot is that the SOTA assertion sits on top of the mechanism description. The abstract and high-level account do not show the actual metrics, baselines, or ablation tables, so it is still unclear how large the gains are over prior unified setups or whether the understanding capabilities really stay intact. The stress-test found no internal contradictions, which is good, but the performance story remains the part that needs verification.

This is for groups already working on unified perception-generation models who want concrete architecture options rather than another scale-up paper. Readers focused on tokenizer design or self-supervised alignment tricks would get the most out of it.

I would send it to peer review. The idea is coherent enough that referees should see the full results and decide whether the gains are real.

Referee Report

2 major / 0 minor

Summary. The paper proposes SPAR, a unified multimodal framework to bridge semantic perception and pixel-level reconstruction in MLLMs. It introduces an asymmetric dual-stream unified tokenizer (lightweight semantic stream + Transformer-augmented pixel stream) to create a compact latent space, a self-aligned diffusion paradigm that uses this tokenizer as an internal teacher without external dependencies, and Dynamic Token Routing to let tokens adaptively aggregate multi-layer MLLM features. The central claim is that SPAR achieves SOTA performance among unified architectures on generation and reconstruction while preserving visual understanding capabilities.

Significance. If the experimental validation holds, the self-aligned internal-teacher approach and asymmetric tokenizer design could meaningfully advance unified multimodal models by reducing reliance on external alignment mechanisms and addressing the semantic-pixel discrepancy directly in the architecture. The explicit motivation of design choices (lightweight semantic anchor, pixel augmentation) is a strength.

major comments (2)

[Abstract] Abstract: The claim that SPAR 'establishes the state-of-the-art for unified architectures' and achieves 'exceptional generation and reconstruction quality' is presented without any metrics, baselines, ablation studies, tables, or figures. This absence makes the central empirical claim impossible to evaluate and is load-bearing for the paper's contribution.
[Abstract] The description of the asymmetric dual-stream tokenizer and self-aligned paradigm (Abstract) asserts that the tokenizer serves as an internal alignment teacher 'without relying on external teachers' and preserves 'foundational visual understanding capabilities,' but no derivation, loss formulation, or verification is supplied to show that the lightweight semantic stream does not degrade discriminative power or that the pixel stream recovers details without introducing new discrepancies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments. We address each major point regarding the abstract below and outline revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that SPAR 'establishes the state-of-the-art for unified architectures' and achieves 'exceptional generation and reconstruction quality' is presented without any metrics, baselines, ablation studies, tables, or figures. This absence makes the central empirical claim impossible to evaluate and is load-bearing for the paper's contribution.

Authors: We acknowledge the referee's point that the abstract's SOTA and quality claims would be easier to evaluate with supporting numbers. The full manuscript reports these results in Sections 4 and 5, with quantitative tables against unified baselines (e.g., Chameleon, Show-o) on FID, reconstruction metrics, and understanding benchmarks, plus ablations. To address the concern directly, we will revise the abstract to include a small number of key metrics supporting the claims. revision: yes
Referee: [Abstract] The description of the asymmetric dual-stream tokenizer and self-aligned paradigm (Abstract) asserts that the tokenizer serves as an internal alignment teacher 'without relying on external teachers' and preserves 'foundational visual understanding capabilities,' but no derivation, loss formulation, or verification is supplied to show that the lightweight semantic stream does not degrade discriminative power or that the pixel stream recovers details without introducing new discrepancies.

Authors: The abstract is a concise summary; the asymmetric dual-stream tokenizer (lightweight semantic stream + Transformer-augmented pixel stream) and its motivation are derived in Section 3.1, the self-aligned diffusion paradigm with internal-teacher losses (no external models) is formulated in Section 3.2, and verification that semantic power is preserved (no drop on VQA/understanding tasks) while pixel details improve is shown via ablations and comparisons in Section 4.3. We will revise the abstract wording to avoid over-assertion and briefly note the empirical support. revision: partial

Circularity Check

0 steps flagged

No significant circularity; architecture claims are self-contained

full rationale

The abstract and method outline introduce an asymmetric dual-stream tokenizer, self-aligned diffusion using the tokenizer as internal teacher, and dynamic token routing as explicit design choices to address stated challenges of semantic-pixel discrepancy. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce any central claim to its own inputs by construction. The derivation chain remains independent, with performance claims tied to asserted experimental validation rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Only the abstract is available, so the ledger reflects high-level claims; full paper may introduce additional fitted parameters or domain assumptions not visible here.

axioms (2)

domain assumption Semantic perception and pixel-level reconstruction can be reconciled via an asymmetric dual-stream tokenizer in a unified compact latent space.
Invoked as the first core challenge the framework is designed to solve.
domain assumption The optimized tokenizer can serve as an internal alignment teacher for the diffusion model without external dependencies.
Central to the self-aligned generation paradigm described in the abstract.

invented entities (1)

Dynamic Token Routing no independent evidence
purpose: Enables each token to adaptively aggregate multi-layer MLLM features based on distinct semantic demands.
Introduced as a new component for flexible multimodal interaction within the unified space.

pith-pipeline@v0.9.1-grok · 5790 in / 1304 out tokens · 27376 ms · 2026-06-26T09:06:41.837401+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 33 linked inside Pith

[1]

arXiv preprint arXiv:2511.21631 (2025) 1

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

Pith/arXiv arXiv 2025
[2]

arXiv preprint arXiv:2502.13923 (2025) 11

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 11

Pith/arXiv arXiv 2025
[3]

Black Forest Labs: FLUX.2: Analyzing and enhancing the latent space of FLUX – representation comparison (2025),https://bfl.ai/research/representation- comparison3

2025
[4]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023) 14

2023
[5]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: Pushing web- scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3558–3568 (2021) 10

2021
[6]

Li et al

Chen, J., Xu, Z., Pan, X., Hu, Y., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al.: Blip3-o: A family of fully open unified multimodal 16 H. Li et al. models-architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025) 4, 10, 11, 12

Pith/arXiv arXiv 2025
[7]

Chen, J., Cai, Z., Chen, P., Chen, S., Ji, K., Wang, X., Yang, Y., Wang, B.: Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image genera- tion (2025),https://arxiv.org/abs/2506.1809511

arXiv 2025
[8]

arXiv preprint arXiv:2410.10733 (2024) 10

Chen, J., Cai, H., Chen, J., Xie, E., Yang, S., Tang, H., Li, M., Lu, Y., Han, S.: Deep compression autoencoder for efficient high-resolution diffusion models. arXiv preprint arXiv:2410.10733 (2024) 10

arXiv 2024
[9]

arXiv preprint arXiv:2501.17811 (2025) 11, 12

Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025) 11, 12

Pith/arXiv arXiv 2025
[10]

arXiv preprint arXiv:2505.14683 (2025) 4, 11, 12, 14

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., Fan, H.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 4, 11, 12, 14

Pith/arXiv arXiv 2025
[11]

In: Forty-first international conference on machine learning (2024) 2, 12

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024) 2, 12

2024
[12]

arXiv preprint arXiv:2306.13394 (2023) 11

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023) 11

Pith/arXiv arXiv 2023
[13]

arXiv preprint arXiv:2404.14396 (2024) 11

Ge,Y.,Zhao,S.,Zhu,J.,Ge,Y.,Yi,K.,Song,L.,Li,C.,Ding,X.,Shan,Y.:Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396 (2024) 11

Pith/arXiv arXiv 2024
[14]

Advances in Neural Information Processing Systems36, 52132–52152 (2023) 12

Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023) 12

2023
[15]

arXiv preprint arXiv:2506.18898 (2025) 11, 12

Han, J., Chen, H., Zhao, Y., Wang, H., Zhao, Q., Yang, Z., He, H., Yue, X., Jiang, L.: Vision as a dialect: Unifying visual understanding and generation via text-aligned representations. arXiv preprint arXiv:2506.18898 (2025) 11, 12

arXiv 2025
[16]

arXiv preprint arXiv:2504.01934 (2025) 10

Huang, R., Wang, C., Yang, J., Lu, G., Yuan, Y., Han, J., Hou, L., Zhang, W., Hong, L., Zhao, H., et al.: Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement. arXiv preprint arXiv:2504.01934 (2025) 10

arXiv 2025
[17]

arXiv preprint arXiv:1312.6114 (2013) 2

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 2

Pith/arXiv arXiv 2013
[18]

Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024) 12

2024
[19]

Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025) 2

2025
[20]

Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...

Pith/arXiv arXiv 2025
[21]

arXiv preprint arXiv:2408.03326 (2024) 11

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 11

Pith/arXiv arXiv 2024
[22]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, B., Ge, Y., Ge, Y., Wang, G., Wang, R., Zhang, R., Shan, Y.: Seed- bench: Benchmarking multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13299– 13308 (2024) 11 Abbreviated paper title 17

2024
[23]

arXiv preprint arXiv:2506.03147 (2025) 4, 14

Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025) 4, 14

Pith/arXiv arXiv 2025
[24]

arXiv preprint arXiv:2505.05422 (2025) 11

Lin, H., Wang, T., Ge, Y., Ge, Y., Lu, Z., Wei, Y., Zhang, Q., Sun, Z., Shan, Y.: Toklip: Marry visual tokens to clip for multimodal comprehension and generation. arXiv preprint arXiv:2505.05422 (2025) 11

arXiv 2025
[25]

Advances in neural information processing systems36, 34892–34916 (2023) 1

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 1

2023
[26]

arXiv preprint arXiv:2504.17761 (2025) 14

Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025) 14

Pith/arXiv arXiv 2025
[27]

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024) 11

2024
[28]

Luo, Z., Shi, F., Ge, Y., Yang, Y., Wang, L., Shan, Y.: Open-magvit2: An open-source project toward democratizing auto-regressive visual generation (2025), https://arxiv.org/abs/2409.0441010

arXiv 2025
[29]

arXiv preprint arXiv:2502.20321 (2025) 4

Ma, C., Jiang, Y., Wu, J., Yang, J., Yu, X., Yuan, Z., Peng, B., Qi, X.: Uni- tok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321 (2025) 4

arXiv 2025
[30]

arXiv preprint arXiv:2503.07265 (2025) 12

Niu, Y., Ning, M., Zheng, M., Jin, W., Lin, B., Jin, P., Liao, J., Ning, K., Feng, C., Zhu, B., Yuan, L.: Wise: A world knowledge-informed semantic evaluation for text-to-image generation. arXiv preprint arXiv:2503.07265 (2025) 12

Pith/arXiv arXiv 2025
[31]

OpenAI: Introducing 4o Image Generation (2025),https://openai.com/index/ introducing-4o-image-generation/14

2025
[32]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,Howes,R.,Huang,P.Y.,Xu,H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features without ...

2023
[33]

arXiv preprint arXiv:2307.01952 (2023) 12

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) 12

Pith/arXiv arXiv 2023
[34]

In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 2545–2555 (2025) 4, 10, 12

2025
[35]

Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...

Pith/arXiv arXiv 2025
[36]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 2, 10

2022
[37]

Li et al

Shi, M., Wang, H., Zheng, W., Yuan, Z., Wu, X., Wang, X., Wan, P., Zhou, J., Lu, J.: Latent diffusion model without variational autoencoder (2025),https: //arxiv.org/abs/2510.153014 18 H. Li et al

arXiv 2025
[38]

arXiv preprint arXiv:2503.14324 (2025) 10

Song, W., Wang, Y., Song, Z., Li, Y., Sun, H., Chen, W., Zhou, Z., Xu, J., Wang, J., Yu, K.: Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. arXiv preprint arXiv:2503.14324 (2025) 10

Pith/arXiv arXiv 2025
[39]

Advances in neural information processing systems36, 49659–49678 (2023) 11

Sun, K., Pan, J., Ge, Y., Li, H., Duan, H., Wu, X., Zhang, R., Zhou, A., Qin, Z., Wang, Y., et al.: Journeydb: A benchmark for generative image understanding. Advances in neural information processing systems36, 49659–49678 (2023) 11

2023
[40]

arXiv preprint arXiv:2406.06525 (2024) 10

Sun, P., Jiang, Y., Chen, S., Zhang, S., Peng, B., Luo, P., Yuan, Z.: Autoregres- sive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525 (2024) 10

Pith/arXiv arXiv 2024
[41]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Sun, Q., Cui, Y., Zhang, X., Zhang, F., Yu, Q., Wang, Y., Rao, Y., Liu, J., Huang, T., Wang, X.: Generative multimodal models are in-context learners. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14398–14409 (2024) 4, 10

2024
[42]

arXiv preprint arXiv:2307.05222 (2023) 4

Sun, Q., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, Y., Gao, H., Liu, J., Huang, T., Wang, X.: Emu: Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222 (2023) 4

Pith/arXiv arXiv 2023
[43]

arXiv preprint arXiv:2507.23278 (2025) 4, 10

Tang, H., Xie, C., Bao, X., Weng, T., Li, P., Zheng, Y., Wang, L.: Unilip: Adapting clip for unified multimodal understanding, generation and editing. arXiv preprint arXiv:2507.23278 (2025) 4, 10

arXiv 2025
[44]

Team,C.:Chameleon:Mixed-modalearly-fusionfoundationmodels.arXivpreprint arXiv:2405.09818 (2024) 4, 11

Pith/arXiv arXiv 2024
[45]

Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive modeling: Scalableimagegenerationvianext-scaleprediction.Advancesinneuralinformation processing systems37, 84839–84865 (2024) 2, 10

2024
[46]

arXiv preprint arXiv:2412.14164 (2024) 11

Tong, S., Fan, D., Zhu, J., Xiong, Y., Chen, X., Sinha, K., Rabbat, M., LeCun, Y., Xie, S., Liu, Z.: Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164 (2024) 11

Pith/arXiv arXiv 2024
[47]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9568–9578 (2024) 11

2024
[48]

arXiv preprint arXiv:2302.13971 (2023) 1

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) 1

Pith/arXiv arXiv 2023
[49]

Advances in neural information processing systems30(2017) 4

Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Advances in neural information processing systems30(2017) 4

2017
[50]

arXiv preprint arXiv:2508.18265 (2025) 1

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 1

Pith/arXiv arXiv 2025
[51]

arXiv preprint arXiv:2409.18869 (2024) 11, 12

Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024) 11, 12

Pith/arXiv arXiv 2024
[52]

arXiv preprint arXiv:2507.21033 (2025) 11

Wang, Y., Yang, S., Zhao, B., Zhang, L., Liu, Q., Zhou, Y., Xie, C.: Gpt- image-edit-1.5 m: A million-scale, gpt-generated image dataset. arXiv preprint arXiv:2507.21033 (2025) 11

arXiv 2025
[53]

arXiv preprint arXiv:2506.18871 (2025) 14

Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025) 14

Pith/arXiv arXiv 2025
[54]

arXiv preprint arXiv:2505.23661 (2025) 4, 12 Abbreviated paper title 19

Wu, S., Wu, Z., Gong, Z., Tao, Q., Jin, S., Li, Q., Li, W., Loy, C.C.: Openuni: A simple baseline for unified multimodal understanding and generation. arXiv preprint arXiv:2505.23661 (2025) 4, 12 Abbreviated paper title 19

arXiv 2025
[55]

Wu, S., Zhang, W., Xu, L., Jin, S., Wu, Z., Tao, Q., Liu, W., Li, W., Loy, C.C.: Harmonizing visual representations for unified multimodal understanding and gen- eration (2025),https://arxiv.org/abs/2503.2197911, 12

arXiv 2025
[56]

arXiv preprint arXiv:2409.04429 (2024) 4, 10, 11, 12

Wu, Y., Zhang, Z., Chen, J., Tang, H., Li, D., Fang, Y., Zhu, L., Xie, E., Yin, H., Yi, L., et al.: Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429 (2024) 4, 10, 11, 12

Pith/arXiv arXiv 2024
[57]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13294–13304 (2025) 14

2025
[58]

Xie, E., Chen, J., Chen, J., Cai, H., Tang, H., Lin, Y., Zhang, Z., Li, M., Zhu, L., Lu, Y., Han, S.: Sana: Efficient high-resolution image synthesis with linear diffusion transformer (2024),https://arxiv.org/abs/2410.1062910, 12

Pith/arXiv arXiv 2024
[59]

arXiv preprint arXiv:2408.12528 (2024) 4, 11

Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. arXiv preprint arXiv:2408.12528 (2024) 4, 11

Pith/arXiv arXiv 2024
[60]

arXiv preprint arXiv:2506.15564 (2025) 12

Xie,J.,Yang,Z.,Shou,M.Z.:Show-o2:Improvednativeunifiedmultimodalmodels. arXiv preprint arXiv:2506.15564 (2025) 12

Pith/arXiv arXiv 2025
[61]

arXiv preprint arXiv:2505.09388 (2025) 1

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...

Pith/arXiv arXiv 2025
[62]

generation: Taming optimization dilemma in latent diffusion models

Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15703–15712 (2025) 2, 3, 10

2025
[63]

Yu, L., Shi, B., Pasunuru, R., Muller, B., Golovneva, O., Wang, T., Babu, A., Tang, B., Karrer, B., Sheynin, S., Ross, C., Polyak, A., Howes, R., Sharma, V., Xu, P., Tamoyan, H., Ashual, O., Singer, U., Li, S.W., Zhang, S., James, R., Ghosh, G., Taigman, Y., Fazel-Zarandi, M., Celikyilmaz, A., Zettlemoyer, L., Aghajanyan, A.: Scaling autoregressive multi-...

arXiv 2023
[64]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yu, Q., Chow, W., Yue, Z., Pan, K., Wu, Y., Wan, X., Li, J., Tang, S., Zhang, H., Zhuang, Y.: Anyedit: Mastering unified high-quality image editing for any idea. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26125–26135 (2025) 14

2025
[65]

Advances in Neural Information Processing Systems37, 128940–128966 (2024) 2

Yu, Q., Weber, M., Deng, X., Shen, X., Cremers, D., Chen, L.C.: An image is worth 32 tokens for reconstruction and generation. Advances in Neural Information Processing Systems37, 128940–128966 (2024) 2

2024
[66]

In: International Conference on Learning Representations (2025) 2, 3, 4

Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. In: International Conference on Learning Representations (2025) 2, 3, 4

2025
[67]

arXiv preprint arXiv:2308.02490 (2023) 11

Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., Wang, L.: Mm- vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023) 11

Pith/arXiv arXiv 2023
[68]

Li et al

Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal 20 H. Li et al. understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9556–9567 (2024) 11

2024
[69]

arXiv preprint arXiv:2510.10575 (2025) 4

Yue, Z., Zhang, H., Zeng, X., Chen, B., Wang, C., Zhuang, S., Dong, L., Du, K., Wang, Y., Wang, L., et al.: Uniflow: A unified pixel flow tokenizer for visual understanding and generation. arXiv preprint arXiv:2510.10575 (2025) 4

arXiv 2025
[70]

Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 14

Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 14

2023
[71]

arXiv preprint arXiv:2512.17909 (2025) 3, 4

Zhang, S., Zhang, H., Zhang, Z., Ge, C., Xue, S., Liu, S., Ren, M., Kim, S.Y., Zhou, Y., Liu, Q., et al.: Both semantics and reconstruction matter: Making repre- sentation encoders ready for text-to-image generation and editing. arXiv preprint arXiv:2512.17909 (2025) 3, 4

arXiv 2025
[72]

Zhang, Z., Xie, J., Lu, Y., Yang, Z., Yang, Y.: Enabling instructional image editing within-contextgenerationinlargescalediffusiontransformer.In:TheThirty-ninth Annual Conference on Neural Information Processing Systems (2025) 14

2025
[73]

Advances in Neural Information Processing Systems37, 3058–3093 (2024) 14

Zhao, H., Ma, X.S., Chen, L., Si, S., Wu, R., An, K., Yu, P., Zhang, M., Li, Q., Chang, B.: Ultraedit: Instruction-based fine-grained image editing at scale. Advances in Neural Information Processing Systems37, 3058–3093 (2024) 14

2024
[74]

arXiv preprint arXiv:2510.11690 (2025) 2, 4, 10

Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690 (2025) 2, 4, 10

Pith/arXiv arXiv 2025
[75]

arXiv preprint arXiv:2408.11039 (2024) 4

Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer, L., Levy, O.: Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039 (2024) 4

Pith/arXiv arXiv 2024
[76]

arXiv preprint arXiv:2504.10479 (2025) 10, 11

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 10, 11

Pith/arXiv arXiv 2025

[1] [1]

arXiv preprint arXiv:2511.21631 (2025) 1

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

Pith/arXiv arXiv 2025

[2] [2]

arXiv preprint arXiv:2502.13923 (2025) 11

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 11

Pith/arXiv arXiv 2025

[3] [3]

Black Forest Labs: FLUX.2: Analyzing and enhancing the latent space of FLUX – representation comparison (2025),https://bfl.ai/research/representation- comparison3

2025

[4] [4]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023) 14

2023

[5] [5]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: Pushing web- scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3558–3568 (2021) 10

2021

[6] [6]

Li et al

Chen, J., Xu, Z., Pan, X., Hu, Y., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al.: Blip3-o: A family of fully open unified multimodal 16 H. Li et al. models-architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025) 4, 10, 11, 12

Pith/arXiv arXiv 2025

[7] [7]

Chen, J., Cai, Z., Chen, P., Chen, S., Ji, K., Wang, X., Yang, Y., Wang, B.: Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image genera- tion (2025),https://arxiv.org/abs/2506.1809511

arXiv 2025

[8] [8]

arXiv preprint arXiv:2410.10733 (2024) 10

Chen, J., Cai, H., Chen, J., Xie, E., Yang, S., Tang, H., Li, M., Lu, Y., Han, S.: Deep compression autoencoder for efficient high-resolution diffusion models. arXiv preprint arXiv:2410.10733 (2024) 10

arXiv 2024

[9] [9]

arXiv preprint arXiv:2501.17811 (2025) 11, 12

Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025) 11, 12

Pith/arXiv arXiv 2025

[10] [10]

arXiv preprint arXiv:2505.14683 (2025) 4, 11, 12, 14

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., Fan, H.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 4, 11, 12, 14

Pith/arXiv arXiv 2025

[11] [11]

In: Forty-first international conference on machine learning (2024) 2, 12

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024) 2, 12

2024

[12] [12]

arXiv preprint arXiv:2306.13394 (2023) 11

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023) 11

Pith/arXiv arXiv 2023

[13] [13]

arXiv preprint arXiv:2404.14396 (2024) 11

Ge,Y.,Zhao,S.,Zhu,J.,Ge,Y.,Yi,K.,Song,L.,Li,C.,Ding,X.,Shan,Y.:Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396 (2024) 11

Pith/arXiv arXiv 2024

[14] [14]

Advances in Neural Information Processing Systems36, 52132–52152 (2023) 12

Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023) 12

2023

[15] [15]

arXiv preprint arXiv:2506.18898 (2025) 11, 12

Han, J., Chen, H., Zhao, Y., Wang, H., Zhao, Q., Yang, Z., He, H., Yue, X., Jiang, L.: Vision as a dialect: Unifying visual understanding and generation via text-aligned representations. arXiv preprint arXiv:2506.18898 (2025) 11, 12

arXiv 2025

[16] [16]

arXiv preprint arXiv:2504.01934 (2025) 10

Huang, R., Wang, C., Yang, J., Lu, G., Yuan, Y., Han, J., Hou, L., Zhang, W., Hong, L., Zhao, H., et al.: Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement. arXiv preprint arXiv:2504.01934 (2025) 10

arXiv 2025

[17] [17]

arXiv preprint arXiv:1312.6114 (2013) 2

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 2

Pith/arXiv arXiv 2013

[18] [18]

Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024) 12

2024

[19] [19]

Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025) 2

2025

[20] [20]

Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...

Pith/arXiv arXiv 2025

[21] [21]

arXiv preprint arXiv:2408.03326 (2024) 11

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 11

Pith/arXiv arXiv 2024

[22] [22]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, B., Ge, Y., Ge, Y., Wang, G., Wang, R., Zhang, R., Shan, Y.: Seed- bench: Benchmarking multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13299– 13308 (2024) 11 Abbreviated paper title 17

2024

[23] [23]

arXiv preprint arXiv:2506.03147 (2025) 4, 14

Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025) 4, 14

Pith/arXiv arXiv 2025

[24] [24]

arXiv preprint arXiv:2505.05422 (2025) 11

Lin, H., Wang, T., Ge, Y., Ge, Y., Lu, Z., Wei, Y., Zhang, Q., Sun, Z., Shan, Y.: Toklip: Marry visual tokens to clip for multimodal comprehension and generation. arXiv preprint arXiv:2505.05422 (2025) 11

arXiv 2025

[25] [25]

Advances in neural information processing systems36, 34892–34916 (2023) 1

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 1

2023

[26] [26]

arXiv preprint arXiv:2504.17761 (2025) 14

Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025) 14

Pith/arXiv arXiv 2025

[27] [27]

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024) 11

2024

[28] [28]

Luo, Z., Shi, F., Ge, Y., Yang, Y., Wang, L., Shan, Y.: Open-magvit2: An open-source project toward democratizing auto-regressive visual generation (2025), https://arxiv.org/abs/2409.0441010

arXiv 2025

[29] [29]

arXiv preprint arXiv:2502.20321 (2025) 4

Ma, C., Jiang, Y., Wu, J., Yang, J., Yu, X., Yuan, Z., Peng, B., Qi, X.: Uni- tok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321 (2025) 4

arXiv 2025

[30] [30]

arXiv preprint arXiv:2503.07265 (2025) 12

Niu, Y., Ning, M., Zheng, M., Jin, W., Lin, B., Jin, P., Liao, J., Ning, K., Feng, C., Zhu, B., Yuan, L.: Wise: A world knowledge-informed semantic evaluation for text-to-image generation. arXiv preprint arXiv:2503.07265 (2025) 12

Pith/arXiv arXiv 2025

[31] [31]

OpenAI: Introducing 4o Image Generation (2025),https://openai.com/index/ introducing-4o-image-generation/14

2025

[32] [32]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,Howes,R.,Huang,P.Y.,Xu,H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features without ...

2023

[33] [33]

arXiv preprint arXiv:2307.01952 (2023) 12

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) 12

Pith/arXiv arXiv 2023

[34] [34]

In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 2545–2555 (2025) 4, 10, 12

2025

[35] [35]

Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...

Pith/arXiv arXiv 2025

[36] [36]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 2, 10

2022

[37] [37]

Li et al

Shi, M., Wang, H., Zheng, W., Yuan, Z., Wu, X., Wang, X., Wan, P., Zhou, J., Lu, J.: Latent diffusion model without variational autoencoder (2025),https: //arxiv.org/abs/2510.153014 18 H. Li et al

arXiv 2025

[38] [38]

arXiv preprint arXiv:2503.14324 (2025) 10

Song, W., Wang, Y., Song, Z., Li, Y., Sun, H., Chen, W., Zhou, Z., Xu, J., Wang, J., Yu, K.: Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. arXiv preprint arXiv:2503.14324 (2025) 10

Pith/arXiv arXiv 2025

[39] [39]

Advances in neural information processing systems36, 49659–49678 (2023) 11

Sun, K., Pan, J., Ge, Y., Li, H., Duan, H., Wu, X., Zhang, R., Zhou, A., Qin, Z., Wang, Y., et al.: Journeydb: A benchmark for generative image understanding. Advances in neural information processing systems36, 49659–49678 (2023) 11

2023

[40] [40]

arXiv preprint arXiv:2406.06525 (2024) 10

Sun, P., Jiang, Y., Chen, S., Zhang, S., Peng, B., Luo, P., Yuan, Z.: Autoregres- sive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525 (2024) 10

Pith/arXiv arXiv 2024

[41] [41]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Sun, Q., Cui, Y., Zhang, X., Zhang, F., Yu, Q., Wang, Y., Rao, Y., Liu, J., Huang, T., Wang, X.: Generative multimodal models are in-context learners. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14398–14409 (2024) 4, 10

2024

[42] [42]

arXiv preprint arXiv:2307.05222 (2023) 4

Sun, Q., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, Y., Gao, H., Liu, J., Huang, T., Wang, X.: Emu: Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222 (2023) 4

Pith/arXiv arXiv 2023

[43] [43]

arXiv preprint arXiv:2507.23278 (2025) 4, 10

Tang, H., Xie, C., Bao, X., Weng, T., Li, P., Zheng, Y., Wang, L.: Unilip: Adapting clip for unified multimodal understanding, generation and editing. arXiv preprint arXiv:2507.23278 (2025) 4, 10

arXiv 2025

[44] [44]

Team,C.:Chameleon:Mixed-modalearly-fusionfoundationmodels.arXivpreprint arXiv:2405.09818 (2024) 4, 11

Pith/arXiv arXiv 2024

[45] [45]

Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive modeling: Scalableimagegenerationvianext-scaleprediction.Advancesinneuralinformation processing systems37, 84839–84865 (2024) 2, 10

2024

[46] [46]

arXiv preprint arXiv:2412.14164 (2024) 11

Tong, S., Fan, D., Zhu, J., Xiong, Y., Chen, X., Sinha, K., Rabbat, M., LeCun, Y., Xie, S., Liu, Z.: Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164 (2024) 11

Pith/arXiv arXiv 2024

[47] [47]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9568–9578 (2024) 11

2024

[48] [48]

arXiv preprint arXiv:2302.13971 (2023) 1

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) 1

Pith/arXiv arXiv 2023

[49] [49]

Advances in neural information processing systems30(2017) 4

Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Advances in neural information processing systems30(2017) 4

2017

[50] [50]

arXiv preprint arXiv:2508.18265 (2025) 1

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 1

Pith/arXiv arXiv 2025

[51] [51]

arXiv preprint arXiv:2409.18869 (2024) 11, 12

Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024) 11, 12

Pith/arXiv arXiv 2024

[52] [52]

arXiv preprint arXiv:2507.21033 (2025) 11

Wang, Y., Yang, S., Zhao, B., Zhang, L., Liu, Q., Zhou, Y., Xie, C.: Gpt- image-edit-1.5 m: A million-scale, gpt-generated image dataset. arXiv preprint arXiv:2507.21033 (2025) 11

arXiv 2025

[53] [53]

arXiv preprint arXiv:2506.18871 (2025) 14

Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025) 14

Pith/arXiv arXiv 2025

[54] [54]

arXiv preprint arXiv:2505.23661 (2025) 4, 12 Abbreviated paper title 19

Wu, S., Wu, Z., Gong, Z., Tao, Q., Jin, S., Li, Q., Li, W., Loy, C.C.: Openuni: A simple baseline for unified multimodal understanding and generation. arXiv preprint arXiv:2505.23661 (2025) 4, 12 Abbreviated paper title 19

arXiv 2025

[55] [55]

Wu, S., Zhang, W., Xu, L., Jin, S., Wu, Z., Tao, Q., Liu, W., Li, W., Loy, C.C.: Harmonizing visual representations for unified multimodal understanding and gen- eration (2025),https://arxiv.org/abs/2503.2197911, 12

arXiv 2025

[56] [56]

arXiv preprint arXiv:2409.04429 (2024) 4, 10, 11, 12

Wu, Y., Zhang, Z., Chen, J., Tang, H., Li, D., Fang, Y., Zhu, L., Xie, E., Yin, H., Yi, L., et al.: Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429 (2024) 4, 10, 11, 12

Pith/arXiv arXiv 2024

[57] [57]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13294–13304 (2025) 14

2025

[58] [58]

Xie, E., Chen, J., Chen, J., Cai, H., Tang, H., Lin, Y., Zhang, Z., Li, M., Zhu, L., Lu, Y., Han, S.: Sana: Efficient high-resolution image synthesis with linear diffusion transformer (2024),https://arxiv.org/abs/2410.1062910, 12

Pith/arXiv arXiv 2024

[59] [59]

arXiv preprint arXiv:2408.12528 (2024) 4, 11

Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. arXiv preprint arXiv:2408.12528 (2024) 4, 11

Pith/arXiv arXiv 2024

[60] [60]

arXiv preprint arXiv:2506.15564 (2025) 12

Xie,J.,Yang,Z.,Shou,M.Z.:Show-o2:Improvednativeunifiedmultimodalmodels. arXiv preprint arXiv:2506.15564 (2025) 12

Pith/arXiv arXiv 2025

[61] [61]

arXiv preprint arXiv:2505.09388 (2025) 1

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...

Pith/arXiv arXiv 2025

[62] [62]

generation: Taming optimization dilemma in latent diffusion models

Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15703–15712 (2025) 2, 3, 10

2025

[63] [63]

Yu, L., Shi, B., Pasunuru, R., Muller, B., Golovneva, O., Wang, T., Babu, A., Tang, B., Karrer, B., Sheynin, S., Ross, C., Polyak, A., Howes, R., Sharma, V., Xu, P., Tamoyan, H., Ashual, O., Singer, U., Li, S.W., Zhang, S., James, R., Ghosh, G., Taigman, Y., Fazel-Zarandi, M., Celikyilmaz, A., Zettlemoyer, L., Aghajanyan, A.: Scaling autoregressive multi-...

arXiv 2023

[64] [64]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yu, Q., Chow, W., Yue, Z., Pan, K., Wu, Y., Wan, X., Li, J., Tang, S., Zhang, H., Zhuang, Y.: Anyedit: Mastering unified high-quality image editing for any idea. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26125–26135 (2025) 14

2025

[65] [65]

Advances in Neural Information Processing Systems37, 128940–128966 (2024) 2

Yu, Q., Weber, M., Deng, X., Shen, X., Cremers, D., Chen, L.C.: An image is worth 32 tokens for reconstruction and generation. Advances in Neural Information Processing Systems37, 128940–128966 (2024) 2

2024

[66] [66]

In: International Conference on Learning Representations (2025) 2, 3, 4

Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. In: International Conference on Learning Representations (2025) 2, 3, 4

2025

[67] [67]

arXiv preprint arXiv:2308.02490 (2023) 11

Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., Wang, L.: Mm- vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023) 11

Pith/arXiv arXiv 2023

[68] [68]

Li et al

Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal 20 H. Li et al. understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9556–9567 (2024) 11

2024

[69] [69]

arXiv preprint arXiv:2510.10575 (2025) 4

Yue, Z., Zhang, H., Zeng, X., Chen, B., Wang, C., Zhuang, S., Dong, L., Du, K., Wang, Y., Wang, L., et al.: Uniflow: A unified pixel flow tokenizer for visual understanding and generation. arXiv preprint arXiv:2510.10575 (2025) 4

arXiv 2025

[70] [70]

Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 14

Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 14

2023

[71] [71]

arXiv preprint arXiv:2512.17909 (2025) 3, 4

Zhang, S., Zhang, H., Zhang, Z., Ge, C., Xue, S., Liu, S., Ren, M., Kim, S.Y., Zhou, Y., Liu, Q., et al.: Both semantics and reconstruction matter: Making repre- sentation encoders ready for text-to-image generation and editing. arXiv preprint arXiv:2512.17909 (2025) 3, 4

arXiv 2025

[72] [72]

Zhang, Z., Xie, J., Lu, Y., Yang, Z., Yang, Y.: Enabling instructional image editing within-contextgenerationinlargescalediffusiontransformer.In:TheThirty-ninth Annual Conference on Neural Information Processing Systems (2025) 14

2025

[73] [73]

Advances in Neural Information Processing Systems37, 3058–3093 (2024) 14

Zhao, H., Ma, X.S., Chen, L., Si, S., Wu, R., An, K., Yu, P., Zhang, M., Li, Q., Chang, B.: Ultraedit: Instruction-based fine-grained image editing at scale. Advances in Neural Information Processing Systems37, 3058–3093 (2024) 14

2024

[74] [74]

arXiv preprint arXiv:2510.11690 (2025) 2, 4, 10

Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690 (2025) 2, 4, 10

Pith/arXiv arXiv 2025

[75] [75]

arXiv preprint arXiv:2408.11039 (2024) 4

Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer, L., Levy, O.: Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039 (2024) 4

Pith/arXiv arXiv 2024

[76] [76]

arXiv preprint arXiv:2504.10479 (2025) 10, 11

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 10, 11

Pith/arXiv arXiv 2025