pith. sign in

arxiv: 2606.23041 · v2 · pith:S2FMRM4Jnew · submitted 2026-06-22 · 💻 cs.CV

SPAR: Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Multimodal Models

Pith reviewed 2026-07-03 23:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords unified multimodal modelssemantic-pixel self-alignmentdual-stream tokenizeradaptive routingdiffusion modelsvisual generationMLLMself-aligned generation
0
0 comments X

The pith

A dual-stream tokenizer with internal self-alignment unifies semantic understanding and pixel generation in one multimodal model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the mismatch between semantic perception and pixel reconstruction that limits multimodal large language models in generation tasks. It introduces an asymmetric dual-stream unified tokenizer where one lightweight stream anchors discriminative semantic features and a second Transformer-augmented stream recovers fine-grained pixel details, both feeding a single compact latent space. A self-aligned generation paradigm then uses this tokenizer directly as an internal teacher to train a diffusion model, eliminating external alignment models. Dynamic token routing further lets each token pull relevant features from multiple MLLM layers according to its own needs. If successful, the approach would let a single architecture deliver high-quality image generation and reconstruction while retaining strong visual understanding, reaching state-of-the-art results among unified systems.

Core claim

SPAR establishes a unified multimodal framework through semantic-pixel self-alignment and adaptive routing. The asymmetric dual-stream unified tokenizer reconciles semantic perception with pixel-level reconstruction by using a lightweight semantic stream and a Transformer-augmented pixel stream into a unified compact latent space. The self-aligned generation paradigm leverages the optimized tokenizer as an internal alignment teacher for the diffusion model without external dependencies. Dynamic token routing enables each token to adaptively aggregate multi-layer MLLM features based on its semantic demands. Extensive experiments show this yields state-of-the-art performance for unified archit

What carries the argument

The asymmetric dual-stream unified tokenizer that anchors discriminative semantic features in a lightweight stream while recovering fine-grained pixel details in a Transformer-augmented stream into one compact latent space.

If this is right

  • Unified multimodal models can perform both high-fidelity image generation and accurate visual understanding without separate specialized components.
  • The diffusion model aligns to semantic spaces using only internal tokenizer feedback, removing the need for external teacher models.
  • Dynamic token routing allows flexible aggregation of features from different MLLM layers for each token according to its distinct demands.
  • The framework achieves state-of-the-art results among unified architectures while preserving core visual understanding capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The internal self-alignment could lower training costs by avoiding separate teacher networks for alignment.
  • The single latent space might support downstream tasks such as language-guided image editing within the same model without additional adapters.
  • Applying the dual-stream design to video or audio modalities could test whether the same unification pattern holds beyond static images.
  • Direct measurement of token routing decisions on held-out image categories would show whether the adaptive aggregation improves robustness on diverse inputs.

Load-bearing premise

The asymmetric dual-stream unified tokenizer can anchor discriminative semantic features and recover fine-grained pixel details simultaneously in a single compact latent space without external supervision or loss of either capability.

What would settle it

A controlled comparison in which a model built with the SPAR tokenizer shows either substantially lower visual understanding accuracy than standard MLLMs or substantially lower generation and reconstruction quality than specialized diffusion models would disprove the central claim.

Figures

Figures reproduced from arXiv: 2606.23041 by Chenyang Zhu, Hongxiang Li, Hongxu Chen, Jiayin Cai, Long Chen, Xiaolong Jiang, Xiaoshuang Huang, Yao Hu.

Figure 1
Figure 1. Figure 1: (a) Image Reconstruction: When modeling directly within the semantic representation space, existing methods suffer from lossy compression and struggle to preserve high-frequency details. In contrast, our method effectively recovers these cru￾cial pixel-level details. (b) Representation Alignment Paradigm: Unlike existing approaches that rely on external semantic encoders to guide the generative model, our … view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of the semantic-pixel self-aligned unified tokenizer. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the unified multimodal model. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of image generation. Compared with the second-best OmniGen2 (3.44), SPAR-3B yields an absolute improvement of +0.58. Across specific editing categories, SPAR-3B achieves the best performance on various subtypes, with particularly pronounced advantages on semantically demanding tasks, indicating that the rich semantic representa￾tions within the SPAR tokenizer effectively enhance the edi… view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in visual understanding but remain constrained in visual generation due to the fundamental feature discrepancy between semantic perception and pixel-level reconstruction. Bridging this gap requires overcoming two core challenges: endowing semantic encoders with high-fidelity reconstruction capabilities, and effectively aligning generative models with semantic spaces without relying on external teachers. To this end, we propose a novel unified multimodal framework featuring \textbf{S}emantic-\textbf{P}ixel self-alignment and \textbf{A}daptive \textbf{R}outing (\textbf{SPAR}). First, to reconcile semantic perception with pixel-level reconstruction, we introduce an asymmetric dual-stream unified tokenizer. A lightweight semantic stream anchors discriminative features, while a Transformer-augmented pixel stream recovers fine-grained visual details into a unified compact latent space. Second, to eliminate external dependencies, we propose a self-aligned generation paradigm that natively leverages this optimized tokenizer as an internal alignment teacher for the diffusion model. Furthermore, to facilitate flexible multimodal interaction within this unified space, we introduce Dynamic Token Routing, which enables each token to adaptively aggregate multi-layer MLLM features based on its distinct semantic demands. Extensive experiments demonstrate that SPAR establishes the state-of-the-art for unified architectures, achieving exceptional generation and reconstruction quality while preserving foundational visual understanding capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces SPAR, a unified multimodal framework for MLLMs that addresses the discrepancy between semantic perception and pixel-level reconstruction. It proposes an asymmetric dual-stream unified tokenizer (lightweight semantic stream plus Transformer-augmented pixel stream) to create a compact latent space, a self-aligned generation paradigm that uses the tokenizer as an internal teacher for the diffusion model, and Dynamic Token Routing to aggregate multi-layer MLLM features adaptively. The abstract claims this yields SOTA performance on generation, reconstruction, and visual understanding without external supervision.

Significance. If the quantitative results and ablations hold, the work would be significant for enabling self-contained unified architectures that avoid external teachers while preserving both discriminative and generative capabilities, a key open challenge in multimodal modeling.

major comments (2)
  1. [Abstract] Abstract: the central claim that the asymmetric dual-stream tokenizer simultaneously anchors discriminative semantic features and recovers fine-grained pixel details 'without external supervision or loss of either capability' is load-bearing for all subsequent claims (self-alignment, SOTA performance, preservation of understanding), yet no derivation, loss formulation, or ablation is provided to demonstrate this is achievable.
  2. [Abstract] Abstract: the assertion of 'establishing the state-of-the-art for unified architectures' with 'exceptional generation and reconstruction quality' cannot be evaluated because the text supplies no tables, metrics, baselines, or experimental setup, leaving the performance claims unsupported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We address the two major comments on the abstract below, clarifying where the supporting material appears in the full manuscript while noting that abstracts are necessarily concise.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the asymmetric dual-stream tokenizer simultaneously anchors discriminative semantic features and recovers fine-grained pixel details 'without external supervision or loss of either capability' is load-bearing for all subsequent claims (self-alignment, SOTA performance, preservation of understanding), yet no derivation, loss formulation, or ablation is provided to demonstrate this is achievable.

    Authors: The loss formulation for the asymmetric dual-stream tokenizer (lightweight semantic stream + Transformer-augmented pixel stream) and the self-aligned generation paradigm (using the tokenizer as internal teacher) is derived in Sections 3.1 and 3.2. The objective combines semantic anchoring losses with pixel reconstruction terms without external models. Ablations confirming preservation of both capabilities appear in Section 4.3 (Table 4), showing performance retention on understanding benchmarks when generation is enabled. We can add a parenthetical reference to these sections in a revised abstract if the editor prefers. revision: partial

  2. Referee: [Abstract] Abstract: the assertion of 'establishing the state-of-the-art for unified architectures' with 'exceptional generation and reconstruction quality' cannot be evaluated because the text supplies no tables, metrics, baselines, or experimental setup, leaving the performance claims unsupported.

    Authors: Quantitative results, metrics (FID, PSNR, CLIP score, VQA accuracy, etc.), baselines, and experimental setup are reported in Section 4, with direct comparisons in Tables 1 (generation), 2 (reconstruction), and 5 (understanding). The setup details (datasets, training protocol, evaluation protocols) are in Section 4.1. These tables support the SOTA claims for unified models. If the referee did not locate these sections, we can improve cross-referencing from the abstract. revision: no

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper proposes an asymmetric dual-stream tokenizer and self-aligned generation paradigm as novel architectural choices that reconcile semantic and pixel features without external supervision. No equations, fitted parameters, or self-citations are visible in the provided text that reduce any claimed prediction or result to an input by construction. The design elements (Dynamic Token Routing, internal alignment teacher) are presented as independent innovations rather than renamings or self-referential definitions. The derivation chain remains self-contained against external benchmarks, with the central assumption being an empirical design claim rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no equations, training details, or component specifications are given, so no free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5790 in / 1064 out tokens · 13196 ms · 2026-07-03T23:13:33.192682+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 55 canonical work pages · 36 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  2. [2]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., 16 H. Li et al. Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 11

  3. [3]

    Black Forest Labs: FLUX.2: Analyzing and enhancing the latent space of FLUX – representation comparison (2025),https://bfl.ai/research/representation- comparison3

  4. [4]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023) 14

  5. [5]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: Pushing web- scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3558–3568 (2021) 11

  6. [6]

    Bi-Anchor Interpolation Solver for Accelerating Generative Modeling

    Chen, H., Li, H., Wang, Z., Chen, L.: Bi-anchor interpolation solver for accelerating generative modeling. arXiv preprint arXiv:2601.21542 (2026) 2

  7. [7]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Chen, J., Xu, Z., Pan, X., Hu, Y., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al.: Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025) 4, 11, 12

  8. [8]

    Chen, J., Cai, Z., Chen, P., Chen, S., Ji, K., Wang, X., Yang, Y., Wang, B.: Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image genera- tion (2025),https://arxiv.org/abs/2506.180954, 11

  9. [9]

    arXiv preprint arXiv:2410.10733 (2024) 10

    Chen, J., Cai, H., Chen, J., Xie, E., Yang, S., Tang, H., Li, M., Lu, Y., Han, S.: Deep compression autoencoder for efficient high-resolution diffusion models. arXiv preprint arXiv:2410.10733 (2024) 10, 11

  10. [10]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025) 11, 12

  11. [11]

    Emerging Properties in Unified Multimodal Pretraining

    Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., Fan, H.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 4, 5, 11, 12, 14

  12. [12]

    In: Forty-first international conference on machine learning (2024) 2, 12

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024) 2, 12

  13. [13]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023) 12

  14. [14]

    SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    Ge,Y.,Zhao,S.,Zhu,J.,Ge,Y.,Yi,K.,Song,L.,Li,C.,Ding,X.,Shan,Y.:Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396 (2024) 11

  15. [15]

    Advances in Neural Information Processing Systems36, 52132–52152 (2023) 12

    Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023) 12

  16. [16]

    arXiv preprint arXiv:2506.18898 (2025) 11, 12

    Han, J., Chen, H., Zhao, Y., Wang, H., Zhao, Q., Yang, Z., He, H., Yue, X., Jiang, L.: Vision as a dialect: Unifying visual understanding and generation via text-aligned representations. arXiv preprint arXiv:2506.18898 (2025) 11, 12

  17. [17]

    Illume+: Il- luminating unified mllm with dual visual tokenization and diffusion refinement.arXiv preprint arXiv:2504.01934,

    Huang, R., Wang, C., Yang, J., Lu, G., Yuan, Y., Han, J., Hou, L., Zhang, W., Hong, L., Zhao, H., et al.: Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement. arXiv preprint arXiv:2504.01934 (2025) 10

  18. [18]

    Auto-Encoding Variational Bayes

    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 2 Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Models 17

  19. [19]

    Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024) 12

  20. [20]

    Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025) 2

  21. [21]

    Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...

  22. [22]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 11

  23. [23]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, B., Ge, Y., Ge, Y., Wang, G., Wang, R., Zhang, R., Shan, Y.: Seed- bench: Benchmarking multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13299– 13308 (2024) 12

  24. [24]

    In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=4c1gAsVd9C4

    Li, H., Li, Y., Lin, B., Niu, Y., Yang, Y., Huang, X., Cai, J., Jiang, X., Hu, Y., Chen, L.: GIR-bench: Versatile benchmark for generating images with reasoning. In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=4c1gAsVd9C4

  25. [25]

    In: International Conference on Learning Representations

    Li, H., Li, Y., Yang, Y., Cao, J., Zhu, Z., Cheng, X., Chen, L.: Dispose: Disen- tangling pose guidance for controllable human image animation. In: International Conference on Learning Representations. vol. 2025, pp. 72213–72231 (2025) 2

  26. [26]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025) 5, 14

  27. [27]

    arXiv preprint arXiv:2505.05422 (2025) 11

    Lin, H., Wang, T., Ge, Y., Ge, Y., Lu, Z., Wei, Y., Zhang, Q., Sun, Z., Shan, Y.: Toklip: Marry visual tokens to clip for multimodal comprehension and generation. arXiv preprint arXiv:2505.05422 (2025) 11

  28. [28]

    Advances in neural information processing systems36, 34892–34916 (2023) 1

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 1

  29. [29]

    Step1X-Edit: A Practical Framework for General Image Editing

    Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025) 14

  30. [30]

    Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024) 12

  31. [31]

    Luo, Z., Shi, F., Ge, Y., Yang, Y., Wang, L., Shan, Y.: Open-magvit2: An open-source project toward democratizing auto-regressive visual generation (2025), https://arxiv.org/abs/2409.0441010

  32. [32]

    Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321,

    Ma, C., Jiang, Y., Wu, J., Yang, J., Yu, X., Yuan, Z., Peng, B., Qi, X.: Uni- tok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321 (2025) 4

  33. [33]

    WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    Niu, Y., Ning, M., Zheng, M., Jin, W., Lin, B., Jin, P., Liao, J., Ning, K., Feng, C., Zhu, B., Yuan, L.: Wise: A world knowledge-informed semantic evaluation for text-to-image generation. arXiv preprint arXiv:2503.07265 (2025) 12

  34. [34]

    OpenAI: Introducing 4o Image Generation (2025),https://openai.com/index/ introducing-4o-image-generation/14

  35. [35]

    Li et al

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,Howes,R.,Huang,P.Y.,Xu,H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, 18 H. Li et al. G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual f...

  36. [36]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) 12

  37. [37]

    In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

    Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 2545–2555 (2025) 4, 10, 12

  38. [38]

    Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...

  39. [39]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 2, 10

  40. [40]

    Shi, M., Wang, H., Zheng, W., Yuan, Z., Wu, X., Wang, X., Wan, P., Zhou, J., Lu, J.: Latent diffusion model without variational autoencoder (2025),https: //arxiv.org/abs/2510.153014

  41. [41]

    DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

    Song, W., Wang, Y., Song, Z., Li, Y., Sun, H., Chen, W., Zhou, Z., Xu, J., Wang, J., Yu, K.: Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. arXiv preprint arXiv:2503.14324 (2025) 10

  42. [42]

    Advances in neural information processing systems36, 49659–49678 (2023) 11

    Sun, K., Pan, J., Ge, Y., Li, H., Duan, H., Wu, X., Zhang, R., Zhou, A., Qin, Z., Wang, Y., et al.: Journeydb: A benchmark for generative image understanding. Advances in neural information processing systems36, 49659–49678 (2023) 11

  43. [43]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Sun, P., Jiang, Y., Chen, S., Zhang, S., Peng, B., Luo, P., Yuan, Z.: Autoregres- sive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525 (2024) 10

  44. [44]

    In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

    Sun, Q., Cui, Y., Zhang, X., Zhang, F., Yu, Q., Wang, Y., Rao, Y., Liu, J., Huang, T., Wang, X.: Generative multimodal models are in-context learners. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14398–14409 (2024) 4, 10

  45. [45]

    Emu: Generative Pretraining in Multimodality

    Sun, Q., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, Y., Gao, H., Liu, J., Huang, T., Wang, X.: Emu: Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222 (2023) 4

  46. [46]

    Unilip: Adapting clip for unified multimodal understanding, generation and editing

    Tang, H., Xie, C., Bao, X., Weng, T., Li, P., Zheng, Y., Wang, L.: Unilip: Adapting clip for unified multimodal understanding, generation and editing. arXiv preprint arXiv:2507.23278 (2025) 4, 5, 10

  47. [47]

    Team,C.:Chameleon:Mixed-modalearly-fusionfoundationmodels.arXivpreprint arXiv:2405.09818 (2024) 4, 11

  48. [48]

    Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive modeling: Scalableimagegenerationvianext-scaleprediction.Advancesinneuralinformation processing systems37, 84839–84865 (2024) 2, 10

  49. [49]

    MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

    Tong, S., Fan, D., Zhu, J., Xiong, Y., Chen, X., Sinha, K., Rabbat, M., LeCun, Y., Xie, S., Liu, Z.: Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164 (2024) 11 Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Models 19

  50. [50]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9568–9578 (2024) 12

  51. [51]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) 2

  52. [52]

    Advances in neural information processing systems30(2017) 4

    Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Advances in neural information processing systems30(2017) 4

  53. [53]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 1

  54. [54]

    Emu3: Next-Token Prediction is All You Need

    Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024) 11, 12

  55. [55]

    LISA: Likelihood Score Alignment for Visual-condition Controllable Generation

    Wang, Y., Chen, H., Liu, J., He, Z., Liu, R., Wang, Z., Chen, L.: Lisa: Likeli- hood score alignment for visual-condition controllable generation. arXiv preprint arXiv:2606.27192 (2026) 2

  56. [56]

    arXiv preprint arXiv:2603.12057 (2026) 2

    Wang, Y., Jiang, Z., Wang, Z., Chen, L.: Coarse-guided visual generation via weighted h-transform sampling. arXiv preprint arXiv:2603.12057 (2026) 2

  57. [57]

    arXiv preprint arXiv:2510.20212 (2025) 2

    Wang, Y., Wang, Z., Chen, L.: Target-aware image editing via cycle-consistent constraints. arXiv preprint arXiv:2510.20212 (2025) 2

  58. [58]

    arXiv preprint arXiv:2507.21033 (2025) 11

    Wang, Y., Yang, S., Zhao, B., Zhang, L., Liu, Q., Zhou, Y., Xie, C.: Gpt- image-edit-1.5 m: A million-scale, gpt-generated image dataset. arXiv preprint arXiv:2507.21033 (2025) 11

  59. [59]

    Qwen-Image Technical Report

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025) 2

  60. [60]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025) 14

  61. [61]

    arXiv preprint arXiv:2505.23661 (2025) 4, 12 Abbreviated paper title 19

    Wu, S., Wu, Z., Gong, Z., Tao, Q., Jin, S., Li, Q., Li, W., Loy, C.C.: Openuni: A simple baseline for unified multimodal understanding and generation. arXiv preprint arXiv:2505.23661 (2025) 4, 12

  62. [62]

    Wu, S., Zhang, W., Xu, L., Jin, S., Wu, Z., Tao, Q., Liu, W., Li, W., Loy, C.C.: Harmonizing visual representations for unified multimodal understanding and gen- eration (2025),https://arxiv.org/abs/2503.2197911, 12

  63. [63]

    VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

    Wu, Y., Zhang, Z., Chen, J., Tang, H., Li, D., Fang, Y., Zhu, L., Xie, E., Yin, H., Yi, L., et al.: Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429 (2024) 4, 10, 11, 12

  64. [64]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13294–13304 (2025) 14

  65. [65]

    Xie, E., Chen, J., Chen, J., Cai, H., Tang, H., Lin, Y., Zhang, Z., Li, M., Zhu, L., Lu, Y., Han, S.: Sana: Efficient high-resolution image synthesis with linear diffusion transformer (2024),https://arxiv.org/abs/2410.1062910, 12

  66. [66]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. arXiv preprint arXiv:2408.12528 (2024) 4, 11

  67. [67]

    Show-o2: Improved Native Unified Multimodal Models

    Xie,J.,Yang,Z.,Shou,M.Z.:Show-o2:Improvednativeunifiedmultimodalmodels. arXiv preprint arXiv:2506.15564 (2025) 12 20 H. Li et al

  68. [68]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...

  69. [69]

    generation: Taming optimization dilemma in latent diffusion models

    Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15703–15712 (2025) 2, 3, 10

  70. [70]

    Yu, L., Shi, B., Pasunuru, R., Muller, B., Golovneva, O., Wang, T., Babu, A., Tang, B., Karrer, B., Sheynin, S., Ross, C., Polyak, A., Howes, R., Sharma, V., Xu, P., Tamoyan, H., Ashual, O., Singer, U., Li, S.W., Zhang, S., James, R., Ghosh, G., Taigman, Y., Fazel-Zarandi, M., Celikyilmaz, A., Zettlemoyer, L., Aghajanyan, A.: Scaling autoregressive multi-...

  71. [71]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Yu, Q., Chow, W., Yue, Z., Pan, K., Wu, Y., Wan, X., Li, J., Tang, S., Zhang, H., Zhuang, Y.: Anyedit: Mastering unified high-quality image editing for any idea. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26125–26135 (2025) 14

  72. [72]

    Advances in Neural Information Processing Systems37, 128940–128966 (2024) 2

    Yu, Q., Weber, M., Deng, X., Shen, X., Cremers, D., Chen, L.C.: An image is worth 32 tokens for reconstruction and generation. Advances in Neural Information Processing Systems37, 128940–128966 (2024) 2

  73. [73]

    In: International Conference on Learning Representations (2025) 2, 3, 4

    Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. In: International Conference on Learning Representations (2025) 2, 3, 4

  74. [74]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., Wang, L.: Mm- vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023) 12

  75. [75]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9556–9567 (2024) 12

  76. [76]

    arXiv preprint arXiv:2510.10575 (2025) 4

    Yue, Z., Zhang, H., Zeng, X., Chen, B., Wang, C., Zhuang, S., Dong, L., Du, K., Wang, Y., Wang, L., et al.: Uniflow: A unified pixel flow tokenizer for visual understanding and generation. arXiv preprint arXiv:2510.10575 (2025) 4

  77. [77]

    Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 14

    Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 14

  78. [78]

    arXiv preprint arXiv:2512.17909 (2025) 3, 4

    Zhang, S., Zhang, H., Zhang, Z., Ge, C., Xue, S., Liu, S., Ren, M., Kim, S.Y., Zhou, Y., Liu, Q., et al.: Both semantics and reconstruction matter: Making repre- sentation encoders ready for text-to-image generation and editing. arXiv preprint arXiv:2512.17909 (2025) 3, 4

  79. [79]

    Zhang, Z., Xie, J., Lu, Y., Yang, Z., Yang, Y.: Enabling instructional image editing within-contextgenerationinlargescalediffusiontransformer.In:TheThirty-ninth Annual Conference on Neural Information Processing Systems (2025) 14

  80. [80]

    Advances in Neural Information Processing Systems37, 3058–3093 (2024) 14 Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Models 21

    Zhao, H., Ma, X.S., Chen, L., Si, S., Wu, R., An, K., Yu, P., Zhang, M., Li, Q., Chang, B.: Ultraedit: Instruction-based fine-grained image editing at scale. Advances in Neural Information Processing Systems37, 3058–3093 (2024) 14 Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Models 21

Showing first 80 references.