pith. sign in

arxiv: 2606.23041 · v1 · pith:S2FMRM4Jnew · submitted 2026-06-22 · 💻 cs.CV

SPAR: Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Multimodal Models

Pith reviewed 2026-06-26 09:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords unified multimodal modelssemantic-pixel alignmentadaptive routingdiffusion modelsunified tokenizervisual generationself-alignmentMLLM
0
0 comments X

The pith

SPAR reconciles semantic perception and pixel reconstruction in one tokenizer to let multimodal models generate and understand without external teachers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models handle visual understanding well but lag in generation because semantic features clash with the needs of pixel-level reconstruction. The paper introduces SPAR to close this gap with an asymmetric dual-stream unified tokenizer that keeps a lightweight semantic stream for strong discrimination while a transformer-augmented pixel stream adds fine details into one compact latent space. This tokenizer then serves directly as an internal teacher to align a diffusion model for generation, removing any need for outside alignment systems. Dynamic token routing further lets each token pull the right multi-layer features based on its own semantic needs. The result is a single architecture that aims to match specialized systems on both understanding and generation tasks.

Core claim

The central claim is that an asymmetric dual-stream unified tokenizer can reconcile semantic perception with pixel-level reconstruction by anchoring discriminative features in a lightweight semantic stream and recovering fine-grained details in a Transformer-augmented pixel stream within a unified compact latent space, and that this tokenizer can natively serve as an internal alignment teacher for a diffusion model in a self-aligned generation paradigm, while Dynamic Token Routing enables flexible multimodal interaction, allowing the unified model to achieve state-of-the-art generation and reconstruction quality while preserving visual understanding.

What carries the argument

Asymmetric dual-stream unified tokenizer, which anchors discriminative features in a lightweight semantic stream and recovers fine-grained visual details in a Transformer-augmented pixel stream inside one compact latent space, then used as internal teacher for diffusion.

If this is right

  • Unified models reach state-of-the-art generation and reconstruction quality.
  • Visual understanding capabilities stay intact without separate training branches.
  • Dynamic token routing lets tokens adaptively aggregate multi-layer features according to their semantic demands.
  • Generation proceeds without external alignment teachers or added dependencies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the internal teacher scales reliably, multimodal training pipelines could drop separate pre-alignment stages and reduce compute overhead.
  • The routing mechanism might extend naturally to video or audio by swapping the pixel stream while keeping the semantic anchor.
  • Token-level adaptation could improve efficiency at inference time by skipping unneeded layers for simple tokens.

Load-bearing premise

The dual-stream tokenizer can deliver both high discriminative power for understanding and high-fidelity reconstruction for generation in the same space without one weakening the other.

What would settle it

A controlled ablation where the internal self-alignment produces measurably lower generation quality than an external teacher baseline on standard metrics such as FID while the semantic stream is kept fixed.

Figures

Figures reproduced from arXiv: 2606.23041 by Chenyang Zhu, Hongxiang Li, Hongxu Chen, Jiayin Cai, Long Chen, Xiaolong Jiang, Xiaoshuang Huang, Yao Hu.

Figure 1
Figure 1. Figure 1: (a) Image Reconstruction: When modeling directly within the semantic representation space, existing methods suffer from lossy compression and struggle to preserve high-frequency details. In contrast, our method effectively recovers these cru￾cial pixel-level details. (b) Representation Alignment Paradigm: Unlike existing approaches that rely on external semantic encoders to guide the generative model, our … view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of the semantic-pixel self-aligned unified tokenizer. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the unified multimodal model. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of image generation. Compared with the second-best OmniGen2 (3.44), SPAR-3B yields an absolute improvement of +0.58. Across specific editing categories, SPAR-3B achieves the best performance on various subtypes, with particularly pronounced advantages on semantically demanding tasks, indicating that the rich semantic representa￾tions within the SPAR tokenizer effectively enhance the edi… view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in visual understanding but remain constrained in visual generation due to the fundamental feature discrepancy between semantic perception and pixel-level reconstruction. Bridging this gap requires overcoming two core challenges: endowing semantic encoders with high-fidelity reconstruction capabilities, and effectively aligning generative models with semantic spaces without relying on external teachers. To this end, we propose a novel unified multimodal framework featuring \textbf{S}emantic-\textbf{P}ixel self-alignment and \textbf{A}daptive \textbf{R}outing (\textbf{SPAR}). First, to reconcile semantic perception with pixel-level reconstruction, we introduce an asymmetric dual-stream unified tokenizer. A lightweight semantic stream anchors discriminative features, while a Transformer-augmented pixel stream recovers fine-grained visual details into a unified compact latent space. Second, to eliminate external dependencies, we propose a self-aligned generation paradigm that natively leverages this optimized tokenizer as an internal alignment teacher for the diffusion model. Furthermore, to facilitate flexible multimodal interaction within this unified space, we introduce Dynamic Token Routing, which enables each token to adaptively aggregate multi-layer MLLM features based on its distinct semantic demands. Extensive experiments demonstrate that SPAR establishes the state-of-the-art for unified architectures, achieving exceptional generation and reconstruction quality while preserving foundational visual understanding capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes SPAR, a unified multimodal framework to bridge semantic perception and pixel-level reconstruction in MLLMs. It introduces an asymmetric dual-stream unified tokenizer (lightweight semantic stream + Transformer-augmented pixel stream) to create a compact latent space, a self-aligned diffusion paradigm that uses this tokenizer as an internal teacher without external dependencies, and Dynamic Token Routing to let tokens adaptively aggregate multi-layer MLLM features. The central claim is that SPAR achieves SOTA performance among unified architectures on generation and reconstruction while preserving visual understanding capabilities.

Significance. If the experimental validation holds, the self-aligned internal-teacher approach and asymmetric tokenizer design could meaningfully advance unified multimodal models by reducing reliance on external alignment mechanisms and addressing the semantic-pixel discrepancy directly in the architecture. The explicit motivation of design choices (lightweight semantic anchor, pixel augmentation) is a strength.

major comments (2)
  1. [Abstract] Abstract: The claim that SPAR 'establishes the state-of-the-art for unified architectures' and achieves 'exceptional generation and reconstruction quality' is presented without any metrics, baselines, ablation studies, tables, or figures. This absence makes the central empirical claim impossible to evaluate and is load-bearing for the paper's contribution.
  2. [Abstract] The description of the asymmetric dual-stream tokenizer and self-aligned paradigm (Abstract) asserts that the tokenizer serves as an internal alignment teacher 'without relying on external teachers' and preserves 'foundational visual understanding capabilities,' but no derivation, loss formulation, or verification is supplied to show that the lightweight semantic stream does not degrade discriminative power or that the pixel stream recovers details without introducing new discrepancies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments. We address each major point regarding the abstract below and outline revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that SPAR 'establishes the state-of-the-art for unified architectures' and achieves 'exceptional generation and reconstruction quality' is presented without any metrics, baselines, ablation studies, tables, or figures. This absence makes the central empirical claim impossible to evaluate and is load-bearing for the paper's contribution.

    Authors: We acknowledge the referee's point that the abstract's SOTA and quality claims would be easier to evaluate with supporting numbers. The full manuscript reports these results in Sections 4 and 5, with quantitative tables against unified baselines (e.g., Chameleon, Show-o) on FID, reconstruction metrics, and understanding benchmarks, plus ablations. To address the concern directly, we will revise the abstract to include a small number of key metrics supporting the claims. revision: yes

  2. Referee: [Abstract] The description of the asymmetric dual-stream tokenizer and self-aligned paradigm (Abstract) asserts that the tokenizer serves as an internal alignment teacher 'without relying on external teachers' and preserves 'foundational visual understanding capabilities,' but no derivation, loss formulation, or verification is supplied to show that the lightweight semantic stream does not degrade discriminative power or that the pixel stream recovers details without introducing new discrepancies.

    Authors: The abstract is a concise summary; the asymmetric dual-stream tokenizer (lightweight semantic stream + Transformer-augmented pixel stream) and its motivation are derived in Section 3.1, the self-aligned diffusion paradigm with internal-teacher losses (no external models) is formulated in Section 3.2, and verification that semantic power is preserved (no drop on VQA/understanding tasks) while pixel details improve is shown via ablations and comparisons in Section 4.3. We will revise the abstract wording to avoid over-assertion and briefly note the empirical support. revision: partial

Circularity Check

0 steps flagged

No significant circularity; architecture claims are self-contained

full rationale

The abstract and method outline introduce an asymmetric dual-stream tokenizer, self-aligned diffusion using the tokenizer as internal teacher, and dynamic token routing as explicit design choices to address stated challenges of semantic-pixel discrepancy. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce any central claim to its own inputs by construction. The derivation chain remains independent, with performance claims tied to asserted experimental validation rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Only the abstract is available, so the ledger reflects high-level claims; full paper may introduce additional fitted parameters or domain assumptions not visible here.

axioms (2)
  • domain assumption Semantic perception and pixel-level reconstruction can be reconciled via an asymmetric dual-stream tokenizer in a unified compact latent space.
    Invoked as the first core challenge the framework is designed to solve.
  • domain assumption The optimized tokenizer can serve as an internal alignment teacher for the diffusion model without external dependencies.
    Central to the self-aligned generation paradigm described in the abstract.
invented entities (1)
  • Dynamic Token Routing no independent evidence
    purpose: Enables each token to adaptively aggregate multi-layer MLLM features based on distinct semantic demands.
    Introduced as a new component for flexible multimodal interaction within the unified space.

pith-pipeline@v0.9.1-grok · 5790 in / 1304 out tokens · 27376 ms · 2026-06-26T09:06:41.837401+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 33 linked inside Pith

  1. [1]

    arXiv preprint arXiv:2511.21631 (2025) 1

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  2. [2]

    arXiv preprint arXiv:2502.13923 (2025) 11

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 11

  3. [3]

    Black Forest Labs: FLUX.2: Analyzing and enhancing the latent space of FLUX – representation comparison (2025),https://bfl.ai/research/representation- comparison3

  4. [4]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023) 14

  5. [5]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: Pushing web- scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3558–3568 (2021) 10

  6. [6]

    Li et al

    Chen, J., Xu, Z., Pan, X., Hu, Y., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al.: Blip3-o: A family of fully open unified multimodal 16 H. Li et al. models-architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025) 4, 10, 11, 12

  7. [7]

    Chen, J., Cai, Z., Chen, P., Chen, S., Ji, K., Wang, X., Yang, Y., Wang, B.: Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image genera- tion (2025),https://arxiv.org/abs/2506.1809511

  8. [8]

    arXiv preprint arXiv:2410.10733 (2024) 10

    Chen, J., Cai, H., Chen, J., Xie, E., Yang, S., Tang, H., Li, M., Lu, Y., Han, S.: Deep compression autoencoder for efficient high-resolution diffusion models. arXiv preprint arXiv:2410.10733 (2024) 10

  9. [9]

    arXiv preprint arXiv:2501.17811 (2025) 11, 12

    Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025) 11, 12

  10. [10]

    arXiv preprint arXiv:2505.14683 (2025) 4, 11, 12, 14

    Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., Fan, H.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 4, 11, 12, 14

  11. [11]

    In: Forty-first international conference on machine learning (2024) 2, 12

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024) 2, 12

  12. [12]

    arXiv preprint arXiv:2306.13394 (2023) 11

    Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023) 11

  13. [13]

    arXiv preprint arXiv:2404.14396 (2024) 11

    Ge,Y.,Zhao,S.,Zhu,J.,Ge,Y.,Yi,K.,Song,L.,Li,C.,Ding,X.,Shan,Y.:Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396 (2024) 11

  14. [14]

    Advances in Neural Information Processing Systems36, 52132–52152 (2023) 12

    Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023) 12

  15. [15]

    arXiv preprint arXiv:2506.18898 (2025) 11, 12

    Han, J., Chen, H., Zhao, Y., Wang, H., Zhao, Q., Yang, Z., He, H., Yue, X., Jiang, L.: Vision as a dialect: Unifying visual understanding and generation via text-aligned representations. arXiv preprint arXiv:2506.18898 (2025) 11, 12

  16. [16]

    arXiv preprint arXiv:2504.01934 (2025) 10

    Huang, R., Wang, C., Yang, J., Lu, G., Yuan, Y., Han, J., Hou, L., Zhang, W., Hong, L., Zhao, H., et al.: Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement. arXiv preprint arXiv:2504.01934 (2025) 10

  17. [17]

    arXiv preprint arXiv:1312.6114 (2013) 2

    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 2

  18. [18]

    Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024) 12

  19. [19]

    Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025) 2

  20. [20]

    Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...

  21. [21]

    arXiv preprint arXiv:2408.03326 (2024) 11

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 11

  22. [22]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, B., Ge, Y., Ge, Y., Wang, G., Wang, R., Zhang, R., Shan, Y.: Seed- bench: Benchmarking multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13299– 13308 (2024) 11 Abbreviated paper title 17

  23. [23]

    arXiv preprint arXiv:2506.03147 (2025) 4, 14

    Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025) 4, 14

  24. [24]

    arXiv preprint arXiv:2505.05422 (2025) 11

    Lin, H., Wang, T., Ge, Y., Ge, Y., Lu, Z., Wei, Y., Zhang, Q., Sun, Z., Shan, Y.: Toklip: Marry visual tokens to clip for multimodal comprehension and generation. arXiv preprint arXiv:2505.05422 (2025) 11

  25. [25]

    Advances in neural information processing systems36, 34892–34916 (2023) 1

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 1

  26. [26]

    arXiv preprint arXiv:2504.17761 (2025) 14

    Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025) 14

  27. [27]

    Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024) 11

  28. [28]

    Luo, Z., Shi, F., Ge, Y., Yang, Y., Wang, L., Shan, Y.: Open-magvit2: An open-source project toward democratizing auto-regressive visual generation (2025), https://arxiv.org/abs/2409.0441010

  29. [29]

    arXiv preprint arXiv:2502.20321 (2025) 4

    Ma, C., Jiang, Y., Wu, J., Yang, J., Yu, X., Yuan, Z., Peng, B., Qi, X.: Uni- tok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321 (2025) 4

  30. [30]

    arXiv preprint arXiv:2503.07265 (2025) 12

    Niu, Y., Ning, M., Zheng, M., Jin, W., Lin, B., Jin, P., Liao, J., Ning, K., Feng, C., Zhu, B., Yuan, L.: Wise: A world knowledge-informed semantic evaluation for text-to-image generation. arXiv preprint arXiv:2503.07265 (2025) 12

  31. [31]

    OpenAI: Introducing 4o Image Generation (2025),https://openai.com/index/ introducing-4o-image-generation/14

  32. [32]

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,Howes,R.,Huang,P.Y.,Xu,H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features without ...

  33. [33]

    arXiv preprint arXiv:2307.01952 (2023) 12

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) 12

  34. [34]

    In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

    Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 2545–2555 (2025) 4, 10, 12

  35. [35]

    Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...

  36. [36]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 2, 10

  37. [37]

    Li et al

    Shi, M., Wang, H., Zheng, W., Yuan, Z., Wu, X., Wang, X., Wan, P., Zhou, J., Lu, J.: Latent diffusion model without variational autoencoder (2025),https: //arxiv.org/abs/2510.153014 18 H. Li et al

  38. [38]

    arXiv preprint arXiv:2503.14324 (2025) 10

    Song, W., Wang, Y., Song, Z., Li, Y., Sun, H., Chen, W., Zhou, Z., Xu, J., Wang, J., Yu, K.: Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. arXiv preprint arXiv:2503.14324 (2025) 10

  39. [39]

    Advances in neural information processing systems36, 49659–49678 (2023) 11

    Sun, K., Pan, J., Ge, Y., Li, H., Duan, H., Wu, X., Zhang, R., Zhou, A., Qin, Z., Wang, Y., et al.: Journeydb: A benchmark for generative image understanding. Advances in neural information processing systems36, 49659–49678 (2023) 11

  40. [40]

    arXiv preprint arXiv:2406.06525 (2024) 10

    Sun, P., Jiang, Y., Chen, S., Zhang, S., Peng, B., Luo, P., Yuan, Z.: Autoregres- sive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525 (2024) 10

  41. [41]

    In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

    Sun, Q., Cui, Y., Zhang, X., Zhang, F., Yu, Q., Wang, Y., Rao, Y., Liu, J., Huang, T., Wang, X.: Generative multimodal models are in-context learners. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14398–14409 (2024) 4, 10

  42. [42]

    arXiv preprint arXiv:2307.05222 (2023) 4

    Sun, Q., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, Y., Gao, H., Liu, J., Huang, T., Wang, X.: Emu: Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222 (2023) 4

  43. [43]

    arXiv preprint arXiv:2507.23278 (2025) 4, 10

    Tang, H., Xie, C., Bao, X., Weng, T., Li, P., Zheng, Y., Wang, L.: Unilip: Adapting clip for unified multimodal understanding, generation and editing. arXiv preprint arXiv:2507.23278 (2025) 4, 10

  44. [44]

    Team,C.:Chameleon:Mixed-modalearly-fusionfoundationmodels.arXivpreprint arXiv:2405.09818 (2024) 4, 11

  45. [45]

    Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive modeling: Scalableimagegenerationvianext-scaleprediction.Advancesinneuralinformation processing systems37, 84839–84865 (2024) 2, 10

  46. [46]

    arXiv preprint arXiv:2412.14164 (2024) 11

    Tong, S., Fan, D., Zhu, J., Xiong, Y., Chen, X., Sinha, K., Rabbat, M., LeCun, Y., Xie, S., Liu, Z.: Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164 (2024) 11

  47. [47]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9568–9578 (2024) 11

  48. [48]

    arXiv preprint arXiv:2302.13971 (2023) 1

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) 1

  49. [49]

    Advances in neural information processing systems30(2017) 4

    Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Advances in neural information processing systems30(2017) 4

  50. [50]

    arXiv preprint arXiv:2508.18265 (2025) 1

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 1

  51. [51]

    arXiv preprint arXiv:2409.18869 (2024) 11, 12

    Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024) 11, 12

  52. [52]

    arXiv preprint arXiv:2507.21033 (2025) 11

    Wang, Y., Yang, S., Zhao, B., Zhang, L., Liu, Q., Zhou, Y., Xie, C.: Gpt- image-edit-1.5 m: A million-scale, gpt-generated image dataset. arXiv preprint arXiv:2507.21033 (2025) 11

  53. [53]

    arXiv preprint arXiv:2506.18871 (2025) 14

    Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025) 14

  54. [54]

    arXiv preprint arXiv:2505.23661 (2025) 4, 12 Abbreviated paper title 19

    Wu, S., Wu, Z., Gong, Z., Tao, Q., Jin, S., Li, Q., Li, W., Loy, C.C.: Openuni: A simple baseline for unified multimodal understanding and generation. arXiv preprint arXiv:2505.23661 (2025) 4, 12 Abbreviated paper title 19

  55. [55]

    Wu, S., Zhang, W., Xu, L., Jin, S., Wu, Z., Tao, Q., Liu, W., Li, W., Loy, C.C.: Harmonizing visual representations for unified multimodal understanding and gen- eration (2025),https://arxiv.org/abs/2503.2197911, 12

  56. [56]

    arXiv preprint arXiv:2409.04429 (2024) 4, 10, 11, 12

    Wu, Y., Zhang, Z., Chen, J., Tang, H., Li, D., Fang, Y., Zhu, L., Xie, E., Yin, H., Yi, L., et al.: Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429 (2024) 4, 10, 11, 12

  57. [57]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13294–13304 (2025) 14

  58. [58]

    Xie, E., Chen, J., Chen, J., Cai, H., Tang, H., Lin, Y., Zhang, Z., Li, M., Zhu, L., Lu, Y., Han, S.: Sana: Efficient high-resolution image synthesis with linear diffusion transformer (2024),https://arxiv.org/abs/2410.1062910, 12

  59. [59]

    arXiv preprint arXiv:2408.12528 (2024) 4, 11

    Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. arXiv preprint arXiv:2408.12528 (2024) 4, 11

  60. [60]

    arXiv preprint arXiv:2506.15564 (2025) 12

    Xie,J.,Yang,Z.,Shou,M.Z.:Show-o2:Improvednativeunifiedmultimodalmodels. arXiv preprint arXiv:2506.15564 (2025) 12

  61. [61]

    arXiv preprint arXiv:2505.09388 (2025) 1

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...

  62. [62]

    generation: Taming optimization dilemma in latent diffusion models

    Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15703–15712 (2025) 2, 3, 10

  63. [63]

    Yu, L., Shi, B., Pasunuru, R., Muller, B., Golovneva, O., Wang, T., Babu, A., Tang, B., Karrer, B., Sheynin, S., Ross, C., Polyak, A., Howes, R., Sharma, V., Xu, P., Tamoyan, H., Ashual, O., Singer, U., Li, S.W., Zhang, S., James, R., Ghosh, G., Taigman, Y., Fazel-Zarandi, M., Celikyilmaz, A., Zettlemoyer, L., Aghajanyan, A.: Scaling autoregressive multi-...

  64. [64]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Yu, Q., Chow, W., Yue, Z., Pan, K., Wu, Y., Wan, X., Li, J., Tang, S., Zhang, H., Zhuang, Y.: Anyedit: Mastering unified high-quality image editing for any idea. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26125–26135 (2025) 14

  65. [65]

    Advances in Neural Information Processing Systems37, 128940–128966 (2024) 2

    Yu, Q., Weber, M., Deng, X., Shen, X., Cremers, D., Chen, L.C.: An image is worth 32 tokens for reconstruction and generation. Advances in Neural Information Processing Systems37, 128940–128966 (2024) 2

  66. [66]

    In: International Conference on Learning Representations (2025) 2, 3, 4

    Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. In: International Conference on Learning Representations (2025) 2, 3, 4

  67. [67]

    arXiv preprint arXiv:2308.02490 (2023) 11

    Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., Wang, L.: Mm- vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023) 11

  68. [68]

    Li et al

    Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal 20 H. Li et al. understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9556–9567 (2024) 11

  69. [69]

    arXiv preprint arXiv:2510.10575 (2025) 4

    Yue, Z., Zhang, H., Zeng, X., Chen, B., Wang, C., Zhuang, S., Dong, L., Du, K., Wang, Y., Wang, L., et al.: Uniflow: A unified pixel flow tokenizer for visual understanding and generation. arXiv preprint arXiv:2510.10575 (2025) 4

  70. [70]

    Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 14

    Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 14

  71. [71]

    arXiv preprint arXiv:2512.17909 (2025) 3, 4

    Zhang, S., Zhang, H., Zhang, Z., Ge, C., Xue, S., Liu, S., Ren, M., Kim, S.Y., Zhou, Y., Liu, Q., et al.: Both semantics and reconstruction matter: Making repre- sentation encoders ready for text-to-image generation and editing. arXiv preprint arXiv:2512.17909 (2025) 3, 4

  72. [72]

    Zhang, Z., Xie, J., Lu, Y., Yang, Z., Yang, Y.: Enabling instructional image editing within-contextgenerationinlargescalediffusiontransformer.In:TheThirty-ninth Annual Conference on Neural Information Processing Systems (2025) 14

  73. [73]

    Advances in Neural Information Processing Systems37, 3058–3093 (2024) 14

    Zhao, H., Ma, X.S., Chen, L., Si, S., Wu, R., An, K., Yu, P., Zhang, M., Li, Q., Chang, B.: Ultraedit: Instruction-based fine-grained image editing at scale. Advances in Neural Information Processing Systems37, 3058–3093 (2024) 14

  74. [74]

    arXiv preprint arXiv:2510.11690 (2025) 2, 4, 10

    Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690 (2025) 2, 4, 10

  75. [75]

    arXiv preprint arXiv:2408.11039 (2024) 4

    Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer, L., Levy, O.: Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039 (2024) 4

  76. [76]

    arXiv preprint arXiv:2504.10479 (2025) 10, 11

    Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 10, 11