pith. sign in

arxiv: 2605.18115 · v1 · pith:3ASE6HNUnew · submitted 2026-05-18 · 💻 cs.CV

WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

Pith reviewed 2026-05-20 12:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual tokenizersemantic tokensimage reconstructionvisual understandingimage generationtoken distillationhybrid modelunified visual tokenizer
0
0 comments X

The pith

WinTok adds learnable semantic tokens alongside pixel tokens to separate high-level understanding from low-level reconstruction in one tokenizer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes WinTok as a hybrid visual tokenizer that keeps standard pixel tokens for reconstruction but adds a separate set of learnable semantic tokens to handle understanding tasks. This explicit split reduces the conflict that occurs when a single token space must serve both pixel-level detail and semantic abstraction. Guidance comes through an asymmetric distillation step that pulls the semantic tokens toward embeddings from any pretrained visual foundation model. The result is measurable gains in reconstruction, classification, and generation benchmarks while training on only 50 million open-source images.

Core claim

WinTok supplements pixel tokens with a set of learnable semantic tokens, effectively mitigating cross-task interference without incurring the computational overhead of dual tokenizers, and achieves consistent improvements across reconstruction, understanding, and generation.

What carries the argument

Hybrid tokenizer that pairs pixel tokens with learnable semantic tokens under asymmetric token distillation from pretrained embeddings.

Load-bearing premise

Pretrained semantic embeddings can guide the learnable tokens to gain strong discriminative power without interfering with pixel-level reconstruction.

What would settle it

A controlled ablation in which removing the semantic tokens or the distillation step produces no drop in classification accuracy or generation quality on the same benchmarks.

Figures

Figures reproduced from arXiv: 2605.18115 by Canmiao Fu, Chen Li, Jing Lyu, Shaobin Zhuang, Yali Wang, Yiwei Guo, Zhipeng Huang.

Figure 1
Figure 1. Figure 1: Comparsion of different tokenization paradigms. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our WinTok. Our WinTok adopts a hybrid tokenization paradigm integrated with learnable tokens. The separate token sets are optimized for visual understanding and generation respectively, with asymmetric token distillation for semantic tokens and image reconstruction for pixel tokens. Therefore, we mitigate the representation conflict while maintaining a unified tokenizer. More specifically, we … view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results demonstrating the superior performance of our [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparisons of using different tokenization strategies. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablations on design choices.(a) As the learnable token number increases, both the reconstruction quality and semantic capability improve. (b) Our Win￾Tok demonstrates similar trends when adopting different semantic teachers, while SigLIP2 [83] benefits the most. (c) A larger decoder not only enhances reconstruc￾tion quality but also improves generation performance. Implementation details and more comprehen… view at source ↗
Figure 6
Figure 6. Figure 6: Discussions. (a) Reversing the roles of the two token types results in signif￾icant performance degradation, and WinTok is more rationale. (b) WinTok produces more discriminative clusters compared to other tokenizers, even with only 64 tokens representing the global semantics. in our setting. Ultimately, this ablation confirms the WinTok design, highlight￾ing the necessity of strictly disentangling semanti… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of visual reconstruction. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results of multimodal understanding. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative results of visual generation. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative results of image classification. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
read the original abstract

Building a unified visual tokenizer is essential for bridging the gap between visual understanding and generation. Yet existing approaches struggle with the inherent conflict between these tasks, as a single token space is forced to support both high-level semantic abstraction and low-level pixel reconstruction. We propose WinTok, a concise hybrid tokenizer that achieves a win-win performance by explicitly decoupling the two objectives. WinTok supplements pixel tokens with a set of learnable semantic tokens, effectively mitigating cross-task interference without incurring the computational overhead of dual tokenizers. To further enhance understanding capability, we introduce an asymmetric token distillation mechanism: the semantic tokens are guided by pretrained semantic embeddings from any visual foundation model, enabling them to inherit strong discriminative power while maintaining flexibility. Across 10 challenging benchmarks, WinTok delivers consistent improvements in reconstruction, understanding, and generation. Trained on only 50M open-source data, WinTok surpasses the strong baseline UniTok by 11.2% in classification accuracy and achieves a competitive reconstruction rFID of 0.41, despite using substantially less training data. Code is released at https://github.com/markywg/WinTok.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes WinTok, a hybrid visual tokenizer that augments standard pixel tokens with a small set of learnable semantic tokens. An asymmetric distillation loss transfers discriminative knowledge from any pretrained visual foundation model into the semantic tokens while the pixel tokens continue to optimize reconstruction. The design is claimed to eliminate cross-task interference without the cost of separate tokenizers. Experiments on 10 benchmarks, trained on 50 M open-source images, report an 11.2 % gain in classification accuracy over UniTok and a competitive rFID of 0.41, with code released.

Significance. If the reported gains prove robust, the work supplies a lightweight, single-pass route to joint understanding and generation that avoids the parameter and compute overhead of dual-tokenizer architectures. The explicit use of transferable semantic tokens and the open release of code and training data constitute concrete strengths for reproducibility and downstream adoption in unified multimodal models.

major comments (2)
  1. [§4, Table 2] §4 (Experiments) and Table 2: the central claim of consistent cross-task improvement rests on comparisons to UniTok, yet the manuscript provides no statistical significance tests, no repeated runs with different seeds, and no explicit statement of the train/validation splits used for the 50 M image corpus. These omissions make it impossible to judge whether the 11.2 % classification lift is load-bearing or sensitive to implementation details.
  2. [§3.2] §3.2 (Asymmetric Token Distillation): the description of the distillation objective does not specify the weighting hyper-parameter between the semantic guidance loss and the pixel reconstruction loss, nor does it report an ablation on this trade-off. Because the paper’s main thesis is that the two objectives can be decoupled without interference, the absence of this control directly affects the credibility of the win-win claim.
minor comments (2)
  1. [Figure 3] Figure 3: the legend and axis labels are too small to read at print size; please enlarge or split into two panels.
  2. [§2] §2 (Related Work): the discussion of prior hybrid tokenizers omits the recent work on semantic-pixel disentanglement in [citation needed]; a brief comparison would help situate the contribution.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We are grateful to the referee for providing detailed and insightful feedback. Below we respond to each major comment and indicate the changes we plan to make in the revised manuscript.

read point-by-point responses
  1. Referee: [§4, Table 2] §4 (Experiments) and Table 2: the central claim of consistent cross-task improvement rests on comparisons to UniTok, yet the manuscript provides no statistical significance tests, no repeated runs with different seeds, and no explicit statement of the train/validation splits used for the 50 M image corpus. These omissions make it impossible to judge whether the 11.2 % classification lift is load-bearing or sensitive to implementation details.

    Authors: We agree that additional statistical analysis would be beneficial. However, performing multiple full training runs on 50M images is computationally prohibitive at this scale. Instead, we have added a note in the revised manuscript emphasizing the consistency of improvements across the 10 diverse benchmarks as evidence of robustness. We have also explicitly described the composition and splits of the 50M image corpus in Section 4.1, clarifying that it uses the standard training partitions from the source datasets (ImageNet-21K, LAION subset, etc.) without a held-out validation set for the tokenizer pretraining stage. revision: partial

  2. Referee: [§3.2] §3.2 (Asymmetric Token Distillation): the description of the distillation objective does not specify the weighting hyper-parameter between the semantic guidance loss and the pixel reconstruction loss, nor does it report an ablation on this trade-off. Because the paper’s main thesis is that the two objectives can be decoupled without interference, the absence of this control directly affects the credibility of the win-win claim.

    Authors: We thank the referee for noting this omission. The distillation objective is formulated as L_total = L_pixel + λ * L_semantic, with λ set to 0.1 in our experiments. This weighting was determined through initial tuning to achieve the reported balance between reconstruction and understanding. We have now included this hyper-parameter value and its justification in Section 3.2. Furthermore, we have added an ablation study on the effect of λ in the supplementary material, demonstrating that the hybrid design yields improvements over a range of λ values, supporting the claim of minimal cross-task interference. revision: yes

standing simulated objections not resolved
  • Lack of multiple repeated runs with different seeds due to computational constraints of training on 50M images.

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper proposes an empirical hybrid tokenizer architecture that decouples pixel reconstruction from semantic understanding via learnable semantic tokens and asymmetric distillation from an external foundation model. All performance claims (e.g., +11.2% classification accuracy over UniTok, rFID 0.41) are validated on held-out benchmarks using 50M open-source images rather than reducing to quantities defined by internal fitted parameters or self-referential equations. No load-bearing step equates a prediction to its own input by construction, and the method does not rely on self-citation chains for its core justification.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that foundation-model embeddings provide useful guidance and that adding a modest number of learnable semantic tokens does not introduce new optimization difficulties.

free parameters (1)
  • number of semantic tokens
    The cardinality of the added semantic token set is a design hyperparameter whose value must be chosen or tuned.
axioms (1)
  • domain assumption Pretrained visual foundation models supply reliable semantic embeddings suitable for distillation.
    The asymmetric distillation step depends on the quality and generality of external foundation models.

pith-pipeline@v0.9.0 · 5750 in / 1240 out tokens · 62183 ms · 2026-05-20T12:00:35.093772+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

112 extracted references · 112 canonical work pages · 28 internal anchors

  1. [1]

    Advances in neural information processing systems35, 23716– 23736 (2022) 4

    Alayrac,J.B.,Donahue,J.,Luc,P.,Miech,A.,Barr,I.,Hasson,Y.,Lenc,K.,Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022) 4

  2. [2]

    In: Forty-second International Conference on Machine Learning (2025) 4, 13

    Bachmann, R., Allardice, J., Mizrahi, D., Fini, E., Kar, O.F., Amirloo, E., El- Nouby, A., Zamir, A., Dehghan, A.: Flextok: Resampling images into 1d token sequences of flexible length. In: Forty-second International Conference on Machine Learning (2025) 4, 13

  3. [3]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023) 1

  4. [4]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 9

  5. [5]

    arXiv preprint arXiv:2411.16681 (2024) 4

    Bai, Z., Gao, J., Gao, Z., Wang, P., Zhang, Z., He, T., Shou, M.Z.: Factorized visual tokenization and generation. arXiv preprint arXiv:2411.16681 (2024) 4

  6. [6]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradi- ents through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013) 8

  7. [7]

    Computer Science

    Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., et al.: Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf2(3), 8 (2023) 10

  8. [8]

    VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

    Bi, T., Zhang, X., Lu, Y., Zheng, N.: Vision foundation models can be good tokenizers for latent diffusion models. arXiv preprint arXiv:2510.18457 (2025) 4, 7

  9. [9]

    Advances in neural information processing systems33, 1877–1901 (2020) 1

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few- shot learners. Advances in neural information processing systems33, 1877–1901 (2020) 1

  10. [10]

    Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset(2022) 9, 1

  11. [11]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Cai, H., Cao, S., Du, R., Gao, P., Hoi, S., Hou, Z., Huang, S., Jiang, D., Jin, X., Li, L., et al.: Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699 (2025) 9

  12. [12]

    Chang,H.,Zhang,H.,Jiang,L.,Liu,C.,Freeman,W.T.:Maskgit:Maskedgenera- tiveimagetransformer.In:ProceedingsoftheIEEE/CVFconferenceoncomputer vision and pattern recognition. pp. 11315–11325 (2022) 3

  13. [13]

    In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

    Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: Pushing web- scale image-text pre-training to recognize long-tail visual concepts. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3558–3568 (2021) 9, 1

  14. [14]

    arXiv preprint arXiv:2509.25162 (2025) 4

    Chen,B.,Bi,S.,Tan,H.,Zhang,H.,Zhang,T.,Li,Z.,Xiong,Y.,Zhang,J.,Zhang, K.: Aligning visual foundation encoders to tokenizers for diffusion models. arXiv preprint arXiv:2509.25162 (2025) 4

  15. [15]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Chen, J., Xu, Z., Pan, X., Hu, Y., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al.: Blip3-o: A family of fully open unified multimodal models- architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025) 9, 1 WinTok 9

  16. [16]

    In: European Conference on Computer Vision

    Chen, J., Ge, C., Xie, E., Wu, Y., Yao, L., Ren, X., Wang, Z., Luo, P., Lu, H., Li, Z.: Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text- to-image generation. In: European Conference on Computer Vision. pp. 74–91. Springer (2024) 10

  17. [17]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P., Lu, H., et al.: Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426 (2023) 10

  18. [18]

    In: Eu- ropean Conference on Computer Vision

    Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. In: Eu- ropean Conference on Computer Vision. pp. 370–387. Springer (2024) 9

  19. [19]

    In: The Eleventh International Conference on Learning Representations (2023) 1

    Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al.: Pali: A jointly-scaled multilingual language-image model. In: The Eleventh International Conference on Learning Representations (2023) 1

  20. [20]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025) 2, 4, 9

  21. [21]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024) 9

  22. [22]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024) 3

  23. [23]

    arXiv preprint arXiv:2503.06764 (2025) 4, 7, 9, 10, 1

    Chen, Z., Wang, C., Chen, X., Xu, H., Huang, R., Zhou, J., Han, J., Xu, H., Liang, X.: Semhitok: A unified image tokenizer via semantic-guided hier- archical codebook for multimodal understanding and generation. arXiv preprint arXiv:2503.06764 (2025) 4, 7, 9, 10, 1

  24. [24]

    Advances in neural information processing systems36, 49250–49267 (2023) 9

    Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruc- tion tuning. Advances in neural information processing systems36, 49250–49267 (2023) 9

  25. [25]

    Emerging Properties in Unified Multimodal Pretraining

    Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 4, 12

  26. [26]

    In: 2009 IEEE conference on computer vision and pattern recognition

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large- scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009) 3, 5, 9, 1

  27. [27]

    In: The Ninth Interna- tional Conference on Learning Representations (2021) 4, 5

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: The Ninth Interna- tional Conference on Learning Representations (2021) 4, 5

  28. [28]

    arXiv preprint arXiv:2511.23386 (2025) 4, 7, 9

    Du, S., Guo, J., Li, B., Cui, S., Xu, Z., Luo, Y., Wei, Y., Gai, K., Wang, X., Wu, K., et al.: Vqrae: Representation quantization autoencoders for multimodal understanding, generation and reconstruction. arXiv preprint arXiv:2511.23386 (2025) 4, 7, 9

  29. [29]

    In: Forty-first international conference on machine learning (2024) 7, 10 10 Y

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz,D.,Sauer,A.,Boesel,F.,etal.:Scalingrectifiedflowtransformersforhigh- resolution image synthesis. In: Forty-first international conference on machine learning (2024) 7, 10 10 Y. Guo et al

  30. [30]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution im- age synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12873–12883 (2021) 2, 3, 4

  31. [31]

    arXiv preprint arXiv:2411.15296 , year=

    Fu, C., Zhang, Y.F., Yin, S., Li, B., Fang, X., Zhao, S., Duan, H., Sun, X., Liu, Z., Wang, L., et al.: Mme-survey: A comprehensive survey on evaluation of multimodal llms. arXiv preprint arXiv:2411.15296 (2024) 9

  32. [32]

    Advances in Neural Information Processing Systems36, 27092–27112 (2023) 9, 1

    Gadre, S.Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., et al.: Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems36, 27092–27112 (2023) 9, 1

  33. [33]

    SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    Ge, Y., Zhao, S., Zhu, J., Ge, Y., Yi, K., Song, L., Li, C., Ding, X., Shan, Y.: Seed- x: Multimodal models with unified multi-granularity comprehension and genera- tion. arXiv preprint arXiv:2404.14396 (2024) 9, 10

  34. [34]

    Advances in Neural Information Processing Systems36, 52132–52152 (2023) 9, 12

    Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023) 9, 12

  35. [35]

    IEEE Trans- actions on Pattern Analysis and Machine Intelligence46(12), 9052–9071 (2024) 4

    Gui, J., Chen, T., Zhang, J., Cao, Q., Sun, Z., Luo, H., Tao, D.: A survey on self-supervised learning: Algorithms, applications, and future trends. IEEE Trans- actions on Pattern Analysis and Machine Intelligence46(12), 9052–9071 (2024) 4

  36. [36]

    He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalablevisionlearners.In:ProceedingsoftheIEEE/CVFconferenceoncomputer vision and pattern recognition. pp. 16000–16009 (2022) 4

  37. [37]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P., Yu, G.: Ella: Equip diffusion mod- els with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135 (2024) 9

  38. [38]

    In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition

    Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 6700–6709 (2019) 9, 12

  39. [39]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024) 1

  40. [40]

    arXiv preprint arXiv:2507.07997 (2025) 4, 7

    Jia, M., Yin, W., Hu, X., Guo, J., Guo, X., Zhang, Q., Long, X.X., Tan, P.: Mgvq: Could vq-vae beat vae? a generalizable tokenizer with multi-group quantization. arXiv preprint arXiv:2507.07997 (2025) 4, 7

  41. [41]

    In: The Twelfth International Conference on Learning Representations (2024) 9

    Jin, Y., Xu, K., Chen, L., Liao, C., Tan, J., Huang, Q., CHEN, B., Song, C., ZHANG, D., Ou, W., et al.: Unified language-vision pretraining in llm with dy- namic discrete visual tokenization. In: The Twelfth International Conference on Learning Representations (2024) 9

  42. [42]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019) 7

  43. [43]

    Auto-Encoding Variational Bayes

    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 3

  44. [44]

    Foundations and Trends®in Machine Learning12(4), 307–392 (2019) 3

    Kingma, D.P., Welling, M., et al.: An introduction to variational autoencoders. Foundations and Trends®in Machine Learning12(4), 307–392 (2019) 3

  45. [45]

    International journal of computer vision128(7), 1956–1981 (2020) 1 WinTok 11

    Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A., et al.: The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision128(7), 1956–1981 (2020) 1 WinTok 11

  46. [46]

    Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024) 7

  47. [47]

    Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025) 9

  48. [48]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Lee, D., Kim, C., Kim, S., Cho, M., Han, W.S.: Autoregressive image genera- tion using residual quantization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11523–11532 (2022) 8

  49. [49]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 9

  50. [50]

    In: International conference on machine learning

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023) 3

  51. [51]

    Imagefolder: Autoregres- sive image generation with folded tokens.arXiv preprint arXiv:2410.01756, 2024

    Li, X., Qiu, K., Chen, H., Kuen, J., Gu, J., Raj, B., Lin, Z.: Imagefolder: Autore- gressive image generation with folded tokens. arXiv preprint arXiv:2410.01756 (2024) 4, 13

  52. [52]

    arXiv preprint arXiv:2509.16197 (2025) 4

    Li, Y., Qian, R., Pan, B., Zhang, H., Huang, H., Zhang, B., Tong, J., You, H., Du, X., Gan, Z., et al.: Manzano: A simple and scalable unified multimodal model with a hybrid vision tokenizer. arXiv preprint arXiv:2509.16197 (2025) 4

  53. [53]

    In: Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Processing

    Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Processing. pp. 292–305 (2023) 3, 9, 12

  54. [54]

    Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

    Li,Z.,Zhang,J.,Lin,Q.,Xiong,J.,Long,Y.,Deng,X.,Zhang,Y.,Liu,X.,Huang, M., Xiao, Z., et al.: Hunyuan-dit: A powerful multi-resolution diffusion trans- former with fine-grained chinese understanding. arXiv preprint arXiv:2405.08748 (2024) 10

  55. [55]

    arXiv preprint arXiv:2505.05422 (2025) 2, 4, 7, 9, 1

    Lin, H., Wang, T., Ge, Y., Ge, Y., Lu, Z., Wei, Y., Zhang, Q., Sun, Z., Shan, Y.: Toklip: Marry visual tokens to clip for multimodal comprehension and generation. arXiv preprint arXiv:2505.05422 (2025) 2, 4, 7, 9, 1

  56. [56]

    In: European confer- ence on computer vision

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European confer- ence on computer vision. pp. 740–755. Springer (2014) 9

  57. [57]

    World Model on Million-Length Video And Language With Blockwise RingAttention

    Liu, H., Yan, W., Zaharia, M., Abbeel, P.: World model on million-length video and language with blockwise ringattention. arXiv preprint arXiv:2402.08268 (2024) 9

  58. [58]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu,H.,Li,C.,Li,Y.,Lee,Y.J.:Improvedbaselineswithvisualinstructiontuning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024) 2, 3, 4, 9

  59. [59]

    Advances in neural information processing systems36, 34892–34916 (2023) 2, 3, 4

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 2, 3, 4

  60. [60]

    Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024) 9, 12

  61. [61]

    Luo, Z., Shi, F., Ge, Y., Yang, Y., Wang, L., Shan, Y.: Open-magvit2: An open-sourceprojecttowarddemocratizingauto-regressivevisualgeneration.arXiv preprint arXiv:2409.04410 (2024) 7, 8

  62. [62]

    Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025a

    Ma, C., Jiang, Y., Wu, J., Yang, J., Yu, X., Yuan, Z., Peng, B., Qi, X.: Uni- tok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321 (2025) 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 14 12 Y. Guo et al

  63. [63]

    In: European Conference on Computer Vision

    Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Exploring flow and diffusion-based generative models with scalable inter- polant transformers. In: European Conference on Computer Vision. pp. 23–40. Springer (2024) 3

  64. [64]

    Journal of machine learning research9(11) (2008) 14

    Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of machine learning research9(11) (2008) 14

  65. [65]

    Transactions on Machine Learning Research (2023) 4, 7, 13, 1, 2

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research (2023) 4, 7, 13, 1, 2

  66. [66]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 3

  67. [67]

    In: The Twelfth International Conference on Learning Representations (2024) 7, 10

    Podell,D.,English,Z.,Lacey,K.,Blattmann,A.,Dockhorn,T.,Müller,J.,Penna, J.,Rombach,R.:Sdxl:Improvinglatentdiffusionmodelsforhigh-resolutionimage synthesis. In: The Twelfth International Conference on Learning Representations (2024) 7, 10

  68. [68]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2545–2555 (2025) 2, 4, 5, 7, 8, 9, 10, 11

  69. [69]

    In: International conference on machine learning

    Radford,A.,Kim,J.W.,Hallacy,C.,Ramesh,A.,Goh,G.,Agarwal,S.,Sastry,G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 2, 3, 4, 7, 10, 13, 1

  70. [70]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 3, 7, 10

  71. [71]

    Advances in neural information processing systems35, 25278–25294 (2022) 9, 1

    Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems35, 25278–25294 (2022) 9, 1

  72. [72]

    In: Proceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2556–2565 (2018) 9, 1

  73. [73]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8317– 8326 (2019) 9, 12

  74. [74]

    arXiv preprint arXiv:2406.10328 (2024) 9

    Singla, V., Yue, K., Paul, S., Shirkavand, R., Jayawardhana, M., Ganjdanesh, A., Huang, H., Bhatele, A., Somepalli, G., Goldstein, T.: From pixels to prose: A large dataset of dense image captions. arXiv preprint arXiv:2406.10328 (2024) 9

  75. [75]

    DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

    Song, W., Wang, Y., Song, Z., Li, Y., Sun, H., Chen, W., Zhou, Z., Xu, J., Wang, J., Yu, K.: Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. arXiv preprint arXiv:2503.14324 (2025) 2, 3, 4, 7

  76. [76]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Sun, P., Jiang, Y., Chen, S., Zhang, S., Peng, B., Luo, P., Yuan, Z.: Autoregres- sive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525 (2024) 2, 3, 5, 7, 8, 10, 11, 12 WinTok 13

  77. [77]

    In: The Twelfth International Conference on Learning Representations (2024) 1

    Sun, Q., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, Y., Gao, H., Liu, J., Huang, T., Wang, X.: Emu: Generative pretraining in multimodality. In: The Twelfth International Conference on Learning Representations (2024) 1

  78. [78]

    Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278,

    Tang,H.,Xie,C.,Bao,X.,Weng,T.,Li,P.,Zheng,Y.,Wang,L.:Unilip:Adapting clip for unified multimodal understanding, generation and editing. arXiv preprint arXiv:2507.23278 (2025) 2, 3, 4, 7

  79. [79]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Team, C.: Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818 (2024) 1, 10, 11

  80. [80]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023) 1

Showing first 80 references.