WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

Canmiao Fu; Chen Li; Jing Lyu; Shaobin Zhuang; Yali Wang; Yiwei Guo; Zhipeng Huang

arxiv: 2605.18115 · v1 · pith:3ASE6HNUnew · submitted 2026-05-18 · 💻 cs.CV

WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

Yiwei Guo , Shaobin Zhuang , Zhipeng Huang , Canmiao Fu , Chen Li , Jing Lyu , Yali Wang This is my paper

Pith reviewed 2026-05-20 12:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual tokenizersemantic tokensimage reconstructionvisual understandingimage generationtoken distillationhybrid modelunified visual tokenizer

0 comments

The pith

WinTok adds learnable semantic tokens alongside pixel tokens to separate high-level understanding from low-level reconstruction in one tokenizer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes WinTok as a hybrid visual tokenizer that keeps standard pixel tokens for reconstruction but adds a separate set of learnable semantic tokens to handle understanding tasks. This explicit split reduces the conflict that occurs when a single token space must serve both pixel-level detail and semantic abstraction. Guidance comes through an asymmetric distillation step that pulls the semantic tokens toward embeddings from any pretrained visual foundation model. The result is measurable gains in reconstruction, classification, and generation benchmarks while training on only 50 million open-source images.

Core claim

WinTok supplements pixel tokens with a set of learnable semantic tokens, effectively mitigating cross-task interference without incurring the computational overhead of dual tokenizers, and achieves consistent improvements across reconstruction, understanding, and generation.

What carries the argument

Hybrid tokenizer that pairs pixel tokens with learnable semantic tokens under asymmetric token distillation from pretrained embeddings.

Load-bearing premise

Pretrained semantic embeddings can guide the learnable tokens to gain strong discriminative power without interfering with pixel-level reconstruction.

What would settle it

A controlled ablation in which removing the semantic tokens or the distillation step produces no drop in classification accuracy or generation quality on the same benchmarks.

Figures

Figures reproduced from arXiv: 2605.18115 by Canmiao Fu, Chen Li, Jing Lyu, Shaobin Zhuang, Yali Wang, Yiwei Guo, Zhipeng Huang.

**Figure 2.** Figure 2: Overview of our WinTok. Our WinTok adopts a hybrid tokenization paradigm integrated with learnable tokens. The separate token sets are optimized for visual understanding and generation respectively, with asymmetric token distillation for semantic tokens and image reconstruction for pixel tokens. Therefore, we mitigate the representation conflict while maintaining a unified tokenizer. More specifically, we … view at source ↗

**Figure 3.** Figure 3: Qualitative results demonstrating the superior performance of our [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Comparisons of using different tokenization strategies. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Ablations on design choices.(a) As the learnable token number increases, both the reconstruction quality and semantic capability improve. (b) Our WinTok demonstrates similar trends when adopting different semantic teachers, while SigLIP2 [83] benefits the most. (c) A larger decoder not only enhances reconstruction quality but also improves generation performance. Implementation details and more comprehen… view at source ↗

**Figure 6.** Figure 6: Discussions. (a) Reversing the roles of the two token types results in significant performance degradation, and WinTok is more rationale. (b) WinTok produces more discriminative clusters compared to other tokenizers, even with only 64 tokens representing the global semantics. in our setting. Ultimately, this ablation confirms the WinTok design, highlighting the necessity of strictly disentangling semanti… view at source ↗

**Figure 7.** Figure 7: Qualitative results of visual reconstruction. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative results of multimodal understanding. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative results of visual generation. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative results of image classification. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

read the original abstract

Building a unified visual tokenizer is essential for bridging the gap between visual understanding and generation. Yet existing approaches struggle with the inherent conflict between these tasks, as a single token space is forced to support both high-level semantic abstraction and low-level pixel reconstruction. We propose WinTok, a concise hybrid tokenizer that achieves a win-win performance by explicitly decoupling the two objectives. WinTok supplements pixel tokens with a set of learnable semantic tokens, effectively mitigating cross-task interference without incurring the computational overhead of dual tokenizers. To further enhance understanding capability, we introduce an asymmetric token distillation mechanism: the semantic tokens are guided by pretrained semantic embeddings from any visual foundation model, enabling them to inherit strong discriminative power while maintaining flexibility. Across 10 challenging benchmarks, WinTok delivers consistent improvements in reconstruction, understanding, and generation. Trained on only 50M open-source data, WinTok surpasses the strong baseline UniTok by 11.2% in classification accuracy and achieves a competitive reconstruction rFID of 0.41, despite using substantially less training data. Code is released at https://github.com/markywg/WinTok.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WinTok adds learnable semantic tokens plus asymmetric distillation to ease the conflict between pixel reconstruction and high-level understanding in a single tokenizer, and it reports gains over UniTok with open data and released code.

read the letter

WinTok's hybrid design with supplementary semantic tokens and asymmetric distillation from foundation models is the main new piece here, and it reports better performance than UniTok on understanding and reconstruction using less data. The work builds directly on prior hybrid tokenizer efforts by making the semantic part learnable and guided by external embeddings. This avoids the overhead of dual tokenizers while aiming for a win-win. The release of code and use of open data are positive for the field. On the soft side, the abstract is short on specifics about how the experiments were controlled or what ablations were run on the semantic token count and distillation process. Without seeing those, it's difficult to assess if the central claim about reduced interference holds up solidly or if other variables are at play. The assumption about the tokens inheriting power without interference seems reasonable but requires the full paper's evidence. This is the kind of paper for researchers focused on unified vision tokenizers and multimodal models. Someone looking for incremental but practical advances in balancing gen and understanding tasks could get something out of it, particularly with the open code. I'd recommend sending it to peer review so the community can evaluate the methods and results in detail.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes WinTok, a hybrid visual tokenizer that augments standard pixel tokens with a small set of learnable semantic tokens. An asymmetric distillation loss transfers discriminative knowledge from any pretrained visual foundation model into the semantic tokens while the pixel tokens continue to optimize reconstruction. The design is claimed to eliminate cross-task interference without the cost of separate tokenizers. Experiments on 10 benchmarks, trained on 50 M open-source images, report an 11.2 % gain in classification accuracy over UniTok and a competitive rFID of 0.41, with code released.

Significance. If the reported gains prove robust, the work supplies a lightweight, single-pass route to joint understanding and generation that avoids the parameter and compute overhead of dual-tokenizer architectures. The explicit use of transferable semantic tokens and the open release of code and training data constitute concrete strengths for reproducibility and downstream adoption in unified multimodal models.

major comments (2)

[§4, Table 2] §4 (Experiments) and Table 2: the central claim of consistent cross-task improvement rests on comparisons to UniTok, yet the manuscript provides no statistical significance tests, no repeated runs with different seeds, and no explicit statement of the train/validation splits used for the 50 M image corpus. These omissions make it impossible to judge whether the 11.2 % classification lift is load-bearing or sensitive to implementation details.
[§3.2] §3.2 (Asymmetric Token Distillation): the description of the distillation objective does not specify the weighting hyper-parameter between the semantic guidance loss and the pixel reconstruction loss, nor does it report an ablation on this trade-off. Because the paper’s main thesis is that the two objectives can be decoupled without interference, the absence of this control directly affects the credibility of the win-win claim.

minor comments (2)

[Figure 3] Figure 3: the legend and axis labels are too small to read at print size; please enlarge or split into two panels.
[§2] §2 (Related Work): the discussion of prior hybrid tokenizers omits the recent work on semantic-pixel disentanglement in [citation needed]; a brief comparison would help situate the contribution.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We are grateful to the referee for providing detailed and insightful feedback. Below we respond to each major comment and indicate the changes we plan to make in the revised manuscript.

read point-by-point responses

Referee: [§4, Table 2] §4 (Experiments) and Table 2: the central claim of consistent cross-task improvement rests on comparisons to UniTok, yet the manuscript provides no statistical significance tests, no repeated runs with different seeds, and no explicit statement of the train/validation splits used for the 50 M image corpus. These omissions make it impossible to judge whether the 11.2 % classification lift is load-bearing or sensitive to implementation details.

Authors: We agree that additional statistical analysis would be beneficial. However, performing multiple full training runs on 50M images is computationally prohibitive at this scale. Instead, we have added a note in the revised manuscript emphasizing the consistency of improvements across the 10 diverse benchmarks as evidence of robustness. We have also explicitly described the composition and splits of the 50M image corpus in Section 4.1, clarifying that it uses the standard training partitions from the source datasets (ImageNet-21K, LAION subset, etc.) without a held-out validation set for the tokenizer pretraining stage. revision: partial
Referee: [§3.2] §3.2 (Asymmetric Token Distillation): the description of the distillation objective does not specify the weighting hyper-parameter between the semantic guidance loss and the pixel reconstruction loss, nor does it report an ablation on this trade-off. Because the paper’s main thesis is that the two objectives can be decoupled without interference, the absence of this control directly affects the credibility of the win-win claim.

Authors: We thank the referee for noting this omission. The distillation objective is formulated as L_total = L_pixel + λ * L_semantic, with λ set to 0.1 in our experiments. This weighting was determined through initial tuning to achieve the reported balance between reconstruction and understanding. We have now included this hyper-parameter value and its justification in Section 3.2. Furthermore, we have added an ablation study on the effect of λ in the supplementary material, demonstrating that the hybrid design yields improvements over a range of λ values, supporting the claim of minimal cross-task interference. revision: yes

standing simulated objections not resolved

Lack of multiple repeated runs with different seeds due to computational constraints of training on 50M images.

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper proposes an empirical hybrid tokenizer architecture that decouples pixel reconstruction from semantic understanding via learnable semantic tokens and asymmetric distillation from an external foundation model. All performance claims (e.g., +11.2% classification accuracy over UniTok, rFID 0.41) are validated on held-out benchmarks using 50M open-source images rather than reducing to quantities defined by internal fitted parameters or self-referential equations. No load-bearing step equates a prediction to its own input by construction, and the method does not rely on self-citation chains for its core justification.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that foundation-model embeddings provide useful guidance and that adding a modest number of learnable semantic tokens does not introduce new optimization difficulties.

free parameters (1)

number of semantic tokens
The cardinality of the added semantic token set is a design hyperparameter whose value must be chosen or tuned.

axioms (1)

domain assumption Pretrained visual foundation models supply reliable semantic embeddings suitable for distillation.
The asymmetric distillation step depends on the quality and generality of external foundation models.

pith-pipeline@v0.9.0 · 5750 in / 1240 out tokens · 62183 ms · 2026-05-20T12:00:35.093772+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

WinTok supplements pixel tokens with a set of learnable semantic tokens... asymmetric token distillation mechanism... guided by pretrained semantic embeddings from any visual foundation model
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

unified encoder... pixel tokens for visual generation... learnable tokens for visual understanding

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

112 extracted references · 112 canonical work pages · 28 internal anchors

[1]

Advances in neural information processing systems35, 23716– 23736 (2022) 4

Alayrac,J.B.,Donahue,J.,Luc,P.,Miech,A.,Barr,I.,Hasson,Y.,Lenc,K.,Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022) 4

work page 2022
[2]

In: Forty-second International Conference on Machine Learning (2025) 4, 13

Bachmann, R., Allardice, J., Mizrahi, D., Fini, E., Kar, O.F., Amirloo, E., El- Nouby, A., Zamir, A., Dehghan, A.: Flextok: Resampling images into 1d token sequences of flexible length. In: Forty-second International Conference on Machine Learning (2025) 4, 13

work page 2025
[3]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023) 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

arXiv preprint arXiv:2411.16681 (2024) 4

Bai, Z., Gao, J., Gao, Z., Wang, P., Zhang, Z., He, T., Shou, M.Z.: Factorized visual tokenization and generation. arXiv preprint arXiv:2411.16681 (2024) 4

work page arXiv 2024
[6]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradi- ents through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013) 8

work page internal anchor Pith review Pith/arXiv arXiv 2013
[7]

Computer Science

Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., et al.: Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf2(3), 8 (2023) 10

work page 2023
[8]

VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

Bi, T., Zhang, X., Lu, Y., Zheng, N.: Vision foundation models can be good tokenizers for latent diffusion models. arXiv preprint arXiv:2510.18457 (2025) 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Advances in neural information processing systems33, 1877–1901 (2020) 1

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few- shot learners. Advances in neural information processing systems33, 1877–1901 (2020) 1

work page 1901
[10]

Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset(2022) 9, 1

work page 2022
[11]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Cai, H., Cao, S., Du, R., Gao, P., Hoi, S., Hou, Z., Huang, S., Jiang, D., Jin, X., Li, L., et al.: Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699 (2025) 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Chang,H.,Zhang,H.,Jiang,L.,Liu,C.,Freeman,W.T.:Maskgit:Maskedgenera- tiveimagetransformer.In:ProceedingsoftheIEEE/CVFconferenceoncomputer vision and pattern recognition. pp. 11315–11325 (2022) 3

work page 2022
[13]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: Pushing web- scale image-text pre-training to recognize long-tail visual concepts. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3558–3568 (2021) 9, 1

work page 2021
[14]

arXiv preprint arXiv:2509.25162 (2025) 4

Chen,B.,Bi,S.,Tan,H.,Zhang,H.,Zhang,T.,Li,Z.,Xiong,Y.,Zhang,J.,Zhang, K.: Aligning visual foundation encoders to tokenizers for diffusion models. arXiv preprint arXiv:2509.25162 (2025) 4

work page arXiv 2025
[15]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Chen, J., Xu, Z., Pan, X., Hu, Y., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al.: Blip3-o: A family of fully open unified multimodal models- architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025) 9, 1 WinTok 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

In: European Conference on Computer Vision

Chen, J., Ge, C., Xie, E., Wu, Y., Yao, L., Ren, X., Wang, Z., Luo, P., Lu, H., Li, Z.: Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text- to-image generation. In: European Conference on Computer Vision. pp. 74–91. Springer (2024) 10

work page 2024
[17]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P., Lu, H., et al.: Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426 (2023) 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

In: Eu- ropean Conference on Computer Vision

Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. In: Eu- ropean Conference on Computer Vision. pp. 370–387. Springer (2024) 9

work page 2024
[19]

In: The Eleventh International Conference on Learning Representations (2023) 1

Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al.: Pali: A jointly-scaled multilingual language-image model. In: The Eleventh International Conference on Learning Representations (2023) 1

work page 2023
[20]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025) 2, 4, 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024) 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024) 3

work page 2024
[23]

arXiv preprint arXiv:2503.06764 (2025) 4, 7, 9, 10, 1

Chen, Z., Wang, C., Chen, X., Xu, H., Huang, R., Zhou, J., Han, J., Xu, H., Liang, X.: Semhitok: A unified image tokenizer via semantic-guided hier- archical codebook for multimodal understanding and generation. arXiv preprint arXiv:2503.06764 (2025) 4, 7, 9, 10, 1

work page arXiv 2025
[24]

Advances in neural information processing systems36, 49250–49267 (2023) 9

Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruc- tion tuning. Advances in neural information processing systems36, 49250–49267 (2023) 9

work page 2023
[25]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 4, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

In: 2009 IEEE conference on computer vision and pattern recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large- scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009) 3, 5, 9, 1

work page 2009
[27]

In: The Ninth Interna- tional Conference on Learning Representations (2021) 4, 5

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: The Ninth Interna- tional Conference on Learning Representations (2021) 4, 5

work page 2021
[28]

arXiv preprint arXiv:2511.23386 (2025) 4, 7, 9

Du, S., Guo, J., Li, B., Cui, S., Xu, Z., Luo, Y., Wei, Y., Gai, K., Wang, X., Wu, K., et al.: Vqrae: Representation quantization autoencoders for multimodal understanding, generation and reconstruction. arXiv preprint arXiv:2511.23386 (2025) 4, 7, 9

work page arXiv 2025
[29]

In: Forty-first international conference on machine learning (2024) 7, 10 10 Y

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz,D.,Sauer,A.,Boesel,F.,etal.:Scalingrectifiedflowtransformersforhigh- resolution image synthesis. In: Forty-first international conference on machine learning (2024) 7, 10 10 Y. Guo et al

work page 2024
[30]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution im- age synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12873–12883 (2021) 2, 3, 4

work page 2021
[31]

arXiv preprint arXiv:2411.15296 , year=

Fu, C., Zhang, Y.F., Yin, S., Li, B., Fang, X., Zhao, S., Duan, H., Sun, X., Liu, Z., Wang, L., et al.: Mme-survey: A comprehensive survey on evaluation of multimodal llms. arXiv preprint arXiv:2411.15296 (2024) 9

work page arXiv 2024
[32]

Advances in Neural Information Processing Systems36, 27092–27112 (2023) 9, 1

Gadre, S.Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., et al.: Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems36, 27092–27112 (2023) 9, 1

work page 2023
[33]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Ge, Y., Zhao, S., Zhu, J., Ge, Y., Yi, K., Song, L., Li, C., Ding, X., Shan, Y.: Seed- x: Multimodal models with unified multi-granularity comprehension and genera- tion. arXiv preprint arXiv:2404.14396 (2024) 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Advances in Neural Information Processing Systems36, 52132–52152 (2023) 9, 12

Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023) 9, 12

work page 2023
[35]

IEEE Trans- actions on Pattern Analysis and Machine Intelligence46(12), 9052–9071 (2024) 4

Gui, J., Chen, T., Zhang, J., Cao, Q., Sun, Z., Luo, H., Tao, D.: A survey on self-supervised learning: Algorithms, applications, and future trends. IEEE Trans- actions on Pattern Analysis and Machine Intelligence46(12), 9052–9071 (2024) 4

work page 2024
[36]

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalablevisionlearners.In:ProceedingsoftheIEEE/CVFconferenceoncomputer vision and pattern recognition. pp. 16000–16009 (2022) 4

work page 2022
[37]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P., Yu, G.: Ella: Equip diffusion mod- els with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135 (2024) 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition

Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 6700–6709 (2019) 9, 12

work page 2019
[39]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024) 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

arXiv preprint arXiv:2507.07997 (2025) 4, 7

Jia, M., Yin, W., Hu, X., Guo, J., Guo, X., Zhang, Q., Long, X.X., Tan, P.: Mgvq: Could vq-vae beat vae? a generalizable tokenizer with multi-group quantization. arXiv preprint arXiv:2507.07997 (2025) 4, 7

work page arXiv 2025
[41]

In: The Twelfth International Conference on Learning Representations (2024) 9

Jin, Y., Xu, K., Chen, L., Liao, C., Tan, J., Huang, Q., CHEN, B., Song, C., ZHANG, D., Ou, W., et al.: Unified language-vision pretraining in llm with dy- namic discrete visual tokenization. In: The Twelfth International Conference on Learning Representations (2024) 9

work page 2024
[42]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019) 7

work page 2019
[43]

Auto-Encoding Variational Bayes

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 3

work page internal anchor Pith review Pith/arXiv arXiv 2013
[44]

Foundations and Trends®in Machine Learning12(4), 307–392 (2019) 3

Kingma, D.P., Welling, M., et al.: An introduction to variational autoencoders. Foundations and Trends®in Machine Learning12(4), 307–392 (2019) 3

work page 2019
[45]

International journal of computer vision128(7), 1956–1981 (2020) 1 WinTok 11

Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A., et al.: The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision128(7), 1956–1981 (2020) 1 WinTok 11

work page 1956
[46]

Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024) 7

work page 2024
[47]

Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025) 9

work page 2025
[48]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Lee, D., Kim, C., Kim, S., Cho, M., Han, W.S.: Autoregressive image genera- tion using residual quantization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11523–11532 (2022) 8

work page 2022
[49]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023) 3

work page 2023
[51]

Imagefolder: Autoregres- sive image generation with folded tokens.arXiv preprint arXiv:2410.01756, 2024

Li, X., Qiu, K., Chen, H., Kuen, J., Gu, J., Raj, B., Lin, Z.: Imagefolder: Autore- gressive image generation with folded tokens. arXiv preprint arXiv:2410.01756 (2024) 4, 13

work page arXiv 2024
[52]

arXiv preprint arXiv:2509.16197 (2025) 4

Li, Y., Qian, R., Pan, B., Zhang, H., Huang, H., Zhang, B., Tong, J., You, H., Du, X., Gan, Z., et al.: Manzano: A simple and scalable unified multimodal model with a hybrid vision tokenizer. arXiv preprint arXiv:2509.16197 (2025) 4

work page arXiv 2025
[53]

In: Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Processing

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Processing. pp. 292–305 (2023) 3, 9, 12

work page 2023
[54]

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Li,Z.,Zhang,J.,Lin,Q.,Xiong,J.,Long,Y.,Deng,X.,Zhang,Y.,Liu,X.,Huang, M., Xiao, Z., et al.: Hunyuan-dit: A powerful multi-resolution diffusion trans- former with fine-grained chinese understanding. arXiv preprint arXiv:2405.08748 (2024) 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

arXiv preprint arXiv:2505.05422 (2025) 2, 4, 7, 9, 1

Lin, H., Wang, T., Ge, Y., Ge, Y., Lu, Z., Wei, Y., Zhang, Q., Sun, Z., Shan, Y.: Toklip: Marry visual tokens to clip for multimodal comprehension and generation. arXiv preprint arXiv:2505.05422 (2025) 2, 4, 7, 9, 1

work page arXiv 2025
[56]

In: European confer- ence on computer vision

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European confer- ence on computer vision. pp. 740–755. Springer (2014) 9

work page 2014
[57]

World Model on Million-Length Video And Language With Blockwise RingAttention

Liu, H., Yan, W., Zaharia, M., Abbeel, P.: World model on million-length video and language with blockwise ringattention. arXiv preprint arXiv:2402.08268 (2024) 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu,H.,Li,C.,Li,Y.,Lee,Y.J.:Improvedbaselineswithvisualinstructiontuning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024) 2, 3, 4, 9

work page 2024
[59]

Advances in neural information processing systems36, 34892–34916 (2023) 2, 3, 4

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 2, 3, 4

work page 2023
[60]

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024) 9, 12

work page 2024
[61]

Luo, Z., Shi, F., Ge, Y., Yang, Y., Wang, L., Shan, Y.: Open-magvit2: An open-sourceprojecttowarddemocratizingauto-regressivevisualgeneration.arXiv preprint arXiv:2409.04410 (2024) 7, 8

work page arXiv 2024
[62]

Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025a

Ma, C., Jiang, Y., Wu, J., Yang, J., Yu, X., Yuan, Z., Peng, B., Qi, X.: Uni- tok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321 (2025) 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 14 12 Y. Guo et al

work page arXiv 2025
[63]

In: European Conference on Computer Vision

Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Exploring flow and diffusion-based generative models with scalable inter- polant transformers. In: European Conference on Computer Vision. pp. 23–40. Springer (2024) 3

work page 2024
[64]

Journal of machine learning research9(11) (2008) 14

Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of machine learning research9(11) (2008) 14

work page 2008
[65]

Transactions on Machine Learning Research (2023) 4, 7, 13, 1, 2

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research (2023) 4, 7, 13, 1, 2

work page 2023
[66]

In: Proceedings of the IEEE/CVF international conference on computer vision

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 3

work page 2023
[67]

In: The Twelfth International Conference on Learning Representations (2024) 7, 10

Podell,D.,English,Z.,Lacey,K.,Blattmann,A.,Dockhorn,T.,Müller,J.,Penna, J.,Rombach,R.:Sdxl:Improvinglatentdiffusionmodelsforhigh-resolutionimage synthesis. In: The Twelfth International Conference on Learning Representations (2024) 7, 10

work page 2024
[68]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2545–2555 (2025) 2, 4, 5, 7, 8, 9, 10, 11

work page 2025
[69]

In: International conference on machine learning

Radford,A.,Kim,J.W.,Hallacy,C.,Ramesh,A.,Goh,G.,Agarwal,S.,Sastry,G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 2, 3, 4, 7, 10, 13, 1

work page 2021
[70]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 3, 7, 10

work page 2022
[71]

Advances in neural information processing systems35, 25278–25294 (2022) 9, 1

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems35, 25278–25294 (2022) 9, 1

work page 2022
[72]

In: Proceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2556–2565 (2018) 9, 1

work page 2018
[73]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8317– 8326 (2019) 9, 12

work page 2019
[74]

arXiv preprint arXiv:2406.10328 (2024) 9

Singla, V., Yue, K., Paul, S., Shirkavand, R., Jayawardhana, M., Ganjdanesh, A., Huang, H., Bhatele, A., Somepalli, G., Goldstein, T.: From pixels to prose: A large dataset of dense image captions. arXiv preprint arXiv:2406.10328 (2024) 9

work page arXiv 2024
[75]

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

Song, W., Wang, Y., Song, Z., Li, Y., Sun, H., Chen, W., Zhou, Z., Xu, J., Wang, J., Yu, K.: Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. arXiv preprint arXiv:2503.14324 (2025) 2, 3, 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Sun, P., Jiang, Y., Chen, S., Zhang, S., Peng, B., Luo, P., Yuan, Z.: Autoregres- sive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525 (2024) 2, 3, 5, 7, 8, 10, 11, 12 WinTok 13

work page internal anchor Pith review Pith/arXiv arXiv 2024
[77]

In: The Twelfth International Conference on Learning Representations (2024) 1

Sun, Q., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, Y., Gao, H., Liu, J., Huang, T., Wang, X.: Emu: Generative pretraining in multimodality. In: The Twelfth International Conference on Learning Representations (2024) 1

work page 2024
[78]

Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278,

Tang,H.,Xie,C.,Bao,X.,Weng,T.,Li,P.,Zheng,Y.,Wang,L.:Unilip:Adapting clip for unified multimodal understanding, generation and editing. arXiv preprint arXiv:2507.23278 (2025) 2, 3, 4, 7

work page arXiv 2025
[79]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Team, C.: Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818 (2024) 1, 10, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[80]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023) 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

Showing first 80 references.

[1] [1]

Advances in neural information processing systems35, 23716– 23736 (2022) 4

Alayrac,J.B.,Donahue,J.,Luc,P.,Miech,A.,Barr,I.,Hasson,Y.,Lenc,K.,Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022) 4

work page 2022

[2] [2]

In: Forty-second International Conference on Machine Learning (2025) 4, 13

Bachmann, R., Allardice, J., Mizrahi, D., Fini, E., Kar, O.F., Amirloo, E., El- Nouby, A., Zamir, A., Dehghan, A.: Flextok: Resampling images into 1d token sequences of flexible length. In: Forty-second International Conference on Machine Learning (2025) 4, 13

work page 2025

[3] [3]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023) 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 9

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

arXiv preprint arXiv:2411.16681 (2024) 4

Bai, Z., Gao, J., Gao, Z., Wang, P., Zhang, Z., He, T., Shou, M.Z.: Factorized visual tokenization and generation. arXiv preprint arXiv:2411.16681 (2024) 4

work page arXiv 2024

[6] [6]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradi- ents through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013) 8

work page internal anchor Pith review Pith/arXiv arXiv 2013

[7] [7]

Computer Science

Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., et al.: Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf2(3), 8 (2023) 10

work page 2023

[8] [8]

VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

Bi, T., Zhang, X., Lu, Y., Zheng, N.: Vision foundation models can be good tokenizers for latent diffusion models. arXiv preprint arXiv:2510.18457 (2025) 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Advances in neural information processing systems33, 1877–1901 (2020) 1

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few- shot learners. Advances in neural information processing systems33, 1877–1901 (2020) 1

work page 1901

[10] [10]

Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset(2022) 9, 1

work page 2022

[11] [11]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Cai, H., Cao, S., Du, R., Gao, P., Hoi, S., Hou, Z., Huang, S., Jiang, D., Jin, X., Li, L., et al.: Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699 (2025) 9

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Chang,H.,Zhang,H.,Jiang,L.,Liu,C.,Freeman,W.T.:Maskgit:Maskedgenera- tiveimagetransformer.In:ProceedingsoftheIEEE/CVFconferenceoncomputer vision and pattern recognition. pp. 11315–11325 (2022) 3

work page 2022

[13] [13]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: Pushing web- scale image-text pre-training to recognize long-tail visual concepts. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3558–3568 (2021) 9, 1

work page 2021

[14] [14]

arXiv preprint arXiv:2509.25162 (2025) 4

Chen,B.,Bi,S.,Tan,H.,Zhang,H.,Zhang,T.,Li,Z.,Xiong,Y.,Zhang,J.,Zhang, K.: Aligning visual foundation encoders to tokenizers for diffusion models. arXiv preprint arXiv:2509.25162 (2025) 4

work page arXiv 2025

[15] [15]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Chen, J., Xu, Z., Pan, X., Hu, Y., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al.: Blip3-o: A family of fully open unified multimodal models- architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025) 9, 1 WinTok 9

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

In: European Conference on Computer Vision

Chen, J., Ge, C., Xie, E., Wu, Y., Yao, L., Ren, X., Wang, Z., Luo, P., Lu, H., Li, Z.: Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text- to-image generation. In: European Conference on Computer Vision. pp. 74–91. Springer (2024) 10

work page 2024

[17] [17]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P., Lu, H., et al.: Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426 (2023) 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

In: Eu- ropean Conference on Computer Vision

Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. In: Eu- ropean Conference on Computer Vision. pp. 370–387. Springer (2024) 9

work page 2024

[19] [19]

In: The Eleventh International Conference on Learning Representations (2023) 1

Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al.: Pali: A jointly-scaled multilingual language-image model. In: The Eleventh International Conference on Learning Representations (2023) 1

work page 2023

[20] [20]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025) 2, 4, 9

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024) 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024) 3

work page 2024

[23] [23]

arXiv preprint arXiv:2503.06764 (2025) 4, 7, 9, 10, 1

Chen, Z., Wang, C., Chen, X., Xu, H., Huang, R., Zhou, J., Han, J., Xu, H., Liang, X.: Semhitok: A unified image tokenizer via semantic-guided hier- archical codebook for multimodal understanding and generation. arXiv preprint arXiv:2503.06764 (2025) 4, 7, 9, 10, 1

work page arXiv 2025

[24] [24]

Advances in neural information processing systems36, 49250–49267 (2023) 9

Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruc- tion tuning. Advances in neural information processing systems36, 49250–49267 (2023) 9

work page 2023

[25] [25]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 4, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

In: 2009 IEEE conference on computer vision and pattern recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large- scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009) 3, 5, 9, 1

work page 2009

[27] [27]

In: The Ninth Interna- tional Conference on Learning Representations (2021) 4, 5

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: The Ninth Interna- tional Conference on Learning Representations (2021) 4, 5

work page 2021

[28] [28]

arXiv preprint arXiv:2511.23386 (2025) 4, 7, 9

Du, S., Guo, J., Li, B., Cui, S., Xu, Z., Luo, Y., Wei, Y., Gai, K., Wang, X., Wu, K., et al.: Vqrae: Representation quantization autoencoders for multimodal understanding, generation and reconstruction. arXiv preprint arXiv:2511.23386 (2025) 4, 7, 9

work page arXiv 2025

[29] [29]

In: Forty-first international conference on machine learning (2024) 7, 10 10 Y

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz,D.,Sauer,A.,Boesel,F.,etal.:Scalingrectifiedflowtransformersforhigh- resolution image synthesis. In: Forty-first international conference on machine learning (2024) 7, 10 10 Y. Guo et al

work page 2024

[30] [30]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution im- age synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12873–12883 (2021) 2, 3, 4

work page 2021

[31] [31]

arXiv preprint arXiv:2411.15296 , year=

Fu, C., Zhang, Y.F., Yin, S., Li, B., Fang, X., Zhao, S., Duan, H., Sun, X., Liu, Z., Wang, L., et al.: Mme-survey: A comprehensive survey on evaluation of multimodal llms. arXiv preprint arXiv:2411.15296 (2024) 9

work page arXiv 2024

[32] [32]

Advances in Neural Information Processing Systems36, 27092–27112 (2023) 9, 1

Gadre, S.Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., et al.: Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems36, 27092–27112 (2023) 9, 1

work page 2023

[33] [33]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Ge, Y., Zhao, S., Zhu, J., Ge, Y., Yi, K., Song, L., Li, C., Ding, X., Shan, Y.: Seed- x: Multimodal models with unified multi-granularity comprehension and genera- tion. arXiv preprint arXiv:2404.14396 (2024) 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Advances in Neural Information Processing Systems36, 52132–52152 (2023) 9, 12

Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023) 9, 12

work page 2023

[35] [35]

IEEE Trans- actions on Pattern Analysis and Machine Intelligence46(12), 9052–9071 (2024) 4

Gui, J., Chen, T., Zhang, J., Cao, Q., Sun, Z., Luo, H., Tao, D.: A survey on self-supervised learning: Algorithms, applications, and future trends. IEEE Trans- actions on Pattern Analysis and Machine Intelligence46(12), 9052–9071 (2024) 4

work page 2024

[36] [36]

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalablevisionlearners.In:ProceedingsoftheIEEE/CVFconferenceoncomputer vision and pattern recognition. pp. 16000–16009 (2022) 4

work page 2022

[37] [37]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P., Yu, G.: Ella: Equip diffusion mod- els with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135 (2024) 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition

Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 6700–6709 (2019) 9, 12

work page 2019

[39] [39]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024) 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

arXiv preprint arXiv:2507.07997 (2025) 4, 7

Jia, M., Yin, W., Hu, X., Guo, J., Guo, X., Zhang, Q., Long, X.X., Tan, P.: Mgvq: Could vq-vae beat vae? a generalizable tokenizer with multi-group quantization. arXiv preprint arXiv:2507.07997 (2025) 4, 7

work page arXiv 2025

[41] [41]

In: The Twelfth International Conference on Learning Representations (2024) 9

Jin, Y., Xu, K., Chen, L., Liao, C., Tan, J., Huang, Q., CHEN, B., Song, C., ZHANG, D., Ou, W., et al.: Unified language-vision pretraining in llm with dy- namic discrete visual tokenization. In: The Twelfth International Conference on Learning Representations (2024) 9

work page 2024

[42] [42]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019) 7

work page 2019

[43] [43]

Auto-Encoding Variational Bayes

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 3

work page internal anchor Pith review Pith/arXiv arXiv 2013

[44] [44]

Foundations and Trends®in Machine Learning12(4), 307–392 (2019) 3

Kingma, D.P., Welling, M., et al.: An introduction to variational autoencoders. Foundations and Trends®in Machine Learning12(4), 307–392 (2019) 3

work page 2019

[45] [45]

International journal of computer vision128(7), 1956–1981 (2020) 1 WinTok 11

Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A., et al.: The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision128(7), 1956–1981 (2020) 1 WinTok 11

work page 1956

[46] [46]

Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024) 7

work page 2024

[47] [47]

Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025) 9

work page 2025

[48] [48]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Lee, D., Kim, C., Kim, S., Cho, M., Han, W.S.: Autoregressive image genera- tion using residual quantization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11523–11532 (2022) 8

work page 2022

[49] [49]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023) 3

work page 2023

[51] [51]

Imagefolder: Autoregres- sive image generation with folded tokens.arXiv preprint arXiv:2410.01756, 2024

Li, X., Qiu, K., Chen, H., Kuen, J., Gu, J., Raj, B., Lin, Z.: Imagefolder: Autore- gressive image generation with folded tokens. arXiv preprint arXiv:2410.01756 (2024) 4, 13

work page arXiv 2024

[52] [52]

arXiv preprint arXiv:2509.16197 (2025) 4

Li, Y., Qian, R., Pan, B., Zhang, H., Huang, H., Zhang, B., Tong, J., You, H., Du, X., Gan, Z., et al.: Manzano: A simple and scalable unified multimodal model with a hybrid vision tokenizer. arXiv preprint arXiv:2509.16197 (2025) 4

work page arXiv 2025

[53] [53]

In: Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Processing

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Processing. pp. 292–305 (2023) 3, 9, 12

work page 2023

[54] [54]

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Li,Z.,Zhang,J.,Lin,Q.,Xiong,J.,Long,Y.,Deng,X.,Zhang,Y.,Liu,X.,Huang, M., Xiao, Z., et al.: Hunyuan-dit: A powerful multi-resolution diffusion trans- former with fine-grained chinese understanding. arXiv preprint arXiv:2405.08748 (2024) 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

arXiv preprint arXiv:2505.05422 (2025) 2, 4, 7, 9, 1

Lin, H., Wang, T., Ge, Y., Ge, Y., Lu, Z., Wei, Y., Zhang, Q., Sun, Z., Shan, Y.: Toklip: Marry visual tokens to clip for multimodal comprehension and generation. arXiv preprint arXiv:2505.05422 (2025) 2, 4, 7, 9, 1

work page arXiv 2025

[56] [56]

In: European confer- ence on computer vision

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European confer- ence on computer vision. pp. 740–755. Springer (2014) 9

work page 2014

[57] [57]

World Model on Million-Length Video And Language With Blockwise RingAttention

Liu, H., Yan, W., Zaharia, M., Abbeel, P.: World model on million-length video and language with blockwise ringattention. arXiv preprint arXiv:2402.08268 (2024) 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [58]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu,H.,Li,C.,Li,Y.,Lee,Y.J.:Improvedbaselineswithvisualinstructiontuning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024) 2, 3, 4, 9

work page 2024

[59] [59]

Advances in neural information processing systems36, 34892–34916 (2023) 2, 3, 4

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 2, 3, 4

work page 2023

[60] [60]

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024) 9, 12

work page 2024

[61] [61]

Luo, Z., Shi, F., Ge, Y., Yang, Y., Wang, L., Shan, Y.: Open-magvit2: An open-sourceprojecttowarddemocratizingauto-regressivevisualgeneration.arXiv preprint arXiv:2409.04410 (2024) 7, 8

work page arXiv 2024

[62] [62]

Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025a

Ma, C., Jiang, Y., Wu, J., Yang, J., Yu, X., Yuan, Z., Peng, B., Qi, X.: Uni- tok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321 (2025) 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 14 12 Y. Guo et al

work page arXiv 2025

[63] [63]

In: European Conference on Computer Vision

Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Exploring flow and diffusion-based generative models with scalable inter- polant transformers. In: European Conference on Computer Vision. pp. 23–40. Springer (2024) 3

work page 2024

[64] [64]

Journal of machine learning research9(11) (2008) 14

Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of machine learning research9(11) (2008) 14

work page 2008

[65] [65]

Transactions on Machine Learning Research (2023) 4, 7, 13, 1, 2

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research (2023) 4, 7, 13, 1, 2

work page 2023

[66] [66]

In: Proceedings of the IEEE/CVF international conference on computer vision

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 3

work page 2023

[67] [67]

In: The Twelfth International Conference on Learning Representations (2024) 7, 10

Podell,D.,English,Z.,Lacey,K.,Blattmann,A.,Dockhorn,T.,Müller,J.,Penna, J.,Rombach,R.:Sdxl:Improvinglatentdiffusionmodelsforhigh-resolutionimage synthesis. In: The Twelfth International Conference on Learning Representations (2024) 7, 10

work page 2024

[68] [68]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2545–2555 (2025) 2, 4, 5, 7, 8, 9, 10, 11

work page 2025

[69] [69]

In: International conference on machine learning

Radford,A.,Kim,J.W.,Hallacy,C.,Ramesh,A.,Goh,G.,Agarwal,S.,Sastry,G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 2, 3, 4, 7, 10, 13, 1

work page 2021

[70] [70]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 3, 7, 10

work page 2022

[71] [71]

Advances in neural information processing systems35, 25278–25294 (2022) 9, 1

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems35, 25278–25294 (2022) 9, 1

work page 2022

[72] [72]

In: Proceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2556–2565 (2018) 9, 1

work page 2018

[73] [73]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8317– 8326 (2019) 9, 12

work page 2019

[74] [74]

arXiv preprint arXiv:2406.10328 (2024) 9

Singla, V., Yue, K., Paul, S., Shirkavand, R., Jayawardhana, M., Ganjdanesh, A., Huang, H., Bhatele, A., Somepalli, G., Goldstein, T.: From pixels to prose: A large dataset of dense image captions. arXiv preprint arXiv:2406.10328 (2024) 9

work page arXiv 2024

[75] [75]

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

Song, W., Wang, Y., Song, Z., Li, Y., Sun, H., Chen, W., Zhou, Z., Xu, J., Wang, J., Yu, K.: Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. arXiv preprint arXiv:2503.14324 (2025) 2, 3, 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[76] [76]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Sun, P., Jiang, Y., Chen, S., Zhang, S., Peng, B., Luo, P., Yuan, Z.: Autoregres- sive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525 (2024) 2, 3, 5, 7, 8, 10, 11, 12 WinTok 13

work page internal anchor Pith review Pith/arXiv arXiv 2024

[77] [77]

In: The Twelfth International Conference on Learning Representations (2024) 1

Sun, Q., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, Y., Gao, H., Liu, J., Huang, T., Wang, X.: Emu: Generative pretraining in multimodality. In: The Twelfth International Conference on Learning Representations (2024) 1

work page 2024

[78] [78]

Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278,

Tang,H.,Xie,C.,Bao,X.,Weng,T.,Li,P.,Zheng,Y.,Wang,L.:Unilip:Adapting clip for unified multimodal understanding, generation and editing. arXiv preprint arXiv:2507.23278 (2025) 2, 3, 4, 7

work page arXiv 2025

[79] [79]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Team, C.: Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818 (2024) 1, 10, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[80] [80]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023) 1

work page internal anchor Pith review Pith/arXiv arXiv 2023