Recognition: unknown
VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
Pith reviewed 2026-05-08 04:09 UTC · model grok-4.3
The pith
VibeToken encodes images into 32-256 dynamic tokens, enabling autoregressive generation at any resolution with fixed low compute.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VibeToken is a resolution-agnostic 1D Transformer-based image tokenizer that converts images of any size into a controllable sequence of 32 to 256 tokens. VibeToken-Gen, the class-conditioned autoregressive model built on it, generates images at arbitrary resolutions and aspect ratios. It reaches 3.94 gFID on 1024x1024 outputs using only 64 tokens, compared with a diffusion baseline that requires 1024 tokens for 5.87 gFID, while holding inference cost fixed at 179G FLOPs regardless of resolution.
What carries the argument
VibeToken, the 1D Transformer tokenizer that produces a dynamic, user-controllable sequence of 32-256 tokens from images of varying resolution and aspect ratio.
If this is right
- Autoregressive generators can now operate at any resolution without quadratic growth in inference FLOPs.
- High-resolution synthesis requires far fewer tokens than current diffusion pipelines while matching or exceeding their quality.
- Constant compute cost across scales makes production deployment of autoregressive visual models feasible.
- Class-conditioned generation works directly for varying aspect ratios without retraining or padding tricks.
Where Pith is reading between the lines
- The same dynamic tokenization idea could be tested on video sequences where frame count and spatial size both vary.
- Training the tokenizer on a broad mix of resolutions might improve robustness beyond the resolutions evaluated here.
- The efficiency advantage could allow autoregressive models to reach resolutions where diffusion methods become computationally prohibitive.
Load-bearing premise
The 1D tokenizer can encode and decode images at any resolution and aspect ratio using only a short dynamic token sequence without losing fine details required for high-quality autoregressive generation.
What would settle it
High-resolution images produced from 64 tokens exhibit visible artifacts or higher perceptual error rates than diffusion models that use many more tokens on the same benchmarks.
Figures
read the original abstract
We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32-256 tokens, achieving a state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources. Notably, VibeToken-Gen synthesizes 1024x1024 images using only 64 tokens and achieves 3.94 gFID; by comparison, a diffusion-based state-of-the-art alternative requires 1,024 tokens and attains 5.87 gFID. In contrast to fixed-resolution AR models such as LlamaGen -- whose inference FLOPs grow quadratically with resolution (11T FLOPs at 1024x1024) -- VibeToken-Gen maintains a constant 179G FLOPs (63.4x efficient) independent of resolution. We hope VibeToken can help unlock the wide adoption of AR visual generative models in production use cases.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VibeToken, a resolution-agnostic 1D Transformer-based image tokenizer that encodes arbitrary-resolution and aspect-ratio images into a dynamic, user-controllable sequence of 32-256 tokens. It then presents VibeToken-Gen, a class-conditioned autoregressive generator built on this tokenizer that supports out-of-the-box arbitrary resolutions, achieves 3.94 gFID on 1024x1024 images using only 64 tokens (versus 5.87 gFID for a diffusion baseline using 1024 tokens), and maintains constant 179G FLOPs independent of resolution (63.4x more efficient than LlamaGen at 1024x1024).
Significance. If the central claims hold after verification, the work would meaningfully advance autoregressive image synthesis by removing the fixed-resolution and quadratic-complexity barriers that have kept AR models behind diffusion approaches at high resolutions. The dynamic tokenization and constant-FLOPs property could enable practical high-resolution generation in production settings where compute budgets are constrained.
major comments (2)
- [Abstract] Abstract: The headline efficiency and quality claims (3.94 gFID at 1024x1024 with 64 tokens, constant 179G FLOPs) rest on the unverified premise that the 1D Transformer tokenizer faithfully reconstructs high-frequency detail across resolutions without artifacts; no PSNR, LPIPS, or token-count-vs-resolution ablation is reported to support this.
- [VibeToken] VibeToken architecture section: The constant-FLOPs claim for VibeToken-Gen requires that the tokenizer's patch embedding and positional scheme produce a fixed-length sequence independent of input resolution and aspect ratio; if any component (e.g., adaptive pooling or resolution-dependent downsampling) scales with image size, the reported 63.4x efficiency gain over LlamaGen would not hold at 1024x1024.
minor comments (1)
- [Abstract] The abstract would be clearer if it briefly stated the training dataset size, model parameter counts for both tokenizer and generator, and the exact diffusion baseline used for the gFID comparison.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of our claims that warrant clarification and additional support. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline efficiency and quality claims (3.94 gFID at 1024x1024 with 64 tokens, constant 179G FLOPs) rest on the unverified premise that the 1D Transformer tokenizer faithfully reconstructs high-frequency detail across resolutions without artifacts; no PSNR, LPIPS, or token-count-vs-resolution ablation is reported to support this.
Authors: We acknowledge that the abstract emphasizes generation results and does not include explicit reconstruction metrics for VibeToken. To address this, the revised manuscript will add a dedicated subsection on tokenizer reconstruction quality. This will report PSNR and LPIPS scores across multiple resolutions and token counts (32-256), along with an ablation studying the trade-off between token count, reconstruction fidelity, and downstream generation performance. These additions will directly substantiate the tokenizer's ability to preserve high-frequency details. revision: yes
-
Referee: [VibeToken] VibeToken architecture section: The constant-FLOPs claim for VibeToken-Gen requires that the tokenizer's patch embedding and positional scheme produce a fixed-length sequence independent of input resolution and aspect ratio; if any component (e.g., adaptive pooling or resolution-dependent downsampling) scales with image size, the reported 63.4x efficiency gain over LlamaGen would not hold at 1024x1024.
Authors: VibeToken is designed such that the output sequence length is user-specified and independent of input resolution or aspect ratio. The architecture processes a variable number of patches from the input image but employs a fixed-capacity 1D Transformer bottleneck that always produces exactly K tokens (K chosen in [32, 256]), with resolution-agnostic positional encodings. This ensures VibeToken-Gen always operates on a fixed token sequence (e.g., 64 tokens), yielding constant 179G FLOPs regardless of resolution. We will expand the architecture section with a clearer description, pseudocode, and a diagram illustrating the fixed-output mechanism to eliminate any ambiguity. revision: yes
Circularity Check
No circularity; efficiency and performance claims are empirical design outcomes, not reductions by construction.
full rationale
The abstract and provided text introduce VibeToken as a 1D Transformer tokenizer producing dynamic 32-256 token sequences for arbitrary resolutions, followed by VibeToken-Gen as a class-conditioned AR model. Reported results (3.94 gFID at 1024x1024 with 64 tokens, constant 179G FLOPs vs. LlamaGen's 11T) are presented as measured comparisons to baselines, not as outputs of any derivation or equation chain. Constant FLOPs follow directly from the fixed-length token sequence design choice, which is the stated innovation rather than a fitted or renamed input. No equations, self-citations, uniqueness theorems, or ansatzes appear that would make any claim equivalent to its inputs by construction. The chain is self-contained via architecture and external empirical validation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Flextok: Resam- pling images into 1d token sequences of flexible length
Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, O ˘guzhan Fatih Kar, Elmira Amirloo, Alaaeldin El- Nouby, Amir Zamir, and Afshin Dehghan. Flextok: Resam- pling images into 1d token sequences of flexible length. In Forty-second International Conference on Machine Learning,
-
[2]
Flexivit: One model for all patch sizes
Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Min- derer, Michael Tschannen, Ibrahim Alabdulmohsin, and Filip Pavetic. Flexivit: One model for all patch sizes. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14496–14506, 2023. 4, 1
2023
-
[3]
Maskgit: Masked generative image transformer
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022. 8
2022
-
[4]
Masked autoencoders are effective tokenizers for diffusion models.ArXiv, abs/2502.03444, 2025
Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhik- sha Raj. Masked autoencoders are effective tokenizers for diffusion models.ArXiv, abs/2502.03444, 2025. 1
-
[5]
Softvq-vae: Efficient 1-dimensional contin- uous tokenizer
Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, and Emad Barsoum. Softvq-vae: Efficient 1-dimensional contin- uous tokenizer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28358–28370, 2025. 2, 3, 1
2025
-
[6]
Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems, 36:2252–2274, 2023
Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems, 36:2252–2274, 2023. 4, 1
2023
-
[7]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 3
2009
-
[8]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 1
work page internal anchor Pith review arXiv 2010
-
[9]
Adaptive length image tokenization via recurrent allocation
Shivam Duggal, Phillip Isola, Antonio Torralba, and William T Freeman. Adaptive length image tokenization via recurrent allocation. InFirst Workshop on Scalable Opti- mization for Efficient and Adaptive Foundation Models, 2024. 3
2024
-
[10]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 2, 8
2021
-
[11]
Scaling rectified flow trans- formers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim En- tezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first International Conference on Machine Learning, 2024. 2
2024
-
[12]
doi:10.48550/arXiv.2403.18361 , abstract =
Qihang Fan, Quanzeng You, Xiaotian Han, Yongfei Liu, Yun- zhe Tao, Huaibo Huang, Ran He, and Hongxia Yang. Vi- tar: Vision transformer with any resolution.arXiv preprint arXiv:2403.18361, 2024. 4, 1
-
[13]
Rotary position embedding for vision transformer
Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. In European Conference on Computer Vision, pages 289–305. Springer, 2024. 4, 1
2024
-
[14]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 2
work page internal anchor Pith review arXiv 2022
-
[15]
Analyzing and improving the training dynamics of diffusion models
Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24174–24184, 2024. 6
2024
-
[16]
Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiao- hui Shen, Suha Kwak, and Liang-Chieh Chen. Democra- tizing text-to-image masked generative models with com- pact text-aware one-dimensional tokens.arXiv preprint arXiv:2501.07730, 2025. 2, 3, 1
-
[17]
Autoregressive image generation using resid- ual quantization
Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using resid- ual quantization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11523– 11532, 2022. 2, 1
2022
-
[18]
Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024. 2
2024
-
[19]
Imagefolder: Autoregressive image gen- eration with folded tokens.arXiv preprint arXiv:2410.01756,
Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhik- sha Raj, and Zhe Lin. Imagefolder: Autoregressive image gen- eration with folded tokens.arXiv preprint arXiv:2410.01756,
-
[20]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. InThe Eleventh International Conference on Learning Representations, 2023. 2
2023
-
[21]
Yiheng Liu, Liao Qu, Huichao Zhang, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Xian Li, Shuai Wang, Daniel K Du, et al. Detailflow: 1d coarse-to-fine autoregressive image generation via next-detail prediction.arXiv preprint arXiv:2505.21473,
-
[22]
Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, and Lei Bai
Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, and Yinfei Yang. Atoken: A unified tokenizer for vision.ArXiv, abs/2509.14476, 2025. 1
-
[23]
Fit: Flexible vision transformer for diffusion model.arXiv preprint arXiv:2402.12376, 2024
Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, and Lei Bai. Fit: Flexible vision transformer for diffusion model.arXiv preprint arXiv:2402.12376, 2024. 3, 1
-
[24]
Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410, 2024. 2, 6, 8, 1
-
[25]
arXiv preprint arXiv:2502.20321 (2025) 9
Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321, 2025. 2, 6, 8
-
[26]
Keita Miwa, Kento Sasaki, Hidehisa Arai, Tsubasa Taka- hashi, and Yu Yamaguchi. One-d-piece: Image tokenizer meets quality-controllable compression.arXiv preprint arXiv:2501.10064, 2025. 2, 3, 5, 6, 1
-
[27]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 2
work page internal anchor Pith review arXiv 2021
-
[28]
Eclipse: A resource-efficient text-to-image prior for image generations
Maitreya Patel, Changhoon Kim, Sheng Cheng, Chitta Baral, and Yezhou Yang. Eclipse: A resource-efficient text-to-image prior for image generations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9069–9078, 2024. 2
2024
-
[29]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
-
[30]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 2
work page internal anchor Pith review arXiv 2023
-
[31]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,
work page internal anchor Pith review arXiv
-
[32]
Image tokenizer needs post-training.arXiv preprint arXiv:2509.12474,
Kai Qiu, Xiang Li, Hao Chen, Jason Kuen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, and Mar- ios Savvides. Image tokenizer needs post-training.ArXiv, abs/2509.12474, 2025. 1
-
[33]
Pho- torealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Pho- torealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 2
2022
-
[34]
Stretching each dollar: Diffu- sion training from scratch on a micro-budget
Vikash Sehwag, Xianghao Kong, Jingtao Li, Michael Spranger, and Lingjuan Lyu. Stretching each dollar: Diffu- sion training from scratch on a micro-budget. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28596–28608, 2025. 2
2025
-
[35]
Scalable image tokenization with index backpropagation quantization
Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16037–16046, 2025. 2, 6
2025
-
[36]
Denoising diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2021. 2
2021
-
[37]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024. 2, 6, 8, 1
work page internal anchor Pith review arXiv 2024
-
[38]
Nextstep-1: Toward autoregressive image generation with continuous tokens at scale
NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, et al. Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711, 2025. 2
-
[39]
Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural in- formation processing systems, 37:84839–84865, 2024
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Li- wei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural in- formation processing systems, 37:84839–84865, 2024. 2, 6, 8
2024
-
[40]
Resformer: Scaling vits with multi-resolution training
Rui Tian, Zuxuan Wu, Qi Dai, Han Hu, Yu Qiao, and Yu-Gang Jiang. Resformer: Scaling vits with multi-resolution training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22721–22731, 2023. 1
2023
-
[41]
Training data-efficient image transformers & distillation through atten- tion
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through atten- tion. InInternational conference on machine learning, pages 10347–10357. PMLR, 2021. 1
2021
-
[42]
ZiDong Wang, Zeyu Lu, Di Huang, Cai Zhou, Wanli Ouyang, et al. Fitv2: Scalable and improved flexible vision transformer for diffusion model.arXiv preprint arXiv:2410.13925, 2024. 3, 6, 1
-
[43]
Native-resolution image synthesis.arXiv preprint arXiv:2506.03131, 2025
Zidong Wang, Lei Bai, Xiangyu Yue, Wanli Ouyang, and Yiyuan Zhang. Native-resolution image synthesis.arXiv preprint arXiv:2506.03131, 2025. 3, 6, 1
-
[44]
Ze Wang, Hao Chen, Benran Hu, Jiang Liu, Ximeng Sun, Jialian Wu, Yusheng Su, Xiaodong Yu, Emad Barsoum, and Zicheng Liu. Instella-t2i: Pushing the limits of 1d discrete latent space image generation.arXiv preprint arXiv:2506.21022, 2025. 6, 1
-
[45]
Maskbit: Embedding-free image generation via bit tokens.arXiv preprint arXiv:2409.16211, 2024
Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. Maskbit: Embedding-free image generation via bit tokens.arXiv preprint arXiv:2409.16211, 2024. 8
-
[46]
Vila-u: a unified foundation model integrating visual understanding and generation
Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model inte- grating visual understanding and generation.arXiv preprint arXiv:2409.04429, 2024. 2
-
[47]
Dc-ar: Efficient masked autoregressive image generation with deep compression hybrid tokenizer
Yecheng Wu, Han Cai, Junyu Chen, Zhuoyang Zhang, Enze Xie, Jincheng Yu, Junsong Chen, Jinyi Hu, Yao Lu, and Song Han. Dc-ar: Efficient masked autoregressive image generation with deep compression hybrid tokenizer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18034–18045, 2025. 2, 6
2025
-
[48]
Qingsong Xie, Zhao Zhang, Zhe Huang, Yanhao Zhang, Hao- nan Lu, and Zhenyu Yang. Layton: Latent consistency tok- enizer for 1024-pixel image reconstruction and generation by 256 tokens.ArXiv, abs/2503.08377, 2025. 1
-
[49]
Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021. 2, 8
-
[50]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun- jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive mod- els for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022. 2
work page internal anchor Pith review arXiv 2022
-
[51]
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Ver- sari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation.arXiv preprint arXiv:2310.05737, 2023. 8
work page internal anchor Pith review arXiv 2023
-
[52]
An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37:128940–128966,
Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37:128940–128966,
-
[53]
Randomized autoregressive visual generation
Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang- Chieh Chen. Randomized autoregressive visual generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18431–18441, 2025. 2, 8
2025
-
[54]
Ar- gus: A compact and versatile foundation model for vision
Weiming Zhuang, Chen Chen, Zhizhong Li, Sina Sajad- manesh, Jingtao Li, Jiabo Huang, Vikash Sehwag, Vivek Sharma, Hirotaka Shinozaki, Felan Carlo Garcia, et al. Ar- gus: A compact and versatile foundation model for vision. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4418–4429, 2025. 2 VibeToken: Scaling 1D Image Tokeniz...
2025
-
[55]
256x256": 0.5,
The latter half of training uses an adversarial loss as well. Ablation models are trained at two resolutions, 256×256 and512×512. Final tokenizer.The scaledVibeTokenuses multi-vector quantization (MVQ) with 8 codebooks and 256 latent di- mensions per token, factorized into 8 sub-codes of 32 di- mensions each. The maximum latent length is variable in [32,2...
2050
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.