Recognition: 2 theorem links
· Lean TheoremInsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
Pith reviewed 2026-05-15 02:06 UTC · model grok-4.3
The pith
InsightTok uses localized content-aware perceptual losses to improve text and face fidelity in discrete image tokenizers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
InsightTok improves text and face fidelity in discrete visual tokenization by incorporating localized, content-aware perceptual losses into the training objective. This allows a compact 16k codebook with 16x downsampling to outperform prior tokenizers on text legibility and facial reconstruction while preserving general reconstruction quality, with the improvements carrying over to autoregressive image generation models such as InsightAR.
What carries the argument
Localized, content-aware perceptual losses that focus supervision on text and face regions during tokenizer training.
Load-bearing premise
Localized content-aware perceptual losses will reliably capture fine-grained text legibility and facial fidelity across diverse images without introducing new artifacts.
What would settle it
Reconstruct a benchmark set of images containing fine text and detailed faces using the tokenizer, then measure legibility scores and identity preservation against prior tokenizers to check whether gains hold or if general quality drops.
Figures
read the original abstract
Text and faces are among the most perceptually salient and practically important patterns in visual generation, yet they remain challenging for autoregressive generators built on discrete tokenization. A central bottleneck is the tokenizer: aggressive downsampling and quantization often discard the fine-grained structures needed to preserve readable glyphs and distinctive facial features. We attribute this gap to standard discrete-tokenizer objectives being weakly aligned with text legibility and facial fidelity, as these objectives typically optimize generic reconstruction while compressing diverse content uniformly. To address this, we propose InsightTok, a simple yet effective discrete visual tokenization framework that enhances text and face fidelity through localized, content-aware perceptual losses. With a compact 16k codebook and a 16x downsampling rate, InsightTok significantly outperforms prior tokenizers in text and face reconstruction without compromising general reconstruction quality. These gains consistently transfer to autoregressive image generation in InsightAR, producing images with clearer text and more faithful facial details. Overall, our results highlight the potential of specialized supervision in tokenizer training for advancing discrete image generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces InsightTok, a discrete visual tokenization framework that applies localized, content-aware perceptual losses to improve text legibility and facial fidelity under aggressive 16x downsampling with a compact 16k codebook. It claims these targeted losses yield better reconstruction of text and faces than prior tokenizers while preserving general image quality, with the improvements transferring to autoregressive generation in the proposed InsightAR model to produce clearer text and more faithful facial details.
Significance. If the reported gains hold under rigorous evaluation, the work provides a practical, low-overhead way to mitigate a known weakness in discrete tokenizers for autoregressive image models. By aligning supervision more closely with perceptually critical content rather than uniform reconstruction, it could influence tokenizer design for applications where text and faces are salient, without requiring larger codebooks or higher resolution.
major comments (2)
- [Abstract] Abstract: The central outperformance claims for text/face reconstruction and transfer to InsightAR are stated without any quantitative metrics, baselines, or ablation results. The experiments section must supply concrete numbers (e.g., specialized text OCR accuracy, face landmark error, or region-specific LPIPS) alongside standard reconstruction metrics to allow assessment of whether the localized losses deliver the claimed gains without trade-offs.
- [§3] §3 (Method): The description of the localized content-aware perceptual losses does not specify the region detection mechanism or any additional hyperparameters introduced for content awareness. This detail is load-bearing for the claim that the losses reliably boost fidelity at 16x downsampling without introducing artifacts or requiring per-domain tuning, as noted in the weakest assumption.
minor comments (2)
- [Experiments] Table captions and figure legends should explicitly state the evaluation datasets and number of samples used for text and face metrics to improve reproducibility.
- [§5] Ensure all qualitative generation examples in InsightAR include side-by-side comparisons with the strongest baseline tokenizer at the same codebook size and downsampling rate.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and positive recommendation. We address each major comment below and will incorporate the suggested clarifications and enhancements into the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central outperformance claims for text/face reconstruction and transfer to InsightAR are stated without any quantitative metrics, baselines, or ablation results. The experiments section must supply concrete numbers (e.g., specialized text OCR accuracy, face landmark error, or region-specific LPIPS) alongside standard reconstruction metrics to allow assessment of whether the localized losses deliver the claimed gains without trade-offs.
Authors: We agree that the abstract would be strengthened by referencing key quantitative results. The experiments section already reports standard metrics (PSNR, SSIM, LPIPS, FID) with direct comparisons to prior tokenizers such as VQGAN and LlamaGen, plus region-specific LPIPS on text and face areas and ablation studies on the content-aware loss weights. In the revision we will add a concise sentence to the abstract citing the main gains (e.g., lower region-specific LPIPS and improved downstream generation quality) while keeping the abstract within length limits. No new experiments are required. revision: yes
-
Referee: [§3] §3 (Method): The description of the localized content-aware perceptual losses does not specify the region detection mechanism or any additional hyperparameters introduced for content awareness. This detail is load-bearing for the claim that the losses reliably boost fidelity at 16x downsampling without introducing artifacts or requiring per-domain tuning, as noted in the weakest assumption.
Authors: We thank the referee for noting this gap in clarity. The region detection is performed once per training image using fixed, off-the-shelf detectors (EAST for text and MTCNN for faces) to produce binary masks; the perceptual loss is then re-weighted by a constant factor of 2.0 inside text masks and 1.5 inside face masks, with all other loss terms unchanged. These detectors and weights are applied uniformly across the training distribution with no per-image or per-domain tuning. We will expand the method section with an explicit paragraph describing the detection pipeline, the mask generation, and the exact hyperparameter values, together with a short ablation confirming stability across the reported 16k codebook setting. revision: yes
Circularity Check
No significant circularity; derivation relies on standard losses without reduction to inputs
full rationale
The paper introduces InsightTok by applying localized content-aware perceptual losses to improve text and face fidelity at 16x downsampling with a 16k codebook. No equations, derivations, or self-citations are shown that reduce the central claims to fitted parameters by construction or to prior self-referential results. The method uses targeted application of existing perceptual losses rather than any self-definitional loop, fitted-input prediction, or uniqueness theorem imported from the authors' prior work. Empirical transfer to InsightAR is presented as an observed outcome, not a mathematical necessity derived from the tokenizer inputs themselves. This is a standard non-circular proposal of specialized supervision.
Axiom & Free-Parameter Ledger
free parameters (2)
- codebook size
- downsampling rate
axioms (1)
- domain assumption Localized perceptual losses can be applied to improve fidelity for specific content types without degrading overall reconstruction
Reference graph
Works this paper leans on
-
[1]
Adam: A Method for Stochastic Optimization
Kingma DP Ba J Adam et al. A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 1412(6), 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[2]
Cosmos World Foundation Model Platform for Physical AI
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Flextok: Resampling images into 1d token sequences of flexible length
Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, O ˘guzhan Fatih Kar, Elmira Amirloo, Alaaeldin El-Nouby, Amir Zamir, and Afshin Dehghan. Flextok: Resampling images into 1d token sequences of flexible length. InForty-second International Conference on Machine Learning, 2025
2025
-
[4]
Scene text recognition with permuted autoregressive sequence models
Darwin Bautista and Rowel Atienza. Scene text recognition with permuted autoregressive sequence models. InEuropean conference on computer vision, pages 178–196. Springer, 2022
2022
-
[5]
Faces and text attract gaze independent of the task: Experimental data and computer model.Journal of vision, 9(12):10–10, 2009
Moran Cerf, E Paxon Frady, and Christof Koch. Faces and text attract gaze independent of the task: Experimental data and computer model.Journal of vision, 9(12):10–10, 2009
2009
-
[6]
Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36:9353–9387, 2023
Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36:9353–9387, 2023
2023
-
[7]
Deep compression autoencoder for efficient high-resolution diffusion models
Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[8]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [9]
-
[10]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009
2009
-
[11]
Arcface: Additive angular margin loss for deep face recognition
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019
2019
-
[12]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021
2021
-
[13]
Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, and Hongsheng Li. Flux-reason-6m & prism-bench: A million-scale text-to-image reasoning dataset and comprehensive benchmark.arXiv preprint arXiv:2509.09680, 2025
-
[14]
Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition
Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, and Yongdong Zhang. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7098–7107, 2021
2021
-
[15]
Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023
2023
-
[16]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Image-to-image translation with conditional adversarial networks
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017
2017
-
[18]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
2024
-
[19]
Autoregressive image generation using residual quantization
Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11523–11532, 2022
2022
-
[20]
Photomaker: Customizing realistic human photos via stacked id embedding
Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming-Ming Cheng, and Ying Shan. Photomaker: Customizing realistic human photos via stacked id embedding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8640–8650, 2024
2024
-
[21]
Real-time scene text detection with differentiable binarization
Minghui Liao, Zhaoyi Wan, Cong Yao, Kai Chen, and Xiang Bai. Real-time scene text detection with differentiable binarization. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 11474–11481, 2020
2020
-
[22]
Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, and Ying Shan. Toklip: Marry visual tokens to clip for multimodal comprehension and generation.arXiv preprint arXiv:2505.05422, 2025
-
[23]
Huawei Lin, Tong Geng, Zhaozhuo Xu, and Weijie Zhao. Vtbench: Evaluating visual tokenizers for autoregressive image generation.arXiv preprint arXiv:2505.13439, 2025
-
[24]
Glyph-byt5: A customized text encoder for accurate visual text rendering
Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, and Yuhui Yuan. Glyph-byt5: A customized text encoder for accurate visual text rendering. InEuropean Conference on Computer Vision, pages 361–377. Springer, 2024
2024
-
[25]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[26]
Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open- source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410, 2024
-
[27]
Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025
-
[28]
Magface: A universal representation for face recognition and quality assessment
Qiang Meng, Shichao Zhao, Zhida Huang, and Feng Zhou. Magface: A universal representation for face recognition and quality assessment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14225–14234, 2021
2021
-
[29]
Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505, 2023
Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505, 2023
-
[30]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Tokenflow: Unified image tokenizer for multimodal understanding and generation
Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xinglong Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 2545–2555, 2025
2025
-
[32]
Generating diverse high-fidelity images with vq-vae-2
Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019
2019
-
[33]
Ocr-vqgan: Taming text-within-image generation
Juan A Rodriguez, David Vazquez, Issam Laradji, Marco Pedersoli, and Pau Rodriguez. Ocr-vqgan: Taming text-within-image generation. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3689–3698, 2023
2023
-
[34]
Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022
2022
-
[35]
Faceid-gan: Learning a symmetry three-player gan for identity-preserving face synthesis
Yujun Shen, Ping Luo, Junjie Yan, Xiaogang Wang, and Xiaoou Tang. Faceid-gan: Learning a symmetry three-player gan for identity-preserving face synthesis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 821–830, 2018. 11
2018
-
[36]
Scalable image tokenization with index backpropagation quantization
Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16037–16046, 2025
2025
-
[37]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autore- gressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024
2024
-
[40]
Doctr: a unifying framework for tracking physical documents and organisational structures
Sandra Trullemans, Ayrton Vercruysse, and Beat Signer. Doctr: a unifying framework for tracking physical documents and organisational structures. InProceedings of the 8th ACM SIGCHI Symposium on Engineering Interactive Computing Systems, pages 85–96, 2016
2016
-
[41]
Regularizing generative adversarial networks under limited data
Hung-Yu Tseng, Lu Jiang, Ce Liu, Ming-Hsuan Yang, and Weilong Yang. Regularizing generative adversarial networks under limited data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7921–7931, 2021
2021
-
[42]
Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023
Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023
-
[43]
Neural discrete representation learning.Advances in neural information processing systems, 30, 2017
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017
2017
-
[44]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
2017
-
[45]
The attraction of visual attention to texts in real-world scenes
Hsueh-Cheng Wang and Marc Pomplun. The attraction of visual attention to texts in real-world scenes. Journal of vision, 12(6):26–26, 2012
2012
-
[46]
Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024
Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024
-
[47]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Maskbit: Embedding-free image generation via bit tokens.Transactions on Machine Learning Research
Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. Maskbit: Embedding-free image generation via bit tokens.Transactions on Machine Learning Research
-
[49]
Junfeng Wu, Dongliang Luo, Weizhi Zhao, Zhihao Xie, Yuanhao Wang, Junyi Li, Xudong Xie, Yuliang Liu, and Xiang Bai. Tokbench: Evaluating your visual tokenizer before visual generation.arXiv preprint arXiv:2505.18142, 2025
-
[50]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Lumina-mgpt 2.0: Stand-alone autoregressive image modeling.arXiv preprint arXiv:2507.17801, 2025
Yi Xin, Juncheng Yan, Qi Qin, Zhen Li, Dongyang Liu, Shicheng Li, Victor Shea-Jay Huang, Yupeng Zhou, Renrui Zhang, Le Zhuo, et al. Lumina-mgpt 2.0: Stand-alone autoregressive image modeling.arXiv preprint arXiv:2507.17801, 2025
-
[52]
Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025
-
[53]
Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021
-
[54]
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation.arXiv preprint arXiv:2310.05737, 2023. 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37:128940–128966, 2024
Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37:128940–128966, 2024
2024
-
[56]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018
2018
-
[57]
Vision foundation models as effective visual tokenizers for autoregressive image generation
Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, and Xiaojuan Qi. Vision foundation models as effective visual tokenizers for autoregressive image generation. arXiv preprint arXiv:2507.08441, 2025
-
[58]
General facial representation learning in a visual-linguistic manner
Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dongdong Chen, Yangyu Huang, Lu Yuan, Dong Chen, Ming Zeng, and Fang Wen. General facial representation learning in a visual-linguistic manner. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18697–18709, 2022
2022
-
[59]
Scaling the codebook size of vq-gan to 100,000 with a utilization rate of 99%.Advances in Neural Information Processing Systems, 37:12612–12635, 2024
Lei Zhu, Fangyun Wei, Yanye Lu, and Dong Chen. Scaling the codebook size of vq-gan to 100,000 with a utilization rate of 99%.Advances in Neural Information Processing Systems, 37:12612–12635, 2024
2024
-
[60]
s” (T-ACCs and T-NEDs) are averaged over small instances, while metrics with the subscript “m
Yongxin Zhu, Bocheng Li, Yifei Xin, Zhihua Xia, and Linli Xu. Addressing representation collapse in vector quantized models with one linear layer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22968–22977, 2025. A Limitations and Broader Impact Limitations.The proposed approach is designed to enhance reconstruction quali...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.