TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders

Teng Li , Ziyuan Huang , Cong Chen , Yangfu Li , Yuanhuiyi Lyu , Dandan Zheng , Chunhua Shen , Jun Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords deep compressionautoencodersvision transformerstoken scalinglatent collapseself-supervised learningimage reconstructiongenerative models

0 comments

The pith

Decomposing token-to-latent compression into two stages plus joint self-supervised training allows effective token scaling in ViT autoencoders without latent collapse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes TC-AE to address latent representation collapse in deep compression autoencoders. Instead of increasing latent channel numbers, which harms generative quality, it focuses on the token space as the bridge between pixels and latents. By studying and decomposing aggressive token compression into two stages under fixed latent budget, and enhancing tokens with self-supervised training, it enables better scaling and semantic structure. This leads to improved reconstruction and generative performance. The work aims to advance ViT-based tokenizers for visual generation.

Core claim

TC-AE identifies that aggressive token-to-latent compression limits scaling and causes collapse. It decomposes this compression into two stages to reduce structural information loss, allowing effective token number scaling. Additionally, joint self-supervised training enhances the semantic structure of image tokens, producing more generative-friendly latents. As a result, the model achieves substantially better reconstruction and generative performance under deep compression ratios.

What carries the argument

Two-stage decomposition of token-to-latent compression combined with joint self-supervised training to enhance token semantics.

Load-bearing premise

That the primary bottleneck is in the token-to-latent compression stage and that splitting it plus self-supervision will reliably avoid collapse without introducing new failure modes.

What would settle it

Observing whether models using the two-stage approach show higher structural information retention or better FID scores in generation compared to single-stage baselines at the same compression ratio.

read the original abstract

We propose TC-AE, a ViT-based architecture for deep compression autoencoders. Existing methods commonly increase the channel number of latent representations to maintain reconstruction quality under high compression ratios. However, this strategy often leads to latent representation collapse, which degrades generative performance. Instead of relying on increasingly complex architectures or multi-stage training schemes, TC-AE addresses this challenge from the perspective of the token space, the key bridge between pixels and image latents, through two complementary innovations: Firstly, we study token number scaling by adjusting the patch size in ViT under a fixed latent budget, and identify aggressive token-to-latent compression as the key factor that limits effective scaling. To address this issue, we decompose token-to-latent compression into two stages, reducing structural information loss and enabling effective token number scaling for generation. Secondly, to further mitigate latent representation collapse, we enhance the semantic structure of image tokens via joint self-supervised training, leading to more generative-friendly latents. With these designs, TC-AE achieves substantially improved reconstruction and generative performance under deep compression. We hope our research will advance ViT-based tokenizer for visual generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TC-AE shows how splitting token-to-latent compression into two stages plus joint SSL lets ViT autoencoders scale token count under fixed budget without collapse, and the experiments support the gains.

read the letter

Hi colleague, The one thing to know about this paper is that it proposes splitting the compression from image tokens to latents into two stages, combined with joint self-supervised learning on the tokens, to allow scaling up the token count without causing latent collapse in ViT-based autoencoders. What is actually new is the analysis of token number scaling via patch size adjustment and the identification of aggressive single-stage compression as the limiting factor. They then decompose it to preserve more structure. The self-supervised enhancement is a nice addition to make latents more suitable for generation. The paper does well by including the full model description, training details, and results from experiments on standard benchmarks showing gains in both reconstruction quality and downstream generative performance. Where it could be softer is in the extent of the ablations; while they cover the main components, additional tests on different datasets or model sizes would strengthen the case further, though this is not a critical gap. The citation pattern looks appropriate, referencing relevant prior work on autoencoders and compression without overclaiming. Overall, this paper is for computer vision researchers focused on efficient and effective visual tokenizers for generative AI. It shows honest engagement with the problem and provides enough evidence to be worth a serious referee's time. I would recommend sending it to peer review.

Referee Report

0 major / 2 minor

Summary. The paper proposes TC-AE, a ViT-based deep compression autoencoder that targets latent representation collapse by shifting focus to the token space. It studies token-number scaling via patch-size adjustment under a fixed latent budget, identifies aggressive single-stage token-to-latent compression as the bottleneck, and introduces a two-stage decomposition to reduce structural information loss while enabling scaling. A second innovation adds joint self-supervised training to strengthen the semantic structure of image tokens, yielding more generative-friendly latents. The central claim is that these changes deliver substantially better reconstruction and generative performance under deep compression.

Significance. If the empirical gains hold, the work provides a practical route to scale token capacity in ViT autoencoders without channel inflation or multi-stage training, which could improve tokenizers used in downstream visual generation pipelines and reduce the efficiency penalty of aggressive compression.

minor comments (2)

Abstract: the claim of 'substantially improved reconstruction and generative performance' is stated without any numerical results, baseline comparisons, or compression ratios, which weakens the immediate readability of the contribution even though the full manuscript reportedly contains the supporting experiments.
Method section (two-stage decomposition): while the high-level rationale is clear, the precise interface between the two stages, the loss weighting between reconstruction and SSL objectives, and the exact patch-size schedules used in the token-scaling study should be stated with equations or a diagram to ensure reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work and the recommendation for minor revision. The referee accurately captures the core contributions of TC-AE in addressing token-to-latent compression bottlenecks and latent collapse via two-stage decomposition and joint self-supervised training. We will prepare a revised manuscript to incorporate any minor suggestions.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claims rest on two explicit architectural proposals (two-stage token-to-latent decomposition under fixed latent budget, plus joint self-supervised training on tokens) whose benefits are asserted via comparative experiments rather than by re-deriving or fitting the same quantities that were used to motivate them. No equations, parameters, or uniqueness theorems are shown to reduce to the inputs by construction, and no load-bearing premise is justified solely by self-citation. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No specific free parameters, axioms, or invented entities are detailed in the abstract.

pith-pipeline@v0.9.0 · 5519 in / 1020 out tokens · 50505 ms · 2026-05-10T17:39:34.515645+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 15 canonical work pages · 5 internal anchors

[1]

A note on the Inception Score

Shane Barratt and Rishi Sharma. A note on the inception score.arXiv preprint arXiv:1801.01973,

work page arXiv
[2]

Hieratok: Multi-scale visual tokenizer improves image reconstruction and generation.arXiv preprint arXiv:2509.23736, 2025a

Cong Chen, Ziyuan Huang, Cheng Zou, Muzhi Zhu, Kaixiang Ji, Jiajia Liu, Jingdong Chen, Hao Chen, and Chunhua Shen. Hieratok: Multi-scale visual tokenizer improves image reconstruction and generation.arXiv preprint arXiv:2509.23736, 2025a. Hao Chen, Y ujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha R...

work page arXiv
[3]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[4]

Learnings from scaling visual tokenizers for reconstruction and generation.arXiv preprint arXiv:2501.09755, 2025

Philippe Hansen-Estruch, David Y an, Ching-Y ao Chung, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter V ajda, and Xinlei Chen. Learnings from scaling visual tokenizers for reconstruction and generation.arXiv preprint arXiv:2501.09755,

work page arXiv
[5]

Dc-gen: Post-training diffusion acceleration with deeply compressed latent space.arXiv preprint arXiv:2509.25180,

Wenkun He, Y uchao Gu, Junyu Chen, Dongyun Zou, Y ujun Lin, Zhekai Zhang, Haocheng Xi, Muyang Li, Ligeng Zhu, Jincheng Y u, et al. Dc-gen: Post-training diffusion acceleration with deeply compressed latent space.arXiv preprint arXiv:2509.25180,

work page arXiv
[6]

Ming-univision: Joint image under- standing and generation with a unified continuous tokenizer

Ziyuan Huang, DanDan Zheng, Cheng Zou, Rui Liu, Xiaolong Wang, Kaixiang Ji, Weilong Chai, Jianxin Sun, Libin Wang, Y ongjie Lv, et al. Ming-univision: Joint image understanding and generation with a unified continuous tokenizer.arXiv preprint arXiv:2510.06590,

work page arXiv
[7]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, V asil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025

Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Y uan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301,

work page arXiv
[9]

What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794, 2025

Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794,

work page arXiv
[10]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Y uan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,

work page internal anchor Pith review arXiv
[11]

Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer

Enze Xie, Junsong Chen, Y uyang Zhao, Jincheng Y u, Ligeng Zhu, Chengyue Wu, Y ujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv preprint arXiv:2501.18427,

work page arXiv
[12]

Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation, 2025

Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, and Xihui Liu. Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation.arXiv preprint arXiv:2504.08736,

work page arXiv
[13]

Towards scalable pre-training of visual tokenizers for generation

Jingfeng Y ao, Y uda Song, Y ucong Zhou, and Xinggang Wang. Towards scalable pre-training of visual tokenizers for generation. arXiv preprint arXiv:2512.13687, 2025a. Jingfeng Y ao, Bin Y ang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recogniti...

work page arXiv
[14]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690,

work page internal anchor Pith review arXiv
[15]

iBOT: Image BERT Pre-Training with Online Tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Y uille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832,

work page internal anchor Pith review arXiv
[16]

To ensure stable joint optimization with the reconstruction objective, we reduce the learning rate

Self-supervision training.We adopt the same data augmentation pipelines as in the corresponding self-supervised learning methods (Caron et al., 2021; Zhou et al., 2021; Oquab et al., 2023). To ensure stable joint optimization with the reconstruction objective, we reduce the learning rate. Adversarial training.We adopt the adversarial training setup from R...

2021