pith. machine review for the scientific record. sign in

arxiv: 2604.07340 · v1 · submitted 2026-04-08 · 💻 cs.CV

Recognition: unknown

TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:39 UTC · model grok-4.3

classification 💻 cs.CV
keywords deep compressionautoencodersvision transformerstoken scalinglatent collapseself-supervised learningimage reconstructiongenerative models
0
0 comments X

The pith

Decomposing token-to-latent compression into two stages plus joint self-supervised training allows effective token scaling in ViT autoencoders without latent collapse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes TC-AE to address latent representation collapse in deep compression autoencoders. Instead of increasing latent channel numbers, which harms generative quality, it focuses on the token space as the bridge between pixels and latents. By studying and decomposing aggressive token compression into two stages under fixed latent budget, and enhancing tokens with self-supervised training, it enables better scaling and semantic structure. This leads to improved reconstruction and generative performance. The work aims to advance ViT-based tokenizers for visual generation.

Core claim

TC-AE identifies that aggressive token-to-latent compression limits scaling and causes collapse. It decomposes this compression into two stages to reduce structural information loss, allowing effective token number scaling. Additionally, joint self-supervised training enhances the semantic structure of image tokens, producing more generative-friendly latents. As a result, the model achieves substantially better reconstruction and generative performance under deep compression ratios.

What carries the argument

Two-stage decomposition of token-to-latent compression combined with joint self-supervised training to enhance token semantics.

Load-bearing premise

That the primary bottleneck is in the token-to-latent compression stage and that splitting it plus self-supervision will reliably avoid collapse without introducing new failure modes.

What would settle it

Observing whether models using the two-stage approach show higher structural information retention or better FID scores in generation compared to single-stage baselines at the same compression ratio.

read the original abstract

We propose TC-AE, a ViT-based architecture for deep compression autoencoders. Existing methods commonly increase the channel number of latent representations to maintain reconstruction quality under high compression ratios. However, this strategy often leads to latent representation collapse, which degrades generative performance. Instead of relying on increasingly complex architectures or multi-stage training schemes, TC-AE addresses this challenge from the perspective of the token space, the key bridge between pixels and image latents, through two complementary innovations: Firstly, we study token number scaling by adjusting the patch size in ViT under a fixed latent budget, and identify aggressive token-to-latent compression as the key factor that limits effective scaling. To address this issue, we decompose token-to-latent compression into two stages, reducing structural information loss and enabling effective token number scaling for generation. Secondly, to further mitigate latent representation collapse, we enhance the semantic structure of image tokens via joint self-supervised training, leading to more generative-friendly latents. With these designs, TC-AE achieves substantially improved reconstruction and generative performance under deep compression. We hope our research will advance ViT-based tokenizer for visual generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper proposes TC-AE, a ViT-based deep compression autoencoder that targets latent representation collapse by shifting focus to the token space. It studies token-number scaling via patch-size adjustment under a fixed latent budget, identifies aggressive single-stage token-to-latent compression as the bottleneck, and introduces a two-stage decomposition to reduce structural information loss while enabling scaling. A second innovation adds joint self-supervised training to strengthen the semantic structure of image tokens, yielding more generative-friendly latents. The central claim is that these changes deliver substantially better reconstruction and generative performance under deep compression.

Significance. If the empirical gains hold, the work provides a practical route to scale token capacity in ViT autoencoders without channel inflation or multi-stage training, which could improve tokenizers used in downstream visual generation pipelines and reduce the efficiency penalty of aggressive compression.

minor comments (2)
  1. Abstract: the claim of 'substantially improved reconstruction and generative performance' is stated without any numerical results, baseline comparisons, or compression ratios, which weakens the immediate readability of the contribution even though the full manuscript reportedly contains the supporting experiments.
  2. Method section (two-stage decomposition): while the high-level rationale is clear, the precise interface between the two stages, the loss weighting between reconstruction and SSL objectives, and the exact patch-size schedules used in the token-scaling study should be stated with equations or a diagram to ensure reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work and the recommendation for minor revision. The referee accurately captures the core contributions of TC-AE in addressing token-to-latent compression bottlenecks and latent collapse via two-stage decomposition and joint self-supervised training. We will prepare a revised manuscript to incorporate any minor suggestions.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claims rest on two explicit architectural proposals (two-stage token-to-latent decomposition under fixed latent budget, plus joint self-supervised training on tokens) whose benefits are asserted via comparative experiments rather than by re-deriving or fitting the same quantities that were used to motivate them. No equations, parameters, or uniqueness theorems are shown to reduce to the inputs by construction, and no load-bearing premise is justified solely by self-citation. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No specific free parameters, axioms, or invented entities are detailed in the abstract.

pith-pipeline@v0.9.0 · 5519 in / 1020 out tokens · 50505 ms · 2026-05-10T17:39:34.515645+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 15 canonical work pages · 5 internal anchors

  1. [1]

    A note on the Inception Score

    Shane Barratt and Rishi Sharma. A note on the inception score.arXiv preprint arXiv:1801.01973,

  2. [2]

    Hieratok: Multi-scale visual tokenizer improves image reconstruction and generation.arXiv preprint arXiv:2509.23736, 2025a

    Cong Chen, Ziyuan Huang, Cheng Zou, Muzhi Zhu, Kaixiang Ji, Jiajia Liu, Jingdong Chen, Hao Chen, and Chunhua Shen. Hieratok: Multi-scale visual tokenizer improves image reconstruction and generation.arXiv preprint arXiv:2509.23736, 2025a. Hao Chen, Y ujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha R...

  3. [3]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

  4. [4]

    Learnings from scaling visual tokenizers for reconstruction and generation.arXiv preprint arXiv:2501.09755, 2025

    Philippe Hansen-Estruch, David Y an, Ching-Y ao Chung, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter V ajda, and Xinlei Chen. Learnings from scaling visual tokenizers for reconstruction and generation.arXiv preprint arXiv:2501.09755,

  5. [5]

    Dc-gen: Post-training diffusion acceleration with deeply compressed latent space.arXiv preprint arXiv:2509.25180,

    Wenkun He, Y uchao Gu, Junyu Chen, Dongyun Zou, Y ujun Lin, Zhekai Zhang, Haocheng Xi, Muyang Li, Ligeng Zhu, Jincheng Y u, et al. Dc-gen: Post-training diffusion acceleration with deeply compressed latent space.arXiv preprint arXiv:2509.25180,

  6. [6]

    Ming-univision: Joint image under- standing and generation with a unified continuous tokenizer

    Ziyuan Huang, DanDan Zheng, Cheng Zou, Rui Liu, Xiaolong Wang, Kaixiang Ji, Weilong Chai, Jianxin Sun, Libin Wang, Y ongjie Lv, et al. Ming-univision: Joint image understanding and generation with a unified continuous tokenizer.arXiv preprint arXiv:2510.06590,

  7. [7]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, V asil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

  8. [8]

    Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025

    Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Y uan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301,

  9. [9]

    What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794, 2025

    Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794,

  10. [10]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Y uan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,

  11. [11]

    Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer

    Enze Xie, Junsong Chen, Y uyang Zhao, Jincheng Y u, Ligeng Zhu, Chengyue Wu, Y ujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv preprint arXiv:2501.18427,

  12. [12]

    Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation, 2025

    Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, and Xihui Liu. Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation.arXiv preprint arXiv:2504.08736,

  13. [13]

    Towards scalable pre-training of visual tokenizers for generation

    Jingfeng Y ao, Y uda Song, Y ucong Zhou, and Xinggang Wang. Towards scalable pre-training of visual tokenizers for generation. arXiv preprint arXiv:2512.13687, 2025a. Jingfeng Y ao, Bin Y ang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recogniti...

  14. [14]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690,

  15. [15]

    iBOT: Image BERT Pre-Training with Online Tokenizer

    Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Y uille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832,

  16. [16]

    To ensure stable joint optimization with the reconstruction objective, we reduce the learning rate

    Self-supervision training.We adopt the same data augmentation pipelines as in the corresponding self-supervised learning methods (Caron et al., 2021; Zhou et al., 2021; Oquab et al., 2023). To ensure stable joint optimization with the reconstruction objective, we reduce the learning rate. Adversarial training.We adopt the adversarial training setup from R...