ChannelTok: Efficient Flexible-Length Vision Tokenization

Arpit Bansal; Sukriti Paul; Tom Goldstein

arxiv: 2606.04461 · v1 · pith:ZUDYG4KHnew · submitted 2026-06-03 · 💻 cs.CV

ChannelTok: Efficient Flexible-Length Vision Tokenization

Sukriti Paul , Arpit Bansal , Tom Goldstein This is my paper

Pith reviewed 2026-06-28 07:07 UTC · model grok-4.3

classification 💻 cs.CV

keywords channel-wise tokenizationflexible vision tokensstochastic tail-droppingimage tokenizationperceptual qualityautoregressive generationefficient autoencoders

0 comments

The pith

Treating each latent channel as a token yields flexible-length vision representations that maintain high perceptual quality with a lightweight model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a tokenizer that treats every channel in the latent representation as an individual visual token rather than using spatial patches. This choice supports a compact CNN-Transformer hybrid backbone instead of heavy generative decoders. Stochastic tail-dropping applied during training causes the channels to sort themselves by semantic importance, so that inference simply keeps the first k channels to achieve any desired length or compression rate. The same ordering supports variable-length autoregressive image generation without extra machinery. A reader would care because the approach replaces complex, slow decoders with a direct, scalable way to trade token count for quality on standard image benchmarks.

Core claim

By representing an image as an ordered set of latent channels and training with stochastic tail-dropping, the model produces a representation whose prefix of any length already constitutes a valid, high-quality encoding; this single mechanism simultaneously delivers flexible compression, variable-length generation, and competitive perceptual fidelity on ImageNet while using fewer parameters and faster decoding than prior flexible tokenizers.

What carries the argument

Stochastic tail-dropping during training, which forces latent channels to self-organize by semantic importance so that the prefix of k channels suffices for any budget.

If this is right

Quality remains consistent when the same model is evaluated at many different token budgets on ImageNet.
Variable-length autoregressive image generation becomes possible by emitting channels sequentially without architectural changes.
Decoding speed increases because the decoder operates on a simple ordered channel list rather than iterative spatial refinement.
Model size stays small because the backbone is a lightweight CNN-Transformer hybrid rather than a parameter-heavy spatial tokenizer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prefix-ordering property could be tested for progressive image transmission where partial channel sets are sent first.
If the importance ordering generalizes across datasets, the method might reduce retraining costs when adapting to new domains.
The channel view may simplify integration with existing channel-wise compression pipelines used in video codecs.

Load-bearing premise

Stochastic tail-dropping during training will reliably sort channels by semantic importance so that keeping only the earliest channels preserves quality at every length.

What would settle it

An experiment that shows higher perceptual quality when any later channel is retained instead of the corresponding prefix channel at the same total count, or that quality collapses abruptly at particular k values.

Figures

Figures reproduced from arXiv: 2606.04461 by Arpit Bansal, Sukriti Paul, Tom Goldstein.

**Figure 1.** Figure 1: Quality-efficiency comparison. Reconstruction fidelity (rFID), decoding throughput, and model size across recent tokenizers. Our method achieves state-of-the-art rFID while being the smallest and among the fastest decoders. enization that adjusts representation length based on visual complexity. This need has become increasingly urgent in the era of large-scale vision models, where compute budgets constr… view at source ↗

**Figure 2.** Figure 2: Overview of our channel-wise flexible tokenizer. The encoder compresses the input image into a latent representation z ∈ R C×h×w. During training, we adaptively mask channels by retaining only the first k active channels (shown in teal) while stopping gradients through inactive channels (shown in gray). Each active channel is independently quantized using Binary Spherical Quantization (BSQ). The decoder re… view at source ↗

**Figure 3.** Figure 3: Performance across token budgets. Our method demonstrates consistent quality improvement across reconstruction metrics across token budgets while being computationally efficient. We construct binary mask M ∈ {0, 1} C×h×w as: Mc = ( 1 if c ≤ k 0 otherwise (1) The mask will be used to stochastically drop the tail of the feature tensor, promoting hierarchical organization where critical information concentrat… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison across token budgets. We show reconstructions from our model and prior flexible tokenizers at 32–256 tokens, along with the original image. Differences in sharpness, color consistency, and structural preservation can be observed as the token budget increases. Additional results are provided in the Appendix. a VGG-based perceptual loss Lperc = X l ∥Φl(x) − Φl(xˆ)∥ 2 2 , (3) where l in… view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Semantic organization in early channels. Channel swapping experiments demonstrate hierarchical semantic encoding. Each pair of rows shows two images whose first t channels are progressively swapped. When channels from one image are replaced with another, the image progressively transforms from the source to the target. and disk I/O from timing measurements. All baselines were benchmarked under these ident… view at source ↗

**Figure 7.** Figure 7: Autoregressive image generation across token budgets. Images are generated by sampling from the LlamaGen [22] GPT-L transformer trained on discrete channel tokens. Generation begins from a randomly sampled first token and proceeds autoregressively, with remaining channels zero-filled at truncation. Even at 32 tokens (7.9× speedup), generated samples show coherent global structure, with quality improving pr… view at source ↗

**Figure 8.** Figure 8: Architectural ablations. (a) Effect of masking probability pmask. (b) Effect of sampling bias on retention ratio t. (c) Effect of model scale. rFID consistently improves with more balanced masking, uniform sampling, and larger model capacity. adding progressive refinement. Additional training details on our baseline model and qualitative examples are in the supplementary material. 4.2. Semantic Transferabi… view at source ↗

**Figure 9.** Figure 9: DINOv2 classification accuracy across token budgets. Higher token budgets preserve more discriminative structure, mirroring the rFID trends. B. Downstream Analysis B.1. Autoregressive Image Generation (LlamaGen) To evaluate the efficacy of our flexible-length visual tokens, we train an autoregressive (AR) generation model following the LlamaGen framework. We adopt a GPT-L (Large) architecture with 343M p… view at source ↗

**Figure 10.** Figure 10: Semantic clustering of early channels. K-means clustering on first 32 channels produces semantically coherent groups organized by scene characteristics (black, marine blue, greenery), hinting that early channels encode meaningful semantic structure. Beyond global scene attributes, clusters also align with object-level semantics, grouping marine life and birds into distinct regions. Crucially, this organis… view at source ↗

**Figure 11.** Figure 11: Token allocation across ImageNet-1K validation classes. Left: Rows 1–2 show complex classes that require high token counts: Coral fungus (498, 490), Toyshop (494), Rotisserie (491, 459, 442, 432), and Jinrikisha (422), all featuring intricate textures and fine-grained details. Rows 3–4 show visually simple classes that need far fewer tokens: Airship (5, 6, 9, 12), Parachute (9), and Nematode (12, 12, 13),… view at source ↗

**Figure 12.** Figure 12: Reconstruction with and without prefix masking. Each image pair shows channel progression across increasing token budgets. The first row is our flexible tokenizer and the second row is the baseline, which is architecturally identical but trained without channel-wise adaptive masking. Without masking, the baseline produces no meaningful reconstruction at low token counts, with recognisable structure emergi… view at source ↗

**Figure 13.** Figure 13: Autoregressive generation across token budgets. LlamaGen [22] GPT-L generations across diverse ImageNet-100 categories (birds, insects, annelids, and marine life) using discrete channel tokens with truncated channels zero-filled. Even at 32 tokens, outputs maintain coherent global structure, with fidelity improving progressively at higher budgets. Generation at such low token counts is made possible by ou… view at source ↗

**Figure 14.** Figure 14: Qualitative comparison on images with contrasting tones. Top: A jellyfish against a dark background, where our method preserves color fidelity even at lower token budgets. Bottom: A butterfly on a flower, where subtle wing textures and fine details emerge progressively with increasing tokens. Our method maintains perceptual coherence and colour consistency across all budgets. 16 [PITH_FULL_IMAGE:figures/… view at source ↗

**Figure 15.** Figure 15: Qualitative comparison on images with varied textures. Top: A red mushroom with white spots against a mossy background, where our method preserves fine surface detail and color fidelity even at low token budgets. Bottom: A dark round fruit, where competing methods introduce color artifacts and lose surface sheen at low tokens, while ours maintains perceptual consistency across all budgets. 17 [PITH_FULL_… view at source ↗

**Figure 16.** Figure 16: Reconstructions on cases with text and vibrant colours. Top: Christmas stocking with text, where legibility remains difficult at low token counts but improves by 128 tokens. Bottom: A geyser eruption scene, where our method recovers landscape structure. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

read the original abstract

Leading flexible vision tokenizers achieve SOTA quality at an extreme cost, relying on parameter-heavy backbones and slow, multi-step generative decoders. We depart from this complex, spatial-token paradigm and introduce a simple, lightweight, and fast channel-wise flexible-length tokenizer. Our method treats each latent channel as a visual token, enabling a parameter-efficient CNN-Transformer hybrid backbone. Furthermore, employing a stochastic tail-dropping paradigm during training naturally forces channels to organize by semantic importance. This allows for flexible compression at inference by simply retaining the first $k$ channels, and naturally enables variable-length autoregressive image generation. We validate our approach through extensive experiments on ImageNet, demonstrating consistent quality across diverse token budgets. The results establish a new quality-efficiency frontier: our model achieves state-of-the-art perceptual quality (rFID 2.92) while being $8.6\times$ faster in decoding and $2.1\times$ smaller (159M params) than the next-best alternative. Our work establishes channel-wise tokenization as a powerful and practical paradigm for efficient visual representation. Project page: https://channeltok.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ChannelTok's channel-wise tokens plus tail-dropping is a real shift from spatial tokenizers, but the claim that this naturally produces ordered importance lacks direct support.

read the letter

The main point is that this paper moves away from spatial tokens and heavy generative decoders by treating each latent channel as a token in a CNN-Transformer hybrid. Stochastic tail-dropping during training is meant to push important information into the early channels so that inference can just truncate to the first k for different budgets.

The efficiency side is the clearest strength. The model is reported at 159M parameters with 8.6x faster decoding and rFID 2.92 on ImageNet, which would matter for anyone running large vision or multimodal pipelines if the numbers hold.

The ordering mechanism is the soft spot. Nothing in the abstract or described setup shows per-channel importance metrics or an ablation confirming that truncation at test time matches the training distribution. It is possible the decoder simply learns to ignore later channels without the early ones being monotonically ranked by semantic content. If that is true, the flexible-length claim rests on an unverified assumption.

The experiments claim consistent quality across budgets, but the abstract gives no detail on exact baselines, controls, or statistical checks, so the SOTA positioning is hard to assess from what is shown.

This is for researchers focused on practical tokenization efficiency rather than pure theory. A reader who needs lighter backbones and variable-length generation could extract useful ideas even if the importance-ranking story needs more proof.

It deserves peer review. The paradigm change is concrete and the efficiency angle is worth referee scrutiny, even with the gaps around the ordering evidence.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ChannelTok, a channel-wise flexible-length vision tokenizer based on a lightweight CNN-Transformer hybrid backbone. It employs stochastic tail-dropping during training to induce semantic ordering of latent channels, enabling inference-time flexible compression by retaining the first k channels and supporting variable-length autoregressive generation. On ImageNet, it reports state-of-the-art perceptual quality (rFID 2.92) together with 8.6× faster decoding and a 2.1× smaller model (159M parameters) relative to the next-best flexible tokenizer.

Significance. If the tail-dropping mechanism reliably produces monotonically ordered channels and the reported efficiency/quality numbers hold under controlled comparison, the work would establish a practical alternative to spatial-token paradigms for efficient, variable-budget visual tokenization, with direct implications for autoregressive image models and compression pipelines.

major comments (2)

[Abstract and method description] The central claim that stochastic tail-dropping 'naturally forces channels to organize by semantic importance' (Abstract) is load-bearing for the flexible-length inference procedure, yet no per-channel ablation, importance ranking, or monotonicity test is presented to distinguish this from the decoder simply learning to ignore dropped channels. Without such evidence, the equivalence between training distribution and test-time truncation remains unverified.
[Abstract and experimental section] The SOTA claims (rFID 2.92, 8.6× decoding speedup, 159M parameters) rest on empirical results whose experimental setup, baseline implementations, controls, and statistical significance are not detailed in the abstract or summary of results, making it impossible to assess whether the reported frontier is robust.

minor comments (1)

[Method] Notation for the channel dimension and the tail-dropping probability schedule should be introduced with explicit equations rather than prose only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments. We address each major point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and method description] The central claim that stochastic tail-dropping 'naturally forces channels to organize by semantic importance' (Abstract) is load-bearing for the flexible-length inference procedure, yet no per-channel ablation, importance ranking, or monotonicity test is presented to distinguish this from the decoder simply learning to ignore dropped channels. Without such evidence, the equivalence between training distribution and test-time truncation remains unverified.

Authors: We agree that providing explicit evidence for the semantic ordering induced by stochastic tail-dropping would strengthen the paper. The training procedure is intended to enforce this by exposing the model to random truncations, encouraging earlier channels to capture more critical information. To address this, we will include additional experiments in the revision, such as per-channel ablation studies measuring reconstruction quality when dropping channels in different orders, and visualizations of channel importance rankings to demonstrate monotonicity. revision: yes
Referee: [Abstract and experimental section] The SOTA claims (rFID 2.92, 8.6× decoding speedup, 159M parameters) rest on empirical results whose experimental setup, baseline implementations, controls, and statistical significance are not detailed in the abstract or summary of results, making it impossible to assess whether the reported frontier is robust.

Authors: The full experimental details, including dataset splits, training procedures, baseline re-implementations, and evaluation metrics, are described in Section 4 (Experiments) of the manuscript. We acknowledge that the abstract and result summary could be more self-contained. In the revision, we will augment the abstract with a concise description of the evaluation protocol and add a dedicated paragraph in the results section summarizing the controls and statistical significance of the reported improvements. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical validation rather than self-referential derivations

full rationale

The paper presents a channel-wise tokenizer using stochastic tail-dropping to enable flexible compression, with performance claims (rFID 2.92, speed/size gains) backed by ImageNet experiments. No equations, fitted-parameter predictions, or self-citation chains are shown that reduce the ordering claim or results to inputs by construction. The 'naturally forces' assertion is an empirical hypothesis tested via experiments, not a definitional or fitted reduction. This is self-contained against external benchmarks, warranting a low score.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate free parameters or axioms; the ordering-by-importance property is presented as an emergent outcome of training rather than an explicitly postulated entity.

pith-pipeline@v0.9.1-grok · 5726 in / 1095 out tokens · 20482 ms · 2026-06-28T07:07:12.651860+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 16 canonical work pages · 4 internal anchors

[1]

arXiv preprint arXiv:2502.13967 , year=

Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, O ˘guzhan Fatih Kar, Elmira Amirloo, Alaaeldin El- Nouby, Amir Zamir, and Afshin Dehghan. Flextok: Re- sampling images into 1d token sequences of flexible length. arXiv preprint arXiv:2502.13967, 2025

work page arXiv 2025
[2]

Es- timating or propagating gradients through stochastic neurons for conditional computation

Yoshua Bengio, Nicholas L´eonard, and Aaron Courville. Es- timating or propagating gradients through stochastic neurons for conditional computation. 2013

2013
[3]

Large scale GAN training for high fidelity natural image synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. InInt. Conf. Learn. Represent., 2019

2019
[4]

A dendrite method for cluster analysis.Communications in Statistics-Theory and Methods, 3(1):1–27, 1974

Tadeusz Cali´nski and Jerzy Harabasz. A dendrite method for cluster analysis.Communications in Statistics-Theory and Methods, 3(1):1–27, 1974

1974
[5]

Adaptive length im- age tokenization via recurrent allocation.arXiv preprint arXiv:2411.02393, 2024

Shivam Duggal, Sanghyun Byun, William T Freeman, An- tonio Torralba, and Phillip Isola. Adaptive length im- age tokenization via recurrent allocation.arXiv preprint arXiv:2411.02393, 2024

work page arXiv 2024
[6]

Freeman, Antonio Torralba, and Phillip Isola

Shivam Duggal, Sanghyun Byun, William T Freeman, Anto- nio Torralba, and Phillip Isola. Single-pass adaptive image tokenization for minimum program search.arXiv preprint arXiv:2507.07995, 2025

work page arXiv 2025
[7]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bj¨orn Ommer. Taming transformers for high-resolution image synthesis. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021

2021
[8]

Dreamsim: Learn- ing new dimensions of human visual similarity using synthetic data

Stephanie Fu, Netanel Y Ramesh, V ongani H Xie, Yue Luo, Philip HS Torr, Joshua B Tenenbaum, Olga Russakovsky, William T Freeman, and Stephanie Wong. Dreamsim: Learn- ing new dimensions of human visual similarity using synthetic data. InAdvances in Neural Information Processing Systems, 2023

2023
[9]

Reducing the dimensionality of data with neural networks.Science, 313 (5786):504–507, 2006

Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks.Science, 313 (5786):504–507, 2006

2006
[10]

Image-to-image translation with conditional adversarial net- works

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial net- works. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1125–1134, 2017

2017
[11]

Re- thinking fid: Towards a better evaluation metric for image generation

Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Re- thinking fid: Towards a better evaluation metric for image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9307–9315, 2024. 9

2024
[12]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4401–4410, 2019

2019
[13]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[14]

Kakade, Prateek Jain, and Ali Farhadi

Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham M. Kakade, Prateek Jain, and Ali Farhadi. Matryoshka representation learning. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022

2022
[15]

Images are worth variable length of representations

Lingjun Mao, Zikang Jin, Haokui Wang, Xiaodan Zhang, and Xin Li. Images are worth variable length of representations. arXiv preprint arXiv:2506.03643, 2025

work page arXiv 2025
[16]

Finite Scalar Quantization: VQ-VAE Made Simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

arXiv preprint arXiv:2501.10064 , year=

Kazuki Miwa, Go Irie, Yuki Nakashima, and Rin-ichiro Taniguchi. One-d-piece: Image tokenizer meets quality- controllable compression.arXiv preprint arXiv:2501.10064, 2025

work page arXiv 2025
[18]

Spectral normalization for generative adver- sarial networks

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adver- sarial networks. InInternational Conference on Learning Representations, 2018

2018
[19]

Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e J´egou, Julien Mairal,...

2023
[20]

Imagenet large scale visual recognition chal- lenge.International Journal of Computer Vision, 115:211– 252, 2014

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S Bernstein, Alexander C Berg, and Li Fei-Fei. Imagenet large scale visual recognition chal- lenge.International Journal of Computer Vision, 115:211– 252, 2014

2014
[21]

arXiv preprint arXiv:2501.03120 , year=

Junhong Shen, Kushal Tirumala, Michihiro Yasunaga, Is- han Misra, Luke Zettlemoyer, Lili Yu, and Chunting Zhou. Cat: Content-adaptive image tokenization.arXiv preprint arXiv:2501.03120, 2025

work page arXiv 2025
[22]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Detailflow: 1d coarse-to-fine autoregressive im- age generation via next-detail prediction.arXiv preprint arXiv:2505.21473, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Li- wei Wang. Detailflow: 1d coarse-to-fine autoregressive im- age generation via next-detail prediction.arXiv preprint arXiv:2505.21473, 2024

work page arXiv 2024
[24]

Neural discrete representation learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InAdvances in Neu- ral Information Processing Systems, 2017

2017
[25]

Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

2004
[26]

AToken: Adaptive tokenization for vision transformers.arXiv preprint arXiv:2509.14476, 2024

Wentao Wu, Libin Huang, Wenyi Xu, Qi Chen, Yue Zhang, and Weiwei Zhou. AToken: Adaptive tokenization for vision transformers.arXiv preprint arXiv:2509.14476, 2024

work page arXiv 2024
[27]

arXiv preprint arXiv:2410.08368 , year=

Wilson Yan, Matei Zaharia, V olodymyr Mnih, Pieter Abbeel, Aleksandra Faust, and Hao Liu. Elastictok: Adaptive tokeniza- tion for image and video.arXiv preprint arXiv:2410.08368, 2024

work page arXiv 2024
[28]

Quantize-then-rectify: Efficient vq-vae training.arXiv preprint arXiv:2507.10547, 2025

Jingfeng Yao and Xinggang Wang. Quantize-then-rectify: Efficient vq-vae training.arXiv preprint arXiv:2507.10547, 2025

work page arXiv 2025
[29]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, Jos´e Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion–tokenizer is key to visual generation.arXiv preprint arXiv:2310.05737, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018

2018
[31]

Image and video tok- enization with binary spherical quantization.arXiv preprint arXiv:2406.07548, 2024

Yue Zhao, Yuanjun Panda, Zhengzhong Xu, Zhenzhong Wang, Gaurav Kumar, Yu Zhang, Jinshuo Zhou, Yan Chen, Guan Wang, Jiaqi Zhang, et al. Image and video tok- enization with binary spherical quantization.arXiv preprint arXiv:2406.07548, 2024

work page arXiv 2024
[32]

Wetok: Powerful discrete tokenization for high-fidelity visual reconstruction.arXiv preprint arXiv:2508.05599, 2025

Shaobin Zhuang, Yiwei Guo, Canmiao Fu, Zhipeng Huang, Zeyue Tian, Fangyikang Wang, Ying Zhang, Chen Li, and Yali Wang. Wetok: Powerful discrete tokenization for high-fidelity visual reconstruction.arXiv preprint arXiv:2508.05599, 2025. 10 ChannelTok: Efficient Flexible-Length Vision Tokenization Supplementary Material A. Training and Evaluation Details A....

work page arXiv 2025

[1] [1]

arXiv preprint arXiv:2502.13967 , year=

Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, O ˘guzhan Fatih Kar, Elmira Amirloo, Alaaeldin El- Nouby, Amir Zamir, and Afshin Dehghan. Flextok: Re- sampling images into 1d token sequences of flexible length. arXiv preprint arXiv:2502.13967, 2025

work page arXiv 2025

[2] [2]

Es- timating or propagating gradients through stochastic neurons for conditional computation

Yoshua Bengio, Nicholas L´eonard, and Aaron Courville. Es- timating or propagating gradients through stochastic neurons for conditional computation. 2013

2013

[3] [3]

Large scale GAN training for high fidelity natural image synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. InInt. Conf. Learn. Represent., 2019

2019

[4] [4]

A dendrite method for cluster analysis.Communications in Statistics-Theory and Methods, 3(1):1–27, 1974

Tadeusz Cali´nski and Jerzy Harabasz. A dendrite method for cluster analysis.Communications in Statistics-Theory and Methods, 3(1):1–27, 1974

1974

[5] [5]

Adaptive length im- age tokenization via recurrent allocation.arXiv preprint arXiv:2411.02393, 2024

Shivam Duggal, Sanghyun Byun, William T Freeman, An- tonio Torralba, and Phillip Isola. Adaptive length im- age tokenization via recurrent allocation.arXiv preprint arXiv:2411.02393, 2024

work page arXiv 2024

[6] [6]

Freeman, Antonio Torralba, and Phillip Isola

Shivam Duggal, Sanghyun Byun, William T Freeman, Anto- nio Torralba, and Phillip Isola. Single-pass adaptive image tokenization for minimum program search.arXiv preprint arXiv:2507.07995, 2025

work page arXiv 2025

[7] [7]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bj¨orn Ommer. Taming transformers for high-resolution image synthesis. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021

2021

[8] [8]

Dreamsim: Learn- ing new dimensions of human visual similarity using synthetic data

Stephanie Fu, Netanel Y Ramesh, V ongani H Xie, Yue Luo, Philip HS Torr, Joshua B Tenenbaum, Olga Russakovsky, William T Freeman, and Stephanie Wong. Dreamsim: Learn- ing new dimensions of human visual similarity using synthetic data. InAdvances in Neural Information Processing Systems, 2023

2023

[9] [9]

Reducing the dimensionality of data with neural networks.Science, 313 (5786):504–507, 2006

Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks.Science, 313 (5786):504–507, 2006

2006

[10] [10]

Image-to-image translation with conditional adversarial net- works

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial net- works. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1125–1134, 2017

2017

[11] [11]

Re- thinking fid: Towards a better evaluation metric for image generation

Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Re- thinking fid: Towards a better evaluation metric for image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9307–9315, 2024. 9

2024

[12] [12]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4401–4410, 2019

2019

[13] [13]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[14] [14]

Kakade, Prateek Jain, and Ali Farhadi

Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham M. Kakade, Prateek Jain, and Ali Farhadi. Matryoshka representation learning. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022

2022

[15] [15]

Images are worth variable length of representations

Lingjun Mao, Zikang Jin, Haokui Wang, Xiaodan Zhang, and Xin Li. Images are worth variable length of representations. arXiv preprint arXiv:2506.03643, 2025

work page arXiv 2025

[16] [16]

Finite Scalar Quantization: VQ-VAE Made Simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

arXiv preprint arXiv:2501.10064 , year=

Kazuki Miwa, Go Irie, Yuki Nakashima, and Rin-ichiro Taniguchi. One-d-piece: Image tokenizer meets quality- controllable compression.arXiv preprint arXiv:2501.10064, 2025

work page arXiv 2025

[18] [18]

Spectral normalization for generative adver- sarial networks

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adver- sarial networks. InInternational Conference on Learning Representations, 2018

2018

[19] [19]

Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e J´egou, Julien Mairal,...

2023

[20] [20]

Imagenet large scale visual recognition chal- lenge.International Journal of Computer Vision, 115:211– 252, 2014

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S Bernstein, Alexander C Berg, and Li Fei-Fei. Imagenet large scale visual recognition chal- lenge.International Journal of Computer Vision, 115:211– 252, 2014

2014

[21] [21]

arXiv preprint arXiv:2501.03120 , year=

Junhong Shen, Kushal Tirumala, Michihiro Yasunaga, Is- han Misra, Luke Zettlemoyer, Lili Yu, and Chunting Zhou. Cat: Content-adaptive image tokenization.arXiv preprint arXiv:2501.03120, 2025

work page arXiv 2025

[22] [22]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Detailflow: 1d coarse-to-fine autoregressive im- age generation via next-detail prediction.arXiv preprint arXiv:2505.21473, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Li- wei Wang. Detailflow: 1d coarse-to-fine autoregressive im- age generation via next-detail prediction.arXiv preprint arXiv:2505.21473, 2024

work page arXiv 2024

[24] [24]

Neural discrete representation learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InAdvances in Neu- ral Information Processing Systems, 2017

2017

[25] [25]

Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

2004

[26] [26]

AToken: Adaptive tokenization for vision transformers.arXiv preprint arXiv:2509.14476, 2024

Wentao Wu, Libin Huang, Wenyi Xu, Qi Chen, Yue Zhang, and Weiwei Zhou. AToken: Adaptive tokenization for vision transformers.arXiv preprint arXiv:2509.14476, 2024

work page arXiv 2024

[27] [27]

arXiv preprint arXiv:2410.08368 , year=

Wilson Yan, Matei Zaharia, V olodymyr Mnih, Pieter Abbeel, Aleksandra Faust, and Hao Liu. Elastictok: Adaptive tokeniza- tion for image and video.arXiv preprint arXiv:2410.08368, 2024

work page arXiv 2024

[28] [28]

Quantize-then-rectify: Efficient vq-vae training.arXiv preprint arXiv:2507.10547, 2025

Jingfeng Yao and Xinggang Wang. Quantize-then-rectify: Efficient vq-vae training.arXiv preprint arXiv:2507.10547, 2025

work page arXiv 2025

[29] [29]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, Jos´e Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion–tokenizer is key to visual generation.arXiv preprint arXiv:2310.05737, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018

2018

[31] [31]

Image and video tok- enization with binary spherical quantization.arXiv preprint arXiv:2406.07548, 2024

Yue Zhao, Yuanjun Panda, Zhengzhong Xu, Zhenzhong Wang, Gaurav Kumar, Yu Zhang, Jinshuo Zhou, Yan Chen, Guan Wang, Jiaqi Zhang, et al. Image and video tok- enization with binary spherical quantization.arXiv preprint arXiv:2406.07548, 2024

work page arXiv 2024

[32] [32]

Wetok: Powerful discrete tokenization for high-fidelity visual reconstruction.arXiv preprint arXiv:2508.05599, 2025

Shaobin Zhuang, Yiwei Guo, Canmiao Fu, Zhipeng Huang, Zeyue Tian, Fangyikang Wang, Ying Zhang, Chen Li, and Yali Wang. Wetok: Powerful discrete tokenization for high-fidelity visual reconstruction.arXiv preprint arXiv:2508.05599, 2025. 10 ChannelTok: Efficient Flexible-Length Vision Tokenization Supplementary Material A. Training and Evaluation Details A....

work page arXiv 2025