pith. sign in

arxiv: 2605.21171 · v1 · pith:BXFVV6HDnew · submitted 2026-05-20 · 💻 cs.CV

FTerViT: Fully Ternary Vision Transformer

Pith reviewed 2026-05-21 05:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords ternary quantizationvision transformermodel compressionmicrocontroller deploymentedge AIImageNet classificationquantization-aware training
0
0 comments X

The pith

Vision Transformers can be made fully ternary by replacing every weight matrix and normalization parameter, enabling 15x compression for microcontroller deployment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a Vision Transformer can have all its weights and LayerNorm parameters forced into ternary representation without leaving any full-precision components behind. This is accomplished through two new operators that handle patch embedding convolution and normalization in ternary arithmetic, combined with a training sequence of knowledge distillation followed by a short recovery stage. A reader would care because earlier ternary ViT approaches left patch embeddings, norms, and heads in floating point, so the memory savings were capped and deployment on tiny chips remained impossible. If the claim holds, the last major obstacle to running competitive vision models on the smallest embedded processors disappears.

Core claim

FTerViT fully ternarizes every weight matrix and normalization parameter inside a Vision Transformer. The method introduces TernaryBitConv2d with per-channel scaling for the patch embedding and TernaryLayerNorm, then trains via knowledge distillation plus a lightweight quantization-aware recovery phase. The resulting W2A8 DeiT-III-S model at 384x384 resolution reaches 82.43 percent ImageNet-1K top-1 accuracy at 6.09 MB, a roughly 15x compression from FP32 with a 2.42 percentage point drop, and outperforms earlier ternary ViT methods by up to 8 points. The same approach yields the first working ternary Vision Transformer on the dual-core XTensa LX7 inside the ESP32-S3, delivering 79.64% top-1

What carries the argument

TernaryBitConv2d operator with per-channel scaling for patch embedding together with TernaryLayerNorm operator, which replace the last full-precision blocks and allow the entire model to run with only ternary weights and activations.

If this is right

  • The complete removal of full-precision parameters reduces model size to 5.81-6.09 MB, making on-chip storage feasible on microcontrollers.
  • Ternary ViTs can now execute inference directly on the ESP32-S3 without external memory for floating-point values.
  • Accuracy remains within 2.5 points of the floating-point baseline while delivering up to 8 points higher than previous partial-ternary ViT methods.
  • The same full-ternarization recipe can be applied to other compact ViT backbones to produce variants suitable for real-time edge vision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique could extend to other transformer families, allowing similar memory reductions for language models on embedded hardware.
  • Combining full ternarization with further optimizations such as structured pruning might push sizes below 5 MB while preserving accuracy.
  • Successful microcontroller deployment suggests that ternary ViTs could support always-on camera applications in battery-powered sensors without cloud offload.

Load-bearing premise

That knowledge distillation followed by a short quantization-aware recovery phase can restore usable accuracy after every weight and normalization parameter has been forced into ternary form without any full-precision components remaining.

What would settle it

Train the identical architecture with only the distillation stage and no recovery phase, then measure whether ImageNet top-1 accuracy falls below 79 percent when the model runs on the ESP32-S3.

Figures

Figures reproduced from arXiv: 2605.21171 by Engin T\"uretken, Michele Magno, Nadim Maamari, Pietro Bonazzi, Simon Narduzzi, Szymon Ruci\'nski.

Figure 1
Figure 1. Figure 1: (a) DeiT-III-Small size vs. accuracy; FTerViT (W2A8): 82.43% at 6.09 MB (3842 ) / 79.64% at 5.81 MB (2242 ). FTerViT based on DeiT-S224 reaches 77.47% under ternary quantization. (b) DeiT-Tiny storage; partial-W2 leaves 38% of bytes at FP32, fully ternary drops the share to 10%. DeiT-III-Small follows the same trend (24% partial, 4% fully ternary; 88.3 MB → 5.81 MB). The resulting model substantially narro… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Phase 1 training of DeiT-III-S224 on ImageNet-1K for 250 epochs. Validation top-1 saturates near 78% and never bridges the gap to FP32 (83.08%). (b) Phase 2 fine-tuning from five P1 checkpoints (epochs 30–400). P1@250 converges to ∼79.64% top-1 (−3.44 pp vs. FP32) in 10 epochs. P1@400 reaches 79.61%, while early checkpoints (P1@30, P1@60) recover far less. To better understand the source of performance… view at source ↗
Figure 3
Figure 3. Figure 3: Component-wise fidelity of fully ternary ViTs. (a) Global distribution of weights constrained to {−1, 0, +1}. (b,c,d) FP32–ternary alignment across key components of DeiT-III-S224 shows strong preservation of representational structure despite ternary quantization. cosine). The shift parameter is reproduced less precisely (0.71±0.09), suggesting that scale domi￾nates. Finally, output behavior remains consi… view at source ↗
Figure 4
Figure 4. Figure 4: Attention rollout on 10 additional ImageNet-1K classes. FTerViT-DeiT-III-S [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: ESP32-S3-EYE board running on-device DeiT-III-S [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: PPK2 power measurements on ESP32-S3-EYE (Nordic PPK2 probe, VOUT rail). [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Ternary Vision Transformers offer substantial model compression, however state-of-the-art methods only ternarize the encoder layers, leaving patch embeddings, LayerNorm parameters, and classifier heads in full precision. In compact models targeting resource-constrained processors, such as microcontrollers, these remaining full-precision components determine the total memory footprint, severely limiting deployment efficiency and on-device feasibility. In this work, we introduce a fully ternarized Vision Transformer in which \emph{all} weight matrices and normalization parameters are ternarized (FTerViT). To this end, we introduce two novel operators : TernaryBitConv2d with per-channel scaling for patch embedding and TernaryLayerNorm. FTerViT is trained using knowledge distillation, followed by a lightweight quantization-aware recovery phase. Our ternary W2A8 DeiT-III-S at 384$\times$384 resolution achieves 82.43\% ImageNet-1K top-1 at 6.09\,MB (${\sim}$15$\times$ compression, $-$2.42\,pp vs.\ FP32), outperforming prior ternary ViTs methods up to 8 pp. Finally, we demonstrate the first implementation of ternary vision transformers on a dual cores XTensa LX7 microcontroller inside the ESP32-S3 system-on-chip. By deploying FTerViT-Small (based on DeiT-III-Small at 224$\times$224 resolution, 5.81\,MB), we achieve 79.64\% ImageNet-1K top-1 accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces FTerViT, a fully ternarized Vision Transformer in which all weight matrices and normalization parameters (including patch embeddings, LayerNorm, and classifier heads) are forced into ternary representation, unlike prior methods that retain full-precision components for these elements. Novel operators TernaryBitConv2d (with per-channel scaling) and TernaryLayerNorm are proposed. Training uses knowledge distillation followed by a lightweight quantization-aware recovery phase. The central empirical result is a W2A8 DeiT-III-S model at 384×384 resolution achieving 82.43% ImageNet-1K top-1 accuracy at 6.09 MB (~15× compression, −2.42 pp vs. FP32), outperforming prior ternary ViT methods by up to 8 pp, plus the first reported deployment of a ternary ViT on an ESP32-S3 microcontroller (79.64% for the 224×224 Small variant at 5.81 MB).

Significance. If the accuracy numbers and hardware demonstration hold under scrutiny, this work would be significant for enabling Vision Transformers on severely resource-constrained microcontrollers by eliminating the memory footprint of retained full-precision components. The full-ternarization approach directly targets a practical deployment bottleneck and includes a concrete on-device implementation, which strengthens its engineering contribution if the training procedure proves robust.

major comments (3)
  1. [Training Procedure / Experimental Setup] The central accuracy claim (82.43% top-1 for the 384×384 W2A8 model) rests on the two-stage procedure (knowledge distillation followed by lightweight quantization-aware recovery) being sufficient to offset the drop from complete ternarization of weights, embeddings, and all LayerNorm parameters. No ablations are shown on recovery-phase length, learning rates, or accuracy immediately before versus after the recovery stage; this is load-bearing because ternarizing LayerNorm and patch embeddings is more aggressive than in prior partial-ternary ViT works.
  2. [Method / TernaryLayerNorm definition] TernaryLayerNorm is introduced as a novel operator to handle normalization parameters in ternary form, yet the manuscript provides insufficient detail on its forward/backward pass, how per-channel scaling factors interact with it, and whether it introduces additional free parameters beyond those already listed. This directly affects both reproducibility and the strength of the 'fully ternary' claim.
  3. [Hardware Implementation / Results] Hardware demonstration on the dual-core XTensa LX7 inside the ESP32-S3 is presented as a first, but the manuscript lacks quantitative details on inference latency, power, or memory usage during on-device execution, which are necessary to substantiate the deployment feasibility claim alongside the accuracy numbers.
minor comments (2)
  1. [Abstract and §4] The abstract and method sections use 'lightweight quantization-aware recovery phase' without specifying its duration or hyper-parameters; adding these would improve clarity without altering the core claims.
  2. [Table of results] Ensure all tables comparing against prior ternary ViT methods explicitly state the bit-widths for weights, activations, and any retained full-precision components so that the 'up to 8 pp' improvement is directly verifiable.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate planned revisions to improve the manuscript's clarity, reproducibility, and completeness.

read point-by-point responses
  1. Referee: [Training Procedure / Experimental Setup] The central accuracy claim (82.43% top-1 for the 384×384 W2A8 model) rests on the two-stage procedure (knowledge distillation followed by lightweight quantization-aware recovery) being sufficient to offset the drop from complete ternarization of weights, embeddings, and all LayerNorm parameters. No ablations are shown on recovery-phase length, learning rates, or accuracy immediately before versus after the recovery stage; this is load-bearing because ternarizing LayerNorm and patch embeddings is more aggressive than in prior partial-ternary ViT works.

    Authors: We agree that additional ablations would strengthen the validation of the two-stage training procedure. In the revised manuscript, we will include new experiments ablating recovery-phase length and learning rates, along with accuracy metrics immediately before and after the recovery stage. These additions will better illustrate how the lightweight quantization-aware recovery offsets performance drops from full ternarization of LayerNorm and embeddings. revision: yes

  2. Referee: [Method / TernaryLayerNorm definition] TernaryLayerNorm is introduced as a novel operator to handle normalization parameters in ternary form, yet the manuscript provides insufficient detail on its forward/backward pass, how per-channel scaling factors interact with it, and whether it introduces additional free parameters beyond those already listed. This directly affects both reproducibility and the strength of the 'fully ternary' claim.

    Authors: We appreciate this observation on the need for greater methodological detail. We will revise the manuscript to provide explicit mathematical formulations for the forward and backward passes of TernaryLayerNorm, clarify how per-channel scaling factors from TernaryBitConv2d interact with it during normalization, and explicitly confirm that no additional free parameters are introduced beyond the ternary weights and scales already described. revision: yes

  3. Referee: [Hardware Implementation / Results] Hardware demonstration on the dual-core XTensa LX7 inside the ESP32-S3 is presented as a first, but the manuscript lacks quantitative details on inference latency, power, or memory usage during on-device execution, which are necessary to substantiate the deployment feasibility claim alongside the accuracy numbers.

    Authors: We acknowledge the value of these metrics for fully substantiating on-device feasibility. The model size (5.81 MB for the Small variant) is already reported and directly addresses the memory footprint, which is the primary constraint on microcontrollers. We will expand the hardware section in revision to include any available runtime memory measurements and implementation notes. Comprehensive latency and power figures were not obtained in our initial deployment experiments due to hardware profiling limitations; we will note this limitation and identify it as future work while highlighting the achieved on-device accuracy. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results from measured accuracies

full rationale

This is an empirical engineering paper whose central claims consist of measured ImageNet-1K top-1 accuracies (e.g., 82.43% for W2A8 DeiT-III-S at 384×384) obtained after applying the proposed TernaryBitConv2d and TernaryLayerNorm operators plus a two-stage training procedure. No mathematical derivation, first-principles prediction, or uniqueness theorem is presented that reduces by construction to fitted inputs, self-citations, or renamed empirical patterns. The reported compression ratios and hardware deployment results are direct experimental outcomes rather than self-referential definitions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The work rests on standard assumptions of knowledge-distillation-based quantization recovery and the effectiveness of per-channel scaling for ternary convolutions; no new physical entities are postulated.

free parameters (1)
  • per-channel scaling factors
    Introduced in TernaryBitConv2d to compensate for the limited dynamic range of ternary weights in the patch embedding layer.
axioms (1)
  • domain assumption Knowledge distillation followed by quantization-aware recovery can restore accuracy after full ternarization of all parameters
    Invoked to justify the training procedure that enables the reported accuracy numbers.
invented entities (1)
  • TernaryLayerNorm no independent evidence
    purpose: To allow LayerNorm parameters to be represented in ternary format while preserving normalization behavior
    New operator required to achieve full ternarization of normalization layers.

pith-pipeline@v0.9.0 · 5829 in / 1446 out tokens · 39781 ms · 2026-05-21T05:56:40.906656+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 6 internal anchors

  1. [1]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  2. [2]

    Training data-efficient image transformers & distillation through attention

    Hugo Touvron et al. Training data-efficient image transformers & distillation through attention. ICML, 2021

  3. [3]

    Deit iii: Revenge of the vit.ECCV, 2022

    Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii: Revenge of the vit.ECCV, 2022

  4. [4]

    Tinytracker: Ultra-fast and ultra-low-power edge vision for in-sensor gaze estimation

    Pietro Bonazzi, Thomas Rüegg, Sizhen Bian, Yawei Li, and Michele Magno. Tinytracker: Ultra-fast and ultra-low-power edge vision for in-sensor gaze estimation. InIEEE Sensors, 2023

  5. [5]

    Vit-1.58b: Mobile vision transformers in the 1-bit era.arXiv preprint arXiv:2406.18051, 2024

    Zhengqing Yuan et al. Vit-1.58b: Mobile vision transformers in the 1-bit era.arXiv preprint arXiv:2406.18051, 2024

  6. [6]

    Tervit: An efficient ternary vision transformer.arXiv preprint arXiv:2201.08050, 2022

    Sheng Xu, Yanjing Li, Teli Ma, Bohan Zeng, Baochang Zhang, Peng Gao, and Jinhu Lu. Tervit: An efficient ternary vision transformer.arXiv preprint arXiv:2201.08050, 2022

  7. [7]

    Bitmedvit: Ternary-quantized vision transformer for medical ai assistants on the edge.ICCAD, 2025

    Mikolaj Walczak, Uttej Kallakuri, Edward Humes, Xiaomin Lin, and Tinoosh Mohsenin. Bitmedvit: Ternary-quantized vision transformer for medical ai assistants on the edge.ICCAD, 2025

  8. [8]

    TernaryCLIP: Efficiently compressing vision-language models with ternary weights and distilled knowledge.arXiv preprint arXiv:2510.21879, 2025

    Shu-Hao Zhang, Yue-Lu Gong, Kun-Peng Ning, Hao-Yang He, Yu-Jie Yuan, Jin-Dong Wang, and Shao-Qun Zhang. TernaryCLIP: Efficiently compressing vision-language models with ternary weights and distilled knowledge.arXiv preprint arXiv:2510.21879, 2025

  9. [9]

    Bivit: Extremely compressed binary vision transformers.ICCV, 2023

    Yefei He et al. Bivit: Extremely compressed binary vision transformers.ICCV, 2023

  10. [10]

    Bi-vit: Pushing the limit of vision transformer quantization.AAAI, 2024

    Yanjing Li et al. Bi-vit: Pushing the limit of vision transformer quantization.AAAI, 2024

  11. [11]

    BinaryViT: Pushing binary vision transformers towards convolutional models.CVPR Workshops, 2023

    Phuoc-Hoan Charles Le and Xinlin Li. BinaryViT: Pushing binary vision transformers towards convolutional models.CVPR Workshops, 2023

  12. [12]

    BinaryViT: Towards efficient and accurate binary vision transformers.IEEE Transactions on Circuits and Systems for Video Technology, 2025

    Junrui Xiao, Zhikai Li, Lianwei Yang, and Qingyi Gu. BinaryViT: Towards efficient and accurate binary vision transformers.IEEE Transactions on Circuits and Systems for Video Technology, 2025

  13. [13]

    Q-vit: Accurate and fully quantized low-bit vision transformer.NeurIPS, 2022

    Yanjing Li et al. Q-vit: Accurate and fully quantized low-bit vision transformer.NeurIPS, 2022

  14. [14]

    Oscillation-free quantization for low-bit vision transformers.ICML, 2023

    Shih-Yang Liu, Zechun Liu, and Kwang-Ting Cheng. Oscillation-free quantization for low-bit vision transformers.ICML, 2023

  15. [15]

    Quantization variation: A new perspective on training transformers with low-bit precision.TMLR, 2024

    Xijie Huang, Zhiqiang Shen, Pingcheng Dong, and Tim Kwang-Ting Cheng. Quantization variation: A new perspective on training transformers with low-bit precision.TMLR, 2024

  16. [16]

    Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

    Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1.arXiv preprint arXiv:1602.02830, 2016

  17. [17]

    DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

    Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients.arXiv preprint arXiv:1606.06160, 2016

  18. [18]

    Mahoney, and Kurt Keutzer

    Zhen Dong, Zhewei Yao, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. Hawq: Hessian aware quantization of neural networks with mixed-precision.ICCV, 2019

  19. [19]

    Post-training quantization for vision transformer.NeurIPS, 2021

    Zhenhua Liu, Yunhe Wang, Kai Han, Wei Zhang, Siwei Ma, and Wen Gao. Post-training quantization for vision transformer.NeurIPS, 2021

  20. [20]

    Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization.ECCV, 2022

    Zhihang Yuan et al. Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization.ECCV, 2022

  21. [21]

    Fq-vit: Post-training quantization for fully quantized vision transformer.IJCAI, 2022

    Yang Lin, Tianyu Zhang, Peiqin Sun, Zheng Li, and Shuchang Zhou. Fq-vit: Post-training quantization for fully quantized vision transformer.IJCAI, 2022. 10

  22. [22]

    Repq-vit: Scale reparameterization for post-training quantization of vision transformers.ICCV, 2023

    Zhikai Li, Junrui Xiao, Lianwei Yang, and Qingyi Gu. Repq-vit: Scale reparameterization for post-training quantization of vision transformers.ICCV, 2023

  23. [23]

    DopQ-ViT: Towards distribution-friendly and outlier-aware post-training quantization for vision transformers.TMLR, 2024

    Lianwei Yang, Haisong Gong, and Qingyi Gu. DopQ-ViT: Towards distribution-friendly and outlier-aware post-training quantization for vision transformers.TMLR, 2024

  24. [24]

    LRP-QViT: Mixed-precision vision transformer quantiza- tion via layer-wise relevance propagation.TMLR, 2024

    Navin Ranjan and Andreas Savakis. LRP-QViT: Mixed-precision vision transformer quantiza- tion via layer-wise relevance propagation.TMLR, 2024

  25. [25]

    AMP-ViT: Optimizing vision transformer efficiency with adaptive mixed-precision post-training quantization.WACV, 2025

    Yu-Shan Tai and An-Yeu Wu. AMP-ViT: Optimizing vision transformer efficiency with adaptive mixed-precision post-training quantization.WACV, 2025

  26. [26]

    Importance estimation for neural network pruning.CVPR, 2019

    Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning.CVPR, 2019

  27. [27]

    The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

    Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit llms: All large language models are in 1.58 bits.arXiv preprint arXiv:2402.17764, 2024

  28. [28]

    Ternary weight networks

    Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks.arXiv preprint arXiv:1605.04711, 2016

  29. [29]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

  30. [30]

    Data-free quantization through weight equalization and bias correction.ICCV, 2019

    Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and Max Welling. Data-free quantization through weight equalization and bias correction.ICCV, 2019

  31. [31]

    Quantizing deep convolutional networks for efficient inference: A whitepaper

    Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper.arXiv preprint arXiv:1806.08342, 2018

  32. [32]

    ReActNet: Towards precise binary neural network with generalized activation functions.ECCV, 2020

    Zechun Liu, Zhiqiang Shen, Marios Savvides, and Kwang-Ting Cheng. ReActNet: Towards precise binary neural network with generalized activation functions.ECCV, 2020

  33. [33]

    Understanding and improving knowledge distillation for quantization aware training of large transformer encoders.EMNLP, 2022

    Minsoo Kim, Sihwa Lee, Sukjin Hong, Du-Seong Chang, and Jungwook Choi. Understanding and improving knowledge distillation for quantization aware training of large transformer encoders.EMNLP, 2022

  34. [34]

    Vision transformer quantization with multi-step knowledge distillation.arXiv preprint arXiv:2406.14004, 2024

    Navin Ranjan and Andreas Savakis. Vision transformer quantization with multi-step knowledge distillation.arXiv preprint arXiv:2406.14004, 2024

  35. [35]

    Self-supervised quantization-aware knowledge distillation.AIS- TATS, 2024

    Kaiqi Zhao and Ming Zhao. Self-supervised quantization-aware knowledge distillation.AIS- TATS, 2024

  36. [36]

    Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796, 2024

    Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Gerald Shen, Jiaqi Zeng, Zijia Chen, Yoshi Suhara, Shizhe Diao, Chenhan Yu, Wei-Chun Chen, Hayley Ross, Oluwatobi Olabiyi, Ashwath Aithal, Oleksii Kuchaiev, Daniel Korzekwa, Pavlo Molchanov, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, and B...

  37. [37]

    Quantization-aware distillation for nvfp4 inference accuracy recovery.arXiv preprint arXiv:2601.20088, 2026

    Meng Xin, Sweta Priyadarshi, Jingyu Xin, Bilal Kartal, Aditya Vavre, Asma Kuriparambil Thekkumpate, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Ido Shahaf, Akhiad Bercovich, et al. Quantization-aware distillation for nvfp4 inference accuracy recovery.arXiv preprint arXiv:2601.20088, 2026

  38. [38]

    TinyViT: Fast pretraining distillation for small vision transformers.ECCV, 2022

    Kan Wu, Jinnian Zhang, Houwen Peng, Mengchen Liu, Bin Xiao, Jianlong Fu, and Lu Yuan. TinyViT: Fast pretraining distillation for small vision transformers.ECCV, 2022

  39. [39]

    The role of masking for efficient supervised knowledge distillation of vision transformers.arXiv preprint arXiv:2302.10494, 2023

    Seungwoo Son, Jegwang Ryu, Namhoon Lee, and Jaeho Lee. The role of masking for efficient supervised knowledge distillation of vision transformers.arXiv preprint arXiv:2302.10494, 2023

  40. [40]

    Logit standardization in knowledge distillation.CVPR, 2024

    Shangquan Sun, Wenqi Ren, Jingzhi Li, Rui Wang, and Xiaochun Cao. Logit standardization in knowledge distillation.CVPR, 2024. 11

  41. [41]

    Decoupled weight decay regularization.ICLR, 2019

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.ICLR, 2019

  42. [42]

    Esser, Jeffrey L

    Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S. Modha. Learned step size quantization.ICLR, 2020

  43. [43]

    Quantifying attention flow in transformers.Proc

    Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers.Proc. ACL, 2020

  44. [44]

    Mcuformer: Deploying vision transformers on microcontrollers with limited memory.arXiv preprint arXiv:2310.16898, 2023

    Yinan Liang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, and Jiwen Lu. Mcuformer: Deploying vision transformers on microcontrollers with limited memory.arXiv preprint arXiv:2310.16898, 2023

  45. [45]

    Tinyformer: Efficient transformer design and deployment on tiny devices.arXiv preprint arXiv:2311.01759, 2023

    Jianlei Yang, Jiacheng Liao, Fanding Lei, Meichen Liu, Lingkun Long, Junyi Chen, Han Wan, Bei Yu, and Weisheng Zhao. Tinyformer: Efficient transformer design and deployment on tiny devices.arXiv preprint arXiv:2311.01759, 2023

  46. [46]

    Can llms revolutionize the design of explainable and efficient tinyml models?IJCNN, 2025

    Christophe El Zeinaty, Wassim Hamidouche, Glenn Herrou, Daniel Ménard, and Mérouane Debbah. Can llms revolutionize the design of explainable and efficient tinyml models?IJCNN, 2025

  47. [47]

    Low-bit quantization favors undertrained LLMs: Scaling laws for quantized LLMs with 100t training tokens, 2024

    Ouyang Xu, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, and Dong Yu. Low-bit quantization favors undertrained LLMs: Scaling laws for quantized LLMs with 100T training tokens.arXiv preprint arXiv:2411.17691, 2024

  48. [48]

    Hassani, S

    Ali Hassani, Steven Walton, Nikhil Shah, Abulikemu Abuduweili, Jiachen Li, and Humphrey Shi. Escaping the big data paradigm with compact transformers.arXiv preprint arXiv:2104.05704, 2021

  49. [49]

    Shashank Nag, Alan T. L. Bacellar, Zachary Susskind, Anshul Jha, Logan Liberty, Aman Sivakumar, Eugene B. John, Karthik Kailas, Paulo M. Lima, Neeraja Yadwadkar, Felipe M. G. França, and Lizy K. John. Ll-vit: Edge deployable vision transformers with look up table neurons.FPT, 2025

  50. [50]

    I-vit: Integer-only quantization for efficient vision transformer inference.ICCV, 2023

    Zhikai Li and Qingyi Gu. I-vit: Integer-only quantization for efficient vision transformer inference.ICCV, 2023

  51. [51]

    Mobilenetv2: Inverted residuals and linear bottlenecks.CVPR, 2018

    Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks.CVPR, 2018. A Experiments Appendix A.1 Benchmark Results on CIFAR-10 and CIFAR-100 As shown in Table 6, our ternary model achieves 97.43% top-1 accuracy on CIFAR-10 and 86.01% on CIFAR-100. These results are within 0.09...