pith. sign in

arxiv: 2605.16423 · v1 · pith:ZDAX3U4Cnew · submitted 2026-05-14 · 💻 cs.CV

Nonlinear Bipolar Compensation: Handling Outliers in Post-Training Quantization

Pith reviewed 2026-05-20 20:50 UTC · model grok-4.3

classification 💻 cs.CV
keywords post-training quantizationoutlier handlingnonlinear compensationbipolar logarithmic transformationmodel compressionneural network quantizationefficient inference
0
0 comments X

The pith

Nonlinear compensation via logarithmic mapping reduces outlier damage in post-training quantization while keeping computation light.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to fix a practical weakness in compensation-based quantization: outliers in weights or activations still cause large accuracy drops even after lightweight linear corrections are added. It claims that a nonlinear Bipolar Logarithmic Transformation applied to both the quantized input and the quantization error moves those outliers into a range where a plain linear layer can correct them effectively. The resulting method stays efficient because the extra work is only a single linear layer in the transformed space. A reader should care if the claim holds because post-training quantization is already the cheapest way to shrink models for deployment; any gain in accuracy without extra cost or retraining would make the technique more reliable across real networks.

Core claim

NBC introduces nonlinear compensation to reduce the effect of outliers, and BLT maps both the quantized input and the quantization error into a transformed space where a simple linear layer performs compensation while preserving efficiency.

What carries the argument

Bipolar Logarithmic Transformation (BLT), a mapping applied jointly to the quantized input and the quantization error that compresses outliers so a subsequent linear layer can perform compensation.

If this is right

  • Quantized networks achieve higher accuracy than prior linear-compensation methods on the same bit-widths.
  • The added layer remains cheap enough that overall inference speed stays comparable to standard post-training quantization.
  • The approach works across multiple quantization algorithms and network architectures without retraining.
  • Outlier sensitivity drops, allowing lower bit-widths to remain usable on tasks where they previously failed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same transformed-space idea might be tested on other post-training compression steps such as pruning or low-rank approximation.
  • If the log mapping proves stable, it could be applied once per layer rather than per tensor to further reduce overhead.
  • A natural next measurement is whether the recovered accuracy holds when the quantized model is fine-tuned for only a few epochs.

Load-bearing premise

Mapping both input and error through the bipolar log transform will compress outliers enough that the linear compensation layer recovers accuracy without leaving model-specific or bit-width-specific distortions unaddressed.

What would settle it

Run the method on a held-out model and bit-width combination; if top-1 accuracy remains more than a few points below the unquantized baseline while the same linear layer without BLT performs no worse, the central claim fails.

Figures

Figures reproduced from arXiv: 2605.16423 by Jianxin Wu, Peilin Sun.

Figure 1
Figure 1. Figure 1: The left figure (a) shows the ImageNet Top-1 accuracy of 4-bit quantized models, comparing [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The left figure (a) shows an NBC illustration on a single block. NBC utilizes BLT to [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Plots for BLT and Inverse BLT [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Network quantization has emerged as one of the most practical model compression techniques, which significantly reduces a model's memory and compute consumption by mapping floating-point numbers to low-bit representations. However, existing quantization methods typically suffer from the speed-accuracy tradeoff and limited generalization. To address these issues, recent compensation-based methods offer an efficient yet general solution by introducing additional lightweight linear layers into the quantized network. However, the accuracy of these methods suffers from their limited compensation capability and high sensitivity to outliers. In this paper, we propose Nonlinear Bipolar Compensation (NBC), a post-training quantization approach that introduces nonlinear compensation to reduce the effect of outliers. We further design Bipolar Logarithmic Transformation (BLT), which compresses outliers by mapping both the quantized input and the quantization error into a transformed space. A simple linear layer is then applied for compensation in the transformed space, preserving the efficiency of our method. Extensive experiments across various tasks, models, and quantization methods confirm the effectiveness, efficiency, robustness, and generality of our NBC approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Nonlinear Bipolar Compensation (NBC) as a post-training quantization technique to mitigate the impact of outliers on accuracy. It introduces the Bipolar Logarithmic Transformation (BLT) that maps both the quantized input activations and the quantization error into a transformed space; a lightweight linear layer then performs compensation in that space before inversion back to the original domain. The authors assert that this nonlinear compensation improves upon prior linear compensation methods while preserving efficiency, with claims of effectiveness, robustness, and generality backed by extensive experiments across tasks, models, and quantization schemes.

Significance. If the central construction proves sound, NBC would supply a practical, low-overhead route to outlier-robust PTQ that retains the efficiency advantages of linear compensation layers. The explicit use of a simple linear layer inside the transformed space is a clear engineering strength. However, the absence of any derivation or bound on the residual error after the nonlinear round-trip limits the ability to assess whether the method systematically reduces error or merely redistributes it across the distribution.

major comments (2)
  1. [Method / BLT construction] The construction applies a nonlinear BLT to both input and error, followed by a linear layer and inversion. Because BLT is nonlinear, the net operator in the original domain is a magnitude-dependent nonlinear correction. No derivation of this effective operator or bound on the residual error (especially for the non-outlier mass of the distribution) is supplied, leaving the claim that the method reliably reduces rather than redistributes quantization error unanalyzed. This analysis is load-bearing for the robustness and generality assertions.
  2. [Abstract / Experiments] The abstract states that extensive experiments confirm effectiveness, efficiency, robustness, and generality, yet the provided text contains no quantitative accuracy deltas, error bars, dataset specifications, or ablation results. Without these concrete numbers it is impossible to verify whether the claimed improvements hold across bit-widths and models or whether they are driven by the nonlinear compensation itself.
minor comments (2)
  1. Clarify the precise functional form of the Bipolar Logarithmic Transformation (including any scaling or offset parameters) and the exact inversion step so that readers can reproduce the nonlinear composition.
  2. Add a short complexity analysis (FLOPs or latency overhead of the added linear layer) to substantiate the efficiency claim relative to prior compensation methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below and indicate revisions to be incorporated in the next version of the manuscript.

read point-by-point responses
  1. Referee: [Method / BLT construction] The construction applies a nonlinear BLT to both input and error, followed by a linear layer and inversion. Because BLT is nonlinear, the net operator in the original domain is a magnitude-dependent nonlinear correction. No derivation of this effective operator or bound on the residual error (especially for the non-outlier mass of the distribution) is supplied, leaving the claim that the method reliably reduces rather than redistributes quantization error unanalyzed. This analysis is load-bearing for the robustness and generality assertions.

    Authors: We acknowledge that the current manuscript does not supply a closed-form derivation of the composed nonlinear operator in the original domain or theoretical bounds on the residual error after the round-trip transformation. The method was developed from the empirical observation that logarithmic compression allows a linear compensator to more effectively attenuate large-magnitude outliers while leaving the bulk distribution largely unaffected. In the revised manuscript we will add a dedicated analysis subsection that (i) derives the effective correction operator obtained by composing BLT, the linear layer, and the inverse BLT, and (ii) reports the empirical distribution of residual quantization error on both outlier and non-outlier activations across representative layers, thereby providing quantitative support for the claim that error is reduced rather than merely redistributed. revision: yes

  2. Referee: [Abstract / Experiments] The abstract states that extensive experiments confirm effectiveness, efficiency, robustness, and generality, yet the provided text contains no quantitative accuracy deltas, error bars, dataset specifications, or ablation results. Without these concrete numbers it is impossible to verify whether the claimed improvements hold across bit-widths and models or whether they are driven by the nonlinear compensation itself.

    Authors: We agree that the abstract would be more informative if it contained concrete performance numbers. In the revised version we will shorten the general claims and insert a concise statement of the principal empirical results, for example the average top-1 accuracy gain on ImageNet for ResNet-50 and ViT-B/16 under W4A4 quantization relative to the strongest linear-compensation baseline, together with a brief reference to the evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the NBC/BLT construction

full rationale

The paper presents NBC and BLT as new algorithmic constructions for post-training quantization compensation. No equations, derivations, or self-citations are exhibited that reduce any claimed prediction or result to a fitted parameter, self-definition, or prior author work by construction. The approach is framed as an empirical method whose effectiveness is demonstrated through experiments across models and bit-widths, leaving the central claims independent of any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method introduces new algorithmic components (NBC and BLT) whose correctness rests on empirical validation rather than stated mathematical assumptions.

pith-pipeline@v0.9.0 · 5700 in / 1045 out tokens · 42020 ms · 2026-05-20T20:50:57.831767+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 8 internal anchors

  1. [1]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  2. [2]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

  3. [3]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  4. [4]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  5. [5]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  6. [6]

    A survey of model compression and acceleration for deep neural networks.ArXiv, abs/1710.09282, 2017

    Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks.arXiv preprint arXiv:1710.09282, 2017

  7. [7]

    A survey of quantization methods for efficient neural network inference

    Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. InLow-power computer vision, pages 291–326. Chapman and Hall/CRC, 2022

  8. [8]

    Gplq: A general, practical, and lightning qat method for vision transformers.arXiv preprint arXiv:2506.11784, 2025

    Guang Liang, Xinyao Liu, and Jianxin Wu. Gplq: A general, practical, and lightning qat method for vision transformers.arXiv preprint arXiv:2506.11784, 2025

  9. [9]

    Q-vit: Accurate and fully quantized low-bit vision transformer.Advances in neural information processing systems, 35:34451–34463, 2022

    Yanjing Li, Sheng Xu, Baochang Zhang, Xianbin Cao, Peng Gao, and Guodong Guo. Q-vit: Accurate and fully quantized low-bit vision transformer.Advances in neural information processing systems, 35:34451–34463, 2022

  10. [10]

    Learned step size quantization

    Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dhar- mendra S Modha. Learned step size quantization.arXiv preprint arXiv:1902.08153, 2019. 17

  11. [11]

    Efficientqat: Efficient quantization-aware training for large language models.arXiv preprint arXiv:2407.11062, 2024

    Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficientqat: Efficient quantization-aware training for large language models.arXiv preprint arXiv:2407.11062, 2024

  12. [12]

    Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization

    Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, and Guangyu Sun. Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization. InEuropean conference on computer vision, pages 191–207. Springer, 2022

  13. [13]

    Repq-vit: Scale reparameterization for post-training quantization of vision transformers

    Zhikai Li, Junrui Xiao, Lianwei Yang, and Qingyi Gu. Repq-vit: Scale reparameterization for post-training quantization of vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17227–17236, 2023

  14. [14]

    Instance-aware group quantization for vision transformers

    Jaehyeon Moon, Dohyung Kim, Junyong Cheon, and Bumsub Ham. Instance-aware group quantization for vision transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16132–16141, 2024

  15. [15]

    Towards accurate post-training quantization of vision transformers via error reduction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2676–2692, 2025

    Yunshan Zhong, You Huang, Jiawei Hu, Yuxin Zhang, and Rongrong Ji. Towards accurate post-training quantization of vision transformers via error reduction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2676–2692, 2025

  16. [16]

    Up or down? adaptive rounding for post-training quantization

    Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? adaptive rounding for post-training quantization. InInternational conference on machine learning, pages 7197–7206. PMLR, 2020

  17. [17]

    Brecq: Pushing the limit of post-training quantization by block reconstruc- tion

    Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction. arXiv preprint arXiv:2102.05426, 2021

  18. [18]

    Qdrop: Randomly dropping quantization for extremely low-bit post-training quantization

    Xiuying Wei, Ruihao Gong, Yuhang Li, Xianglong Liu, and Fengwei Yu. Qdrop: Ran- domly dropping quantization for extremely low-bit post-training quantization.arXiv preprint arXiv:2203.05740, 2022

  19. [19]

    Aphq- vit: Post-training quantization with average perturbation hessian based reconstruction for vision transformers

    Zhuguanyu Wu, Jiayi Zhang, Jiaxin Chen, Jinyang Guo, Di Huang, and Yunhong Wang. Aphq- vit: Post-training quantization with average perturbation hessian based reconstruction for vision transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9686–9695, 2025

  20. [20]

    Quantization without tears

    Minghao Fu, Hao Yu, Jie Shao, Junjie Zhou, Ke Zhu, and Jianxin Wu. Quantization without tears. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4462–4472, June 2025

  21. [21]

    Qwt-v2: Practical, effective and efficient post-training quantization.arXiv preprint arXiv:2505.20932, 2025

    Ningyuan Tang, Minghao Fu, Hao Yu, and Jianxin Wu. Qwt-v2: Practical, effective and efficient post-training quantization.arXiv preprint arXiv:2505.20932, 2025

  22. [22]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35: 30318–30332, 2022

  23. [23]

    Vision Transformers Need Registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023

  24. [24]

    Smoothquant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023

  25. [25]

    Adalog: Post- training quantization for vision transformers with adaptive logarithm quantizer

    Zhuguanyu Wu, Jiaxin Chen, Hanwen Zhong, Di Huang, and Yunhong Wang. Adalog: Post- training quantization for vision transformers with adaptive logarithm quantizer. InEuropean Conference on Computer Vision, pages 411–427. Springer, 2024

  26. [26]

    Q-dit: Accurate post-training quantization for diffusion transformers

    Lei Chen, Yuan Meng, Chen Tang, Xinzhu Ma, Jingyan Jiang, Xin Wang, Zhi Wang, and Wenwu Zhu. Q-dit: Accurate post-training quantization for diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28306–28315, 2025. 18

  27. [27]

    Duquant: Distributing outliers via dual transformation makes stronger quantized llms.Advances in Neural Information Processing Systems, 37:87766–87800, 2024

    Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms.Advances in Neural Information Processing Systems, 37:87766–87800, 2024

  28. [28]

    Uq-vit: Harmonizing extreme activations with hardware-friendly uniform quantization in vision transformers

    Tao Jiang, Yucheng Jiang, Xiwen Yao, Gong Cheng, and Junwei Han. Uq-vit: Harmonizing extreme activations with hardware-friendly uniform quantization in vision transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 22354–22362, 2026

  29. [29]

    Dopq-vit: Towards distribution-friendly and outlier-aware post-training quantization for vision transform- ers.arXiv preprint arXiv:2408.03291, 2024

    Lianwei Yang, Haisong Gong, Haokun Lin, Yichen Wu, Zhenan Sun, and Qingyi Gu. Dopq-vit: Towards distribution-friendly and outlier-aware post-training quantization for vision transform- ers.arXiv preprint arXiv:2408.03291, 2024

  30. [30]

    Fima-q: Post- training quantization for vision transformers by fisher information matrix approximation

    Zhuguanyu Wu, Shihe Wang, Jiayi Zhang, Jiaxin Chen, and Yunhong Wang. Fima-q: Post- training quantization for vision transformers by fisher information matrix approximation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14891–14900, 2025

  31. [31]

    Notes on the use of data transformations.Practical assessment, research, and evaluation, 8(1), 2002

    Jason Osborne. Notes on the use of data transformations.Practical assessment, research, and evaluation, 8(1), 2002

  32. [32]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  33. [33]

    Training data-efficient image transformers & distillation through attention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021

  34. [34]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

  35. [35]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  36. [36]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  37. [37]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

  38. [38]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  39. [39]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

  40. [40]

    Pointer Sentinel Mixture Models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

  41. [41]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  42. [42]

    Piqa: Reasoning about phys- ical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020. 19

  43. [43]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  44. [44]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019

  45. [45]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830, 2019

  46. [46]

    Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

  47. [47]

    https://developer.nvidia.com/ tensorrt

    NVIDIA Corporation.NVIDIA TensorRT, 2024. https://developer.nvidia.com/ tensorrt

  48. [48]

    Microsoft.ONNX Runtime, 2024.https://onnxruntime.ai/

  49. [49]

    {TVM}: An automated {End-to-End} optimizing compiler for deep learning

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In13th USENIX symposium on operating systems design and implementation (OSDI 18), pages 578–594, 2018

  50. [50]

    https://github.com/NVIDIA/ FasterTransformer

    NVIDIA Corporation.NVIDIA FasterTransformer, 2024. https://github.com/NVIDIA/ FasterTransformer

  51. [51]

    Marlin: Mixed- precision auto-regressive parallel inference on large language models

    Elias Frantar, Roberto L Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh. Marlin: Mixed- precision auto-regressive parallel inference on large language models. InProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pages 239–251, 2025

  52. [52]

    Atom: Low-bit quantization for efficient and accurate llm serving.Proceedings of Machine Learning and Systems, 6:196–209, 2024

    Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. Atom: Low-bit quantization for efficient and accurate llm serving.Proceedings of Machine Learning and Systems, 6:196–209, 2024

  53. [53]

    Fully quantized network for object detection

    Rundong Li, Yan Wang, Feng Liang, Hongwei Qin, Junjie Yan, and Rui Fan. Fully quantized network for object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2810–2819, 2019

  54. [54]

    Distilling knowledge by mimicking features.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8183–8195, 2021

    Guo-Hua Wang, Yifan Ge, and Jianxin Wu. Distilling knowledge by mimicking features.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8183–8195, 2021

  55. [55]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

  56. [56]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. InProceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017

  57. [57]

    Cascade r-cnn: Delving into high quality object detection

    Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018

  58. [58]

    Understanding the difficulty of training deep feedfor- ward neural networks

    Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedfor- ward neural networks. InProceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010

  59. [59]

    Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

  60. [60]

    Treasures in discarded weights for llm quantization

    Hao Yu, Yang Zhou, Bohua Chen, Zelan Yang, Shen Li, Yong Li, and Jianxin Wu. Treasures in discarded weights for llm quantization. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 22218–22226, 2025. 20