Nonlinear Bipolar Compensation: Handling Outliers in Post-Training Quantization

Jianxin Wu; Peilin Sun

arxiv: 2605.16423 · v1 · pith:ZDAX3U4Cnew · submitted 2026-05-14 · 💻 cs.CV

Nonlinear Bipolar Compensation: Handling Outliers in Post-Training Quantization

Peilin Sun , Jianxin Wu This is my paper

Pith reviewed 2026-05-20 20:50 UTC · model grok-4.3

classification 💻 cs.CV

keywords post-training quantizationoutlier handlingnonlinear compensationbipolar logarithmic transformationmodel compressionneural network quantizationefficient inference

0 comments

The pith

Nonlinear compensation via logarithmic mapping reduces outlier damage in post-training quantization while keeping computation light.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to fix a practical weakness in compensation-based quantization: outliers in weights or activations still cause large accuracy drops even after lightweight linear corrections are added. It claims that a nonlinear Bipolar Logarithmic Transformation applied to both the quantized input and the quantization error moves those outliers into a range where a plain linear layer can correct them effectively. The resulting method stays efficient because the extra work is only a single linear layer in the transformed space. A reader should care if the claim holds because post-training quantization is already the cheapest way to shrink models for deployment; any gain in accuracy without extra cost or retraining would make the technique more reliable across real networks.

Core claim

NBC introduces nonlinear compensation to reduce the effect of outliers, and BLT maps both the quantized input and the quantization error into a transformed space where a simple linear layer performs compensation while preserving efficiency.

What carries the argument

Bipolar Logarithmic Transformation (BLT), a mapping applied jointly to the quantized input and the quantization error that compresses outliers so a subsequent linear layer can perform compensation.

If this is right

Quantized networks achieve higher accuracy than prior linear-compensation methods on the same bit-widths.
The added layer remains cheap enough that overall inference speed stays comparable to standard post-training quantization.
The approach works across multiple quantization algorithms and network architectures without retraining.
Outlier sensitivity drops, allowing lower bit-widths to remain usable on tasks where they previously failed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same transformed-space idea might be tested on other post-training compression steps such as pruning or low-rank approximation.
If the log mapping proves stable, it could be applied once per layer rather than per tensor to further reduce overhead.
A natural next measurement is whether the recovered accuracy holds when the quantized model is fine-tuned for only a few epochs.

Load-bearing premise

Mapping both input and error through the bipolar log transform will compress outliers enough that the linear compensation layer recovers accuracy without leaving model-specific or bit-width-specific distortions unaddressed.

What would settle it

Run the method on a held-out model and bit-width combination; if top-1 accuracy remains more than a few points below the unquantized baseline while the same linear layer without BLT performs no worse, the central claim fails.

Figures

Figures reproduced from arXiv: 2605.16423 by Jianxin Wu, Peilin Sun.

**Figure 2.** Figure 2: The left figure (a) shows an NBC illustration on a single block. NBC utilizes BLT to [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Plots for BLT and Inverse BLT [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Network quantization has emerged as one of the most practical model compression techniques, which significantly reduces a model's memory and compute consumption by mapping floating-point numbers to low-bit representations. However, existing quantization methods typically suffer from the speed-accuracy tradeoff and limited generalization. To address these issues, recent compensation-based methods offer an efficient yet general solution by introducing additional lightweight linear layers into the quantized network. However, the accuracy of these methods suffers from their limited compensation capability and high sensitivity to outliers. In this paper, we propose Nonlinear Bipolar Compensation (NBC), a post-training quantization approach that introduces nonlinear compensation to reduce the effect of outliers. We further design Bipolar Logarithmic Transformation (BLT), which compresses outliers by mapping both the quantized input and the quantization error into a transformed space. A simple linear layer is then applied for compensation in the transformed space, preserving the efficiency of our method. Extensive experiments across various tasks, models, and quantization methods confirm the effectiveness, efficiency, robustness, and generality of our NBC approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NBC pairs a bipolar log transform with linear compensation to compress outliers in PTQ, but the paper gives no derivation showing the composite operator actually lowers residual error in the original space.

read the letter

The core idea here is straightforward: instead of applying linear compensation directly to quantized values and errors, they first push both through a bipolar logarithmic transform that squeezes large outliers, run a cheap linear layer in that space, and invert. This is presented as fixing the outlier sensitivity that hurts earlier compensation-based PTQ methods while keeping the added cost low. That pairing of the specific transform with the compensation step looks like the main technical move, and it is not just a relabeling of prior linear tricks. If the experiments are clean, the approach could give a modest accuracy lift on models where activation tails are the main quantization headache. The construction stays practical and does not require retraining, which is a plus for post-training settings. The experiments are described as covering multiple tasks, models, and base quantizers, which is the right scope for this kind of work. The soft spot is the lack of any analysis of what the overall map does to the error distribution once you account for the nonlinearity. Because the transform is nonlinear, the effective correction in the original domain is magnitude-dependent; nothing shown guarantees that the net effect reduces total squared error rather than trading outlier error for distortion elsewhere. The abstract claims robustness and generality, but without reported breakdowns on the non-outlier mass or bounds on the residual after inversion, it is hard to know how reliable the gain is across bit widths. This is the kind of paper that matters to people shipping quantized vision or language models on edge hardware. A practitioner looking for incremental PTQ improvements would find the method and the reported numbers useful even if the theory stays light. It is coherent on its own terms and engages the recent compensation literature directly, so it is worth sending out for review rather than desk-rejecting. I would ask referees to check the effective operator and ask for ablations that separate outlier versus bulk error reduction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Nonlinear Bipolar Compensation (NBC) as a post-training quantization technique to mitigate the impact of outliers on accuracy. It introduces the Bipolar Logarithmic Transformation (BLT) that maps both the quantized input activations and the quantization error into a transformed space; a lightweight linear layer then performs compensation in that space before inversion back to the original domain. The authors assert that this nonlinear compensation improves upon prior linear compensation methods while preserving efficiency, with claims of effectiveness, robustness, and generality backed by extensive experiments across tasks, models, and quantization schemes.

Significance. If the central construction proves sound, NBC would supply a practical, low-overhead route to outlier-robust PTQ that retains the efficiency advantages of linear compensation layers. The explicit use of a simple linear layer inside the transformed space is a clear engineering strength. However, the absence of any derivation or bound on the residual error after the nonlinear round-trip limits the ability to assess whether the method systematically reduces error or merely redistributes it across the distribution.

major comments (2)

[Method / BLT construction] The construction applies a nonlinear BLT to both input and error, followed by a linear layer and inversion. Because BLT is nonlinear, the net operator in the original domain is a magnitude-dependent nonlinear correction. No derivation of this effective operator or bound on the residual error (especially for the non-outlier mass of the distribution) is supplied, leaving the claim that the method reliably reduces rather than redistributes quantization error unanalyzed. This analysis is load-bearing for the robustness and generality assertions.
[Abstract / Experiments] The abstract states that extensive experiments confirm effectiveness, efficiency, robustness, and generality, yet the provided text contains no quantitative accuracy deltas, error bars, dataset specifications, or ablation results. Without these concrete numbers it is impossible to verify whether the claimed improvements hold across bit-widths and models or whether they are driven by the nonlinear compensation itself.

minor comments (2)

Clarify the precise functional form of the Bipolar Logarithmic Transformation (including any scaling or offset parameters) and the exact inversion step so that readers can reproduce the nonlinear composition.
Add a short complexity analysis (FLOPs or latency overhead of the added linear layer) to substantiate the efficiency claim relative to prior compensation methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below and indicate revisions to be incorporated in the next version of the manuscript.

read point-by-point responses

Referee: [Method / BLT construction] The construction applies a nonlinear BLT to both input and error, followed by a linear layer and inversion. Because BLT is nonlinear, the net operator in the original domain is a magnitude-dependent nonlinear correction. No derivation of this effective operator or bound on the residual error (especially for the non-outlier mass of the distribution) is supplied, leaving the claim that the method reliably reduces rather than redistributes quantization error unanalyzed. This analysis is load-bearing for the robustness and generality assertions.

Authors: We acknowledge that the current manuscript does not supply a closed-form derivation of the composed nonlinear operator in the original domain or theoretical bounds on the residual error after the round-trip transformation. The method was developed from the empirical observation that logarithmic compression allows a linear compensator to more effectively attenuate large-magnitude outliers while leaving the bulk distribution largely unaffected. In the revised manuscript we will add a dedicated analysis subsection that (i) derives the effective correction operator obtained by composing BLT, the linear layer, and the inverse BLT, and (ii) reports the empirical distribution of residual quantization error on both outlier and non-outlier activations across representative layers, thereby providing quantitative support for the claim that error is reduced rather than merely redistributed. revision: yes
Referee: [Abstract / Experiments] The abstract states that extensive experiments confirm effectiveness, efficiency, robustness, and generality, yet the provided text contains no quantitative accuracy deltas, error bars, dataset specifications, or ablation results. Without these concrete numbers it is impossible to verify whether the claimed improvements hold across bit-widths and models or whether they are driven by the nonlinear compensation itself.

Authors: We agree that the abstract would be more informative if it contained concrete performance numbers. In the revised version we will shorten the general claims and insert a concise statement of the principal empirical results, for example the average top-1 accuracy gain on ImageNet for ResNet-50 and ViT-B/16 under W4A4 quantization relative to the strongest linear-compensation baseline, together with a brief reference to the evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the NBC/BLT construction

full rationale

The paper presents NBC and BLT as new algorithmic constructions for post-training quantization compensation. No equations, derivations, or self-citations are exhibited that reduce any claimed prediction or result to a fitted parameter, self-definition, or prior author work by construction. The approach is framed as an empirical method whose effectiveness is demonstrated through experiments across models and bit-widths, leaving the central claims independent of any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method introduces new algorithmic components (NBC and BLT) whose correctness rests on empirical validation rather than stated mathematical assumptions.

pith-pipeline@v0.9.0 · 5700 in / 1045 out tokens · 42020 ms · 2026-05-20T20:50:57.831767+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Bipolar Logarithmic Transformation (BLT) ... f(x) = log2(x)+N+1 for x>2^{-N}, 2^N x for |x|<=2^{-N}, -log2(-x)-N-1 for x<-2^{-N}
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ynbc = yq + f^{-1}(W f(xq) + b)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 8 internal anchors

[1]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[2]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

work page 2021
[3]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[4]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019
[5]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[6]

A survey of model compression and acceleration for deep neural networks.ArXiv, abs/1710.09282, 2017

Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks.arXiv preprint arXiv:1710.09282, 2017

work page arXiv 2017
[7]

A survey of quantization methods for efficient neural network inference

Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. InLow-power computer vision, pages 291–326. Chapman and Hall/CRC, 2022

work page 2022
[8]

Gplq: A general, practical, and lightning qat method for vision transformers.arXiv preprint arXiv:2506.11784, 2025

Guang Liang, Xinyao Liu, and Jianxin Wu. Gplq: A general, practical, and lightning qat method for vision transformers.arXiv preprint arXiv:2506.11784, 2025

work page arXiv 2025
[9]

Q-vit: Accurate and fully quantized low-bit vision transformer.Advances in neural information processing systems, 35:34451–34463, 2022

Yanjing Li, Sheng Xu, Baochang Zhang, Xianbin Cao, Peng Gao, and Guodong Guo. Q-vit: Accurate and fully quantized low-bit vision transformer.Advances in neural information processing systems, 35:34451–34463, 2022

work page 2022
[10]

Learned step size quantization

Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dhar- mendra S Modha. Learned step size quantization.arXiv preprint arXiv:1902.08153, 2019. 17

work page arXiv 1902
[11]

Efficientqat: Efficient quantization-aware training for large language models.arXiv preprint arXiv:2407.11062, 2024

Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficientqat: Efficient quantization-aware training for large language models.arXiv preprint arXiv:2407.11062, 2024

work page arXiv 2024
[12]

Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization

Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, and Guangyu Sun. Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization. InEuropean conference on computer vision, pages 191–207. Springer, 2022

work page 2022
[13]

Repq-vit: Scale reparameterization for post-training quantization of vision transformers

Zhikai Li, Junrui Xiao, Lianwei Yang, and Qingyi Gu. Repq-vit: Scale reparameterization for post-training quantization of vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17227–17236, 2023

work page 2023
[14]

Instance-aware group quantization for vision transformers

Jaehyeon Moon, Dohyung Kim, Junyong Cheon, and Bumsub Ham. Instance-aware group quantization for vision transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16132–16141, 2024

work page 2024
[15]

Towards accurate post-training quantization of vision transformers via error reduction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2676–2692, 2025

Yunshan Zhong, You Huang, Jiawei Hu, Yuxin Zhang, and Rongrong Ji. Towards accurate post-training quantization of vision transformers via error reduction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2676–2692, 2025

work page 2025
[16]

Up or down? adaptive rounding for post-training quantization

Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? adaptive rounding for post-training quantization. InInternational conference on machine learning, pages 7197–7206. PMLR, 2020

work page 2020
[17]

Brecq: Pushing the limit of post-training quantization by block reconstruc- tion

Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction. arXiv preprint arXiv:2102.05426, 2021

work page arXiv 2021
[18]

Qdrop: Randomly dropping quantization for extremely low-bit post-training quantization

Xiuying Wei, Ruihao Gong, Yuhang Li, Xianglong Liu, and Fengwei Yu. Qdrop: Ran- domly dropping quantization for extremely low-bit post-training quantization.arXiv preprint arXiv:2203.05740, 2022

work page arXiv 2022
[19]

Aphq- vit: Post-training quantization with average perturbation hessian based reconstruction for vision transformers

Zhuguanyu Wu, Jiayi Zhang, Jiaxin Chen, Jinyang Guo, Di Huang, and Yunhong Wang. Aphq- vit: Post-training quantization with average perturbation hessian based reconstruction for vision transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9686–9695, 2025

work page 2025
[20]

Quantization without tears

Minghao Fu, Hao Yu, Jie Shao, Junjie Zhou, Ke Zhu, and Jianxin Wu. Quantization without tears. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4462–4472, June 2025

work page 2025
[21]

Qwt-v2: Practical, effective and efficient post-training quantization.arXiv preprint arXiv:2505.20932, 2025

Ningyuan Tang, Minghao Fu, Hao Yu, and Jianxin Wu. Qwt-v2: Practical, effective and efficient post-training quantization.arXiv preprint arXiv:2505.20932, 2025

work page arXiv 2025
[22]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35: 30318–30332, 2022

work page 2022
[23]

Vision Transformers Need Registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023

work page 2023
[25]

Adalog: Post- training quantization for vision transformers with adaptive logarithm quantizer

Zhuguanyu Wu, Jiaxin Chen, Hanwen Zhong, Di Huang, and Yunhong Wang. Adalog: Post- training quantization for vision transformers with adaptive logarithm quantizer. InEuropean Conference on Computer Vision, pages 411–427. Springer, 2024

work page 2024
[26]

Q-dit: Accurate post-training quantization for diffusion transformers

Lei Chen, Yuan Meng, Chen Tang, Xinzhu Ma, Jingyan Jiang, Xin Wang, Zhi Wang, and Wenwu Zhu. Q-dit: Accurate post-training quantization for diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28306–28315, 2025. 18

work page 2025
[27]

Duquant: Distributing outliers via dual transformation makes stronger quantized llms.Advances in Neural Information Processing Systems, 37:87766–87800, 2024

Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms.Advances in Neural Information Processing Systems, 37:87766–87800, 2024

work page 2024
[28]

Uq-vit: Harmonizing extreme activations with hardware-friendly uniform quantization in vision transformers

Tao Jiang, Yucheng Jiang, Xiwen Yao, Gong Cheng, and Junwei Han. Uq-vit: Harmonizing extreme activations with hardware-friendly uniform quantization in vision transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 22354–22362, 2026

work page 2026
[29]

Dopq-vit: Towards distribution-friendly and outlier-aware post-training quantization for vision transform- ers.arXiv preprint arXiv:2408.03291, 2024

Lianwei Yang, Haisong Gong, Haokun Lin, Yichen Wu, Zhenan Sun, and Qingyi Gu. Dopq-vit: Towards distribution-friendly and outlier-aware post-training quantization for vision transform- ers.arXiv preprint arXiv:2408.03291, 2024

work page arXiv 2024
[30]

Fima-q: Post- training quantization for vision transformers by fisher information matrix approximation

Zhuguanyu Wu, Shihe Wang, Jiayi Zhang, Jiaxin Chen, and Yunhong Wang. Fima-q: Post- training quantization for vision transformers by fisher information matrix approximation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14891–14900, 2025

work page 2025
[31]

Notes on the use of data transformations.Practical assessment, research, and evaluation, 8(1), 2002

Jason Osborne. Notes on the use of data transformations.Practical assessment, research, and evaluation, 8(1), 2002

work page 2002
[32]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[33]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021

work page 2021
[34]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[36]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[37]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017
[38]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

work page 2024
[40]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[41]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[42]

Piqa: Reasoning about phys- ical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020. 19

work page 2020
[43]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[44]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[45]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[46]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

work page 2021
[47]

https://developer.nvidia.com/ tensorrt

NVIDIA Corporation.NVIDIA TensorRT, 2024. https://developer.nvidia.com/ tensorrt

work page 2024
[48]

Microsoft.ONNX Runtime, 2024.https://onnxruntime.ai/

work page 2024
[49]

{TVM}: An automated {End-to-End} optimizing compiler for deep learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In13th USENIX symposium on operating systems design and implementation (OSDI 18), pages 578–594, 2018

work page 2018
[50]

https://github.com/NVIDIA/ FasterTransformer

NVIDIA Corporation.NVIDIA FasterTransformer, 2024. https://github.com/NVIDIA/ FasterTransformer

work page 2024
[51]

Marlin: Mixed- precision auto-regressive parallel inference on large language models

Elias Frantar, Roberto L Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh. Marlin: Mixed- precision auto-regressive parallel inference on large language models. InProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pages 239–251, 2025

work page 2025
[52]

Atom: Low-bit quantization for efficient and accurate llm serving.Proceedings of Machine Learning and Systems, 6:196–209, 2024

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. Atom: Low-bit quantization for efficient and accurate llm serving.Proceedings of Machine Learning and Systems, 6:196–209, 2024

work page 2024
[53]

Fully quantized network for object detection

Rundong Li, Yan Wang, Feng Liang, Hongwei Qin, Junjie Yan, and Rui Fan. Fully quantized network for object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2810–2819, 2019

work page 2019
[54]

Distilling knowledge by mimicking features.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8183–8195, 2021

Guo-Hua Wang, Yifan Ge, and Jianxin Wu. Distilling knowledge by mimicking features.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8183–8195, 2021

work page 2021
[55]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

work page 2014
[56]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. InProceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017

work page 2017
[57]

Cascade r-cnn: Delving into high quality object detection

Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018

work page 2018
[58]

Understanding the difficulty of training deep feedfor- ward neural networks

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedfor- ward neural networks. InProceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010

work page 2010
[59]

Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

work page 1929
[60]

Treasures in discarded weights for llm quantization

Hao Yu, Yang Zhou, Bohua Chen, Zelan Yang, Shen Li, Yong Li, and Jianxin Wu. Treasures in discarded weights for llm quantization. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 22218–22226, 2025. 20

work page 2025

[1] [1]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[2] [2]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

work page 2021

[3] [3]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[4] [4]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019

[5] [5]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021

[6] [6]

A survey of model compression and acceleration for deep neural networks.ArXiv, abs/1710.09282, 2017

Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks.arXiv preprint arXiv:1710.09282, 2017

work page arXiv 2017

[7] [7]

A survey of quantization methods for efficient neural network inference

Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. InLow-power computer vision, pages 291–326. Chapman and Hall/CRC, 2022

work page 2022

[8] [8]

Gplq: A general, practical, and lightning qat method for vision transformers.arXiv preprint arXiv:2506.11784, 2025

Guang Liang, Xinyao Liu, and Jianxin Wu. Gplq: A general, practical, and lightning qat method for vision transformers.arXiv preprint arXiv:2506.11784, 2025

work page arXiv 2025

[9] [9]

Q-vit: Accurate and fully quantized low-bit vision transformer.Advances in neural information processing systems, 35:34451–34463, 2022

Yanjing Li, Sheng Xu, Baochang Zhang, Xianbin Cao, Peng Gao, and Guodong Guo. Q-vit: Accurate and fully quantized low-bit vision transformer.Advances in neural information processing systems, 35:34451–34463, 2022

work page 2022

[10] [10]

Learned step size quantization

Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dhar- mendra S Modha. Learned step size quantization.arXiv preprint arXiv:1902.08153, 2019. 17

work page arXiv 1902

[11] [11]

Efficientqat: Efficient quantization-aware training for large language models.arXiv preprint arXiv:2407.11062, 2024

Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficientqat: Efficient quantization-aware training for large language models.arXiv preprint arXiv:2407.11062, 2024

work page arXiv 2024

[12] [12]

Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization

Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, and Guangyu Sun. Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization. InEuropean conference on computer vision, pages 191–207. Springer, 2022

work page 2022

[13] [13]

Repq-vit: Scale reparameterization for post-training quantization of vision transformers

Zhikai Li, Junrui Xiao, Lianwei Yang, and Qingyi Gu. Repq-vit: Scale reparameterization for post-training quantization of vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17227–17236, 2023

work page 2023

[14] [14]

Instance-aware group quantization for vision transformers

Jaehyeon Moon, Dohyung Kim, Junyong Cheon, and Bumsub Ham. Instance-aware group quantization for vision transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16132–16141, 2024

work page 2024

[15] [15]

Towards accurate post-training quantization of vision transformers via error reduction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2676–2692, 2025

Yunshan Zhong, You Huang, Jiawei Hu, Yuxin Zhang, and Rongrong Ji. Towards accurate post-training quantization of vision transformers via error reduction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2676–2692, 2025

work page 2025

[16] [16]

Up or down? adaptive rounding for post-training quantization

Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? adaptive rounding for post-training quantization. InInternational conference on machine learning, pages 7197–7206. PMLR, 2020

work page 2020

[17] [17]

Brecq: Pushing the limit of post-training quantization by block reconstruc- tion

Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction. arXiv preprint arXiv:2102.05426, 2021

work page arXiv 2021

[18] [18]

Qdrop: Randomly dropping quantization for extremely low-bit post-training quantization

Xiuying Wei, Ruihao Gong, Yuhang Li, Xianglong Liu, and Fengwei Yu. Qdrop: Ran- domly dropping quantization for extremely low-bit post-training quantization.arXiv preprint arXiv:2203.05740, 2022

work page arXiv 2022

[19] [19]

Aphq- vit: Post-training quantization with average perturbation hessian based reconstruction for vision transformers

Zhuguanyu Wu, Jiayi Zhang, Jiaxin Chen, Jinyang Guo, Di Huang, and Yunhong Wang. Aphq- vit: Post-training quantization with average perturbation hessian based reconstruction for vision transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9686–9695, 2025

work page 2025

[20] [20]

Quantization without tears

Minghao Fu, Hao Yu, Jie Shao, Junjie Zhou, Ke Zhu, and Jianxin Wu. Quantization without tears. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4462–4472, June 2025

work page 2025

[21] [21]

Qwt-v2: Practical, effective and efficient post-training quantization.arXiv preprint arXiv:2505.20932, 2025

Ningyuan Tang, Minghao Fu, Hao Yu, and Jianxin Wu. Qwt-v2: Practical, effective and efficient post-training quantization.arXiv preprint arXiv:2505.20932, 2025

work page arXiv 2025

[22] [22]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35: 30318–30332, 2022

work page 2022

[23] [23]

Vision Transformers Need Registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023

work page 2023

[25] [25]

Adalog: Post- training quantization for vision transformers with adaptive logarithm quantizer

Zhuguanyu Wu, Jiaxin Chen, Hanwen Zhong, Di Huang, and Yunhong Wang. Adalog: Post- training quantization for vision transformers with adaptive logarithm quantizer. InEuropean Conference on Computer Vision, pages 411–427. Springer, 2024

work page 2024

[26] [26]

Q-dit: Accurate post-training quantization for diffusion transformers

Lei Chen, Yuan Meng, Chen Tang, Xinzhu Ma, Jingyan Jiang, Xin Wang, Zhi Wang, and Wenwu Zhu. Q-dit: Accurate post-training quantization for diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28306–28315, 2025. 18

work page 2025

[27] [27]

Duquant: Distributing outliers via dual transformation makes stronger quantized llms.Advances in Neural Information Processing Systems, 37:87766–87800, 2024

Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms.Advances in Neural Information Processing Systems, 37:87766–87800, 2024

work page 2024

[28] [28]

Uq-vit: Harmonizing extreme activations with hardware-friendly uniform quantization in vision transformers

Tao Jiang, Yucheng Jiang, Xiwen Yao, Gong Cheng, and Junwei Han. Uq-vit: Harmonizing extreme activations with hardware-friendly uniform quantization in vision transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 22354–22362, 2026

work page 2026

[29] [29]

Dopq-vit: Towards distribution-friendly and outlier-aware post-training quantization for vision transform- ers.arXiv preprint arXiv:2408.03291, 2024

Lianwei Yang, Haisong Gong, Haokun Lin, Yichen Wu, Zhenan Sun, and Qingyi Gu. Dopq-vit: Towards distribution-friendly and outlier-aware post-training quantization for vision transform- ers.arXiv preprint arXiv:2408.03291, 2024

work page arXiv 2024

[30] [30]

Fima-q: Post- training quantization for vision transformers by fisher information matrix approximation

Zhuguanyu Wu, Shihe Wang, Jiayi Zhang, Jiaxin Chen, and Yunhong Wang. Fima-q: Post- training quantization for vision transformers by fisher information matrix approximation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14891–14900, 2025

work page 2025

[31] [31]

Notes on the use of data transformations.Practical assessment, research, and evaluation, 8(1), 2002

Jason Osborne. Notes on the use of data transformations.Practical assessment, research, and evaluation, 8(1), 2002

work page 2002

[32] [32]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009

[33] [33]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021

work page 2021

[34] [34]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[35] [35]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017

[36] [36]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023

[37] [37]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017

[38] [38]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

work page 2024

[40] [40]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[41] [41]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

work page 2020

[42] [42]

Piqa: Reasoning about phys- ical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020. 19

work page 2020

[43] [43]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[44] [44]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[45] [45]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[46] [46]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

work page 2021

[47] [47]

https://developer.nvidia.com/ tensorrt

NVIDIA Corporation.NVIDIA TensorRT, 2024. https://developer.nvidia.com/ tensorrt

work page 2024

[48] [48]

Microsoft.ONNX Runtime, 2024.https://onnxruntime.ai/

work page 2024

[49] [49]

{TVM}: An automated {End-to-End} optimizing compiler for deep learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In13th USENIX symposium on operating systems design and implementation (OSDI 18), pages 578–594, 2018

work page 2018

[50] [50]

https://github.com/NVIDIA/ FasterTransformer

NVIDIA Corporation.NVIDIA FasterTransformer, 2024. https://github.com/NVIDIA/ FasterTransformer

work page 2024

[51] [51]

Marlin: Mixed- precision auto-regressive parallel inference on large language models

Elias Frantar, Roberto L Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh. Marlin: Mixed- precision auto-regressive parallel inference on large language models. InProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pages 239–251, 2025

work page 2025

[52] [52]

Atom: Low-bit quantization for efficient and accurate llm serving.Proceedings of Machine Learning and Systems, 6:196–209, 2024

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. Atom: Low-bit quantization for efficient and accurate llm serving.Proceedings of Machine Learning and Systems, 6:196–209, 2024

work page 2024

[53] [53]

Fully quantized network for object detection

Rundong Li, Yan Wang, Feng Liang, Hongwei Qin, Junjie Yan, and Rui Fan. Fully quantized network for object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2810–2819, 2019

work page 2019

[54] [54]

Distilling knowledge by mimicking features.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8183–8195, 2021

Guo-Hua Wang, Yifan Ge, and Jianxin Wu. Distilling knowledge by mimicking features.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8183–8195, 2021

work page 2021

[55] [55]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

work page 2014

[56] [56]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. InProceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017

work page 2017

[57] [57]

Cascade r-cnn: Delving into high quality object detection

Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018

work page 2018

[58] [58]

Understanding the difficulty of training deep feedfor- ward neural networks

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedfor- ward neural networks. InProceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010

work page 2010

[59] [59]

Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

work page 1929

[60] [60]

Treasures in discarded weights for llm quantization

Hao Yu, Yang Zhou, Bohua Chen, Zelan Yang, Shen Li, Yong Li, and Jianxin Wu. Treasures in discarded weights for llm quantization. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 22218–22226, 2025. 20

work page 2025