SPARQLe: Sub-Precision Activation Representation for Quantized LLM Inference

Anand Raghunathan; Aradhana Mohan Parvathy; Arnab Raha; Deepak A. Mathaikutty; Shamik Kundu; Soumendu Kumar Ghosh; Souvik Kundu

arxiv: 2606.00365 · v1 · pith:M2RFEHUPnew · submitted 2026-05-29 · 💻 cs.AR

SPARQLe: Sub-Precision Activation Representation for Quantized LLM Inference

Aradhana Mohan Parvathy , Soumendu Kumar Ghosh , Shamik Kundu , Arnab Raha , Souvik Kundu , Deepak A. Mathaikutty , Anand Raghunathan This is my paper

Pith reviewed 2026-06-28 19:29 UTC · model grok-4.3

classification 💻 cs.AR

keywords LLM inferencequantizationactivation sparsityhardware-software co-designsub-precision representationmemory traffic reductionenergy efficiencyaccelerator design

0 comments

The pith

SPARQLe represents each 2k-bit quantized activation as a dense k-bit LSB tensor plus a sparse k-bit MSB tensor with a precision bitmap to cut memory traffic and run on k-bit datapaths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that after standard quantization many activations cluster near zero and therefore leave the higher bits sparse. SPARQLe turns this statistical property into a concrete format: the lower k bits stay dense while the upper k bits are stored only where nonzero, indexed by a small bitmap. A lightweight algorithm increases the sparsity further. The resulting hybrid tensor travels over narrower memory buses and feeds a k-bit accelerator that still produces the numerical result of the original 2k-bit activation. Measured on BitNet 3B, Llama 2 7B and Llama 3 8B the scheme delivers the reported latency and energy reductions while the final model accuracy stays unchanged.

Core claim

SPARQLe is a hardware-software co-design that represents each 2k-bit activation tensor as a dense k-bit LSB tensor and a sparse k-bit MSB tensor compressed with a precision bitmap, proposes a lightweight algorithm to increase MSB sparsity, and supplies an accelerator that operates directly on the hybrid format with minimal control overheads, thereby reducing activation memory traffic and enabling efficient k-bit datapath computation while preserving 2k-bit activation accuracy.

What carries the argument

The hybrid activation format (dense k-bit LSB tensor plus sparse k-bit MSB tensor indexed by a precision bitmap) that carries the argument by converting statistical near-zero concentration into reduced memory traffic and narrower datapath width.

If this is right

Prefill latency drops 16-24.3% on the three evaluated models.
Decode latency drops 13.5-23.4%.
Prefill energy falls 17-26.7% and decode energy falls 6.5-14.2%.
All savings occur while 2k-bit activation accuracy is preserved.
Computation runs on k-bit datapaths instead of 2k-bit datapaths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bitmap-plus-sparse-high-bits pattern could be applied to intermediate tensors inside attention or MLP blocks that are not currently quantized.
If the sparsity pattern proves stable across training checkpoints, the bitmap could be generated once and reused for multiple inferences.
Hardware vendors could add native support for the hybrid format in future matrix units to widen the efficiency gap beyond the current software-managed accelerator.

Load-bearing premise

A significant fraction of activations remain concentrated around zero after quantization, producing usable sparsity in the higher-order bits that can be exploited by the bitmap representation without accuracy loss.

What would settle it

Apply SPARQLe to a new model whose post-quantization activations show markedly lower concentration around zero and measure whether accuracy falls below the 2k-bit baseline or the reported latency-energy gains disappear.

Figures

Figures reproduced from arXiv: 2606.00365 by Anand Raghunathan, Aradhana Mohan Parvathy, Arnab Raha, Deepak A. Mathaikutty, Shamik Kundu, Soumendu Kumar Ghosh, Souvik Kundu.

**Figure 2.** Figure 2: Illustration of the SPARQLe data representation. Compression (%) = 𝑝− ( 𝑝 2 +1+ (1−𝑠)𝑝 2 ) 𝑝 · 100 = 𝑠𝑝 2 −1 𝑝 · 100 (1) Ops Reduction (%) = 𝑠 2 ∗ 100 (2) While most activations follow Gaussian- or Laplacian-like distributions, certain non-linear functions such as SiLU may cause activation distributions to deviate from these cases. Nevertheless, these activations still exhibit even higher sub-precision sp… view at source ↗

**Figure 3.** Figure 3: Example of MSB4 sparsity enhancement. Least-important [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Shared load path for dense and sparse phase, (b) Hybrid PE architecture, (c) Drain path in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Timeline of memory read/write, dense and sparse compute [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Energy and performance benefits with SPARQLe: (a) Normalized Prefill and Decode Energy, (b) Normalized Runtime Prefill and Decode, (c) Compute acceleration in Prefill and Memory access acceleration in Decode [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Accuracy/Sub-precision sparsity (averaged across the entire [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Layerwise latency reduction trend in BitNet-3B. Similar [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

read the original abstract

The rapid growth in sizes of Large language models (LLMs) results in high compute and memory costs during inference. Quantization has been a significant pathway to addressing this challenge. In the quest to push the limits of quantization, weights, which are static, can often be quantized aggressively (e.g. 4 bits), while activations often require higher precision (e.g., 8 bits) to preserve accuracy, forcing hardware to operate with higher-precision datapaths. We leverage the statistical property that a significant fraction of activations are concentrated around zero, resulting in sparsity in the higher-order bits. Our proposal, SPARQLe, is a hardware-software co-design framework that exploits this sub-precision redundancy in any given quantized model. SPARQLe represents each 2k-bit activation tensor as a dense k-bit LSB tensor and a sparse k-bit MSB tensor compressed with a precision bitmap, and proposes a lightweight algorithm to increase MSB sparsity. SPARQLe reduces activation memory traffic and enables efficient computation on k-bit datapaths while preserving 2k-bit activation accuracy. SPARQLe includes an accelerator that operates directly on this hybrid format with minimal control overheads. Across the BitNet 3B, Llama2 7B, and Llama3 8B models, SPARQLe reduces prefill latency by 16-24.3% and decode latency by 13.5-23.4%, with 17-26.7% and 6.5-14.2% lower prefill and decode energy, respectively. SPARQLe demonstrates that sub-precision activation sparsity offers an effective and complementary pathway towards efficient LLM inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SPARQLe's hybrid LSB/MSB activation split with bitmap compression is a direct attack on activation traffic, but the abstract gives no sparsity measurements or ablations to support the 16-24% latency claims.

read the letter

The paper's core move is to represent each 2k-bit activation as a dense k-bit LSB tensor plus a sparse k-bit MSB tensor stored with a precision bitmap, plus a lightweight pass that increases MSB sparsity. This lets the hardware run narrower datapaths and cut memory traffic while claiming to keep full 2k-bit accuracy. The reported numbers across BitNet 3B, Llama2 7B, and Llama3 8B are 16-24.3% prefill latency reduction, 13.5-23.4% decode latency reduction, and corresponding energy drops.

What stands out is the explicit targeting of activation memory as a complementary lever to weight quantization. The statistical observation that many activations stay near zero after quantization is used to justify the split, and the co-design with a custom accelerator is laid out at the level of the abstract.

The soft spot is exactly where the stress-test note points: the latency and energy gains require enough zeros in the MSB to make the bitmap format cheaper than a dense baseline. The abstract states this holds and accuracy is preserved, but supplies no per-layer sparsity histograms, no bitmap overhead measurements, and no ablation on what happens when the near-zero concentration is weaker. Without those, the performance numbers cannot be checked or generalized.

The approach itself does not look circular; it rests on an observed property rather than fitted parameters. The citation pattern and full experimental protocol are not visible from the abstract alone.

This is for hardware-software co-design groups working on quantized LLM accelerators. A reader who needs concrete sparsity data or full ablations will not get much yet. It is worth sending to peer review once the full manuscript with the missing measurements is available, because the problem it attacks is real and the representation is not a routine extension of prior work.

Referee Report

2 major / 2 minor

Summary. The paper proposes SPARQLe, a hardware-software co-design for quantized LLM inference that exploits observed concentration of activations around zero after quantization. Each 2k-bit activation tensor is represented as a dense k-bit LSB tensor plus a sparse k-bit MSB tensor compressed via a precision bitmap; a lightweight algorithm increases MSB sparsity. An accelerator operates directly on the hybrid format. The central empirical claim is that this yields 16-24.3% prefill and 13.5-23.4% decode latency reductions (plus corresponding energy savings) on BitNet 3B, Llama2 7B and Llama3 8B while preserving 2k-bit accuracy.

Significance. If the sparsity property and bitmap overhead measurements hold, the approach supplies a complementary, model-agnostic route to lowering activation memory traffic and enabling narrower datapaths without retraining or accuracy loss, which would be of practical interest for inference accelerators.

major comments (2)

[Abstract / §4] Abstract (and §4/§5 experimental sections): the headline latency and energy numbers (16-24.3% prefill latency, 17-26.7% prefill energy, etc.) are load-bearing for the contribution, yet the manuscript supplies neither per-layer/per-model MSB sparsity histograms nor explicit bitmap-overhead measurements that would confirm the hybrid representation is cheaper than a dense 2k-bit baseline on the evaluated models.
[§3.2] §3.2 (lightweight sparsity-increasing algorithm): the claim that the algorithm preserves 2k-bit accuracy while increasing usable MSB sparsity is central, but no ablation is shown that quantifies accuracy degradation when the algorithm is disabled or when the assumed zero-concentration is weaker (e.g., on other model families or bit-widths).

minor comments (2)

[§3.1] Notation for the hybrid format (dense LSB + bitmap-compressed MSB) should be introduced with an explicit equation or diagram in §3.1 to avoid ambiguity when comparing against the 2k-bit baseline.
[§4] Table captions in the experimental section should explicitly state the number of runs and whether error bars reflect standard deviation across seeds or models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to strengthen the empirical support for the claims.

read point-by-point responses

Referee: [Abstract / §4] Abstract (and §4/§5 experimental sections): the headline latency and energy numbers (16-24.3% prefill latency, 17-26.7% prefill energy, etc.) are load-bearing for the contribution, yet the manuscript supplies neither per-layer/per-model MSB sparsity histograms nor explicit bitmap-overhead measurements that would confirm the hybrid representation is cheaper than a dense 2k-bit baseline on the evaluated models.

Authors: We agree that the absence of per-layer/per-model MSB sparsity histograms and explicit bitmap-overhead breakdowns weakens the substantiation of the headline numbers. The reported latency and energy figures were obtained from cycle-accurate simulation of the accelerator that already incorporates bitmap storage and access costs. In the revised manuscript we will add the requested histograms (showing MSB sparsity per layer and model) together with a table breaking down bitmap overhead versus the dense 2k-bit baseline, confirming net traffic reduction on BitNet 3B, Llama2 7B and Llama3 8B. revision: yes
Referee: [§3.2] §3.2 (lightweight sparsity-increasing algorithm): the claim that the algorithm preserves 2k-bit accuracy while increasing usable MSB sparsity is central, but no ablation is shown that quantifies accuracy degradation when the algorithm is disabled or when the assumed zero-concentration is weaker (e.g., on other model families or bit-widths).

Authors: The algorithm is a lightweight post-quantization pass whose only purpose is to increase MSB sparsity; the manuscript already states that end-to-end accuracy remains identical to the 2k-bit baseline on the three evaluated models. We will add an ablation table in §3.2 (and corresponding text in §5) that reports perplexity/accuracy with the algorithm disabled versus enabled on BitNet 3B, Llama2 7B and Llama3 8B. Extending the ablation to additional model families or bit-widths would require new experiments outside the current scope; we will note this limitation and discuss the dependence on the observed zero-concentration property. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical results rest on observed sparsity property

full rationale

The paper proposes SPARQLe as a hardware-software co-design that exploits an observed statistical property of quantized activations (concentration around zero yielding MSB sparsity) to represent tensors as dense LSB plus bitmap-compressed sparse MSB, with a lightweight sparsity-increasing algorithm. Performance numbers (latency/energy reductions on BitNet 3B, Llama2 7B, Llama3 8B) are obtained from direct accelerator measurements and are not derived from any equations, fitted parameters, or self-citations that reduce the claims to their own inputs by construction. The central premise is an external empirical observation rather than a self-referential derivation, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger populated from abstract only; the central claim rests on an unverified statistical property of post-quantization activations and on the correctness of an unreviewed accelerator design.

axioms (1)

domain assumption A significant fraction of activations are concentrated around zero after quantization, producing sparsity in higher-order bits.
Explicitly stated in the abstract as the property leveraged by the method.

invented entities (1)

Hybrid activation format (dense k-bit LSB tensor + sparse k-bit MSB tensor with precision bitmap) no independent evidence
purpose: To allow k-bit datapath computation while preserving 2k-bit accuracy.
Core novel representation introduced by the paper.

pith-pipeline@v0.9.1-grok · 5870 in / 1354 out tokens · 23292 ms · 2026-06-28T19:29:43.114412+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 14 canonical work pages · 4 internal anchors

[1]

Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. 2024. QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. ...

work page doi:10.52202/079017-3180 2024
[2]

Reena Elangovan, Shubham Jain, and Anand Raghunathan. 2020. Ax-BxP: Approximate Blocked Computation for Precision-Reconfigurable Deep Neu- ral Network Acceleration.CoRRabs/2011.13000 (2020). arXiv:2011.13000 https://arxiv.org/abs/2011.13000

work page arXiv 2020
[3]

Aaron Grattafiori et. al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Hugo Touvron et. al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL] https://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2024. The L...

work page doi:10.5281/zenodo.12608602 2024
[6]

Wonsuk Jang and Thierry Tambe. 2025. BlockDialect: Block-wise Fine- grained Mixed Format Quantization for Energy-Efficient LLM Inference.CoRR abs/2501.01144 (2025). arXiv:2501.01144 doi:10.48550/ARXIV.2501.01144

work page doi:10.48550/arxiv.2501.01144 2025
[7]

Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. 2024. Gear: An efficient kv cache compression recipe for near-lossless generative inference of llm.arXiv preprint arXiv:2403.05527(2024)

work page arXiv 2024
[8]

Mahoney, and Kurt Keutzer

Sehoon Kim, Coleman Richard Charles Hooper, Amir Gholami, Zhen Dong, Xi- uyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. 2024. SqueezeLLM: Dense-and-Sparse Quantization. InProceedings of the 41st International Confer- ence on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Katherine Helle...

2024
[9]

Seunghyun Lee, Dongho Ha, Sungbin Kim, Sungwoo Kim, Hyunwuk Lee, and Won Woo Ro. 2025. BitL: A Hybrid Bit-Serial and Parallel Deep Learning Acceler- ator for Critical Path Reduction. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture (MICRO ’25). Association for Computing Ma- chinery, New York, NY, USA, 1565–1578. doi:10.1145/...

work page doi:10.1145/3725843.3756044 2025
[10]

Yuhang Li and Priyadarshini Panda. 2024. TesseraQ: Ultra Low-Bit LLM Post- Training Quantization with Block Reconstruction. arXiv:2410.19103 [cs.LG] https://arxiv.org/abs/2410.19103

work page arXiv 2024
[11]

Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. 2024. DuQuant: distributing outliers via dual transformation makes stronger quantized LLMs. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red...

2024
[12]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. InMLSys

2024
[13]

Yujun Lin*, Haotian Tang*, Shang Yang*, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. 2024. QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving.arXiv preprint arXiv:2405.04532(2024)

work page arXiv 2024
[14]

James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, and Ben Athiwaratkun. 2025. Training-Free Activation Sparsity in Large Language Models. InThe Thirteenth International Conference on Learning Representations. Aradhana Mohan Parvathy1, Soumendu Kumar Ghosh2, Shamik Kundu2, Arnab Raha2, Souvik Kundu2, Deepak A. Mathaikutty2, Anand Raghunathan1 1...

2025
[15]

Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. 2024. The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits.CoRRabs/2402.17764 (2024). arXiv:2402.17764 doi:10.48550/ARXIV.2402.17764

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.17764 2024
[16]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer Sentinel Mixture Models. arXiv:1609.07843 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2016
[17]

Mathaikutty, Shamik Kundu, and Soumendu K

Arnab Raha, Deepak A. Mathaikutty, Shamik Kundu, and Soumendu K. Ghosh
[18]

doi:10.3389/fhpcp.2025.1570210

FlexNPU: a dataflow-aware flexible deep learning accelerator for energy- efficient edge devices.Frontiers in High Performance ComputingVolume 3 - 2025 (2025). doi:10.3389/fhpcp.2025.1570210

work page doi:10.3389/fhpcp.2025.1570210 2025
[19]

Akshat Ramachandran, Souvik Kundu, and Tushar Krishna. 2025. MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quanti- zation. InProceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA ’25). Association for Computing Machinery, New York, NY, USA, 1193–1209. doi:10.1145/3695053.3730989

work page doi:10.1145/3695053.3730989 2025
[20]

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. 2024. OmniQuant: Omnidirec- tionally Calibrated Quantization for Large Language Models. InThe Twelfth Inter- national Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openrev...

2024
[21]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. InProceedings of the 40th International Conference on Machine Learning

2023
[22]

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. 2024. Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving. InProceedings of Machine Learning and Systems, P. Gibbons, G. Pekhimenko, and C. De Sa (Eds.), Vol. 6. 196–209. https://proceedings.mlsys.org/paper_f...

2024

[1] [1]

Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. 2024. QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. ...

work page doi:10.52202/079017-3180 2024

[2] [2]

Reena Elangovan, Shubham Jain, and Anand Raghunathan. 2020. Ax-BxP: Approximate Blocked Computation for Precision-Reconfigurable Deep Neu- ral Network Acceleration.CoRRabs/2011.13000 (2020). arXiv:2011.13000 https://arxiv.org/abs/2011.13000

work page arXiv 2020

[3] [3]

Aaron Grattafiori et. al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Hugo Touvron et. al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL] https://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2024. The L...

work page doi:10.5281/zenodo.12608602 2024

[6] [6]

Wonsuk Jang and Thierry Tambe. 2025. BlockDialect: Block-wise Fine- grained Mixed Format Quantization for Energy-Efficient LLM Inference.CoRR abs/2501.01144 (2025). arXiv:2501.01144 doi:10.48550/ARXIV.2501.01144

work page doi:10.48550/arxiv.2501.01144 2025

[7] [7]

Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. 2024. Gear: An efficient kv cache compression recipe for near-lossless generative inference of llm.arXiv preprint arXiv:2403.05527(2024)

work page arXiv 2024

[8] [8]

Mahoney, and Kurt Keutzer

Sehoon Kim, Coleman Richard Charles Hooper, Amir Gholami, Zhen Dong, Xi- uyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. 2024. SqueezeLLM: Dense-and-Sparse Quantization. InProceedings of the 41st International Confer- ence on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Katherine Helle...

2024

[9] [9]

Seunghyun Lee, Dongho Ha, Sungbin Kim, Sungwoo Kim, Hyunwuk Lee, and Won Woo Ro. 2025. BitL: A Hybrid Bit-Serial and Parallel Deep Learning Acceler- ator for Critical Path Reduction. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture (MICRO ’25). Association for Computing Ma- chinery, New York, NY, USA, 1565–1578. doi:10.1145/...

work page doi:10.1145/3725843.3756044 2025

[10] [10]

Yuhang Li and Priyadarshini Panda. 2024. TesseraQ: Ultra Low-Bit LLM Post- Training Quantization with Block Reconstruction. arXiv:2410.19103 [cs.LG] https://arxiv.org/abs/2410.19103

work page arXiv 2024

[11] [11]

Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. 2024. DuQuant: distributing outliers via dual transformation makes stronger quantized LLMs. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red...

2024

[12] [12]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. InMLSys

2024

[13] [13]

Yujun Lin*, Haotian Tang*, Shang Yang*, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. 2024. QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving.arXiv preprint arXiv:2405.04532(2024)

work page arXiv 2024

[14] [14]

James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, and Ben Athiwaratkun. 2025. Training-Free Activation Sparsity in Large Language Models. InThe Thirteenth International Conference on Learning Representations. Aradhana Mohan Parvathy1, Soumendu Kumar Ghosh2, Shamik Kundu2, Arnab Raha2, Souvik Kundu2, Deepak A. Mathaikutty2, Anand Raghunathan1 1...

2025

[15] [15]

Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. 2024. The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits.CoRRabs/2402.17764 (2024). arXiv:2402.17764 doi:10.48550/ARXIV.2402.17764

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.17764 2024

[16] [16]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer Sentinel Mixture Models. arXiv:1609.07843 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2016

[17] [17]

Mathaikutty, Shamik Kundu, and Soumendu K

Arnab Raha, Deepak A. Mathaikutty, Shamik Kundu, and Soumendu K. Ghosh

[18] [18]

doi:10.3389/fhpcp.2025.1570210

FlexNPU: a dataflow-aware flexible deep learning accelerator for energy- efficient edge devices.Frontiers in High Performance ComputingVolume 3 - 2025 (2025). doi:10.3389/fhpcp.2025.1570210

work page doi:10.3389/fhpcp.2025.1570210 2025

[19] [19]

Akshat Ramachandran, Souvik Kundu, and Tushar Krishna. 2025. MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quanti- zation. InProceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA ’25). Association for Computing Machinery, New York, NY, USA, 1193–1209. doi:10.1145/3695053.3730989

work page doi:10.1145/3695053.3730989 2025

[20] [20]

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. 2024. OmniQuant: Omnidirec- tionally Calibrated Quantization for Large Language Models. InThe Twelfth Inter- national Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openrev...

2024

[21] [21]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. InProceedings of the 40th International Conference on Machine Learning

2023

[22] [22]

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. 2024. Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving. InProceedings of Machine Learning and Systems, P. Gibbons, G. Pekhimenko, and C. De Sa (Eds.), Vol. 6. 196–209. https://proceedings.mlsys.org/paper_f...

2024