pith. sign in

arxiv: 2606.00365 · v1 · pith:M2RFEHUPnew · submitted 2026-05-29 · 💻 cs.AR

SPARQLe: Sub-Precision Activation Representation for Quantized LLM Inference

Pith reviewed 2026-06-28 19:29 UTC · model grok-4.3

classification 💻 cs.AR
keywords LLM inferencequantizationactivation sparsityhardware-software co-designsub-precision representationmemory traffic reductionenergy efficiencyaccelerator design
0
0 comments X

The pith

SPARQLe represents each 2k-bit quantized activation as a dense k-bit LSB tensor plus a sparse k-bit MSB tensor with a precision bitmap to cut memory traffic and run on k-bit datapaths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that after standard quantization many activations cluster near zero and therefore leave the higher bits sparse. SPARQLe turns this statistical property into a concrete format: the lower k bits stay dense while the upper k bits are stored only where nonzero, indexed by a small bitmap. A lightweight algorithm increases the sparsity further. The resulting hybrid tensor travels over narrower memory buses and feeds a k-bit accelerator that still produces the numerical result of the original 2k-bit activation. Measured on BitNet 3B, Llama 2 7B and Llama 3 8B the scheme delivers the reported latency and energy reductions while the final model accuracy stays unchanged.

Core claim

SPARQLe is a hardware-software co-design that represents each 2k-bit activation tensor as a dense k-bit LSB tensor and a sparse k-bit MSB tensor compressed with a precision bitmap, proposes a lightweight algorithm to increase MSB sparsity, and supplies an accelerator that operates directly on the hybrid format with minimal control overheads, thereby reducing activation memory traffic and enabling efficient k-bit datapath computation while preserving 2k-bit activation accuracy.

What carries the argument

The hybrid activation format (dense k-bit LSB tensor plus sparse k-bit MSB tensor indexed by a precision bitmap) that carries the argument by converting statistical near-zero concentration into reduced memory traffic and narrower datapath width.

If this is right

  • Prefill latency drops 16-24.3% on the three evaluated models.
  • Decode latency drops 13.5-23.4%.
  • Prefill energy falls 17-26.7% and decode energy falls 6.5-14.2%.
  • All savings occur while 2k-bit activation accuracy is preserved.
  • Computation runs on k-bit datapaths instead of 2k-bit datapaths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bitmap-plus-sparse-high-bits pattern could be applied to intermediate tensors inside attention or MLP blocks that are not currently quantized.
  • If the sparsity pattern proves stable across training checkpoints, the bitmap could be generated once and reused for multiple inferences.
  • Hardware vendors could add native support for the hybrid format in future matrix units to widen the efficiency gap beyond the current software-managed accelerator.

Load-bearing premise

A significant fraction of activations remain concentrated around zero after quantization, producing usable sparsity in the higher-order bits that can be exploited by the bitmap representation without accuracy loss.

What would settle it

Apply SPARQLe to a new model whose post-quantization activations show markedly lower concentration around zero and measure whether accuracy falls below the 2k-bit baseline or the reported latency-energy gains disappear.

Figures

Figures reproduced from arXiv: 2606.00365 by Anand Raghunathan, Aradhana Mohan Parvathy, Arnab Raha, Deepak A. Mathaikutty, Shamik Kundu, Soumendu Kumar Ghosh, Souvik Kundu.

Figure 1
Figure 1. Figure 1: (a) Perplexity for varying weight precisions (x-axis) and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the SPARQLe data representation. Compression (%) = 𝑝− ( 𝑝 2 +1+ (1−𝑠)𝑝 2 ) 𝑝 · 100 = 𝑠𝑝 2 −1 𝑝 · 100 (1) Ops Reduction (%) = 𝑠 2 ∗ 100 (2) While most activations follow Gaussian- or Laplacian-like dis￾tributions, certain non-linear functions such as SiLU may cause activation distributions to deviate from these cases. Nevertheless, these activations still exhibit even higher sub-precision sp… view at source ↗
Figure 3
Figure 3. Figure 3: Example of MSB4 sparsity enhancement. Least-important [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Shared load path for dense and sparse phase, (b) Hybrid PE architecture, (c) Drain path in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Timeline of memory read/write, dense and sparse compute [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Energy and performance benefits with SPARQLe: (a) Normalized Prefill and Decode Energy, (b) Normalized Runtime Prefill and Decode, (c) Compute acceleration in Prefill and Memory access acceleration in Decode [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Accuracy/Sub-precision sparsity (averaged across the entire [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Layerwise latency reduction trend in BitNet-3B. Similar [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
read the original abstract

The rapid growth in sizes of Large language models (LLMs) results in high compute and memory costs during inference. Quantization has been a significant pathway to addressing this challenge. In the quest to push the limits of quantization, weights, which are static, can often be quantized aggressively (e.g. 4 bits), while activations often require higher precision (e.g., 8 bits) to preserve accuracy, forcing hardware to operate with higher-precision datapaths. We leverage the statistical property that a significant fraction of activations are concentrated around zero, resulting in sparsity in the higher-order bits. Our proposal, SPARQLe, is a hardware-software co-design framework that exploits this sub-precision redundancy in any given quantized model. SPARQLe represents each 2k-bit activation tensor as a dense k-bit LSB tensor and a sparse k-bit MSB tensor compressed with a precision bitmap, and proposes a lightweight algorithm to increase MSB sparsity. SPARQLe reduces activation memory traffic and enables efficient computation on k-bit datapaths while preserving 2k-bit activation accuracy. SPARQLe includes an accelerator that operates directly on this hybrid format with minimal control overheads. Across the BitNet 3B, Llama2 7B, and Llama3 8B models, SPARQLe reduces prefill latency by 16-24.3% and decode latency by 13.5-23.4%, with 17-26.7% and 6.5-14.2% lower prefill and decode energy, respectively. SPARQLe demonstrates that sub-precision activation sparsity offers an effective and complementary pathway towards efficient LLM inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SPARQLe, a hardware-software co-design for quantized LLM inference that exploits observed concentration of activations around zero after quantization. Each 2k-bit activation tensor is represented as a dense k-bit LSB tensor plus a sparse k-bit MSB tensor compressed via a precision bitmap; a lightweight algorithm increases MSB sparsity. An accelerator operates directly on the hybrid format. The central empirical claim is that this yields 16-24.3% prefill and 13.5-23.4% decode latency reductions (plus corresponding energy savings) on BitNet 3B, Llama2 7B and Llama3 8B while preserving 2k-bit accuracy.

Significance. If the sparsity property and bitmap overhead measurements hold, the approach supplies a complementary, model-agnostic route to lowering activation memory traffic and enabling narrower datapaths without retraining or accuracy loss, which would be of practical interest for inference accelerators.

major comments (2)
  1. [Abstract / §4] Abstract (and §4/§5 experimental sections): the headline latency and energy numbers (16-24.3% prefill latency, 17-26.7% prefill energy, etc.) are load-bearing for the contribution, yet the manuscript supplies neither per-layer/per-model MSB sparsity histograms nor explicit bitmap-overhead measurements that would confirm the hybrid representation is cheaper than a dense 2k-bit baseline on the evaluated models.
  2. [§3.2] §3.2 (lightweight sparsity-increasing algorithm): the claim that the algorithm preserves 2k-bit accuracy while increasing usable MSB sparsity is central, but no ablation is shown that quantifies accuracy degradation when the algorithm is disabled or when the assumed zero-concentration is weaker (e.g., on other model families or bit-widths).
minor comments (2)
  1. [§3.1] Notation for the hybrid format (dense LSB + bitmap-compressed MSB) should be introduced with an explicit equation or diagram in §3.1 to avoid ambiguity when comparing against the 2k-bit baseline.
  2. [§4] Table captions in the experimental section should explicitly state the number of runs and whether error bars reflect standard deviation across seeds or models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to strengthen the empirical support for the claims.

read point-by-point responses
  1. Referee: [Abstract / §4] Abstract (and §4/§5 experimental sections): the headline latency and energy numbers (16-24.3% prefill latency, 17-26.7% prefill energy, etc.) are load-bearing for the contribution, yet the manuscript supplies neither per-layer/per-model MSB sparsity histograms nor explicit bitmap-overhead measurements that would confirm the hybrid representation is cheaper than a dense 2k-bit baseline on the evaluated models.

    Authors: We agree that the absence of per-layer/per-model MSB sparsity histograms and explicit bitmap-overhead breakdowns weakens the substantiation of the headline numbers. The reported latency and energy figures were obtained from cycle-accurate simulation of the accelerator that already incorporates bitmap storage and access costs. In the revised manuscript we will add the requested histograms (showing MSB sparsity per layer and model) together with a table breaking down bitmap overhead versus the dense 2k-bit baseline, confirming net traffic reduction on BitNet 3B, Llama2 7B and Llama3 8B. revision: yes

  2. Referee: [§3.2] §3.2 (lightweight sparsity-increasing algorithm): the claim that the algorithm preserves 2k-bit accuracy while increasing usable MSB sparsity is central, but no ablation is shown that quantifies accuracy degradation when the algorithm is disabled or when the assumed zero-concentration is weaker (e.g., on other model families or bit-widths).

    Authors: The algorithm is a lightweight post-quantization pass whose only purpose is to increase MSB sparsity; the manuscript already states that end-to-end accuracy remains identical to the 2k-bit baseline on the three evaluated models. We will add an ablation table in §3.2 (and corresponding text in §5) that reports perplexity/accuracy with the algorithm disabled versus enabled on BitNet 3B, Llama2 7B and Llama3 8B. Extending the ablation to additional model families or bit-widths would require new experiments outside the current scope; we will note this limitation and discuss the dependence on the observed zero-concentration property. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical results rest on observed sparsity property

full rationale

The paper proposes SPARQLe as a hardware-software co-design that exploits an observed statistical property of quantized activations (concentration around zero yielding MSB sparsity) to represent tensors as dense LSB plus bitmap-compressed sparse MSB, with a lightweight sparsity-increasing algorithm. Performance numbers (latency/energy reductions on BitNet 3B, Llama2 7B, Llama3 8B) are obtained from direct accelerator measurements and are not derived from any equations, fitted parameters, or self-citations that reduce the claims to their own inputs by construction. The central premise is an external empirical observation rather than a self-referential derivation, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger populated from abstract only; the central claim rests on an unverified statistical property of post-quantization activations and on the correctness of an unreviewed accelerator design.

axioms (1)
  • domain assumption A significant fraction of activations are concentrated around zero after quantization, producing sparsity in higher-order bits.
    Explicitly stated in the abstract as the property leveraged by the method.
invented entities (1)
  • Hybrid activation format (dense k-bit LSB tensor + sparse k-bit MSB tensor with precision bitmap) no independent evidence
    purpose: To allow k-bit datapath computation while preserving 2k-bit accuracy.
    Core novel representation introduced by the paper.

pith-pipeline@v0.9.1-grok · 5870 in / 1354 out tokens · 23292 ms · 2026-06-28T19:29:43.114412+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 14 canonical work pages · 4 internal anchors

  1. [1]

    Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. 2024. QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. ...

  2. [2]

    Reena Elangovan, Shubham Jain, and Anand Raghunathan. 2020. Ax-BxP: Approximate Blocked Computation for Precision-Reconfigurable Deep Neu- ral Network Acceleration.CoRRabs/2011.13000 (2020). arXiv:2011.13000 https://arxiv.org/abs/2011.13000

  3. [3]

    Aaron Grattafiori et. al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783

  4. [4]

    Hugo Touvron et. al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL] https://arxiv.org/abs/2307.09288

  5. [5]

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2024. The L...

  6. [6]

    Wonsuk Jang and Thierry Tambe. 2025. BlockDialect: Block-wise Fine- grained Mixed Format Quantization for Energy-Efficient LLM Inference.CoRR abs/2501.01144 (2025). arXiv:2501.01144 doi:10.48550/ARXIV.2501.01144

  7. [7]

    Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. 2024. Gear: An efficient kv cache compression recipe for near-lossless generative inference of llm.arXiv preprint arXiv:2403.05527(2024)

  8. [8]

    Mahoney, and Kurt Keutzer

    Sehoon Kim, Coleman Richard Charles Hooper, Amir Gholami, Zhen Dong, Xi- uyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. 2024. SqueezeLLM: Dense-and-Sparse Quantization. InProceedings of the 41st International Confer- ence on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Katherine Helle...

  9. [9]

    Seunghyun Lee, Dongho Ha, Sungbin Kim, Sungwoo Kim, Hyunwuk Lee, and Won Woo Ro. 2025. BitL: A Hybrid Bit-Serial and Parallel Deep Learning Acceler- ator for Critical Path Reduction. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture (MICRO ’25). Association for Computing Ma- chinery, New York, NY, USA, 1565–1578. doi:10.1145/...

  10. [10]

    Yuhang Li and Priyadarshini Panda. 2024. TesseraQ: Ultra Low-Bit LLM Post- Training Quantization with Block Reconstruction. arXiv:2410.19103 [cs.LG] https://arxiv.org/abs/2410.19103

  11. [11]

    Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. 2024. DuQuant: distributing outliers via dual transformation makes stronger quantized LLMs. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red...

  12. [12]

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. InMLSys

  13. [13]

    Yujun Lin*, Haotian Tang*, Shang Yang*, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. 2024. QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving.arXiv preprint arXiv:2405.04532(2024)

  14. [14]

    James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, and Ben Athiwaratkun. 2025. Training-Free Activation Sparsity in Large Language Models. InThe Thirteenth International Conference on Learning Representations. Aradhana Mohan Parvathy1, Soumendu Kumar Ghosh2, Shamik Kundu2, Arnab Raha2, Souvik Kundu2, Deepak A. Mathaikutty2, Anand Raghunathan1 1...

  15. [15]

    Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. 2024. The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits.CoRRabs/2402.17764 (2024). arXiv:2402.17764 doi:10.48550/ARXIV.2402.17764

  16. [16]

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer Sentinel Mixture Models. arXiv:1609.07843 [cs.CL]

  17. [17]

    Mathaikutty, Shamik Kundu, and Soumendu K

    Arnab Raha, Deepak A. Mathaikutty, Shamik Kundu, and Soumendu K. Ghosh

  18. [18]

    doi:10.3389/fhpcp.2025.1570210

    FlexNPU: a dataflow-aware flexible deep learning accelerator for energy- efficient edge devices.Frontiers in High Performance ComputingVolume 3 - 2025 (2025). doi:10.3389/fhpcp.2025.1570210

  19. [19]

    Akshat Ramachandran, Souvik Kundu, and Tushar Krishna. 2025. MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quanti- zation. InProceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA ’25). Association for Computing Machinery, New York, NY, USA, 1193–1209. doi:10.1145/3695053.3730989

  20. [20]

    Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. 2024. OmniQuant: Omnidirec- tionally Calibrated Quantization for Large Language Models. InThe Twelfth Inter- national Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openrev...

  21. [21]

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. InProceedings of the 40th International Conference on Machine Learning

  22. [22]

    Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. 2024. Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving. InProceedings of Machine Learning and Systems, P. Gibbons, G. Pekhimenko, and C. De Sa (Eds.), Vol. 6. 196–209. https://proceedings.mlsys.org/paper_f...