pith. sign in

arxiv: 2606.26587 · v1 · pith:HO6FZ33Gnew · submitted 2026-06-25 · 💻 cs.LG · cs.AI

SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference

Pith reviewed 2026-06-26 05:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords activation sparsityFP4 quantizationLLM inferencesparse-dense decompositiontraining-free methodN:M sparsityoutlier handling
0
0 comments X

The pith

SharQ recovers 43-63 percent of the FP4 accuracy gap on LLMs by splitting activations into a quantized sparse backbone and a compensating dense residual

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a training-free method called SharQ that pairs N:M sparsity with FP4 quantization for LLM activations. Activations contain input-dependent outliers that make direct FP4 scaling inaccurate, while applying sparsity masks alone loses moderate values and adds error. SharQ creates an adaptive mask to isolate an outlier-heavy sparse backbone, quantizes that backbone to FP4, then builds a dense residual measured against the already-quantized backbone values. Two FP4 matrix multiplies—one sparse for the backbone and one dense for correction—share the same weights but use different scales, with all preparation done in one fused kernel. If the approach holds, models could run at lower precision and higher speed on supported hardware without any retraining or calibration data.

Core claim

For each activation tensor, SharQ generates an input-adaptive N:M mask to extract an outlier-dominated sparse backbone, quantizes the backbone to FP4, and defines a dense residual relative to the quantized sparse backbone rather than the original values. A sparse FP4 GEMM handles the backbone while a dense FP4 GEMM compensates for both mask loss and quantization error; the paths share one FP4 weight payload with path-specific scales, and a fused preparation kernel absorbs mask generation, residual construction, and normalization.

What carries the argument

online sparse-dense decomposition that defines the dense residual relative to the quantized sparse backbone instead of the unquantized sparse values

If this is right

  • The method recovers 43-63 percent of the accuracy gap between NVFP4 and FP16 across language and vision-language tasks on the tested models
  • It delivers 2.2-2.4 times lower latency than FP16 and 1.2-1.4 times higher throughput than FP8 on RTX 5090 hardware
  • The same decomposition works without changes across NVFP4, HiF4, and MXFP4 formats
  • When combined with other attention optimizations it yields up to 1.58 times speedup on video generation workloads

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The residual-against-quantized-values choice may generalize to other bit widths or sparsity ratios where direct masking hurts scale factors
  • The fused kernel pattern could reduce overhead in pipelines that already combine sparsity with low-precision compute
  • Testing the same split on training-time activations might reveal whether the decomposition reduces outlier sensitivity earlier in the pipeline

Load-bearing premise

The online generation of the input-adaptive N:M mask, construction of the dense residual relative to the quantized sparse backbone, and fused preparation kernel can be performed with negligible overhead and without model-specific tuning on the evaluated hardware and model families.

What would settle it

Measuring accuracy recovery and latency on the same models after disabling the fused kernel or after computing the residual against unquantized sparse values instead would show whether the claimed recovery range and speed gains depend on those exact steps.

Figures

Figures reproduced from arXiv: 2606.26587 by Haoqian Meng, Huaqing Zheng, Peng Zhang, Wenyuan Liu, Xindian Ma, Yafei Zhao, Yilun Luo.

Figure 1
Figure 1. Figure 1: Overview of SharQ. SharQ extracts an outlier-dominated N:M sparse backbone, constructs [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Kernel-level NVFP4 dataflow of SharQ. The fused activation preparation kernel performs [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: End-to-end serving efficiency of SharQ on RTX 5090 with vLLM (Llama-3.1-8B, batch [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Normalized prefill latency breakdown on Llama-3.1-8B. NVFP4 serves as the 100% [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
read the original abstract

Low-bit floating-point formats and semi-structured sparsity are increasingly supported by modern accelerators, yet combining them for LLM activation compression remains challenging: activations contain input-dependent outliers that dominate block scales in FP4 quantization, and directly applying N:M sparsity masks discards moderate values, coupling sparsification loss with quantization error. We introduce SharQ, a training-free inference method that bridges activation sparsity and FP4 quantization through an online sparse--dense decomposition. For each activation tensor, SharQ generates an input-adaptive N:M mask to extract an outlier-dominated sparse backbone, quantizes it to FP4, and defines a dense residual relative to the quantized sparse backbone rather than the unquantized sparse values. A sparse FP4 GEMM processes the backbone while a dense FP4 GEMM compensates for both mask-induced activation loss and sparse-path quantization error. The two paths share a single FP4 weight payload with path-specific scale views, and a fused preparation kernel absorbs mask generation, residual construction, and layer normalization into one operator. SharQ requires no calibration data, retraining, or model-specific tuning. Evaluated on Llama-3.1-8B, Qwen2.5-7B, Qwen3-30B-A3B, and Qwen3-VL-8B, SharQ recovers 43--63% of the NVFP4-to-FP16 accuracy gap across language and vision-language tasks, and generalizes across NVFP4, HiF4, and MXFP4 formats. On an RTX 5090, SharQ delivers 2.2--2.4$\times$ latency reduction over FP16 and 1.2--1.4$\times$ throughput improvement over FP8 in language model serving, and up to 1.58$\times$ speedup on Wan2.2-T2V-A14B video generation when combined with SageAttention. Our code is available at https://github.com/actypedef/SharQ.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SharQ, a training-free method for LLM inference that combines N:M activation sparsity with FP4 quantization via an online input-adaptive sparse-dense decomposition: an N:M mask extracts an outlier-dominated sparse backbone quantized to FP4, while a dense residual (defined relative to the quantized sparse values) compensates mask and quantization errors using a second FP4 GEMM path. The two paths share a single FP4 weight payload with path-specific scales; a fused preparation kernel handles mask generation, residual construction, and layer norm. Evaluated on Llama-3.1-8B, Qwen2.5-7B, Qwen3-30B-A3B, and Qwen3-VL-8B, it recovers 43-63% of the NVFP4-to-FP16 accuracy gap across tasks and reports 2.2-2.4× latency reduction vs FP16 (plus 1.2-1.4× throughput vs FP8) on RTX 5090, with generalization to other FP4 formats and up to 1.58× speedup on video generation when combined with SageAttention.

Significance. If the overhead of the input-adaptive mask generation and fused kernel is truly negligible without model-specific tuning, and the accuracy recovery holds under the residual construction, SharQ would provide a practical way to exploit emerging hardware support for semi-structured sparsity and FP4 without retraining or calibration, improving the efficiency-accuracy frontier for both language and multimodal models.

major comments (3)
  1. [§4.3 and §5.3] §4.3 and §5.3 (latency evaluation): The 2.2-2.4× latency reduction claim over FP16 rests on the fused preparation kernel absorbing mask generation and residual construction with negligible cost, yet no per-component timing breakdown (e.g., mask selection time vs. the two GEMM paths) or scaling analysis with activation size is provided; without this, the net speedup cannot be verified against the stress-test concern that input-dependent N:M selection may not remain negligible.
  2. [§4.2] §4.2 (residual definition): The dense residual is constructed relative to the quantized sparse backbone rather than the unquantized sparse values, which is presented as compensating both mask loss and sparse-path quantization error, but no error analysis or bound is given showing why this choice yields the reported 43-63% gap recovery rather than simply adding a second quantization error term.
  3. [Table 2 and §5.1] Table 2 and §5.1 (ablation on mask adaptivity): The accuracy numbers are reported only for the full SharQ method; an ablation isolating the contribution of the input-adaptive N:M mask versus a static mask (or versus direct N:M on unquantized activations) is absent, making it impossible to confirm that adaptivity is load-bearing for the claimed recovery percentages.
minor comments (2)
  1. The abstract and §1 cite NVFP4, HiF4, and MXFP4 but the experimental tables do not explicitly label which format was used for each row; adding a column or footnote would improve clarity.
  2. Figure 3 (kernel diagram) uses abbreviations (e.g., 'SP-GEMM', 'DP-GEMM') without an accompanying legend in the caption; expanding the caption would aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each major point below, agreeing where additional evidence or clarification is needed and outlining the planned revisions.

read point-by-point responses
  1. Referee: [§4.3 and §5.3] §4.3 and §5.3 (latency evaluation): The 2.2-2.4× latency reduction claim over FP16 rests on the fused preparation kernel absorbing mask generation and residual construction with negligible cost, yet no per-component timing breakdown (e.g., mask selection time vs. the two GEMM paths) or scaling analysis with activation size is provided; without this, the net speedup cannot be verified against the stress-test concern that input-dependent N:M selection may not remain negligible.

    Authors: We agree that a per-component timing breakdown and scaling analysis would strengthen verification of the net speedup. In the revised version we will add detailed measurements separating mask generation, residual construction, and the two GEMM paths, together with scaling behavior across activation sizes on the evaluated hardware. revision: yes

  2. Referee: [§4.2] §4.2 (residual definition): The dense residual is constructed relative to the quantized sparse backbone rather than the unquantized sparse values, which is presented as compensating both mask loss and sparse-path quantization error, but no error analysis or bound is given showing why this choice yields the reported 43-63% gap recovery rather than simply adding a second quantization error term.

    Authors: The residual is defined after quantization so that the dense path directly offsets the combined mask and quantization error present in the sparse FP4 output. While the manuscript relies on empirical recovery rather than a formal bound, we will expand §4.2 with a short discussion of the motivation and why the chosen construction avoids simply accumulating an independent second quantization term. revision: partial

  3. Referee: [Table 2 and §5.1] Table 2 and §5.1 (ablation on mask adaptivity): The accuracy numbers are reported only for the full SharQ method; an ablation isolating the contribution of the input-adaptive N:M mask versus a static mask (or versus direct N:M on unquantized activations) is absent, making it impossible to confirm that adaptivity is load-bearing for the claimed recovery percentages.

    Authors: We acknowledge the value of isolating the adaptivity contribution. We will add the requested ablation (adaptive N:M versus static mask and versus direct N:M on unquantized activations) to Table 2 or a new table in §5.1 of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical algorithmic method with no derivations or fitted predictions

full rationale

The paper describes a training-free inference algorithm (online N:M mask generation, sparse-dense decomposition, fused kernel) evaluated empirically on specific models and hardware. No mathematical derivation chain, first-principles predictions, parameter fitting, or self-citation load-bearing steps are present in the provided text. All performance claims (accuracy gap recovery, latency reductions) are direct experimental measurements rather than reductions to inputs by construction. This matches the default expectation of no significant circularity for non-derivational papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The contribution is an algorithmic inference procedure rather than a mathematical derivation; no free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.1-grok · 5922 in / 1236 out tokens · 18051 ms · 2026-06-26T05:39:32.217364+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

176 extracted references · 2 canonical work pages

  1. [2]

    Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated LLM s. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=dfqsW38v1X

  2. [3]

    Piqa: Reasoning about physical commonsense in natural language, 2019

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language, 2019. URL https://arxiv.org/abs/1911.11641

  3. [4]

    Oscillation-reduced mxfp4 training for vision transformers, 2025

    Yuxiang Chen, Haocheng Xi, Jun Zhu, and Jianfei Chen. Oscillation-reduced mxfp4 training for vision transformers, 2025. URL https://arxiv.org/abs/2502.20853

  4. [7]

    OCP Microscaling (MX) Specification

    Bita Darvish Rouhani, Nitin Garegrat, Tom Savell, Ankit More, Kyung-Nam Han, Mathew Zhao, Ritchie amd Hall, Jasmine Klar, Eric Chung, Yuan Yu, Michael Schulte, Ralph Wittig, Ian Bratt, Nigel Stephens, Jelena Milanovic, John Brothers, Pradeep Dubey, Marius Cornea, Alexander Heinecke, Andres Rodriguez, Martin Langhammer, Summer Deng, Maxim Naumov, Paulius M...

  5. [8]

    Microscaling data formats for deep learning, 2023 b

    Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, Stosic Dusan, Venmugil Elango, Maximilian Golub, Alexander Heinecke, Phil James-Roxby, Dharmesh Jani, Gaurav Kolhe, Martin Langhammer, Ada Li, Levi Melnick, Maral Mesmakhosroshahi, Andres Rodriguez,...

  6. [9]

    Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022. URL https://arxiv.org/abs/2208.07339

  7. [11]

    S parse GPT : Massive language models can be accurately pruned in one-shot

    Elias Frantar and Dan Alistarh. S parse GPT : Massive language models can be accurately pruned in one-shot. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 1...

  8. [12]

    Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023. URL https://arxiv.org/abs/2210.17323

  9. [13]

    The language model evaluation harness, 07 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  10. [14]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering, 2017

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering, 2017. URL https://arxiv.org/abs/1612.00837

  11. [15]

    Akhil Mathur

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, and et al. Akhil Mathur. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

  12. [16]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021

  13. [17]

    Fgmp: Fine-grained mixed-precision weight and activation quantization for hardware-accelerated llm inference, 2025

    Coleman Hooper, Charbel Sakr, Ben Keller, Rangharajan Venkatesan, Kurt Keutzer, Sophia Shao, and Brucek Khailany. Fgmp: Fine-grained mixed-precision weight and activation quantization for hardware-accelerated llm inference, 2025. URL https://arxiv.org/abs/2504.14152

  14. [18]

    OSTQ uant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting

    Xing Hu, Yuan Cheng, Dawei Yang, Zhixuan Chen, Zukang Xu, JiangyongYu, XUCHEN, Zhihang Yuan, Zhe jiang, and Sifan Zhou. OSTQ uant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/f...

  15. [19]

    Hudson and Christopher D

    Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering, 2019. URL https://arxiv.org/abs/1902.09506

  16. [20]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  17. [21]

    Amxfp4: Taming activation outliers with asymmetric microscaling floating-point for 4-bit llm inference, 2025

    Janghwan Lee, Jiwoong Park, Jinseok Kim, Yongjik Kim, Jungju Oh, Jinwook Oh, and Jungwook Choi. Amxfp4: Taming activation outliers with asymmetric microscaling floating-point for 4-bit llm inference, 2025. URL https://arxiv.org/abs/2411.09909

  18. [23]

    Evaluating object hallucination in large vision-language models, 2023

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models, 2023. URL https://arxiv.org/abs/2305.10355

  19. [24]

    Duquant++: Fine-grained rotation enhances microscaling fp4 quantization

    Haokun Lin, Xinle Jia, Haobo Xu, Bingchen Yao, Xianglong Guo, Yichen Wu, Zhichao Lu, Ying Wei, Qingfu Zhang, and Zhenan Sun. Duquant++: Fine-grained rotation enhances microscaling fp4 quantization. 2026. URL https://api.semanticscholar.org/CorpusID:287634207

  20. [25]

    Awq: Activation-aware weight quantization for llm compression and acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. In MLSys, 2024

  21. [27]

    Spinquant: Llm quantization with learned rotations, 2024

    Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations, 2024. URL https://arxiv.org/abs/2405.16406

  22. [28]

    Deja vu: Contextual sparsity for efficient llms at inference time

    Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137--22176. PMLR, 2023

  23. [29]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

  24. [33]

    Nvidia blackwell architecture technical brief, 2024

    Nvidia. Nvidia blackwell architecture technical brief, 2024. URL https://resources.nvidia.com/en-us-blackwell-architecture

  25. [34]

    cuDNN Frontend API v1.14.0: Block-Scaling Operation

    NVIDIA Corporation . cuDNN Frontend API v1.14.0: Block-Scaling Operation . https://docs.nvidia.com/deeplearning/cudnn/frontend/v1.14.0/operations/BlockScaling.html, 2024 a . Accessed: 2025-09-16

  26. [35]

    PTX: Parallel Thread Execution, ISA Version 8.4

    NVIDIA Corporation . PTX: Parallel Thread Execution, ISA Version 8.4 . https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#tcgen05-mma-instructions, 2024 b . Accessed: 2025-09-16

  27. [36]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Denis Paperno, Germ\' a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernandez. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages ...

  28. [37]

    Qwen2.5 technical report, 2025

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  29. [38]

    Winogrande: An adversarial winograd schema challenge at scale, 2019

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019. URL https://arxiv.org/abs/1907.10641

  30. [39]

    Resq: Mixed-precision quantization of large language models with low-rank residuals, 2025

    Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, and Xin Wang. Resq: Mixed-precision quantization of large language models with low-rank residuals, 2025. URL https://arxiv.org/abs/2412.14363

  31. [40]

    Omniquant: Omnidirectionally calibrated quantization for large language models, 2024

    Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models, 2024. URL https://arxiv.org/abs/2308.13137

  32. [42]

    Towards vqa models that can read, 2019

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read, 2019. URL https://arxiv.org/abs/1904.08920

  33. [43]

    Powerinfer: Fast large language model serving with a consumer-grade gpu

    Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. Powerinfer: Fast large language model serving with a consumer-grade gpu. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 590--606, 2024

  34. [45]

    Zico Kolter, and Zhuang Liu

    Mingjie Sun, Xinlei Chen, J. Zico Kolter, and Zhuang Liu. Massive activations in large language models, 2024 b . URL https://arxiv.org/abs/2402.17762

  35. [46]

    A simple and effective pruning approach for large language models

    Mingjie Sun, Zhuang Liu, Anna Bair, and Zico Kolter. A simple and effective pruning approach for large language models. In International Conference on Learning Representations, volume 2024, pages 4942--4964, 2024 c

  36. [47]

    Flatquant: Flatness matters for LLM quantization

    Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, and Jun Yao. Flatquant: Flatness matters for LLM quantization. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Forty-second Internatio...

  37. [49]

    Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution, 2024

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution, 2024. URL https://arxiv.org/abs/2409.12191

  38. [50]

    Smoothquant: Accurate and efficient post-training quantization for large language models, 2024

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models, 2024. URL https://arxiv.org/abs/2211.10438

  39. [51]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Dayiheng Liu, Fan Zhou, Fei Huang, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Z...

  40. [53]

    Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization, 2025

    Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization, 2025. URL https://arxiv.org/abs/2411.10958

  41. [54]

    Atom: Low-bit quantization for efficient and accurate llm serving

    Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. Atom: Low-bit quantization for efficient and accurate llm serving. In P. Gibbons, G. Pekhimenko, and C. De Sa, editors, Proceedings of Machine Learning and Systems, volume 6, pages 196--209, 2024. URL https://proceedings.m...

  42. [55]

    2016 , eprint=

    Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures , author=. 2016 , eprint=

  43. [56]

    2024 , url=

    Q-VLM: Post-training Quantization for Large Vision-Language Models , author=. 2024 , url=

  44. [57]

    MLSys , year=

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration , author=. MLSys , year=

  45. [58]

    2024 , eprint=

    CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression , author=. 2024 , eprint=

  46. [59]

    2023 , eprint=

    Matrix Compression via Randomized Low Rank and Low Precision Factorization , author=. 2023 , eprint=

  47. [60]

    The Twelfth International Conference on Learning Representations , year=

    The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction , author=. The Twelfth International Conference on Learning Representations , year=

  48. [61]

    2024 , eprint=

    ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models , author=. 2024 , eprint=

  49. [62]

    2023 , editor =

    Frantar, Elias and Alistarh, Dan , booktitle =. 2023 , editor =

  50. [63]

    2023 , eprint=

    Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning , author=. 2023 , eprint=

  51. [64]

    2021 , eprint=

    Post-Training Quantization for Vision Transformer , author=. 2021 , eprint=

  52. [65]

    2023 , eprint=

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers , author=. 2023 , eprint=

  53. [66]

    arXiv preprint arXiv:2305.14314 , year=

    QLoRA: Efficient Finetuning of Quantized LLMs , author=. arXiv preprint arXiv:2305.14314 , year=

  54. [67]

    2024 , eprint=

    SqueezeLLM: Dense-and-Sparse Quantization , author=. 2024 , eprint=

  55. [68]

    2024 , eprint=

    OneBit: Towards Extremely Low-bit Large Language Models , author=. 2024 , eprint=

  56. [69]

    Rajarshi Saha, Varun Srivastava and Mert Pilanci , booktitle=

  57. [70]

    2023 , eprint=

    The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction , author=. 2023 , eprint=

  58. [71]

    arXiv preprint arXiv:2406.01721 , year=

    DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs , author=. arXiv preprint arXiv:2406.01721 , year=

  59. [72]

    2024 , eprint=

    Massive Activations in Large Language Models , author=. 2024 , eprint=

  60. [73]

    2022 , eprint=

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale , author=. 2022 , eprint=

  61. [74]

    2024 , eprint=

    SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models , author=. 2024 , eprint=

  62. [75]

    Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving , url =

    Zhao, Yilong and Lin, Chien-Yu and Zhu, Kan and Ye, Zihao and Chen, Lequn and Zheng, Size and Ceze, Luis and Krishnamurthy, Arvind and Chen, Tianqi and Kasikci, Baris , booktitle =. Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving , url =

  63. [76]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  64. [77]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  65. [78]

    2024 , eprint=

    Phi-4 Technical Report , author=. 2024 , eprint=

  66. [79]

    arXiv preprint arXiv:2309.15531 , year=

    Rethinking channel dimensions to isolate outliers for low-bit weight quantization of large language models , author=. arXiv preprint arXiv:2309.15531 , year=

  67. [80]

    2024 , howpublished =

    Ministral-8B-Instruct-2410 , author =. 2024 , howpublished =

  68. [81]

    2024 , howpublished =

  69. [82]

    arXiv preprint arXiv:1905.07830 , year=

    Hellaswag: Can a machine really finish your sentence? , author=. arXiv preprint arXiv:1905.07830 , year=

  70. [83]

    arXiv preprint arXiv:2403.12544 , year=

    Affinequant: Affine transformation quantization for large language models , author=. arXiv preprint arXiv:2403.12544 , year=

  71. [84]

    FlatQuant: Flatness Matters for

    Yuxuan Sun and Ruikang Liu and Haoli Bai and Han Bao and Kang Zhao and Yuening Li and Jiaxin Hu and Xianzhi Yu and Lu Hou and Chun Yuan and Xin Jiang and Wulong Liu and Jun Yao , editor =. FlatQuant: Flatness Matters for. Forty-second International Conference on Machine Learning,. 2025 , url =

  72. [85]

    CoRR , volume =

    Stephen Merity and Caiming Xiong and James Bradbury and Richard Socher , title =. CoRR , volume =. 2016 , url =. 1609.07843 , timestamp =

  73. [86]

    Liu , title =

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. CoRR , volume =. 2019 , url =. 1910.10683 , timestamp =

  74. [87]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  75. [88]

    Paperno, Denis and Kruszewski, Germ\'. The. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month =. 2016 , address =

  76. [89]

    2019 , eprint=

    PIQA: Reasoning about Physical Commonsense in Natural Language , author=. 2019 , eprint=

  77. [90]

    2019 , eprint=

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author=. 2019 , eprint=

  78. [91]

    arXiv:1803.05457v1 , year =

    Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. arXiv:1803.05457v1 , year =

  79. [92]

    2019 , eprint=

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=

  80. [93]

    arXiv preprint arXiv:2210.09261 , year=

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. arXiv preprint arXiv:2210.09261 , year=

Showing first 80 references.