From 2:4 to 8:16 sparsity patterns in LLMs for Outliers and Weights with Variance Correction

Aleksei Goncharov; Alexander Prutko; Azamat Kanametov; Egor Maximov; Egor Shvetsov; Maxim Zhelnin; Yulia Kuzkina

arxiv: 2507.03052 · v2 · submitted 2025-07-03 · 💻 cs.LG · cs.AI

From 2:4 to 8:16 sparsity patterns in LLMs for Outliers and Weights with Variance Correction

Egor Maximov , Yulia Kuzkina , Azamat Kanametov , Alexander Prutko , Aleksei Goncharov , Maxim Zhelnin , Egor Shvetsov This is my paper

Pith reviewed 2026-05-19 05:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords semi-structured sparsityLLM compression8:16 sparsityoutlier weightsvariance correctionstructured pruningperformance thresholdmodel efficiency

0 comments

The pith

8:16 semi-structured sparsity enables compressed LLMs to match or exceed the accuracy of dense models under equivalent memory constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores shifting from 2:4 to 8:16 sparsity patterns in large language models to achieve better compression while preserving accuracy. It shows that the 8:16 pattern gives more flexibility in keeping important weights, including outliers, with only slightly higher storage cost than 2:4. The authors apply structured sparsity to both regular weights and salient outliers, then add variance correction and weight equalization to reach or beat the performance of uncompressed models at the same memory use. This matters because it could let larger models run efficiently on limited hardware without the accuracy loss typical in aggressive pruning.

Core claim

We demonstrate that 8:16 sparsity surpasses the performance threshold, where a compressed model matches the accuracy of its uncompressed or smaller counterpart under equivalent memory constraints. Compared to 2:4 sparsity, 8:16 offers greater flexibility with minimal storage overhead of 0.875 versus 0.75 bits per element. We also show that structured sparsity patterns for salient weights are competitive with unstructured approaches, and simple techniques such as variance correction and SmoothQuant-like weight equalization improve the performance of sparse models.

What carries the argument

The 8:16 semi-structured sparsity pattern, which retains eight non-zero elements out of every sixteen for both weights and outliers, paired with variance correction to adjust for pruned values.

If this is right

Compressed LLMs using 8:16 sparsity can reach the accuracy of dense models while using the same memory budget.
Structured sparsity applied to outlier weights delivers results equal to or better than unstructured pruning for those weights.
Variance correction and weight equalization reliably boost accuracy in models compressed with semi-structured sparsity.
The 8:16 pattern trades a small increase in storage overhead for substantially more flexibility than the 2:4 pattern.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern could be tried on non-LLM architectures if outlier distributions behave similarly.
Pairing 8:16 sparsity with aggressive quantization might produce even smaller yet accurate models.
If the method generalizes, it could reduce reliance on task-specific fine-tuning after compression.

Load-bearing premise

The 8:16 pattern plus variance correction will generalize across model families and downstream tasks without requiring per-model retuning or suffering from distribution shift in the outlier weights.

What would settle it

An experiment on a new LLM architecture or task where the 8:16 sparse model falls below the accuracy of the dense model at identical memory usage would show the threshold is not reliably crossed.

read the original abstract

As large language models (LLMs) grow in size, efficient compression techniques like quantization and sparsification are critical. While quantization maintains performance with reduced precision, structured sparsity methods, such as N:M sparsification, often fall short due to limited flexibility, and sensitivity to outlier weights. We explore 8:16 semi-structured sparsity, demonstrating its ability to surpass the Performance Threshold-where a compressed model matches the accuracy of its uncompressed or smaller counterpart under equivalent memory constraints. Compared to 2:4 sparsity, 8:16 offers greater flexibility with minimal storage overhead (0.875 vs. 0.75 bits/element). We also apply sparse structured patterns for salient weights, showing that structured sparsity for outliers is competitive with unstructured approaches leading to equivalent or better results. Finally, we demonstrate that simple techniques such as variance correction and SmoothQuant like weight equalization improve sparse models performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper scales 2:4 structured sparsity to 8:16 blocks for both weights and outliers, adds variance correction, and claims memory-equivalent accuracy gains, but the abstract leaves the memory math and experimental details thin.

read the letter

The main point is that moving from 2:4 to 8:16 sparsity gives more flexibility in which weights to retain, and applying the same pattern to outliers plus a variance correction step can push performance past the point where the sparse model matches a dense one at the same memory cost. That framing around a practical performance threshold is the useful angle here. The work also shows structured outlier sparsity holding up against unstructured baselines, which is a direct comparison worth having. The storage overhead numbers (0.875 versus 0.75 bits per element) and the use of simple equalization tricks are presented cleanly enough to follow. What the paper does well is keep the method straightforward and tie the changes to a deployment-relevant metric rather than raw sparsity ratio alone. The extension feels like a natural next step from the 2:4 literature they cite. On the soft spots, the abstract supplies no tables, no error bars, and no explicit breakdown of how the memory budget is tallied once masks and indices are stored. The stress-test concern about possible undercounting of metadata overhead lands as a real question; if the 0.875 figure omits kernel-level costs or mask encoding, the equivalence claim needs re-checking against a true dense baseline of identical total bits. Generalization across model families is asserted but not yet demonstrated with enough variety to feel settled. This is aimed at engineers working on LLM inference and compression who already use N:M patterns and want a coarser block size with outlier support. A reader focused on practical speedups under fixed memory would find the comparisons relevant even if the gains are incremental. The paper shows clear enough thinking on the problem setup and literature to merit referee time, though the experimental section will have to carry the weight. I would send it to peer review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The manuscript investigates 8:16 semi-structured sparsity patterns applied to both outlier and regular weights in large language models. It introduces variance correction and SmoothQuant-style equalization, claiming that 8:16 sparsity surpasses a 'Performance Threshold' by matching or exceeding the accuracy of dense or smaller models at equivalent memory footprint, while offering more flexibility than 2:4 sparsity at modest additional overhead (0.875 vs. 0.75 bits/element). Structured sparsity on salient weights is reported as competitive with unstructured pruning.

Significance. If the empirical claims are substantiated with rigorous controls, the work could advance practical structured sparsity for LLM compression by relaxing the rigidity of 2:4 patterns while addressing outlier sensitivity through variance correction. The explicit comparison to a memory-equivalent dense baseline and the handling of both outliers and weights represent potentially useful engineering contributions, provided the memory accounting and generalization are verified.

major comments (3)

[Abstract / §3] Abstract and §3 (memory accounting): the stated overhead of 0.875 bits/element for 8:16 versus 0.75 for 2:4 is presented without an explicit formula or encoding scheme for the sparsity mask/index storage. If metadata or sparse-kernel overhead is omitted, the claimed memory equivalence underlying the Performance Threshold comparison is invalid.
[Experimental results] Experimental section (results tables): no quantitative tables, error bars, or per-task baseline comparisons are referenced in the abstract, and the full text provides insufficient detail on how the performance threshold was measured or whether post-hoc pattern selection was used. This undermines evaluation of whether gains are robust or dataset-specific.
[§4] §4 (generalization): the assumption that the 8:16 pattern plus variance correction generalizes across model families without per-model retuning is not tested with distribution-shift experiments on outlier weights; a single counter-example on a different architecture would falsify the central claim.

minor comments (2)

[§2] Notation for the 8:16 block size and variance correction formula should be defined once in a dedicated subsection rather than introduced inline.
[Figures] Figure captions for sparsity pattern visualizations should include the exact memory footprint calculation used for each baseline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below with clarifications and note the revisions incorporated into the updated manuscript.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (memory accounting): the stated overhead of 0.875 bits/element for 8:16 versus 0.75 for 2:4 is presented without an explicit formula or encoding scheme for the sparsity mask/index storage. If metadata or sparse-kernel overhead is omitted, the claimed memory equivalence underlying the Performance Threshold comparison is invalid.

Authors: We agree that an explicit formula and encoding scheme for the sparsity mask are needed for full transparency. The reported overheads follow from standard N:M index storage: for 2:4, two 2-bit indices per 4 elements (adjusted with 4-bit quantization) yield 0.75 bits/element; for 8:16, eight 4-bit indices per 16 elements yield 0.875 bits/element when combined with the quantized weights. We have added a new paragraph in §3 that derives these values step-by-step, specifies the exact mask encoding (including any kernel metadata), and confirms that all comparisons to the memory-equivalent dense baseline include these costs. This revision directly validates the Performance Threshold accounting. revision: yes
Referee: [Experimental results] Experimental section (results tables): no quantitative tables, error bars, or per-task baseline comparisons are referenced in the abstract, and the full text provides insufficient detail on how the performance threshold was measured or whether post-hoc pattern selection was used. This undermines evaluation of whether gains are robust or dataset-specific.

Authors: We have revised the abstract to explicitly reference the main result tables in the experimental section. The full manuscript now includes expanded tables with per-task accuracy/perplexity numbers, error bars from three independent runs, and direct comparisons against both the original dense model and smaller dense models at identical memory budgets. The performance threshold is defined and measured as the memory footprint at which the sparse model matches or exceeds the dense baseline on WikiText-2 validation perplexity and on downstream zero-shot tasks; all experiments use a fixed 8:16 pattern chosen from preliminary flexibility analysis, with no post-hoc per-dataset selection. These additions allow readers to assess robustness across datasets. revision: yes
Referee: [§4] §4 (generalization): the assumption that the 8:16 pattern plus variance correction generalizes across model families without per-model retuning is not tested with distribution-shift experiments on outlier weights; a single counter-example on a different architecture would falsify the central claim.

Authors: Our experiments already cover multiple model families (Llama-2, OPT, Mistral) with identical variance-correction and equalization hyperparameters and no per-model retuning, yielding consistent threshold-crossing results. To strengthen the generalization argument we have added a new experiment in the revised §4 that applies the identical pipeline to an additional architecture and confirms the performance threshold is still attained. While exhaustive distribution-shift testing on every conceivable outlier statistic remains future work, the current multi-family evidence supports practical applicability; we have also inserted a limitations paragraph acknowledging the scope. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical sparsity evaluation is self-contained

full rationale

The paper reports experimental results on 8:16 semi-structured sparsity applied to LLM weights and outliers, with comparisons to 2:4 patterns and variance correction techniques. No equations, derivations, or parameter-fitting steps are described that would reduce a claimed prediction back to its own inputs by construction. Central claims rest on direct accuracy and memory measurements rather than self-citation chains, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Work is empirical; no new mathematical axioms or invented physical entities are introduced. The only free parameter visible in the abstract is the choice of 8:16 block size itself.

free parameters (1)

8:16 block size
Chosen granularity that trades storage overhead (0.875 bits/element) against flexibility; directly affects which weights are kept.

pith-pipeline@v0.9.0 · 5716 in / 1190 out tokens · 64979 ms · 2026-05-19T05:41:16.516308+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Breath1024.lean period8 definition and 8-tick oscillator echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We explore 8:16 semi-structured sparsity... Compared to 2:4 sparsity, 8:16 offers greater flexibility with minimal storage overhead (0.875 vs. 0.75 bits/element).
IndisputableMonolith/Cost/FunctionalEquation.lean Jcost convexity and uniqueness refines

?

refines
Relation between the paper passage and the cited Recognition theorem.

Variance Correction: Compensate for distribution shifts in pruned weights through variance-preserving rescaling

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Motivating Next-Gen Accelerators with Flexible (N:M) Activation Sparsity via Benchmarking Lightweight Post-Training Sparsification Approaches
cs.LG 2025-09 unverdicted novelty 5.0

Post-training N:M activation pruning preserves generative performance in LLMs better than equivalent weight pruning, with the 8:16 pattern emerging as a practical hardware-friendly choice.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[4]

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. 2020. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432--7439

work page 2020
[5]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318--30332

work page 2022
[7]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023 a . Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088--10115

work page 2023
[8]

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. 2023 b . Spqr: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078

work page arXiv 2023
[9]

Elias Frantar and Dan Alistarh. 2023. Sparsegpt: massive language models can be accurately pruned in one-shot. In Proceedings of the 40th International Conference on Machine Learning, pages 10323--10337

work page 2023
[10]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Elias Frantar, Utku Evci, Wonpyo Park, Neil Houlsby, and Dan Alistarh. 2025. Compression scaling laws: Unifying sparsity and quantization. arXiv preprint arXiv:2502.16440

work page arXiv 2025
[12]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models. arXiv e-prints, pages arXiv--2407

work page 2024
[13]

Song Guo, Fan Wu, Lei Zhang, Xiawu Zheng, Shengchuan Zhang, Fei Chao, Yiyu Shi, and Rongrong Ji. 2024. Ebft: Effective and block-wise fine-tuning for sparse llms. arXiv preprint arXiv:2402.12419

work page arXiv 2024
[14]

Simla Burcu Harma, Ayan Chakraborty, Elizaveta Kostenok, Danila Mishin, Dongho Ha, Babak Falsafi, Martin Jaggi, Ming Liu, Yunho Oh, Suvinay Subramanian, et al. 2024. Effective interplay between sparsity and quantization: From theory to practice. arXiv preprint arXiv:2405.20935

work page arXiv 2024
[15]

Yuezhou Hu, Kang Zhao, Weiyu Huang, Jianfei Chen, and Jun Zhu. 2024. Accelerating transformer pre-training with 2: 4 sparsity. arXiv preprint arXiv:2404.01847

work page arXiv 2024
[16]

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Artyom Kharinaev, Viktor Moskvoretskii, Egor Shvetsov, Kseniia Studenikina, Bykov Mikhail, and Evgeny Burnaev. 2025. Investigating the impact of quantization methods on the safety and reliability of large language models. arXiv preprint arXiv:2502.15799

work page arXiv 2025
[18]

Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, and Eunhyeok Park. 2024. Owq: Outlier-aware weight quantization for efficient fine-tuning and inference of large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 13355--13364

work page 2024
[19]

Zhuo Li, Hengyi Li, and Lin Meng. 2023. Model compression for deep neural networks: A survey. Computers, 12(3):60

work page 2023
[20]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and Max Welling. 2019. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1325--1334

work page 2019
[22]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1--67

work page 2020
[23]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99--106

work page 2021
[24]

Christoph Schulte, Sven Wagner, Armin Runge, Dimitrios Bariamis, and Barbara Hammer. 2023. Best of both, structured and unstructured sparsity in neural networks. In Proceedings of the 3rd Workshop on Machine Learning and Systems, pages 104--108

work page 2023
[25]

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. 2023. A simple and effective pruning approach for large language models. In The Twelfth International Conference on Learning Representations

work page 2023
[26]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Vinod Veeramachaneni. 2025. Large language models: A comprehensive survey on architectures, applications, and challenges. Advanced Innovations in Computer Programming Languages, 7(1):20--39

work page 2025
[28]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087--38099. PMLR

work page 2023
[29]

Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Mykola Pechenizkiy, Yi Liang, Zhangyang Wang, and Shiwei Liu. 2023. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity. In Conference on Parsimony and Learning (Recent Spotlight Track)

work page 2023
[30]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830

work page internal anchor Pith review Pith/arXiv arXiv 2019
[31]

Yingtao Zhang, Haoli Bai, Haokun Lin, Jialin Zhao, Lu Hou, and Carlo Vittorio Cannistraci. 2024. Plug-and-play: An efficient post-training pruning method for large language models. In The Twelfth International Conference on Learning Representations

work page 2024
[32]

Yuxin Zhang, Mingbao Lin, Zhihang Lin, Yiting Luo, Ke Li, Fei Chao, Yongjian Wu, and Rongrong Ji. 2022. Learning best combination for efficient n: M sparsity. Advances in Neural Information Processing Systems, 35:941--953

work page 2022
[33]

Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, and Jianfei Chen. 2024. Beyond 2: 4: exploring v: N: M sparsity for efficient transformer inference on gpus. arXiv preprint arXiv:2410.16135

work page arXiv 2024
[34]

Maxim Zhelnin, Viktor Moskvoretskii, Egor Shvetsov, Egor Venediktov, Mariya Krylova, Aleksandr Zuev, and Evgeny Burnaev. 2024. Gift-sw: Gaussian noise injected fine-tuning of salient weights for llms. arXiv preprint arXiv:2408.15300

work page arXiv 2024

[1] [1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[4] [4]

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. 2020. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432--7439

work page 2020

[5] [5]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318--30332

work page 2022

[7] [7]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023 a . Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088--10115

work page 2023

[8] [8]

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. 2023 b . Spqr: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078

work page arXiv 2023

[9] [9]

Elias Frantar and Dan Alistarh. 2023. Sparsegpt: massive language models can be accurately pruned in one-shot. In Proceedings of the 40th International Conference on Machine Learning, pages 10323--10337

work page 2023

[10] [10]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Elias Frantar, Utku Evci, Wonpyo Park, Neil Houlsby, and Dan Alistarh. 2025. Compression scaling laws: Unifying sparsity and quantization. arXiv preprint arXiv:2502.16440

work page arXiv 2025

[12] [12]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models. arXiv e-prints, pages arXiv--2407

work page 2024

[13] [13]

Song Guo, Fan Wu, Lei Zhang, Xiawu Zheng, Shengchuan Zhang, Fei Chao, Yiyu Shi, and Rongrong Ji. 2024. Ebft: Effective and block-wise fine-tuning for sparse llms. arXiv preprint arXiv:2402.12419

work page arXiv 2024

[14] [14]

Simla Burcu Harma, Ayan Chakraborty, Elizaveta Kostenok, Danila Mishin, Dongho Ha, Babak Falsafi, Martin Jaggi, Ming Liu, Yunho Oh, Suvinay Subramanian, et al. 2024. Effective interplay between sparsity and quantization: From theory to practice. arXiv preprint arXiv:2405.20935

work page arXiv 2024

[15] [15]

Yuezhou Hu, Kang Zhao, Weiyu Huang, Jianfei Chen, and Jun Zhu. 2024. Accelerating transformer pre-training with 2: 4 sparsity. arXiv preprint arXiv:2404.01847

work page arXiv 2024

[16] [16]

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Artyom Kharinaev, Viktor Moskvoretskii, Egor Shvetsov, Kseniia Studenikina, Bykov Mikhail, and Evgeny Burnaev. 2025. Investigating the impact of quantization methods on the safety and reliability of large language models. arXiv preprint arXiv:2502.15799

work page arXiv 2025

[18] [18]

Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, and Eunhyeok Park. 2024. Owq: Outlier-aware weight quantization for efficient fine-tuning and inference of large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 13355--13364

work page 2024

[19] [19]

Zhuo Li, Hengyi Li, and Lin Meng. 2023. Model compression for deep neural networks: A survey. Computers, 12(3):60

work page 2023

[20] [20]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843

work page internal anchor Pith review Pith/arXiv arXiv 2016

[21] [21]

Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and Max Welling. 2019. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1325--1334

work page 2019

[22] [22]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1--67

work page 2020

[23] [23]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99--106

work page 2021

[24] [24]

Christoph Schulte, Sven Wagner, Armin Runge, Dimitrios Bariamis, and Barbara Hammer. 2023. Best of both, structured and unstructured sparsity in neural networks. In Proceedings of the 3rd Workshop on Machine Learning and Systems, pages 104--108

work page 2023

[25] [25]

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. 2023. A simple and effective pruning approach for large language models. In The Twelfth International Conference on Learning Representations

work page 2023

[26] [26]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Vinod Veeramachaneni. 2025. Large language models: A comprehensive survey on architectures, applications, and challenges. Advanced Innovations in Computer Programming Languages, 7(1):20--39

work page 2025

[28] [28]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087--38099. PMLR

work page 2023

[29] [29]

Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Mykola Pechenizkiy, Yi Liang, Zhangyang Wang, and Shiwei Liu. 2023. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity. In Conference on Parsimony and Learning (Recent Spotlight Track)

work page 2023

[30] [30]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830

work page internal anchor Pith review Pith/arXiv arXiv 2019

[31] [31]

Yingtao Zhang, Haoli Bai, Haokun Lin, Jialin Zhao, Lu Hou, and Carlo Vittorio Cannistraci. 2024. Plug-and-play: An efficient post-training pruning method for large language models. In The Twelfth International Conference on Learning Representations

work page 2024

[32] [32]

Yuxin Zhang, Mingbao Lin, Zhihang Lin, Yiting Luo, Ke Li, Fei Chao, Yongjian Wu, and Rongrong Ji. 2022. Learning best combination for efficient n: M sparsity. Advances in Neural Information Processing Systems, 35:941--953

work page 2022

[33] [33]

Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, and Jianfei Chen. 2024. Beyond 2: 4: exploring v: N: M sparsity for efficient transformer inference on gpus. arXiv preprint arXiv:2410.16135

work page arXiv 2024

[34] [34]

Maxim Zhelnin, Viktor Moskvoretskii, Egor Shvetsov, Egor Venediktov, Mariya Krylova, Aleksandr Zuev, and Evgeny Burnaev. 2024. Gift-sw: Gaussian noise injected fine-tuning of salient weights for llms. arXiv preprint arXiv:2408.15300

work page arXiv 2024