A Hardware-Aware, Per-Layer Methodology for Post-Training Quantization of Large Language Models

Earl Killian

arxiv: 2605.14929 · v1 · pith:B6LWCBTYnew · submitted 2026-05-14 · 💻 cs.LG · cs.AR

A Hardware-Aware, Per-Layer Methodology for Post-Training Quantization of Large Language Models

Earl Killian This is my paper

Pith reviewed 2026-06-30 21:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AR

keywords post-training quantizationlarge language modelsFP6weight reconstruction errorcodebook searchLUT decodehardware-aware quantizationper-layer scaling

0 comments

The pith

Scaled Outer Product quantization finds per-layer codebook and scale combinations that let a 6.5-bit FP6 format beat standard 8-bit FP8 on weight reconstruction error for large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Scaled Outer Product as a post-training method that searches each layer of an LLM for pairs of fixed and dynamic codebooks, chosen by a selection bit, together with signed per-block scales and activation-weighted cosine selection. It adds outlier handling, sparse-residual correction, and a hardware-efficient LUT output format to support decode on per-layer LUT hardware. Across six model families the recommended 6.5 bits-per-weight operating point produces lower reconstruction error than a conventional per-layer power-of-two FP8 baseline while using 1.5 bits per weight less storage. A sympathetic reader would care because the result indicates that small, carefully scaled codebooks can replace the higher-precision format that is currently deployed, freeing memory without increasing error. The full range from 4.5 to 6 bits per weight is evaluated with layer promotion included.

Core claim

Scaled Outer Product (SOP) is a post-training quantization methodology for large language model weights that combines per-layer search of fixed and dynamic codebook pairs selected by a per-block selection bit, signed per-block scales, activation-weighted cosine selection, and multiple-choice knapsack promotion of sensitive layers with outlier and sparse-residual correction. Fixed codebooks include NF4, BOF4, Split87, and SH4; per-layer optimized codebooks are hosted in LUT SRAM. A new hardware-efficient LUT output format is proposed. Across six open model families, the recommended FP6 operating point (E2M3sUE4M4, 6.5 bpw) achieves lower weight reconstruction error than the conventional per-l

What carries the argument

Scaled Outer Product (SOP), a per-layer search over fixed/dynamic codebook pairs plus scales and corrections that minimizes reconstruction error for hardware LUT decode.

If this is right

The 6.5 bpw FP6 point delivers lower reconstruction error than 8.0 bpw FP8 at reduced storage cost.
Block-scaled small atoms with chosen scale precision can replace conventionally deployed FP8.
The method supports near-lossless fidelity across the 4.5-6 bpw range when layer promotion and sparse residual correction are included.
A hardware-efficient LUT output format improves performance, energy, and cost on supported hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Running the same per-layer search on a new model architecture would be needed to confirm whether the FP6 advantage holds.
The storage savings could allow a larger model to fit in the same on-device memory budget.
Hardware support for the proposed LUT output format would be required to realize the claimed energy and cost gains.
The approach could be tested on activation tensors if the same per-layer LUT hardware is available.

Load-bearing premise

The per-layer search procedure and chosen codebook/scale combinations will generalize to unseen models and tasks without the search introducing selection bias that inflates the reported improvement.

What would settle it

Measuring weight reconstruction error for the E2M3sUE4M4 FP6 point versus the E4M3 FP8 baseline on a model family outside the six families already tested.

read the original abstract

Scaled Outer Product (SOP) is a post-training quantization methodology for large language model weights, designed to deliver near-lossless fidelity at 4.5--6 bits per weight on hardware with per-layer LUT decode. The methodology combines per-layer search of fixed and dynamic codebook pairs selected by a per-block selection bit, signed per-block scales, activation-weighted cosine selection, and multiple-choice knapsack promotion of sensitive layers with outlier and sparse-residual correction. Fixed codebooks include NF4, BOF4, Split87, and SH4; per-layer optimized codebooks (DD4) are hosted in LUT SRAM. A new hardware-efficient LUT output format (HIF) is proposed to improve performance, energy, and cost. Across six open model families, the recommended FP6 operating point (E2M3sUE4M4, 6.5 bpw) achieves lower weight reconstruction error than the conventional per-layer-POT FP8 baseline (E4M3, 8.0 bpw) at 1.5 bpw lower storage cost, demonstrating that block-scaled small atoms with carefully chosen scale precision can replace conventionally-deployed FP8. Full evaluation across the 4.5--6 bpw range, including layer promotion and sparse residual correction, is reported in a companion paper.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper lays out a per-layer search over codebook pairs and scales for LLM weight quantization, claiming a 6.5 bpw FP6 point beats standard FP8 on reconstruction error, but the supporting numbers sit in a companion paper.

read the letter

This manuscript describes SOP, a post-training method that searches per layer for fixed-plus-dynamic codebook pairs, signed per-block scales, activation-weighted cosine selection, and knapsack-based layer promotion, plus a new HIF LUT format. The headline empirical claim is that the E2M3sUE4M4 FP6 configuration at 6.5 bpw shows lower weight reconstruction error than a conventional per-layer-POT FP8 baseline at 8.0 bpw across six model families.

What is new is the specific assembly of those search components and the hardware-oriented HIF output format aimed at LUT decode. The approach does a clear job of targeting memory-bound inference on edge or custom hardware that can afford per-layer LUT SRAM.

The soft spot is that the abstract states the full evaluation, including layer promotion and sparse residual correction, appears only in a companion paper. No tables, error bars, or derivation details are visible here, so the central comparison cannot be checked from this text. The stress-test point about unequal search effort on the FP8 baseline also lands: if the baseline used only standard scaling while the proposed method got exhaustive per-layer optimization, the reported gap could partly reflect search effort rather than the 6-bit atom design itself.

The work is incremental on established primitives like NF4 and block scaling, with no closed-form derivation shown. It is aimed at hardware-aware quantization practitioners who already think about LUT decode and per-layer formats. A serious referee could evaluate it if the companion results and baseline details are supplied together, because the method itself is described coherently even if the numbers are not.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Scaled Outer Product (SOP), a per-layer post-training quantization methodology for LLM weights that performs exhaustive search over fixed/dynamic codebook pairs (including NF4, BOF4, DD4), signed per-block scales, activation-weighted cosine selection, and knapsack-based layer promotion with outlier/sparse-residual correction. It proposes a new HIF LUT output format and asserts that the E2M3sUE4M4 FP6 operating point (6.5 bpw) achieves lower weight reconstruction error than the conventional per-layer-POT E4M3 FP8 baseline (8.0 bpw) across six model families at 1.5 bpw lower storage, with full numeric results including layer promotion deferred to a companion paper.

Significance. If the reconstruction-error comparison holds under a matched optimization procedure and generalizes, the result would indicate that block-scaled small-atom formats with per-layer search can outperform standard FP8 at reduced bit-width, offering practical memory and energy savings on LUT-equipped hardware. The hardware-aware elements, including the HIF format and explicit support for per-layer LUT decode, constitute a concrete engineering contribution even if the quantitative claims require additional substantiation.

major comments (2)

[Abstract] Abstract: The primary claim that E2M3sUE4M4 at 6.5 bpw yields lower reconstruction error than E4M3 at 8.0 bpw is stated without any tables, figures, error bars, or derivation steps in the manuscript; the text explicitly notes that the full evaluation (including layer promotion and sparse residual correction) appears only in a companion paper. This renders the central empirical result unverifiable from the present document.
[Abstract] Abstract / methodology description: The comparison is drawn against a 'conventional per-layer-POT FP8 baseline (E4M3, 8.0 bpw)' while the proposed method applies per-layer exhaustive search over codebook pairs, signed scales, activation-weighted selection, and knapsack promotion. No statement confirms that the FP8 baseline received equivalent search machinery; unequal optimization effort could therefore account for the reported error reduction rather than an intrinsic property of the 6-bit atom+scale design.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and commit to revisions that improve clarity and verifiability.

read point-by-point responses

Referee: [Abstract] Abstract: The primary claim that E2M3sUE4M4 at 6.5 bpw yields lower reconstruction error than E4M3 at 8.0 bpw is stated without any tables, figures, error bars, or derivation steps in the manuscript; the text explicitly notes that the full evaluation (including layer promotion and sparse residual correction) appears only in a companion paper. This renders the central empirical result unverifiable from the present document.

Authors: We agree the central claim is not fully verifiable from this manuscript alone because the complete numeric results, tables, and figures reside in the companion paper. We will revise the abstract to explicitly qualify the statement, noting that the reported error comparison and supporting evaluation details appear in the companion paper. This change will prevent readers from expecting standalone verification here. revision: yes
Referee: [Abstract] Abstract / methodology description: The comparison is drawn against a 'conventional per-layer-POT FP8 baseline (E4M3, 8.0 bpw)' while the proposed method applies per-layer exhaustive search over codebook pairs, signed scales, activation-weighted selection, and knapsack promotion. No statement confirms that the FP8 baseline received equivalent search machinery; unequal optimization effort could therefore account for the reported error reduction rather than an intrinsic property of the 6-bit atom+scale design.

Authors: The phrase 'conventional per-layer-POT FP8 baseline' is meant to denote the standard per-layer post-training quantization procedure using E4M3, which does not incorporate the exhaustive codebook-pair search, activation-weighted cosine selection, or knapsack promotion that SOP applies. We will add an explicit clarifying sentence in the abstract (and/or methodology) stating that the FP8 baseline follows the standard per-layer POT approach without the additional SOP-specific search machinery, thereby removing any ambiguity about optimization effort. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical search methodology with direct comparisons

full rationale

The manuscript describes a per-layer exhaustive search procedure over codebook/scale combinations for post-training quantization and reports measured weight reconstruction errors for the resulting FP6 operating point versus a stated FP8 baseline. No mathematical derivation chain, closed-form predictions, or equations are present that reduce to inputs by construction. The central claim rests on empirical evaluation across model families rather than any self-definitional, fitted-prediction, or self-citation load-bearing step. Full numeric details are deferred to a companion paper, but this does not create a circular reduction within the present text. The work is self-contained as an empirical methodology contribution against external benchmarks.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 2 invented entities

The methodology rests on hardware assumptions and empirical search choices whose independent validation is not supplied in the abstract.

free parameters (3)

per-layer codebook pair selection
Chosen via search per layer using activation-weighted cosine
signed per-block scales
Introduced as part of the representation
E2M3sUE4M4 format parameters
Specific exponent and mantissa widths selected for the FP6 point

axioms (2)

domain assumption Target hardware provides efficient per-layer LUT decode
Required for the claimed performance and energy benefits
domain assumption Weight reconstruction error is a sufficient proxy for end-task quality
Used to support the near-lossless claim

invented entities (2)

Scaled Outer Product (SOP) no independent evidence
purpose: Name for the overall quantization procedure
New label for the combined technique
HIF output format no independent evidence
purpose: Hardware-efficient LUT encoding
Proposed new data layout

pith-pipeline@v0.9.1-grok · 5767 in / 1532 out tokens · 40515 ms · 2026-06-30T21:00:22.161516+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 8 canonical work pages · 2 internal anchors

[1]

Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot : Outlier-free 4-bit inference in rotated LLMs . In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024
[2]

Improving block-wise LLM quantization by 4-bit block-wise optimal float ( BOF4 ): Analysis and variations

Patrick Blumenberg, Thomas Graave, and Tim Fingscheidt. Improving block-wise LLM quantization by 4-bit block-wise optimal float ( BOF4 ): Analysis and variations. arXiv preprint arXiv:2505.06653, 2025

work page arXiv 2025
[3]

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han. Four over six: More accurate NVFP4 quantization with adaptive block scaling. arXiv preprint arXiv:2512.02010, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Lee, Kathryn Le, Junxian Guo, Giovanni Traverso, Anantha P

Jack Cook, Hyemin S. Lee, Kathryn Le, Junxian Guo, Giovanni Traverso, Anantha P. Chandrakasan, and Song Han. Adaptive block-scaled data types ( IF4 ). arXiv preprint arXiv:2603.28765, 2026

work page arXiv 2026
[5]

Microscaling data formats for deep learning

Bita Darvish Rouhani , Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, Dusan Stosic, Venmugil Elango, Maximilian Golub, Alexander Heinecke, Phil James-Roxby, Dharmesh Jani, Gaurav Kolhe, Martin Langhammer, Ada Li, Levi Melnick, Maral Mesmakhosroshahi, Andres Rodriguez...

work page arXiv 2023
[6]

LLM.int8() : 8-bit matrix multiplication for transformers at scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8() : 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems (NeurIPS), 2022

2022
[7]

QLoRA : Efficient finetuning of quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA : Efficient finetuning of quantized LLMs . In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023
[8]

SpQR : A sparse-quantized representation for near-lossless LLM weight compression

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. SpQR : A sparse-quantized representation for near-lossless LLM weight compression. In International Conference on Learning Representations (ICLR), 2024

2024
[9]

Sinan Doluca and Thomas J. Riordan. Ultra-low supply-voltage static random-access memory ( SRAM ) with 8-transistor cell with P and N pass gates to same bit lines. U.S. Patent No. 11,170,844 B1, assigned to Aril Computer Corp., 2021. Filed Jul. 7, 2020; granted Nov. 9, 2021

2021
[10]

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

Vage Egiazarian, Erik Schultheis, Andrei Panferov, Earl Killian, Torsten Hoefler, and Dan Alistarh. Grid games: The power of multiple grids for quantizing large language models. arXiv preprint arXiv:2605.12327, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Is finer better? T he limits of microscaling formats in large language models

Andrea Fasoli, Monodeep Kar, Chi-Chun Liu, Swagath Venkataramani, Viji Srinivasan, Leland Chang, and Naigang Wang. Is finer better? T he limits of microscaling formats in large language models. arXiv preprint arXiv:2601.19026, 2026

work page arXiv 2026
[12]

Allen Gersho and Robert M. Gray. Vector Quantization and Signal Compression, volume 159 of Kluwer International Series in Engineering and Computer Science. Kluwer Academic Publishers, Boston, MA, 1992. ISBN 978-0-7923-9181-4

1992
[13]

DCD : Dual codebook decode for hardware-aware LLM quantization

Earl Killian. DCD : Dual codebook decode for hardware-aware LLM quantization. In preparation; arXiv preprint forthcoming, 2026 a

2026
[14]

Scaled outer product ( SOP ): Architecture specification

Earl Killian. Scaled outer product ( SOP ): Architecture specification. In preparation; provisional patent application filed May 2026, 2026 b

2026
[15]

Mahoney, and Kurt Keutzer

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. SqueezeLLM : Dense-and-sparse quantization. In International Conference on Machine Learning (ICML), 2024

2024
[16]

Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher R \'e , and Aditi Raghunathan

Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher R \'e , and Aditi Raghunathan. Scaling laws for precision. arXiv preprint arXiv:2411.04330, 2024

work page arXiv 2024
[17]

BRECQ : Pushing the limit of post-training quantization by block reconstruction

Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. BRECQ : Pushing the limit of post-training quantization by block reconstruction. In International Conference on Learning Representations (ICLR), 2021

2021
[18]

AWQ : Activation-aware weight quantization for LLM compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ : Activation-aware weight quantization for LLM compression and acceleration. In Proceedings of Machine Learning and Systems (MLSys), 2024

2024
[19]

Mahoney, and Kurt Keutzer

Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael W. Mahoney, and Kurt Keutzer. HAWQ-V3 : Dyadic neural network quantization. arXiv preprint arXiv:2011.10680, 2021

work page arXiv 2011

[1] [1]

Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot : Outlier-free 4-bit inference in rotated LLMs . In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024

[2] [2]

Improving block-wise LLM quantization by 4-bit block-wise optimal float ( BOF4 ): Analysis and variations

Patrick Blumenberg, Thomas Graave, and Tim Fingscheidt. Improving block-wise LLM quantization by 4-bit block-wise optimal float ( BOF4 ): Analysis and variations. arXiv preprint arXiv:2505.06653, 2025

work page arXiv 2025

[3] [3]

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han. Four over six: More accurate NVFP4 quantization with adaptive block scaling. arXiv preprint arXiv:2512.02010, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Lee, Kathryn Le, Junxian Guo, Giovanni Traverso, Anantha P

Jack Cook, Hyemin S. Lee, Kathryn Le, Junxian Guo, Giovanni Traverso, Anantha P. Chandrakasan, and Song Han. Adaptive block-scaled data types ( IF4 ). arXiv preprint arXiv:2603.28765, 2026

work page arXiv 2026

[5] [5]

Microscaling data formats for deep learning

Bita Darvish Rouhani , Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, Dusan Stosic, Venmugil Elango, Maximilian Golub, Alexander Heinecke, Phil James-Roxby, Dharmesh Jani, Gaurav Kolhe, Martin Langhammer, Ada Li, Levi Melnick, Maral Mesmakhosroshahi, Andres Rodriguez...

work page arXiv 2023

[6] [6]

LLM.int8() : 8-bit matrix multiplication for transformers at scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8() : 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems (NeurIPS), 2022

2022

[7] [7]

QLoRA : Efficient finetuning of quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA : Efficient finetuning of quantized LLMs . In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023

[8] [8]

SpQR : A sparse-quantized representation for near-lossless LLM weight compression

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. SpQR : A sparse-quantized representation for near-lossless LLM weight compression. In International Conference on Learning Representations (ICLR), 2024

2024

[9] [9]

Sinan Doluca and Thomas J. Riordan. Ultra-low supply-voltage static random-access memory ( SRAM ) with 8-transistor cell with P and N pass gates to same bit lines. U.S. Patent No. 11,170,844 B1, assigned to Aril Computer Corp., 2021. Filed Jul. 7, 2020; granted Nov. 9, 2021

2021

[10] [10]

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

Vage Egiazarian, Erik Schultheis, Andrei Panferov, Earl Killian, Torsten Hoefler, and Dan Alistarh. Grid games: The power of multiple grids for quantizing large language models. arXiv preprint arXiv:2605.12327, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

Is finer better? T he limits of microscaling formats in large language models

Andrea Fasoli, Monodeep Kar, Chi-Chun Liu, Swagath Venkataramani, Viji Srinivasan, Leland Chang, and Naigang Wang. Is finer better? T he limits of microscaling formats in large language models. arXiv preprint arXiv:2601.19026, 2026

work page arXiv 2026

[12] [12]

Allen Gersho and Robert M. Gray. Vector Quantization and Signal Compression, volume 159 of Kluwer International Series in Engineering and Computer Science. Kluwer Academic Publishers, Boston, MA, 1992. ISBN 978-0-7923-9181-4

1992

[13] [13]

DCD : Dual codebook decode for hardware-aware LLM quantization

Earl Killian. DCD : Dual codebook decode for hardware-aware LLM quantization. In preparation; arXiv preprint forthcoming, 2026 a

2026

[14] [14]

Scaled outer product ( SOP ): Architecture specification

Earl Killian. Scaled outer product ( SOP ): Architecture specification. In preparation; provisional patent application filed May 2026, 2026 b

2026

[15] [15]

Mahoney, and Kurt Keutzer

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. SqueezeLLM : Dense-and-sparse quantization. In International Conference on Machine Learning (ICML), 2024

2024

[16] [16]

Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher R \'e , and Aditi Raghunathan

Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher R \'e , and Aditi Raghunathan. Scaling laws for precision. arXiv preprint arXiv:2411.04330, 2024

work page arXiv 2024

[17] [17]

BRECQ : Pushing the limit of post-training quantization by block reconstruction

Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. BRECQ : Pushing the limit of post-training quantization by block reconstruction. In International Conference on Learning Representations (ICLR), 2021

2021

[18] [18]

AWQ : Activation-aware weight quantization for LLM compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ : Activation-aware weight quantization for LLM compression and acceleration. In Proceedings of Machine Learning and Systems (MLSys), 2024

2024

[19] [19]

Mahoney, and Kurt Keutzer

Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael W. Mahoney, and Kurt Keutzer. HAWQ-V3 : Dyadic neural network quantization. arXiv preprint arXiv:2011.10680, 2021

work page arXiv 2011