XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference

Thomas Witt

arxiv: 2605.14844 · v1 · pith:3RCX7EXDnew · submitted 2026-05-14 · 💻 cs.LG · cs.AI

XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference

Thomas Witt This is my paper

Pith reviewed 2026-06-30 21:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM quantizationcodebook quantizationMoE inferenceoutlier separationadaptive bit allocationcosine similaritysparse residual

0 comments

The pith

XFP inverts LLM quantization so the operator sets per-channel cosine similarity floors and the method automatically selects codebook size, outlier budget, and packing without calibration data or Hessian.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reconstruction quality can be controlled directly by cosine similarity thresholds rather than bit widths or calibration sets. Each weight matrix is split into a sparse fp16 outlier part and a dense index tensor into a learned per-group codebook, with sizes chosen on the fly to meet the floors. For models that exceed memory limits, the H-Process iterates the two thresholds until the model fits while generation remains coherent. A sympathetic reader would care because this removes manual tuning steps and lets very large MoE models run on fixed hardware budgets. The approach is demonstrated on Qwen3.5 variants up to 397B parameters, showing higher throughput and accuracy than INT4 baselines that rely on pruning.

Core claim

XFP decomposes each weight matrix into a sparse fp16 outlier residual and a dense sub-byte index tensor into a per-group learned codebook; codebook size, outlier budget, and packing are chosen automatically from operator-specified per-channel cosine similarity floors (strict for attention and shared experts, lazy for routed experts); two storage modes (V2 per-channel Lloyd, V2a shared library of 32 codebooks) share the same frontend and fused kernel; the H-Process iterates the floors inside an OOM boundary and a garbage-generation boundary to fit models into target memory.

What carries the argument

Per-channel cosine similarity floor (strict or lazy) that drives automatic selection of codebook size and outlier budget, together with the H-Process iteration over those floors.

If this is right

On Qwen3.5-122B-A10B under V2 the method reaches 138 tok/s at 94.49% GSM8K and is 49% faster than Marlin INT4 at TP=1.
On Qwen3.5-397B-A17B the H-Process fits the full expert population into 2x96 GB at approximately 3.4 effective bits while delivering 100.9 tok/s at 66.72% GSM8K.
The same thresholds and iteration exceed INT4 with routed-expert pruning on memory, throughput, and accuracy simultaneously.
V2 and V2a modes share one auto-select frontend and one fused decode kernel.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could allow operators to adjust quality-memory trade-offs on the fly for a given hardware envelope without re-deriving Hessians.
If cosine similarity continues to track output quality across new model families, the same floors might serve as a portable control signal for other compression schemes.
The absence of calibration data requirements could simplify deployment pipelines for models that change frequently or must run on air-gapped systems.

Load-bearing premise

That operator-specified per-channel cosine similarity floors plus the H-Process iteration are sufficient to guarantee sensible generation output without any calibration data or post-selection verification.

What would settle it

Apply the same cosine floors to a new model or benchmark set and check whether accuracy falls below the reported GSM8K levels or generation produces incoherent text before the stated thresholds are reached.

Figures

Figures reproduced from arXiv: 2605.14844 by Thomas Witt.

**Figure 1.** Figure 1: XFP pipeline overview. The operator specifies a quality floor τ ; XFP determines everything else. Outlier extraction separates high-magnitude weights into a sparse fp16 residual. Lloyd iteration learns a per-layer codebook on the cleaned bulk distribution. Auto-select (Algorithm 1) tests candidate codebook sizes and picks the minimum N meeting τ . The fused decode kernel reconstructs weights at inference v… view at source ↗

**Figure 2.** Figure 2: Single-stream decode throughput on Qwen3.5-122B-A10B, RTX PRO 6000 Blackwell (SM120), 1,500-token output. At identical TP=1 single-stream (the regime this work targets), XFP is +49% faster than Marlin INT4 (AutoRound); TP=2 extends this to +87%. Both XFP and Marlin are memory-bandwidth-bound at M = 1; XFP reads ∼3.97 effective bits per weight versus Marlin’s 4.0. Concurrent / batched serving is out of scop… view at source ↗

**Figure 3.** Figure 3: XFP vs. Marlin INT4 on Qwen3.5- 122B-A10B, RTX PRO 6000 Blackwell. Bars (left): single-stream tok/s. Markers (right): GSM8K strictmatch (3 seeds, mean ± std). At TP=1, XFP is 49% faster at −0.65 pp accuracy (within seed-variance). 6.3 Front B: The H-Process — Constrained Compression on a 397B Model Qwen3.5-397B-A17B is a hybrid linear-/selfattention MoE with 512 routed experts per layer, 60 layers, and … view at source ↗

read the original abstract

We introduce XFP, a dynamic weight quantizer for LLM inference that inverts the conventional workflow: the operator specifies reconstruction quality floors on per-channel cosine similarity (one strict floor for attention and shared experts, one lazy floor for routed-expert MoE); XFP determines codebook size, outlier budget, and packing per layer automatically -- no Hessian, no calibration data, no manual bit-width selection. Each weight matrix is decomposed into a sparse fp16 outlier residual and a dense sub-byte index tensor into a per-group learned codebook. Two storage modes share one auto-select frontend and one fused decode kernel: V2 (per-channel Lloyd) and V2a (shared library of L=32 codebooks per layer). On Qwen3.5-122B-A10B under V2, XFP reaches 138 tok/s single-stream decode on workstation hardware (RTX PRO 6000 Blackwell, TP=2) at 94.49% GSM8K strict-match (3 seeds, n=3957), and is 49% faster than Marlin INT4 at TP=1. For models that do not fit in the target memory envelope, we present the H-Process: a quality-driven iteration over the two cosine thresholds that finds the operating point at which the model just fits while still producing sensible output. Three constraints define its search space: the operator-set thresholds, an OOM boundary at quantize-on-load, and a garbage boundary in generation (cosine similarity steers; benches verify). On Qwen3.5-397B-A17B (512 routed experts/layer), the H-Process fits the full expert population into 2x96 GB at ~3.4 effective bits and delivers 100.9 tok/s long-output decode at 66.72% GSM8K strict-match on the full 1319-problem set (single seed at submission; multi-seed evaluation in progress), exceeding INT4 with routed-expert pruning on memory, throughput, and accuracy simultaneously.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

XFP's cosine-floor auto-quantization for MoE models reports solid speed and accuracy numbers on large Qwen variants but leaves the quality-proxy link unexamined.

read the letter

The key point is that this paper inverts the usual quantization flow: the user sets two per-channel cosine similarity floors (strict for attention/shared experts, lazy for routed MoE), and the method then picks codebook size, outlier budget, and packing automatically with no calibration data or Hessian. It also adds a sparse fp16 outlier residual plus two storage modes (per-channel Lloyd and shared-library codebooks) under one frontend and kernel. On the reported Qwen3.5-122B-A10B it hits 138 tok/s at 94.49% GSM8K, and the H-Process squeezes the 397B-A17B into 2x96 GB at ~3.4 bits while delivering 100.9 tok/s at 66.72% GSM8K, beating pruned INT4 on all three axes.

The engineering is practical and the numbers are concrete, which is the main strength. The H-Process iteration over the two thresholds to meet memory and output constraints is a clear workflow contribution, and the shared V2a mode looks like a useful storage option.

The soft spot is the missing link between the cosine floors and downstream generation quality. The abstract states that cosine similarity steers while benches verify, but supplies no correlation study, ablation on alternative metrics, or error-propagation check. Without that, the claim that operator-set floors plus iteration suffice for sensible output without calibration data rests on an unshown premise. The circularity burden noted in the stress test is real on the evidence given.

This is for inference engineers who need to run very large MoE models on limited hardware and are willing to tune the two cosine thresholds. A reader focused on practical deployment numbers will find the reported throughputs and accuracies useful even if the method internals need more scrutiny.

It deserves peer review because the claims are specific enough to test and the target hardware setting matters. The full paper would need to show the cosine-to-quality validation to carry the central argument.

Referee Report

2 major / 1 minor

Summary. The paper introduces XFP, a dynamic weight quantizer for LLM inference that inverts the conventional workflow by having the operator specify per-channel cosine similarity reconstruction quality floors (one strict for attention/shared experts, one lazy for routed experts in MoE); the method then automatically determines codebook size, outlier budget, and packing per layer with no Hessian, no calibration data, and no manual bit-width selection. Each weight matrix is decomposed into a sparse FP16 outlier residual and a dense sub-byte index tensor into a per-group learned codebook, supporting two storage modes (V2 per-channel Lloyd and V2a shared library of L=32 codebooks) that share an auto-select frontend and fused decode kernel. For models exceeding memory limits, the H-Process performs a quality-driven iteration over the two cosine thresholds subject to OOM and garbage boundaries (with 'cosine similarity steers; benches verify') to find a fitting operating point. The abstract reports concrete results on Qwen3.5-122B-A10B (138 tok/s at 94.49% GSM8K) and Qwen3.5-397B-A17B (~3.4 effective bits, 100.9 tok/s at 66.72% GSM8K), claiming simultaneous gains over INT4 with routed-expert pruning on memory, throughput, and accuracy.

Significance. If the central claims hold, the work would be significant for simplifying deployment of very large MoE models by providing an automatic, calibration-free quantization path that targets memory envelopes while preserving generation quality. The H-Process and dual storage modes with fused kernels address practical constraints for models like the 397B variant on 2x96 GB hardware, and the reported throughput/accuracy numbers suggest potential advantages over existing INT4 baselines. Strengths include the parameter-free frontend once floors are set and the explicit handling of routed experts. However, the significance depends on resolving whether cosine floors serve as a reliable proxy without hidden data dependence.

major comments (2)

[Abstract] Abstract (H-Process paragraph): The central claim that XFP requires 'no calibration data' and operates automatically is load-bearing, yet the H-Process is described as locating the operating point via iteration where 'cosine similarity steers; benches verify' the garbage boundary. This indicates that threshold selection and validation rely on running generation benchmarks, which is a form of data-dependent post-hoc verification and directly contradicts the no-calibration assertion.
[Abstract] Abstract: No correlation study, ablation, or error-propagation analysis is referenced showing that operator-specified per-channel cosine similarity floors track downstream token-level quality metrics (e.g., perplexity or GSM8K accuracy) across dense and MoE layers. Without this, the premise that the floors plus H-Process iteration suffice to guarantee 'sensible output' independently of benchmarks is unsupported, placing the automatic and calibration-free properties at risk.

minor comments (1)

[Abstract] Abstract: The reported numbers (e.g., 100.9 tok/s, 66.72% GSM8K on 1319-problem set) would benefit from explicit statement of the exact hardware (beyond 'workstation hardware'), number of seeds, and precise INT4 baseline configuration for immediate assessment of the simultaneous gains claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract and the calibration-free claims. The points identify areas where the presentation could be tightened to better separate the core algorithm from the optional H-Process. We respond point-by-point below and will make the indicated revisions.

read point-by-point responses

Referee: [Abstract] Abstract (H-Process paragraph): The central claim that XFP requires 'no calibration data' and operates automatically is load-bearing, yet the H-Process is described as locating the operating point via iteration where 'cosine similarity steers; benches verify' the garbage boundary. This indicates that threshold selection and validation rely on running generation benchmarks, which is a form of data-dependent post-hoc verification and directly contradicts the no-calibration assertion.

Authors: We agree the abstract wording risks conflating two distinct stages. The core XFP quantization procedure is strictly calibration-free: given only the operator-specified per-channel cosine floors, it automatically selects codebook size, outlier budget, and packing without any data, Hessian, or benchmark runs. The H-Process is an optional outer loop invoked solely when the model exceeds the target memory envelope; its benchmark verification step is used only to locate the garbage boundary for that specific deployment scenario. We will revise the abstract to explicitly separate the no-calibration quantization algorithm from the optional H-Process verification, and we will add a clarifying sentence in the H-Process section stating that benchmark checks are a practical safeguard rather than part of the quantization itself. revision: yes
Referee: [Abstract] Abstract: No correlation study, ablation, or error-propagation analysis is referenced showing that operator-specified per-channel cosine similarity floors track downstream token-level quality metrics (e.g., perplexity or GSM8K accuracy) across dense and MoE layers. Without this, the premise that the floors plus H-Process iteration suffice to guarantee 'sensible output' independently of benchmarks is unsupported, placing the automatic and calibration-free properties at risk.

Authors: We acknowledge that the manuscript does not contain a dedicated correlation or ablation study linking the cosine floors to token-level metrics. The design choice rests on cosine similarity being a standard, layer-wise reconstruction metric in vector quantization, and the reported end-to-end results on Qwen3.5 models show that the selected floors preserve competitive accuracy. To strengthen the justification, we will add a new subsection (or appendix) that reports the observed correlation between per-channel cosine similarity and downstream metrics (perplexity and task accuracy) on representative attention, shared-expert, and routed-expert layers from both dense and MoE models. This addition will provide the requested empirical grounding without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents XFP as a procedural method that takes operator-specified per-channel cosine similarity floors as explicit inputs and derives codebook size, outlier budget, and packing from them via decomposition into sparse fp16 residuals and dense sub-byte index tensors. The H-Process is described as an iteration over those same operator-set thresholds to satisfy memory/OOM constraints while meeting a garbage boundary verified externally by benchmarks. No equation or step equates the derived parameters to the input floors by construction, no parameter is fitted on data and then relabeled as a prediction, and no self-citation chain or uniqueness theorem is invoked to justify the core choices. The derivation remains self-contained as an input-driven search procedure whose outputs are not definitionally identical to its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review; ledger reflects only claims visible in the provided text. Learned codebooks imply internal fitting parameters whose count and fitting procedure are not stated.

free parameters (1)

per-channel cosine similarity floors
Operator-specified strict and lazy thresholds that drive all downstream auto-selection of codebook size and outlier budget.

axioms (1)

domain assumption Cosine similarity between original and reconstructed weights is a sufficient proxy for downstream generation quality
Used to define both the input floors and the 'sensible output' boundary in the H-Process.

pith-pipeline@v0.9.1-grok · 5907 in / 1536 out tokens · 30878 ms · 2026-06-30T21:25:57.555429+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 6 canonical work pages

[1]

L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac'h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. A framework for few-shot language model evaluation. https://github.com/EleutherAI/lm-evaluat...

2024
[2]

J. Chee, Y. Cai, V. Kuleshov, and C. De Sa. QuIP : 2-Bit Quantization of Large Language Models with Guarantees. arXiv:2307.13304, 2023

work page arXiv 2023
[3]

Dettmers, M

T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer. LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale. In NeurIPS, 2022

2022
[4]

Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078, 2023

T. Dettmers, R. Svirschevski, V. Egiazarian, D. Kuznedelev, E. Frantar, S. Ashkboos, A. Borzunov, T. Hoefler, and D. Alistarh. SpQR : A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. arXiv:2306.03078, 2023

work page arXiv 2023
[5]

Dettmers, A

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. QLoRA : Efficient Finetuning of Quantized LLMs. In NeurIPS, 2023

2023
[6]

Egiazarian, A

V. Egiazarian, A. Panferov, D. Kuznedelev, E. Frantar, A. Babenko, and D. Alistarh. AQLM : Extreme Compression of Large Language Models via Additive Quantization. In ICML, 2024

2024
[7]

Frantar, S

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. GPTQ : Accurate Post-Training Quantization for Generative Pre-trained Transformers. In ICLR, 2023

2023
[8]

Cheng, W

W. Cheng, W. Lu, et al. Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs. arXiv:2309.05516, 2024

work page arXiv 2024
[9]

Frantar and D

E. Frantar and D. Alistarh. Marlin : Near-Ideal 4-Bit LLM Inference on NVIDIA GPUs. GitHub, IST Austria, 2024

2024
[10]

B. Marie. NVFP4 : Same Accuracy with 2.3x Higher Throughput for 4-Bit LLMs. The Kaitchup -- AI on a Budget, 2025

2025
[11]

TensorRT Model Optimizer : Post-Training Quantization for LLMs

NVIDIA. TensorRT Model Optimizer : Post-Training Quantization for LLMs. https://github.com/NVIDIA/TensorRT-Model-Optimizer, 2024

2024
[12]

Gerganov et al

G. Gerganov et al. llama.cpp. https://github.com/ggerganov/llama.cpp, 2023

2023
[13]

Lasby et al

M. Lasby et al. REAP : Reaping Experts for Activation-Aware Pruning of MoE Models. arXiv preprint, 2024

2024
[14]

S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. Mahoney, and K. Keutzer. SqueezeLLM : Dense-and-Sparse Quantization. arXiv:2306.07629, 2023

work page arXiv 2023
[15]

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention . In SOSP, 2023

2023
[16]

J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han. AWQ : Activation-aware Weight Quantization for LLM Compression and Acceleration. In MLSys, 2024

2024
[17]

S.P. Lloyd. Least Squares Quantization in PCM . IEEE Trans.\ Information Theory, 28(2):129--137, 1982

1982
[18]

NVFP4 Tensor Core Quantization---Blackwell Architecture Whitepaper

NVIDIA. NVFP4 Tensor Core Quantization---Blackwell Architecture Whitepaper. 2024

2024
[19]

OCP Microscaling ( MX ) Specification, v1.0

Open Compute Project. OCP Microscaling ( MX ) Specification, v1.0. 2023

2023
[20]

Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396,

A. Tseng, J. Chee, Q. Sun, V. Kuleshov, and C. De Sa. QuIP\# : Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks. arXiv:2402.04396, 2024

work page arXiv 2024
[21]

Qtip: Quantiza- tion with trellises and incoherence processing,

A. Tseng, Q. Sun, D. Yin, C. De Sa, and V. Kuleshov. QTIP : Quantization with Trellises and Incoherence Processing. arXiv:2406.11235, 2024

work page arXiv 2024

[1] [1]

L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac'h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. A framework for few-shot language model evaluation. https://github.com/EleutherAI/lm-evaluat...

2024

[2] [2]

J. Chee, Y. Cai, V. Kuleshov, and C. De Sa. QuIP : 2-Bit Quantization of Large Language Models with Guarantees. arXiv:2307.13304, 2023

work page arXiv 2023

[3] [3]

Dettmers, M

T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer. LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale. In NeurIPS, 2022

2022

[4] [4]

Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078, 2023

T. Dettmers, R. Svirschevski, V. Egiazarian, D. Kuznedelev, E. Frantar, S. Ashkboos, A. Borzunov, T. Hoefler, and D. Alistarh. SpQR : A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. arXiv:2306.03078, 2023

work page arXiv 2023

[5] [5]

Dettmers, A

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. QLoRA : Efficient Finetuning of Quantized LLMs. In NeurIPS, 2023

2023

[6] [6]

Egiazarian, A

V. Egiazarian, A. Panferov, D. Kuznedelev, E. Frantar, A. Babenko, and D. Alistarh. AQLM : Extreme Compression of Large Language Models via Additive Quantization. In ICML, 2024

2024

[7] [7]

Frantar, S

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. GPTQ : Accurate Post-Training Quantization for Generative Pre-trained Transformers. In ICLR, 2023

2023

[8] [8]

Cheng, W

W. Cheng, W. Lu, et al. Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs. arXiv:2309.05516, 2024

work page arXiv 2024

[9] [9]

Frantar and D

E. Frantar and D. Alistarh. Marlin : Near-Ideal 4-Bit LLM Inference on NVIDIA GPUs. GitHub, IST Austria, 2024

2024

[10] [10]

B. Marie. NVFP4 : Same Accuracy with 2.3x Higher Throughput for 4-Bit LLMs. The Kaitchup -- AI on a Budget, 2025

2025

[11] [11]

TensorRT Model Optimizer : Post-Training Quantization for LLMs

NVIDIA. TensorRT Model Optimizer : Post-Training Quantization for LLMs. https://github.com/NVIDIA/TensorRT-Model-Optimizer, 2024

2024

[12] [12]

Gerganov et al

G. Gerganov et al. llama.cpp. https://github.com/ggerganov/llama.cpp, 2023

2023

[13] [13]

Lasby et al

M. Lasby et al. REAP : Reaping Experts for Activation-Aware Pruning of MoE Models. arXiv preprint, 2024

2024

[14] [14]

S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. Mahoney, and K. Keutzer. SqueezeLLM : Dense-and-Sparse Quantization. arXiv:2306.07629, 2023

work page arXiv 2023

[15] [15]

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention . In SOSP, 2023

2023

[16] [16]

J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han. AWQ : Activation-aware Weight Quantization for LLM Compression and Acceleration. In MLSys, 2024

2024

[17] [17]

S.P. Lloyd. Least Squares Quantization in PCM . IEEE Trans.\ Information Theory, 28(2):129--137, 1982

1982

[18] [18]

NVFP4 Tensor Core Quantization---Blackwell Architecture Whitepaper

NVIDIA. NVFP4 Tensor Core Quantization---Blackwell Architecture Whitepaper. 2024

2024

[19] [19]

OCP Microscaling ( MX ) Specification, v1.0

Open Compute Project. OCP Microscaling ( MX ) Specification, v1.0. 2023

2023

[20] [20]

Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396,

A. Tseng, J. Chee, Q. Sun, V. Kuleshov, and C. De Sa. QuIP\# : Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks. arXiv:2402.04396, 2024

work page arXiv 2024

[21] [21]

Qtip: Quantiza- tion with trellises and incoherence processing,

A. Tseng, Q. Sun, D. Yin, C. De Sa, and V. Kuleshov. QTIP : Quantization with Trellises and Incoherence Processing. arXiv:2406.11235, 2024

work page arXiv 2024