pith. sign in

arxiv: 2605.14844 · v1 · pith:3RCX7EXDnew · submitted 2026-05-14 · 💻 cs.LG · cs.AI

XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference

Pith reviewed 2026-06-30 21:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords LLM quantizationcodebook quantizationMoE inferenceoutlier separationadaptive bit allocationcosine similaritysparse residual
0
0 comments X

The pith

XFP inverts LLM quantization so the operator sets per-channel cosine similarity floors and the method automatically selects codebook size, outlier budget, and packing without calibration data or Hessian.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reconstruction quality can be controlled directly by cosine similarity thresholds rather than bit widths or calibration sets. Each weight matrix is split into a sparse fp16 outlier part and a dense index tensor into a learned per-group codebook, with sizes chosen on the fly to meet the floors. For models that exceed memory limits, the H-Process iterates the two thresholds until the model fits while generation remains coherent. A sympathetic reader would care because this removes manual tuning steps and lets very large MoE models run on fixed hardware budgets. The approach is demonstrated on Qwen3.5 variants up to 397B parameters, showing higher throughput and accuracy than INT4 baselines that rely on pruning.

Core claim

XFP decomposes each weight matrix into a sparse fp16 outlier residual and a dense sub-byte index tensor into a per-group learned codebook; codebook size, outlier budget, and packing are chosen automatically from operator-specified per-channel cosine similarity floors (strict for attention and shared experts, lazy for routed experts); two storage modes (V2 per-channel Lloyd, V2a shared library of 32 codebooks) share the same frontend and fused kernel; the H-Process iterates the floors inside an OOM boundary and a garbage-generation boundary to fit models into target memory.

What carries the argument

Per-channel cosine similarity floor (strict or lazy) that drives automatic selection of codebook size and outlier budget, together with the H-Process iteration over those floors.

If this is right

  • On Qwen3.5-122B-A10B under V2 the method reaches 138 tok/s at 94.49% GSM8K and is 49% faster than Marlin INT4 at TP=1.
  • On Qwen3.5-397B-A17B the H-Process fits the full expert population into 2x96 GB at approximately 3.4 effective bits while delivering 100.9 tok/s at 66.72% GSM8K.
  • The same thresholds and iteration exceed INT4 with routed-expert pruning on memory, throughput, and accuracy simultaneously.
  • V2 and V2a modes share one auto-select frontend and one fused decode kernel.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could allow operators to adjust quality-memory trade-offs on the fly for a given hardware envelope without re-deriving Hessians.
  • If cosine similarity continues to track output quality across new model families, the same floors might serve as a portable control signal for other compression schemes.
  • The absence of calibration data requirements could simplify deployment pipelines for models that change frequently or must run on air-gapped systems.

Load-bearing premise

That operator-specified per-channel cosine similarity floors plus the H-Process iteration are sufficient to guarantee sensible generation output without any calibration data or post-selection verification.

What would settle it

Apply the same cosine floors to a new model or benchmark set and check whether accuracy falls below the reported GSM8K levels or generation produces incoherent text before the stated thresholds are reached.

Figures

Figures reproduced from arXiv: 2605.14844 by Thomas Witt.

Figure 1
Figure 1. Figure 1: XFP pipeline overview. The operator specifies a quality floor τ ; XFP determines everything else. Outlier extraction separates high-magnitude weights into a sparse fp16 residual. Lloyd iteration learns a per-layer codebook on the cleaned bulk distribution. Auto-select (Algorithm 1) tests candidate codebook sizes and picks the minimum N meeting τ . The fused decode kernel reconstructs weights at inference v… view at source ↗
Figure 2
Figure 2. Figure 2: Single-stream decode throughput on Qwen3.5-122B-A10B, RTX PRO 6000 Blackwell (SM120), 1,500-token output. At identical TP=1 single-stream (the regime this work targets), XFP is +49% faster than Marlin INT4 (AutoRound); TP=2 extends this to +87%. Both XFP and Marlin are memory-bandwidth-bound at M = 1; XFP reads ∼3.97 effective bits per weight versus Marlin’s 4.0. Concurrent / batched serving is out of scop… view at source ↗
Figure 3
Figure 3. Figure 3: XFP vs. Marlin INT4 on Qwen3.5- 122B-A10B, RTX PRO 6000 Blackwell. Bars (left): single-stream tok/s. Markers (right): GSM8K strict￾match (3 seeds, mean ± std). At TP=1, XFP is 49% faster at −0.65 pp accuracy (within seed-variance). 6.3 Front B: The H-Process — Con￾strained Compression on a 397B Model Qwen3.5-397B-A17B is a hybrid linear-/self￾attention MoE with 512 routed experts per layer, 60 layers, and … view at source ↗
read the original abstract

We introduce XFP, a dynamic weight quantizer for LLM inference that inverts the conventional workflow: the operator specifies reconstruction quality floors on per-channel cosine similarity (one strict floor for attention and shared experts, one lazy floor for routed-expert MoE); XFP determines codebook size, outlier budget, and packing per layer automatically -- no Hessian, no calibration data, no manual bit-width selection. Each weight matrix is decomposed into a sparse fp16 outlier residual and a dense sub-byte index tensor into a per-group learned codebook. Two storage modes share one auto-select frontend and one fused decode kernel: V2 (per-channel Lloyd) and V2a (shared library of L=32 codebooks per layer). On Qwen3.5-122B-A10B under V2, XFP reaches 138 tok/s single-stream decode on workstation hardware (RTX PRO 6000 Blackwell, TP=2) at 94.49% GSM8K strict-match (3 seeds, n=3957), and is 49% faster than Marlin INT4 at TP=1. For models that do not fit in the target memory envelope, we present the H-Process: a quality-driven iteration over the two cosine thresholds that finds the operating point at which the model just fits while still producing sensible output. Three constraints define its search space: the operator-set thresholds, an OOM boundary at quantize-on-load, and a garbage boundary in generation (cosine similarity steers; benches verify). On Qwen3.5-397B-A17B (512 routed experts/layer), the H-Process fits the full expert population into 2x96 GB at ~3.4 effective bits and delivers 100.9 tok/s long-output decode at 66.72% GSM8K strict-match on the full 1319-problem set (single seed at submission; multi-seed evaluation in progress), exceeding INT4 with routed-expert pruning on memory, throughput, and accuracy simultaneously.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces XFP, a dynamic weight quantizer for LLM inference that inverts the conventional workflow by having the operator specify per-channel cosine similarity reconstruction quality floors (one strict for attention/shared experts, one lazy for routed experts in MoE); the method then automatically determines codebook size, outlier budget, and packing per layer with no Hessian, no calibration data, and no manual bit-width selection. Each weight matrix is decomposed into a sparse FP16 outlier residual and a dense sub-byte index tensor into a per-group learned codebook, supporting two storage modes (V2 per-channel Lloyd and V2a shared library of L=32 codebooks) that share an auto-select frontend and fused decode kernel. For models exceeding memory limits, the H-Process performs a quality-driven iteration over the two cosine thresholds subject to OOM and garbage boundaries (with 'cosine similarity steers; benches verify') to find a fitting operating point. The abstract reports concrete results on Qwen3.5-122B-A10B (138 tok/s at 94.49% GSM8K) and Qwen3.5-397B-A17B (~3.4 effective bits, 100.9 tok/s at 66.72% GSM8K), claiming simultaneous gains over INT4 with routed-expert pruning on memory, throughput, and accuracy.

Significance. If the central claims hold, the work would be significant for simplifying deployment of very large MoE models by providing an automatic, calibration-free quantization path that targets memory envelopes while preserving generation quality. The H-Process and dual storage modes with fused kernels address practical constraints for models like the 397B variant on 2x96 GB hardware, and the reported throughput/accuracy numbers suggest potential advantages over existing INT4 baselines. Strengths include the parameter-free frontend once floors are set and the explicit handling of routed experts. However, the significance depends on resolving whether cosine floors serve as a reliable proxy without hidden data dependence.

major comments (2)
  1. [Abstract] Abstract (H-Process paragraph): The central claim that XFP requires 'no calibration data' and operates automatically is load-bearing, yet the H-Process is described as locating the operating point via iteration where 'cosine similarity steers; benches verify' the garbage boundary. This indicates that threshold selection and validation rely on running generation benchmarks, which is a form of data-dependent post-hoc verification and directly contradicts the no-calibration assertion.
  2. [Abstract] Abstract: No correlation study, ablation, or error-propagation analysis is referenced showing that operator-specified per-channel cosine similarity floors track downstream token-level quality metrics (e.g., perplexity or GSM8K accuracy) across dense and MoE layers. Without this, the premise that the floors plus H-Process iteration suffice to guarantee 'sensible output' independently of benchmarks is unsupported, placing the automatic and calibration-free properties at risk.
minor comments (1)
  1. [Abstract] Abstract: The reported numbers (e.g., 100.9 tok/s, 66.72% GSM8K on 1319-problem set) would benefit from explicit statement of the exact hardware (beyond 'workstation hardware'), number of seeds, and precise INT4 baseline configuration for immediate assessment of the simultaneous gains claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract and the calibration-free claims. The points identify areas where the presentation could be tightened to better separate the core algorithm from the optional H-Process. We respond point-by-point below and will make the indicated revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract (H-Process paragraph): The central claim that XFP requires 'no calibration data' and operates automatically is load-bearing, yet the H-Process is described as locating the operating point via iteration where 'cosine similarity steers; benches verify' the garbage boundary. This indicates that threshold selection and validation rely on running generation benchmarks, which is a form of data-dependent post-hoc verification and directly contradicts the no-calibration assertion.

    Authors: We agree the abstract wording risks conflating two distinct stages. The core XFP quantization procedure is strictly calibration-free: given only the operator-specified per-channel cosine floors, it automatically selects codebook size, outlier budget, and packing without any data, Hessian, or benchmark runs. The H-Process is an optional outer loop invoked solely when the model exceeds the target memory envelope; its benchmark verification step is used only to locate the garbage boundary for that specific deployment scenario. We will revise the abstract to explicitly separate the no-calibration quantization algorithm from the optional H-Process verification, and we will add a clarifying sentence in the H-Process section stating that benchmark checks are a practical safeguard rather than part of the quantization itself. revision: yes

  2. Referee: [Abstract] Abstract: No correlation study, ablation, or error-propagation analysis is referenced showing that operator-specified per-channel cosine similarity floors track downstream token-level quality metrics (e.g., perplexity or GSM8K accuracy) across dense and MoE layers. Without this, the premise that the floors plus H-Process iteration suffice to guarantee 'sensible output' independently of benchmarks is unsupported, placing the automatic and calibration-free properties at risk.

    Authors: We acknowledge that the manuscript does not contain a dedicated correlation or ablation study linking the cosine floors to token-level metrics. The design choice rests on cosine similarity being a standard, layer-wise reconstruction metric in vector quantization, and the reported end-to-end results on Qwen3.5 models show that the selected floors preserve competitive accuracy. To strengthen the justification, we will add a new subsection (or appendix) that reports the observed correlation between per-channel cosine similarity and downstream metrics (perplexity and task accuracy) on representative attention, shared-expert, and routed-expert layers from both dense and MoE models. This addition will provide the requested empirical grounding without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents XFP as a procedural method that takes operator-specified per-channel cosine similarity floors as explicit inputs and derives codebook size, outlier budget, and packing from them via decomposition into sparse fp16 residuals and dense sub-byte index tensors. The H-Process is described as an iteration over those same operator-set thresholds to satisfy memory/OOM constraints while meeting a garbage boundary verified externally by benchmarks. No equation or step equates the derived parameters to the input floors by construction, no parameter is fitted on data and then relabeled as a prediction, and no self-citation chain or uniqueness theorem is invoked to justify the core choices. The derivation remains self-contained as an input-driven search procedure whose outputs are not definitionally identical to its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review; ledger reflects only claims visible in the provided text. Learned codebooks imply internal fitting parameters whose count and fitting procedure are not stated.

free parameters (1)
  • per-channel cosine similarity floors
    Operator-specified strict and lazy thresholds that drive all downstream auto-selection of codebook size and outlier budget.
axioms (1)
  • domain assumption Cosine similarity between original and reconstructed weights is a sufficient proxy for downstream generation quality
    Used to define both the input floors and the 'sensible output' boundary in the H-Process.

pith-pipeline@v0.9.1-grok · 5907 in / 1536 out tokens · 30878 ms · 2026-06-30T21:25:57.555429+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 6 canonical work pages

  1. [1]

    L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac'h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. A framework for few-shot language model evaluation. https://github.com/EleutherAI/lm-evaluat...

  2. [2]

    J. Chee, Y. Cai, V. Kuleshov, and C. De Sa. QuIP : 2-Bit Quantization of Large Language Models with Guarantees. arXiv:2307.13304, 2023

  3. [3]

    Dettmers, M

    T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer. LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale. In NeurIPS, 2022

  4. [4]

    arXiv preprint arXiv:2306.03078 , year=

    T. Dettmers, R. Svirschevski, V. Egiazarian, D. Kuznedelev, E. Frantar, S. Ashkboos, A. Borzunov, T. Hoefler, and D. Alistarh. SpQR : A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. arXiv:2306.03078, 2023

  5. [5]

    Dettmers, A

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. QLoRA : Efficient Finetuning of Quantized LLMs. In NeurIPS, 2023

  6. [6]

    Egiazarian, A

    V. Egiazarian, A. Panferov, D. Kuznedelev, E. Frantar, A. Babenko, and D. Alistarh. AQLM : Extreme Compression of Large Language Models via Additive Quantization. In ICML, 2024

  7. [7]

    Frantar, S

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. GPTQ : Accurate Post-Training Quantization for Generative Pre-trained Transformers. In ICLR, 2023

  8. [8]

    Cheng, W

    W. Cheng, W. Lu, et al. Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs. arXiv:2309.05516, 2024

  9. [9]

    Frantar and D

    E. Frantar and D. Alistarh. Marlin : Near-Ideal 4-Bit LLM Inference on NVIDIA GPUs. GitHub, IST Austria, 2024

  10. [10]

    B. Marie. NVFP4 : Same Accuracy with 2.3x Higher Throughput for 4-Bit LLMs. The Kaitchup -- AI on a Budget, 2025

  11. [11]

    TensorRT Model Optimizer : Post-Training Quantization for LLMs

    NVIDIA. TensorRT Model Optimizer : Post-Training Quantization for LLMs. https://github.com/NVIDIA/TensorRT-Model-Optimizer, 2024

  12. [12]

    Gerganov et al

    G. Gerganov et al. llama.cpp. https://github.com/ggerganov/llama.cpp, 2023

  13. [13]

    Lasby et al

    M. Lasby et al. REAP : Reaping Experts for Activation-Aware Pruning of MoE Models. arXiv preprint, 2024

  14. [14]

    S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. Mahoney, and K. Keutzer. SqueezeLLM : Dense-and-Sparse Quantization. arXiv:2306.07629, 2023

  15. [15]

    W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention . In SOSP, 2023

  16. [16]

    J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han. AWQ : Activation-aware Weight Quantization for LLM Compression and Acceleration. In MLSys, 2024

  17. [17]

    S.P. Lloyd. Least Squares Quantization in PCM . IEEE Trans.\ Information Theory, 28(2):129--137, 1982

  18. [18]

    NVFP4 Tensor Core Quantization---Blackwell Architecture Whitepaper

    NVIDIA. NVFP4 Tensor Core Quantization---Blackwell Architecture Whitepaper. 2024

  19. [19]

    OCP Microscaling ( MX ) Specification, v1.0

    Open Compute Project. OCP Microscaling ( MX ) Specification, v1.0. 2023

  20. [20]

    QuIP#: Even Bet- ter LLM Quantization with Hadamard Incoherence and Lattice Codebooks

    A. Tseng, J. Chee, Q. Sun, V. Kuleshov, and C. De Sa. QuIP\# : Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks. arXiv:2402.04396, 2024

  21. [21]

    Qtip: Quantiza- tion with trellises and incoherence processing,

    A. Tseng, Q. Sun, D. Yin, C. De Sa, and V. Kuleshov. QTIP : Quantization with Trellises and Incoherence Processing. arXiv:2406.11235, 2024