hub

Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396

· 2024 · arXiv 2402.04396

25 Pith papers cite this work. Polarity classification is still indexing.

25 Pith papers citing it

read on arXiv browse 25 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos

cs.LG · 2025-08-06 · unverdicted · novelty 8.0

Derives non-asymptotic 2-norm and infinity-norm error bounds for deterministic and stochastic variants of OPTQ and Qronos PTQ algorithms.

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

cs.DC · 2026-05-20 · conditional · novelty 7.0

LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browser frameworks on tested hardware.

When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

cs.PF · 2026-05-07 · unverdicted · novelty 7.0

A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.

Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perplexity cost.

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

cs.LG · 2024-07-11 · accept · novelty 7.0

FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

cs.CL · 2024-02-27 · unverdicted · novelty 7.0

BitNet b1.58 shows that ternary 1.58-bit LLMs can match full-precision performance at substantially lower inference cost.

LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

LiftQuant enables continuous bit-width LLM quantization via dimensional lifting and projection from a 1-bit lattice, allowing 2.4-bit compression of 70B models that outperforms fixed 2-bit baselines on identical hardware.

LASER: Loss-Aware Singular-value Decomposition and Rank Allocation for Efficient Low-Precision Vision-Language Models

cs.LG · 2026-05-30 · unverdicted · novelty 6.0

LASER introduces curvature-weighted SVD from second-order loss approximation and loss-aware rank allocation to compress VLMs, reporting over 2.3x decoding speedup under low-precision settings.

DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation

cs.LG · 2026-05-30 · unverdicted · novelty 6.0

DREAM-S combines neural architecture search, target-aware supernet training, and attention-entropy-guided distillation to accelerate speculative decoding in VLMs, reporting up to 3.85x speedup over standard methods.

GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

GEMQ applies global LP-based expert importance estimation and router fine-tuning within progressive quantization to cut memory and speed inference in MoE LLMs with little accuracy loss.

XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

XFP introduces quality-targeted adaptive codebook quantization with sparse outlier separation that auto-selects parameters from cosine similarity floors, achieving high throughput and accuracy on Qwen3.5 models at low effective bits without calibration data.

Search Your Block Floating Point Scales!

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.

Theory-optimal Quantization Based on Flatness

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

The paper introduces the Flatness metric, derives a theory-optimal quantization solution, and presents BDQ that uses bidirectional diagonal transformations to reduce outlier impact, achieving under 1% drop at W4A4 on LLaMA-3-8B.

BitRL: Reinforcement Learning with 1-bit Quantized Language Models for Resource-Constrained Edge Deployment

cs.LG · 2026-04-27 · unverdicted · novelty 6.0

BitRL enables on-device RL agents via 1-bit quantized language models, delivering 10-16x memory reduction and 3-5x energy efficiency gains with 85-98% retained performance.

From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.

GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

cs.CL · 2026-04-20 · unverdicted · novelty 6.0 · 2 refs

GSQ uses Gumbel-Softmax to optimize scalar quantization grids for LLMs, closing most of the accuracy gap to vector methods like QTIP at 2-3 bits per parameter while using symmetric scalar grids compatible with existing kernels.

Rethinking Residual Errors in Compensation-based LLM Quantization

cs.LG · 2026-04-09 · conditional · novelty 6.0

Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.

BTC-LLM: Efficient Sub-1-Bit LLM Quantization via Learnable Transformation and Binary Codebook

cs.LG · 2025-05-24 · conditional · novelty 6.0

BTC-LLM uses a binary codebook for pattern clustering and a learnable transformation to achieve 0.7-1.11 bit LLM quantization while limiting accuracy loss to a few percent on LLaMA and Qwen models.

EVA: Accelerating LLM Decoding via an Efficient Vector Quantization Architecture

cs.AR · 2026-05-22 · unverdicted · novelty 5.0

EVA is a vector-quantization hardware architecture that transforms LLM decoding from GEMV to GEMM via direct codebook dot products and conflict-free output buffering, claiming up to 11.17x speedup over prior lookup designs.

GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets

cs.LG · 2026-05-18 · unverdicted · novelty 5.0

GAMMA is a post-training framework that learns stable module sensitivity rankings for mixed-precision LLM quantization and projects them to exact bit budgets via integer programming, enabling reuse across arbitrary memory targets.

High-Rate Quantized Matrix Multiplication II

cs.LG · 2026-05-13 · unverdicted · novelty 5.0 · 2 refs

With known covariance, waterfilling improves GPTQ and WaterSIC reaches within 0.25 bit/entry of the rate-distortion limit while being basis-independent.

High-Rate Quantized Matrix Multiplication I

cs.IT · 2026-01-23 · unverdicted · novelty 5.0

High-rate quantization theory yields accurate approximations for the distortion of absmax INT and FP schemes in generic weight-plus-activation matrix multiplication.

Influence-Inspired Spectral Rotations for Extreme Low-Bit LLM Quantization

cs.LG · 2026-05-24 · unverdicted · novelty 4.0

A WHT rotation plus per-coordinate activation-energy rescaling before auto-round quantization lowers WikiText-2 perplexity 15-58% versus vanilla auto-round at W2A16 on models from 135M to 1.5B parameters.

DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization

cs.CV · 2026-04-20 · unverdicted · novelty 4.0

DuQuant++ adapts outlier-aware fine-grained rotation to MXFP4 by matching block size to the 32-element microscaling group, enabling a single rotation that smooths distributions and achieves SOTA performance on LLaMA-3 with lower cost.

citing papers explorer

Showing 2 of 2 citing papers after filters.

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision cs.LG · 2024-07-11 · accept · none · ref 58
FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.
A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 216
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer