Derives non-asymptotic 2-norm and infinity-norm error bounds for deterministic and stochastic variants of OPTQ and Qronos PTQ algorithms.
hub
Elias Frantar and Dan Alistarh
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
SpinQuant learns optimal rotations to enable accurate 4-bit quantization of LLM weights, activations, and KV cache, reducing the zero-shot gap to full precision to 2.9 points on LLaMA-2 7B.
LBLLM achieves better accuracy than prior binarization methods for LLMs by decoupling weight and activation quantization through initialization, layer-wise distillation, and learnable activation scaling.
Token-wise INT4 KV-cache quantization plus block-diagonal Hadamard rotation recovers nearly all accuracy lost by naive INT4 while adding zero end-to-end overhead under paged serving constraints.
GSQ uses Gumbel-Softmax to optimize scalar quantization grids for LLMs, closing most of the accuracy gap to vector methods like QTIP at 2-3 bits per parameter while using symmetric scalar grids compatible with existing kernels.
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
S2O uses online permutation and importance-based early stopping to increase effective sparsity in attention, delivering 7.51x attention and 3.81x end-to-end speedups on Llama-3.1-8B at 128K context with preserved accuracy.
CoreQ delivers adaptive mismatch correction via closed-form geometric coefficient and successive rounding to improve PTQ accuracy for large language models.
GAMMA is a post-training framework that learns stable module sensitivity rankings for mixed-precision LLM quantization and projects them to exact bit budgets via integer programming, enabling reuse across arbitrary memory targets.
BPDQ creates variable quantization grids from bit-planes and scalar coefficients, refined iteratively with second-order data to minimize output error, enabling 2-bit serving of Qwen2.5-72B on one RTX 3090 at 83.85% GSM8K accuracy.
SignRoundV2 uses gradient-based adaptive mixed-precision allocation and stabilization techniques to achieve near-lossless performance in mixed MXFP quantization at average 4.5 bits and improved accuracy at 2-bit weights for LLMs.
Post-training N:M activation pruning preserves generative performance in LLMs better than equivalent weight pruning, with the 8:16 pattern emerging as a practical hardware-friendly choice.
LlamaFactory provides a unified no-code framework for efficient fine-tuning of 100+ LLMs via an integrated web UI and has been released on GitHub.
ECG foundation models for signal interpretation and medical LLMs for reasoning can be integrated into agentic systems for real-time cardiovascular intelligence on edge devices.
citing papers explorer
-
Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos
Derives non-asymptotic 2-norm and infinity-norm error bounds for deterministic and stochastic variants of OPTQ and Qronos PTQ algorithms.
-
SpinQuant: LLM quantization with learned rotations
SpinQuant learns optimal rotations to enable accurate 4-bit quantization of LLM weights, activations, and KV cache, reducing the zero-shot gap to full precision to 2.9 points on LLaMA-2 7B.
-
LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation
LBLLM achieves better accuracy than prior binarization methods for LLMs by decoupling weight and activation quantization through initialization, layer-wise distillation, and learnable activation scaling.
-
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
Token-wise INT4 KV-cache quantization plus block-diagonal Hadamard rotation recovers nearly all accuracy lost by naive INT4 while adding zero end-to-end overhead under paged serving constraints.
-
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
GSQ uses Gumbel-Softmax to optimize scalar quantization grids for LLMs, closing most of the accuracy gap to vector methods like QTIP at 2-3 bits per parameter while using symmetric scalar grids compatible with existing kernels.
-
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
-
S2O: Early Stopping for Sparse Attention via Online Permutation
S2O uses online permutation and importance-based early stopping to increase effective sparsity in attention, delivering 7.51x attention and 3.81x end-to-end speedups on Llama-3.1-8B at 128K context with preserved accuracy.
-
CoreQ: Learning-Free Mismatch Correction and Successive Rounding for Quantization
CoreQ delivers adaptive mismatch correction via closed-form geometric coefficient and successive rounding to improve PTQ accuracy for large language models.
-
GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets
GAMMA is a post-training framework that learns stable module sensitivity rankings for mixed-precision LLM quantization and projects them to exact bit budgets via integer programming, enabling reuse across arbitrary memory targets.
-
BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models
BPDQ creates variable quantization grids from bit-planes and scalar coefficients, refined iteratively with second-order data to minimize output error, enabling 2-bit serving of Qwen2.5-72B on one RTX 3090 at 83.85% GSM8K accuracy.
-
SignRoundV2: Toward Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs
SignRoundV2 uses gradient-based adaptive mixed-precision allocation and stabilization techniques to achieve near-lossless performance in mixed MXFP quantization at average 4.5 bits and improved accuracy at 2-bit weights for LLMs.
-
Motivating Next-Gen Accelerators with Flexible (N:M) Activation Sparsity via Benchmarking Lightweight Post-Training Sparsification Approaches
Post-training N:M activation pruning preserves generative performance in LLMs better than equivalent weight pruning, with the 8:16 pattern emerging as a practical hardware-friendly choice.
-
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
LlamaFactory provides a unified no-code framework for efficient fine-tuning of 100+ LLMs via an integrated web UI and has been released on GitHub.
-
ECG Foundation Models and Medical LLMs for Agentic Cardiovascular Intelligence at the Edge: A Review and Outlook
ECG foundation models for signal interpretation and medical LLMs for reasoning can be integrated into agentic systems for real-time cardiovascular intelligence on edge devices.