GPTQ is equivalent to Babai's nearest plane algorithm for CVP on the Hessian lattice of layer inputs, yielding geometric interpretation, inherited error bounds, and improved clipping-free quantization with GPU kernels.
and Keutzer, Kurt , title =
24 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Verification of fixed-precision quantized FNNs is NP-complete under both LP and BV specifications, matching the rational case, while dynamic quantization with BV specs has established upper bounds complementing known PSPACE-hardness.
CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.
QuBD extends algorithmic complexity estimation to quantized DNN weights, revealing that complexity decreases during learning, increases with overfitting, follows grokking patterns, and correlates with generalization.
DPQuant uses epoch-wise probabilistic layer rotation and DP loss sensitivity to quantize only a changing subset of layers, reducing accuracy degradation from quantization noise in DP-SGD and delivering up to 2.21x throughput gains with under 2% accuracy drop.
GPTQ quantizes 175B-parameter GPT models to 3-4 bits per weight in one shot using approximate second-order information, achieving negligible accuracy degradation and 3-4x inference speedups.
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Introduces TQS metric and TQS-PTQ framework that uses dynamical-systems stability to enable a priori, calibration-free mixed-precision post-training quantization for time-series models.
Derives optimal quantizer form X=σ(F^{-1}(F_W(W))) with permutation σ minimizing MMSE under specified output distribution P_X, using majorization.
Deep learning extracts photon-by-photon arrival times from scintillation detector waveforms using unsupervised training with a physically informed model, enabling improved timing resolution and photon classification in experiments.
P2F generates low-rank parameter increments for LLM fingerprinting directly from textual descriptions in a single forward pass.
Neural decoders for surface-code QEC achieve practical microsecond FPGA latency when trained on large datasets with appropriate inductive biases and INT4 quantization, rather than relying on architectural complexity.
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
Litespark-Inference delivers custom SIMD kernels for ternary LLMs achieving up to 95.81x throughput versus PyTorch on CPUs by using integer addition/subtraction instead of floating-point math.
Output-aware EM initialization for codebooks in additive quantization avoids poor optimization basins and yields better 2-bit compressed LLMs across Llama and Qwen models.
AWQ quantizes LLM weights to low bits by scaling salient channels based on activation statistics, outperforming prior methods on language, coding, math, and multi-modal benchmarks.
PALUTE is a new PIM accelerator using in-DRAM LUTs on M3D DRAM that reports 1264 TPS at 0.16 W with 12.8x energy efficiency gains over CHIME for quantized edge LLM inference.
Thermodynamic lower bounds are approximated for exact and SGD linear regression, producing energy-aware scaling laws for optimal training dataset size given a target generalization error.
Empirical case study on a flagship Android device profiles energy, latency, and quality trade-offs across eight LLMs, revealing a quantization energy paradox and identifying mid-sized models as practical sweet spots.
Log_b Quant is an adjustable-base logarithmic quantization technique that outperforms tensor-wise asymmetric linear quantization at 4-bit precision on language model benchmarks while providing memory savings.
A structured review concludes that end-to-end DVS-memristor integration for analog in-memory event-driven computing remains an open challenge at TRL 2-5, with half of surveyed applications resting on projections rather than demonstrations.
Photonic computing can reshape AI acceleration through optical bandwidth and parallelism, but requires cross-layer co-design and electronic-photonic design automation to move from prototypes to scalable systems.
Empirical evaluation of quantization effects on eight LLMs across bit widths, showing performance generally declines at lower precision but with model-size-dependent resilience and acceptable accuracy at 2 bits for many cases.
citing papers explorer
-
The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm
GPTQ is equivalent to Babai's nearest plane algorithm for CVP on the Hessian lattice of layer inputs, yielding geometric interpretation, inherited error bounds, and improved clipping-free quantization with GPU kernels.
-
The Complexity of Verifying Feedforward Neural Networks in Quantised Settings
Verification of fixed-precision quantized FNNs is NP-complete under both LP and BV specifications, matching the rational case, while dynamic quantization with BV specs has established upper bounds complementing known PSPACE-hardness.
-
When Bits Break Recourse: Counterfactual-Faithful Quantization
CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.
-
Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis
QuBD extends algorithmic complexity estimation to quantized DNN weights, revealing that complexity decreases during learning, increases with overfitting, follows grokking patterns, and correlates with generalization.
-
DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling
DPQuant uses epoch-wise probabilistic layer rotation and DP loss sensitivity to quantize only a changing subset of layers, reducing accuracy degradation from quantization noise in DP-SGD and delivering up to 2.21x throughput gains with under 2% accuracy drop.
-
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
GPTQ quantizes 175B-parameter GPT models to 3-4 bits per weight in one shot using approximate second-order information, achieving negligible accuracy degradation and 3-4x inference speedups.
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
-
Quantizing Time-Series Models As Dynamical Systems: Trajectory-Based Quantization Sensitivity Score
Introduces TQS metric and TQS-PTQ framework that uses dynamical-systems stability to enable a priori, calibration-free mixed-precision post-training quantization for time-series models.
-
Minimum Distortion Quantization with Specified Output Distribution
Derives optimal quantizer form X=σ(F^{-1}(F_W(W))) with permutation σ minimizing MMSE under specified output distribution P_X, using majorization.
-
Machine learning enables experimental access to photon-by-photon arrival times in scintillation detectors
Deep learning extracts photon-by-photon arrival times from scintillation detector waveforms using unsupervised training with a physically informed model, enabling improved timing resolution and photon classification in experiments.
-
Prompt2Fingerprint: Plug-and-Play LLM Fingerprinting via Text-to-Weight Generation
P2F generates low-rank parameter increments for LLM fingerprinting directly from textual descriptions in a single forward pass.
-
Rethink the Role of Neural Decoders in Quantum Error Correction
Neural decoders for surface-code QEC achieve practical microsecond FPGA latency when trained on large datasets with appropriate inductive biases and INT4 quantization, rather than relying on architectural complexity.
-
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
-
Litespark Inference For CPUs: Ultra-Fast SIMD Framework for Ternary (1.58-bit) Language Models
Litespark-Inference delivers custom SIMD kernels for ternary LLMs achieving up to 95.81x throughput versus PyTorch on CPUs by using integer addition/subtraction instead of floating-point math.
-
Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization
Output-aware EM initialization for codebooks in additive quantization avoids poor optimization basins and yields better 2-bit compressed LLMs across Llama and Qwen models.
-
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
AWQ quantizes LLM weights to low bits by scaling salient channels based on activation statistics, outperforming prior methods on language, coding, math, and multi-modal benchmarks.
-
PALUTE: Processing-In-Memory Acceleration via Lookup Table for Edge LLM Inference
PALUTE is a new PIM accelerator using in-DRAM LUTs on M3D DRAM that reports 1264 TPS at 0.16 W with 12.8x energy efficiency gains over CHIME for quantized edge LLM inference.
-
The Thermodynamic Costs of Simple Linear Regression
Thermodynamic lower bounds are approximated for exact and SGD linear regression, producing energy-aware scaling laws for optimal training dataset size given a target generalization error.
-
Sustainability Is Not Linear: Quantifying Performance, Energy, and Privacy Trade-offs in On-Device Intelligence
Empirical case study on a flagship Android device profiles energy, latency, and quality trade-offs across eight LLMs, revealing a quantization energy paradox and identifying mid-sized models as practical sweet spots.
-
$\text{Log}_\text{b}$Quant: Quantizing Language Models in Logarithmic Space
Log_b Quant is an adjustable-base logarithmic quantization technique that outperforms tensor-wise asymmetric linear quantization at 4-bit precision on language model benchmarks while providing memory savings.
-
Memristor Technologies for Dynamic Vision Sensors: A Critical Assessment and Research Roadmap
A structured review concludes that end-to-end DVS-memristor integration for analog in-memory event-driven computing remains an open challenge at TRL 2-5, with half of surveyed applications resting on projections rather than demonstrations.
-
Harnessing Photonics for Machine Intelligence
Photonic computing can reshape AI acceleration through optical bandwidth and parallelism, but requires cross-layer co-design and electronic-photonic design automation to move from prototypes to scalable systems.
-
K-Quantization and its Impact on Output Performance
Empirical evaluation of quantization effects on eight LLMs across bit widths, showing performance generally declines at lower precision but with model-size-dependent resilience and acceptable accuracy at 2 bits for many cases.
- You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations