hub Canonical reference

Learned step size quantization

Esser, S · 1902 · arXiv 1902.08153

Canonical reference. 100% of citing Pith papers cite this work as background.

16 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

cs.DC · 2026-05-20 · conditional · novelty 7.0

LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browser frameworks on tested hardware.

When Bits Break Recourse: Counterfactual-Faithful Quantization

cs.LG · 2026-05-16 · unverdicted · novelty 7.0

CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.

DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling

cs.LG · 2025-09-03 · unverdicted · novelty 7.0

DPQuant uses epoch-wise probabilistic layer rotation and DP loss sensitivity to quantize only a changing subset of layers, reducing accuracy degradation from quantization noise in DP-SGD and delivering up to 2.21x throughput gains with under 2% accuracy drop.

Reclaiming Residual Knowledge: A Novel Paradigm to Low-Bit Quantization

cs.CV · 2024-08-01 · unverdicted · novelty 7.0

CoRa reclaims quantization residuals in pre-trained ConvNets by searching low-rank adapter architectures instead of weights, matching SOTA accuracy on ImageNet in 3-4 bit settings with under 250 iterations on 1600 images.

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

cs.LG · 2022-08-15 · conditional · novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

Nonlinear Bipolar Compensation: Handling Outliers in Post-Training Quantization

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Nonlinear Bipolar Compensation with Bipolar Logarithmic Transformation reduces outlier effects in post-training quantization by performing compensation in a compressed transformed space.

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

cs.LG · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.

OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

cs.LG · 2026-05-06 · unverdicted · novelty 6.0 · 2 refs

OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.

GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

cs.CL · 2026-04-20 · unverdicted · novelty 6.0 · 2 refs

GSQ uses Gumbel-Softmax to optimize scalar quantization grids for LLMs, closing most of the accuracy gap to vector methods like QTIP at 2-3 bits per parameter while using symmetric scalar grids compatible with existing kernels.

AHCQ-SAM: Toward Accurate and Hardware-Compatible Post-Training Segment Anything Model Quantization

cs.CV · 2025-03-05 · unverdicted · novelty 6.0

AHCQ-SAM introduces ACNR, HLUQ, CAG, and LNQ quantization techniques that deliver 15.2% mAP gain on 4-bit SAM-B and 14.01% J&F gain on 4-bit SAM2-Tiny versus prior PTQ methods.

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

cs.CL · 2023-06-01 · conditional · novelty 6.0

AWQ quantizes LLM weights to low bits by scaling salient channels based on activation statistics, outperforming prior methods on language, coding, math, and multi-modal benchmarks.

CAR-SAM: Cross-Attention Reconstruction for Post-Training Quantization of the Segment Anything Model

cs.CV · 2026-05-16 · unverdicted · novelty 5.0

CAR-SAM introduces MatMul-Aware Compensation and Joint Cross-Attention Reconstruction to enable stable 4-bit post-training quantization of SAM, outperforming prior PTQ methods by 14.6% mAP on SAM-B and 6.6% on SAM-L.

Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay

cs.CV · 2026-05-02 · unverdicted · novelty 5.0

Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.

BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design

cs.LG · 2026-04-05 · unverdicted · novelty 5.0

BWTA achieves near full-precision accuracy on BERT and LLMs using binary weights and ternary activations, with 16-24x kernel speedups via specialized CUDA kernels.

Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression

cs.LG · 2026-04-05 · unverdicted · novelty 4.0

The prune-quantize-distill ordering produces a better accuracy-size-latency frontier on CIFAR-10/100 than any single technique or other orderings, with INT8 QAT providing the main runtime gain.

Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities

cs.DC · 2026-04-24 · unverdicted · novelty 3.0

A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.

citing papers explorer

Showing 16 of 16 citing papers.

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU cs.DC · 2026-05-20 · conditional · none · ref 13
LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browser frameworks on tested hardware.
When Bits Break Recourse: Counterfactual-Faithful Quantization cs.LG · 2026-05-16 · unverdicted · none · ref 9
CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.
DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling cs.LG · 2025-09-03 · unverdicted · none · ref 17
DPQuant uses epoch-wise probabilistic layer rotation and DP loss sensitivity to quantize only a changing subset of layers, reducing accuracy degradation from quantization noise in DP-SGD and delivering up to 2.21x throughput gains with under 2% accuracy drop.
Reclaiming Residual Knowledge: A Novel Paradigm to Low-Bit Quantization cs.CV · 2024-08-01 · unverdicted · none · ref 11
CoRa reclaims quantization residuals in pre-trained ConvNets by searching low-rank adapter architectures instead of weights, matching SOTA accuracy on ImageNet in 3-4 bit settings with under 250 iterations on 1600 images.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale cs.LG · 2022-08-15 · conditional · none · ref 131
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Nonlinear Bipolar Compensation: Handling Outliers in Post-Training Quantization cs.CV · 2026-05-14 · unverdicted · none · ref 10
Nonlinear Bipolar Compensation with Bipolar Logarithmic Transformation reduces outlier effects in post-training quantization by performing compensation in a compressed transformed space.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale cs.LG · 2026-05-11 · unverdicted · none · ref 26 · 2 links
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization cs.LG · 2026-05-06 · unverdicted · none · ref 5 · 2 links
OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling cs.CL · 2026-04-20 · unverdicted · none · ref 10 · 2 links
GSQ uses Gumbel-Softmax to optimize scalar quantization grids for LLMs, closing most of the accuracy gap to vector methods like QTIP at 2-3 bits per parameter while using symmetric scalar grids compatible with existing kernels.
AHCQ-SAM: Toward Accurate and Hardware-Compatible Post-Training Segment Anything Model Quantization cs.CV · 2025-03-05 · unverdicted · none · ref 5
AHCQ-SAM introduces ACNR, HLUQ, CAG, and LNQ quantization techniques that deliver 15.2% mAP gain on 4-bit SAM-B and 14.01% J&F gain on 4-bit SAM2-Tiny versus prior PTQ methods.
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration cs.CL · 2023-06-01 · conditional · none · ref 11
AWQ quantizes LLM weights to low bits by scaling salient channels based on activation statistics, outperforming prior methods on language, coding, math, and multi-modal benchmarks.
CAR-SAM: Cross-Attention Reconstruction for Post-Training Quantization of the Segment Anything Model cs.CV · 2026-05-16 · unverdicted · none · ref 3
CAR-SAM introduces MatMul-Aware Compensation and Joint Cross-Attention Reconstruction to enable stable 4-bit post-training quantization of SAM, outperforming prior PTQ methods by 14.6% mAP on SAM-B and 6.6% on SAM-L.
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay cs.CV · 2026-05-02 · unverdicted · none · ref 12
Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.
BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design cs.LG · 2026-04-05 · unverdicted · none · ref 56
BWTA achieves near full-precision accuracy on BERT and LLMs using binary weights and ternary activations, with 16-24x kernel speedups via specialized CUDA kernels.
Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression cs.LG · 2026-04-05 · unverdicted · none · ref 15
The prune-quantize-distill ordering produces a better accuracy-size-latency frontier on CIFAR-10/100 than any single technique or other orderings, with INT8 QAT providing the main runtime gain.
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities cs.DC · 2026-04-24 · unverdicted · none · ref 39
A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.

Learned step size quantization

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer