GPTQ is equivalent to Babai's nearest plane algorithm for CVP on the Hessian lattice of layer inputs, yielding geometric interpretation, inherited error bounds, and improved clipping-free quantization with GPU kernels.
hub
W., and Keutzer, K
23 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Verification of fixed-precision quantized FNNs is NP-complete under both LP and BV specifications, matching the rational case, while dynamic quantization with BV specs has established upper bounds complementing known PSPACE-hardness.
CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.
QuBD extends algorithmic complexity estimation to quantized DNN weights, revealing that complexity decreases during learning, increases with overfitting, follows grokking patterns, and correlates with generalization.
DPQuant uses epoch-wise probabilistic layer rotation and DP loss sensitivity to quantize only a changing subset of layers, reducing accuracy degradation from quantization noise in DP-SGD and delivering up to 2.21x throughput gains with under 2% accuracy drop.
GPTQ quantizes 175B-parameter GPT models to 3-4 bits per weight in one shot using approximate second-order information, achieving negligible accuracy degradation and 3-4x inference speedups.
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Derives optimal quantizer form X=σ(F^{-1}(F_W(W))) with permutation σ minimizing MMSE under specified output distribution P_X, using majorization.
Deep learning extracts photon-by-photon arrival times from scintillation detector waveforms using unsupervised training with a physically informed model, enabling improved timing resolution and photon classification in experiments.
P2F generates low-rank parameter increments for LLM fingerprinting directly from textual descriptions in a single forward pass.
Neural decoders for surface-code QEC achieve practical microsecond FPGA latency when trained on large datasets with appropriate inductive biases and INT4 quantization, rather than relying on architectural complexity.
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
Litespark-Inference delivers custom SIMD kernels for ternary LLMs achieving up to 95.81x throughput versus PyTorch on CPUs by using integer addition/subtraction instead of floating-point math.
Output-aware EM initialization for codebooks in additive quantization avoids poor optimization basins and yields better 2-bit compressed LLMs across Llama and Qwen models.
AWQ quantizes LLM weights to low bits by scaling salient channels based on activation statistics, outperforming prior methods on language, coding, math, and multi-modal benchmarks.
PALUTE is a new PIM accelerator using in-DRAM LUTs on M3D DRAM that reports 1264 TPS at 0.16 W with 12.8x energy efficiency gains over CHIME for quantized edge LLM inference.
Thermodynamic lower bounds are approximated for exact and SGD linear regression, producing energy-aware scaling laws for optimal training dataset size given a target generalization error.
Empirical case study on a flagship Android device profiles energy, latency, and quality trade-offs across eight LLMs, revealing a quantization energy paradox and identifying mid-sized models as practical sweet spots.
Log_b Quant is an adjustable-base logarithmic quantization technique that outperforms tensor-wise asymmetric linear quantization at 4-bit precision on language model benchmarks while providing memory savings.
A structured review concludes that end-to-end DVS-memristor integration for analog in-memory event-driven computing remains an open challenge at TRL 2-5, with half of surveyed applications resting on projections rather than demonstrations.
Photonic computing can reshape AI acceleration through optical bandwidth and parallelism, but requires cross-layer co-design and electronic-photonic design automation to move from prototypes to scalable systems.
Empirical evaluation of quantization effects on eight LLMs across bit widths, showing performance generally declines at lower precision but with model-size-dependent resilience and acceptable accuracy at 2 bits for many cases.
citing papers explorer
No citing papers match the current filters.