"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization
read the original abstract
Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear. In this paper, we conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4 quantization across academic benchmarks and real-world tasks on the entire Llama-3.1 model family. Through over 500,000 evaluations, our investigation yields several key findings: (1) FP8 (W8A8-FP) is effectively lossless across all model scales, (2) well-tuned INT8 (W8A8-INT) achieves surprisingly low (1-3\%) accuracy degradation, and (3) INT4 weight-only (W4A16-INT) is more competitive than expected, rivaling 8-bit quantization. Further, we investigate the optimal quantization format for different deployments by analyzing inference performance through the popular vLLM framework. Our analysis provides clear deployment recommendations: W4A16 is the most cost-efficient for synchronous setups, while W8A8 dominates in asynchronous continuous batching. For mixed workloads, the optimal choice depends on the specific use case. Our findings offer practical, data-driven guidelines for deploying quantized LLMs at scale -- ensuring the best balance between speed, efficiency, and accuracy.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm
GPTQ is equivalent to Babai's nearest plane algorithm for CVP on the Hessian lattice of layer inputs, yielding geometric interpretation, inherited error bounds, and improved clipping-free quantization with GPU kernels.
-
Reliable Evaluation Protocol for Low-Precision Retrieval
Proposes High-Precision Scoring (HPS) and Tie-aware Retrieval Metrics (TRM) to reduce tie-induced instability in low-precision retrieval evaluation.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.