{"paper":{"title":"GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Gumbel-Softmax relaxation of discrete grid choices lets scalar quantization recover most accuracy of vector methods at 2-3 bits while staying kernel-compatible.","cross_cats":["cs.LG"],"primary_cat":"cs.CL","authors_text":"Alireza Dadgarnia, Dan Alistarh, Eldar Kurtic, Mahdi Nikdan, Maximilian Kleinegger, Michael Helcig, Soroush Tabesh","submitted_at":"2026-04-20T17:45:47Z","abstract_excerpt":"Quantization has become a standard tool for efficient LLM deployment, especially for local inference, where models are now routinely served at 2-3 bits per parameter. The state of the art is currently split into simple scalar quantization techniques, such as GPTQ or AWQ, which are widely deployed but plateau in accuracy at 3-4 bits per parameter (bpp), and \"second-generation\" vector- or trellis-quantized methods, such as QTIP, GPTVQ and AQLM, which push the accuracy frontier but are notoriously hard to implement and to scale. In this paper, we ask whether this gap is fundamental, or whether a "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"GSQ closes most of the gap between scalar quantization and the QTIP frontier at 2 and 3 bits, while using a symmetric scalar grid with group-wise quantization and thus remains compatible with existing scalar inference kernels.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The Gumbel-Softmax relaxation of the discrete grid assignment problem converges to high-quality discrete solutions without introducing optimization bias or instability that would degrade final quantized model accuracy on held-out tasks.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"GSQ uses Gumbel-Softmax to optimize scalar quantization grids for LLMs, closing most of the accuracy gap to vector methods like QTIP at 2-3 bits per parameter while using symmetric scalar grids compatible with existing kernels.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Gumbel-Softmax relaxation of discrete grid choices lets scalar quantization recover most accuracy of vector methods at 2-3 bits while staying kernel-compatible.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"99f4e86a9ea200dcbd5ac700a55965aa702bd68ed3d88bad2a4a5fcebba953de"},"source":{"id":"2604.18556","kind":"arxiv","version":2},"verdict":{"id":"dee65b4c-c261-4ea3-b5d1-6c215a638f9f","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T17:56:27.981350Z","strongest_claim":"GSQ closes most of the gap between scalar quantization and the QTIP frontier at 2 and 3 bits, while using a symmetric scalar grid with group-wise quantization and thus remains compatible with existing scalar inference kernels.","one_line_summary":"GSQ uses Gumbel-Softmax to optimize scalar quantization grids for LLMs, closing most of the accuracy gap to vector methods like QTIP at 2-3 bits per parameter while using symmetric scalar grids compatible with existing kernels.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The Gumbel-Softmax relaxation of the discrete grid assignment problem converges to high-quality discrete solutions without introducing optimization bias or instability that would degrade final quantized model accuracy on held-out tasks.","pith_extraction_headline":"Gumbel-Softmax relaxation of discrete grid choices lets scalar quantization recover most accuracy of vector methods at 2-3 bits while staying kernel-compatible."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2604.18556/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":38,"sample":[{"doi":"","year":null,"title":"arXiv preprint arXiv:2402.11960 , year=","work_id":"373a05d0-61af-4295-9edf-d7cbf23cf54a","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Symbolic discovery of optimization algorithms","work_id":"2151a7b4-dbf0-490e-8582-83731a6bc17c","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","ref_index":3,"cited_arxiv_id":"1803.05457","is_internal_anchor":true},{"doi":"","year":null,"title":"Diﬀerentiable model compression via pseudo quantiza- tion noise","work_id":"05bd4657-d460-4501-8d1e-40066a9838fa","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"8-bit optimizers via block-wise quantization","work_id":"dc9d8ece-f716-493d-89a8-2a30dedcceb4","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":38,"snapshot_sha256":"37420efda47702fa1afcb6dd3b82bffa56f7deea6f991939071198ddfac418b1","internal_anchors":12},"formal_canon":{"evidence_count":2,"snapshot_sha256":"5f9d6fa25427653f7da3c3be12dc22cb4191722a953ff0dcbc0c35b691512eeb"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}