The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm

Dan Alistarh; Elvir Crn\v{c}evi\'c; Jiale Chen; Torsten Hoefler; Yalda Shabanzadeh

arxiv: 2507.18553 · v4 · pith:AKBT3LNKnew · submitted 2025-07-24 · 💻 cs.LG · cs.DS· cs.IT· math.IT

The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm

Jiale Chen , Yalda Shabanzadeh , Elvir Crn\v{c}evi\'c , Torsten Hoefler , Dan Alistarh This is my paper

Pith reviewed 2026-05-19 02:43 UTC · model grok-4.3

classification 💻 cs.LG cs.DScs.ITmath.IT

keywords LLM quantizationGPTQBabai nearest planeclosest vector problemHessian latticeerror boundspost-training quantizationlattice algorithms

0 comments

The pith

GPTQ quantization of linear layers equals Babai's nearest plane algorithm on Hessian lattices

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that running GPTQ from the last dimension to the first for a linear layer produces exactly the same quantized weights as Babai's nearest plane algorithm for the closest vector problem. The lattice here is built from the Hessian matrix of the layer's input activations. A sympathetic reader would care because this turns opaque algebraic steps into a clear geometric process and supplies a worst-case error bound that was missing before. The authors then use the bound to build clipping-free quantization variants that outperform the original GPTQ on accuracy while also releasing efficient GPU inference kernels.

Core claim

When executed back-to-front for a linear layer, GPTQ is mathematically identical to Babai's nearest plane algorithm for the classical closest vector problem on a lattice defined by the Hessian matrix of the layer's inputs. This equivalence gives GPTQ's error propagation step a geometric interpretation as successive projections and lets GPTQ inherit the error upper bound of Babai's algorithm, provided no weights are clipped. The authors leverage the bound to design new post-training quantization methods that avoid clipping and exceed original GPTQ performance, together with efficient GPU kernels for the resulting representation.

What carries the argument

Mathematical equivalence between reverse-order GPTQ weight updates and Babai's nearest plane steps on the lattice generated by the layer input Hessian

If this is right

GPTQ error propagation receives a geometric interpretation as successive nearest-plane projections.
GPTQ inherits an explicit error upper bound from lattice algorithms when clipping is avoided.
New clipping-free post-training quantization methods can be built that outperform the original GPTQ.
Efficient GPU inference kernels become available for the improved quantized representation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other lattice algorithms such as LLL reduction could be adapted to create stronger quantization procedures for billion-parameter models.
The geometric framing might extend to quantization of attention or embedding layers beyond simple linear layers.
Direct empirical checks on full-scale transformers could test whether the theoretical bounds predict actual accuracy gains.

Load-bearing premise

The equivalence and inherited error bound hold only when no weights are clipped during quantization.

What would settle it

Apply both reverse GPTQ and Babai's nearest plane algorithm to the same small linear layer with a known Hessian, then check whether the quantized weights and per-step error vectors are identical.

Figures

Figures reproduced from arXiv: 2507.18553 by Dan Alistarh, Elvir Crn\v{c}evi\'c, Jiale Chen, Torsten Hoefler, Yalda Shabanzadeh.

**Figure 1.** Figure 1: Upper row: (a) CVP in a two-dimensional lattice; (b) Basis reduction can find a shorter, more orthogonal basis that can potentially improve the results; (c-d) The projection steps in Babai’s nearest plane algorithm. Lower row: rounding boundaries of (e) optimal rounding or Voronoi cells; (f) round-to-nearest (RTN); (g) Babai’s nearest plane algorithm without basis reduction; (h) Babai’s algorithm without b… view at source ↗

**Figure 2.** Figure 2: Equivalence of OBQ’s error propagation and Babai’s projection. (a) 3D plot showing the target being projected onto the nearest plane. (b) 3D plot showing how the projection error is propagated. (c) 2D plot showing the vectors on the nearest hyperplane in (a-b). (d) 2D plot showing the vectors on the orthogonal projection plane in (b). Theorem 2 (Error Propagation and Babai’s projection) Babai’s nearest pla… view at source ↗

**Figure 3.** Figure 3: Geometric interpretation of OBQ’s quantization order. This 2D plot shows the target being projected onto the nearest plane. Corollary 3 (OBQ Dimension Selection) At each dimension selection step (Eq. 1), OBQ selects the not-yet-quantized dimension j such that the nearest hyperplane of dimension j is the closest to the target residual vector. Proof We use the same notations defined in Theorem 2 [PITH_FULL_… view at source ↗

**Figure 4.** Figure 4: (a) Comparison of quantization methods (RTN, GPTQ, HRTN, HPTQ, and SSQR with 1~5% outliers) on Qwen3-8B evaluated on WikiText-2. Perplexity is plotted against the average effective bitwidth per weight, with the BF16 baseline shown as a horizontal line. HPTQ has the best (lowest) perplexity. See Section D.3 for zero-shot evaluation results. (b) Scaling behavior of HPTQ across multiple model sizes (0.6B, 1.7… view at source ↗

**Figure 5.** Figure 5: Layer-wise inference speedup of the SSQR kernel over the PyTorch BF16 baseline on Qwen3-8B across inlier bitwidths, outlier rates, and batch sizes on A6000 GPU. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_5.png] view at source ↗

read the original abstract

Quantizing the weights of large language models (LLMs) from 16-bit to lower bitwidth is the de facto approach to deploy massive transformers onto more affordable accelerators. While GPTQ emerged as one of the standard methods for one-shot post-training quantization at LLM scale, its inner workings are described as a sequence of algebraic updates that obscure geometric meaning or worst-case guarantees. In this work, we show that, when executed back-to-front (from the last to first dimension) for a linear layer, GPTQ is mathematically identical to Babai's nearest plane algorithm for the classical closest vector problem (CVP) on a lattice defined by the Hessian matrix of the layer's inputs. This equivalence is based on a sophisticated mathematical argument, and has two analytical consequences: first, the GPTQ error propagation step gains an intuitive geometric interpretation; second, GPTQ inherits the error upper bound of Babai's algorithm under the assumption that no weights are clipped. Leveraging this bound, we design post-training quantization methods that avoid clipping, and outperform the original GPTQ. In addition, we provide efficient GPU inference kernels for the resulting representation. Taken together, these results place GPTQ on a firm theoretical footing and open the door to importing decades of progress in lattice algorithms towards the design of future quantization algorithms for billion-parameter models. Source code is available at https://github.com/IST-DASLab/GPTQ-Babai.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GPTQ run backwards matches Babai's nearest plane on the Hessian lattice, which lets them drop clipping and beat the original method empirically, though the carried-over error bound is useless at LLM widths.

read the letter

The main takeaway is that GPTQ, when executed from the last dimension to the first, is identical to Babai's nearest plane algorithm for the closest vector problem on the lattice defined by the layer's input Hessian. They use the link to build clipping-free quantizers that report better accuracy than standard GPTQ at the same bit width and ship GPU kernels for the new format. The geometric view also gives a cleaner picture of how quantization errors move through the updates. That connection is the actual new content; earlier descriptions of GPTQ stayed at the level of algebraic steps without this lattice framing. The empirical gains from avoiding clipping are concrete and worth noting. The soft spot is the error bound. Babai's guarantee is at most 2 to the n over 2, which for n near 4096 is far too large to constrain anything in practice. The paper flags the no-clipping assumption, but even then the bound does not deliver a useful worst-case statement for these model sizes. The improvements therefore rest on the empirical side rather than on any tight analytical control. The equivalence itself is presented as a sophisticated argument, so a referee would need to check the steps carefully. This paper is for people working on post-training quantization and efficient inference. A reader who wants a different angle on why GPTQ works or who is open to lattice ideas in ML will find something usable. It deserves peer review because the claimed identity is fresh and the practical results are shown, even if the theoretical payoff on bounds is limited.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that, when executed back-to-front for a linear layer, GPTQ is mathematically identical to Babai's nearest plane algorithm for the closest vector problem on the lattice defined by the Hessian of the layer inputs. This equivalence supplies a geometric reading of GPTQ's error-propagation step and lets GPTQ inherit Babai's worst-case error bound under the assumption that no weights are clipped. The authors exploit the bound to construct clipping-free quantization procedures that outperform the original GPTQ and release efficient GPU inference kernels.

Significance. If the claimed equivalence is correct, the work supplies a lattice-theoretic foundation for a widely used quantization algorithm and opens a route for importing results from lattice algorithms into LLM quantization. The release of source code and GPU kernels is a concrete strength that supports reproducibility. The practical significance is limited, however, because the inherited Babai bound is exponential in dimension and therefore vacuous for typical layer widths.

major comments (1)

[Abstract and error-bound derivation] Abstract and the section deriving the error bound: the claim that GPTQ inherits a useful error upper bound from Babai's nearest-plane procedure is load-bearing for the analytical consequences and for the motivation of the no-clipping methods. Babai's bound is at most 2^{n/2} (or worse without basis reduction); for n ≈ 4096 this exceeds 10^{600} and supplies no practical guarantee. The manuscript must either supply a tighter analysis that remains valid after the equivalence or explicitly state that the bound is only of theoretical interest.

minor comments (2)

[Geometric interpretation section] The geometric interpretation of the error-propagation step would be clearer if the lattice planes and the successive projections were illustrated with a low-dimensional diagram.
[Notation and preliminaries] Notation for the lattice basis and the Hessian matrix should be introduced once and used consistently; occasional redefinition of symbols slows reading.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the single major comment below.

read point-by-point responses

Referee: [Abstract and error-bound derivation] Abstract and the section deriving the error bound: the claim that GPTQ inherits a useful error upper bound from Babai's nearest-plane procedure is load-bearing for the analytical consequences and for the motivation of the no-clipping methods. Babai's bound is at most 2^{n/2} (or worse without basis reduction); for n ≈ 4096 this exceeds 10^{600} and supplies no practical guarantee. The manuscript must either supply a tighter analysis that remains valid after the equivalence or explicitly state that the bound is only of theoretical interest.

Authors: We agree that Babai's worst-case bound is exponential in dimension and therefore supplies no practical guarantee for typical LLM layer widths. The manuscript's primary contribution is the exact equivalence between back-to-front GPTQ and Babai's nearest-plane algorithm on the Hessian lattice; this equivalence furnishes a geometric interpretation of the sequential error-compensation step. The inherited bound is invoked only under the explicit assumption of no clipping and is not presented as a practical error certificate. We will revise the abstract and the error-bound section to state clearly that the bound is of theoretical interest only. The clipping-free procedures are motivated by the geometric perspective and are justified by their empirical superiority over GPTQ; they do not rely on the numerical tightness of the Babai bound. A tighter, dimension-independent analysis valid after the equivalence would require substantial new technical work and lies outside the scope of the present manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: core claim is explicit equivalence to external Babai algorithm with independent mathematical argument.

full rationale

The paper's central derivation establishes a mathematical identity between the back-to-front execution of GPTQ and Babai's nearest-plane algorithm on a Hessian-defined lattice. This is presented as a direct equivalence supported by a 'sophisticated mathematical argument' rather than any self-referential fit, ansatz smuggling, or load-bearing self-citation. The error-bound inheritance is explicitly conditioned on the no-clipping assumption and draws from the well-known external Babai result, not from prior work by the same authors. No step reduces a prediction or uniqueness claim to a fitted input or self-citation chain by construction. The derivation is therefore self-contained against external lattice-algorithm benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The report relies on the no-clipping assumption to inherit the error bound; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption No weights are clipped during the quantization process
Required for GPTQ to inherit the error upper bound of Babai's algorithm as stated in the abstract.

pith-pipeline@v0.9.0 · 5816 in / 1185 out tokens · 38606 ms · 2026-05-19T02:43:43.927591+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GPTQ ... is mathematically identical to Babai’s nearest plane algorithm ... on a lattice defined by the Hessian matrix
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Babai’s error bound ... 2^{n/2} approximation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

High-Rate Quantized Matrix Multiplication II
cs.LG 2026-05 unverdicted novelty 6.0

Waterfilling rate allocation makes quantized matrix multiplication for LLMs near information-theoretically optimal, with WaterSIC being basis-free and within 0.25 bits per entry of the limit.
CoreQ: Learning-Free Mismatch Correction and Successive Rounding for Quantization
cs.LG 2026-02 unverdicted novelty 6.0

CoreQ delivers adaptive mismatch correction via closed-form geometric coefficient and successive rounding to improve PTQ accuracy for large language models.
High-Rate Quantized Matrix Multiplication I
cs.IT 2026-01 unverdicted novelty 5.0

High-rate quantization theory yields accurate approximations for the distortion of absmax INT and FP schemes in generic weight-plus-activation matrix multiplication.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 3 Pith papers · 1 internal anchor

[1]

doi: 10.1007/BF02579403

ISSN 1439-6912. doi: 10.1007/BF02579403. URL https://doi.org/10.1007/BF02579403. 5, 7, 8 Johann Birnick. The lattice geometry of neural network quantization – a short equivalence proof of gptq and babai’s algorithm,

work page doi:10.1007/bf02579403
[2]

4 Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher M De Sa

URLhttps://arxiv.org/abs/2508.01077. 4 Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher M De Sa. Quip: 2- bit quantization of large language models with guarantees. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.),Advances in Neu- ral Information Processing Systems, volume 36, pp. 4396–4429. Curran Associates, 15 Inc.,

work page arXiv
[3]

4 Tim Dettmers, Ruslan A

URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 0df38cd13520747e1e64e5b123a78ef8-Paper-Conference.pdf. 4 Tim Dettmers, Ruslan A. Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. SpQR: A sparse-quantized representation for near-lossless LLM weight compress...

work page 2023
[4]

doi: 10.1007/s00493-003-0019-y

ISSN 1439-6912. doi: 10.1007/s00493-003-0019-y. URLhttps://doi.org/10.1007/s00493-003-0019-y. 5 Elias Frantar and Dan Alistarh. Optimal brain compression: A framework for accurate post-training quantization and pruning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.),Advances in Neural Information Processing Systems, volume 35,...

work page doi:10.1007/s00493-003-0019-y
[5]

4 Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh

URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 1caf09c9f4e6b0150b06a07e77f2710c-Paper-Conference.pdf. 4 Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. OPTQ: Accurate quantization for generative pre-trained transformers. InThe Eleventh International Conference on Learning Representations,

work page 2022
[6]

W., and Keutzer, K

URLhttps://arxiv.org/abs/2103.13630. 4 Babak Hassibi, David G. Stork, and Gregory J. Wolff. Optimal brain surgeon and general network pruning. InIEEE International Conference on Neural Networks, pp. 293–299 vol.1,

work page arXiv
[7]

4 Ravi Kannan

doi: 10.1109/ICNN.1993.298572. 4 Ravi Kannan. Minkowski’s convex body theorem and integer programming.Math. Oper. Res., 12(3):415–440, August

work page doi:10.1109/icnn.1993.298572 1993
[8]

give me bf16 or give me death

ISSN 0364-765X. 5 Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, and Dan Alistarh. " give me bf16 or give me death"? accuracy-performance trade-offs in llm quantization.arXiv preprint arXiv:2411.02355,

work page internal anchor Pith review arXiv
[9]

4 Arjen Klaas Lenstra, Hendrik Willem Lenstra, and László Lovász

URL https://proceedings.neurips.cc/paper_files/paper/1989/ file/6c9882bbac1c7093bd25041881277658-Paper.pdf. 4 Arjen Klaas Lenstra, Hendrik Willem Lenstra, and László Lovász. Factoring polynomials with rational coefficients.Mathematische Annalen, 261(4):515–534, dec

work page 1989
[10]

doi: 10.1007/BF01457454

ISSN 1432-1807. doi: 10.1007/BF01457454. URLhttps://doi.org/10.1007/BF01457454. 5 16 Xinlin Li, Osama Hanna, Christina Fragouli, and Suhas Diggavi. ICQuant: Index coding enables low-bit LLM quantization. InSecond Conference on Language Modeling,

work page doi:10.1007/bf01457454
[11]

13 Sasha Luccioni, Yacine Jernite, and Emma Strubell

URLhttps://openreview.net/forum?id=m6nBgFSMTL. 13 Sasha Luccioni, Yacine Jernite, and Emma Strubell. Power hungry processing: Watts driving the cost of ai deployment? InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’24, pp. 85–99, New York, NY, USA,

work page 2024
[12]

Beyond Individual Accountability: (Re-)Asserting Democratic Control of AI

AssociationforComputingMachinery. ISBN9798400704505. doi: 10.1145/3630106.3658542. URLhttps://doi.org/10.1145/3630106.3658542. 4 Daniele Micciancio and Shafi Goldwasser.Complexity of Lattice Problems: A Cryptographic Perspective, volume 671 ofThe Springer International Series in Engineering and Computer Science. Springer, New York, NY, 1 edition,

work page doi:10.1145/3630106.3658542
[13]

doi: 10.1007/ 978-1-4615-0897-7

ISBN 978-0-7923-7688-0. doi: 10.1007/ 978-1-4615-0897-7. URLhttps://doi.org/10.1007/978-1-4615-0897-7. 5 Donald J. Rose, Robert E. Tarjan, and George S. Lueker. Algorithmic aspects of vertex elimination on graphs.SIAM Journal on Computing, 5(2):266–283,

work page doi:10.1007/978-1-4615-0897-7
[14]

For WikiText-2, the entire test split is first concatenated using two line breaks as separators and then tokenized with the default HuggingFace tokenizer for each model

We use WikiText-2 and C4 for perplexity evaluations. For WikiText-2, the entire test split is first concatenated using two line breaks as separators and then tokenized with the default HuggingFace tokenizer for each model. For C4, we sample individual documents from the selected shard, tokenize them, and randomly extract sequences of the desired length. I...

work page 2048

[1] [1]

doi: 10.1007/BF02579403

ISSN 1439-6912. doi: 10.1007/BF02579403. URL https://doi.org/10.1007/BF02579403. 5, 7, 8 Johann Birnick. The lattice geometry of neural network quantization – a short equivalence proof of gptq and babai’s algorithm,

work page doi:10.1007/bf02579403

[2] [2]

4 Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher M De Sa

URLhttps://arxiv.org/abs/2508.01077. 4 Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher M De Sa. Quip: 2- bit quantization of large language models with guarantees. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.),Advances in Neu- ral Information Processing Systems, volume 36, pp. 4396–4429. Curran Associates, 15 Inc.,

work page arXiv

[3] [3]

4 Tim Dettmers, Ruslan A

URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 0df38cd13520747e1e64e5b123a78ef8-Paper-Conference.pdf. 4 Tim Dettmers, Ruslan A. Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. SpQR: A sparse-quantized representation for near-lossless LLM weight compress...

work page 2023

[4] [4]

doi: 10.1007/s00493-003-0019-y

ISSN 1439-6912. doi: 10.1007/s00493-003-0019-y. URLhttps://doi.org/10.1007/s00493-003-0019-y. 5 Elias Frantar and Dan Alistarh. Optimal brain compression: A framework for accurate post-training quantization and pruning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.),Advances in Neural Information Processing Systems, volume 35,...

work page doi:10.1007/s00493-003-0019-y

[5] [5]

4 Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh

URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 1caf09c9f4e6b0150b06a07e77f2710c-Paper-Conference.pdf. 4 Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. OPTQ: Accurate quantization for generative pre-trained transformers. InThe Eleventh International Conference on Learning Representations,

work page 2022

[6] [6]

W., and Keutzer, K

URLhttps://arxiv.org/abs/2103.13630. 4 Babak Hassibi, David G. Stork, and Gregory J. Wolff. Optimal brain surgeon and general network pruning. InIEEE International Conference on Neural Networks, pp. 293–299 vol.1,

work page arXiv

[7] [7]

4 Ravi Kannan

doi: 10.1109/ICNN.1993.298572. 4 Ravi Kannan. Minkowski’s convex body theorem and integer programming.Math. Oper. Res., 12(3):415–440, August

work page doi:10.1109/icnn.1993.298572 1993

[8] [8]

give me bf16 or give me death

ISSN 0364-765X. 5 Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, and Dan Alistarh. " give me bf16 or give me death"? accuracy-performance trade-offs in llm quantization.arXiv preprint arXiv:2411.02355,

work page internal anchor Pith review arXiv

[9] [9]

4 Arjen Klaas Lenstra, Hendrik Willem Lenstra, and László Lovász

URL https://proceedings.neurips.cc/paper_files/paper/1989/ file/6c9882bbac1c7093bd25041881277658-Paper.pdf. 4 Arjen Klaas Lenstra, Hendrik Willem Lenstra, and László Lovász. Factoring polynomials with rational coefficients.Mathematische Annalen, 261(4):515–534, dec

work page 1989

[10] [10]

doi: 10.1007/BF01457454

ISSN 1432-1807. doi: 10.1007/BF01457454. URLhttps://doi.org/10.1007/BF01457454. 5 16 Xinlin Li, Osama Hanna, Christina Fragouli, and Suhas Diggavi. ICQuant: Index coding enables low-bit LLM quantization. InSecond Conference on Language Modeling,

work page doi:10.1007/bf01457454

[11] [11]

13 Sasha Luccioni, Yacine Jernite, and Emma Strubell

URLhttps://openreview.net/forum?id=m6nBgFSMTL. 13 Sasha Luccioni, Yacine Jernite, and Emma Strubell. Power hungry processing: Watts driving the cost of ai deployment? InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’24, pp. 85–99, New York, NY, USA,

work page 2024

[12] [12]

Beyond Individual Accountability: (Re-)Asserting Democratic Control of AI

AssociationforComputingMachinery. ISBN9798400704505. doi: 10.1145/3630106.3658542. URLhttps://doi.org/10.1145/3630106.3658542. 4 Daniele Micciancio and Shafi Goldwasser.Complexity of Lattice Problems: A Cryptographic Perspective, volume 671 ofThe Springer International Series in Engineering and Computer Science. Springer, New York, NY, 1 edition,

work page doi:10.1145/3630106.3658542

[13] [13]

doi: 10.1007/ 978-1-4615-0897-7

ISBN 978-0-7923-7688-0. doi: 10.1007/ 978-1-4615-0897-7. URLhttps://doi.org/10.1007/978-1-4615-0897-7. 5 Donald J. Rose, Robert E. Tarjan, and George S. Lueker. Algorithmic aspects of vertex elimination on graphs.SIAM Journal on Computing, 5(2):266–283,

work page doi:10.1007/978-1-4615-0897-7

[14] [14]

For WikiText-2, the entire test split is first concatenated using two line breaks as separators and then tokenized with the default HuggingFace tokenizer for each model

We use WikiText-2 and C4 for perplexity evaluations. For WikiText-2, the entire test split is first concatenated using two line breaks as separators and then tokenized with the default HuggingFace tokenizer for each model. For C4, we sample individual documents from the selected shard, tokenize them, and randomly extract sequences of the desired length. I...

work page 2048