arxiv: 2602.13595 · v2 · submitted 2026-02-14 · 💻 cs.AI

Recognition: no theorem link

The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning

Henry Han , Xiyang Liu , Xiaodong Wang , Fei Han , Xiaodong Li

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:24 UTC · model grok-4.3

classification 💻 cs.AI

keywords quantization trapscaling lawsmulti-hop reasoningenergy consumptiondequantization overheadcritical model scalelanguage modelsnumerical precision

0 comments

The pith

Reducing precision from 16-bit to 8/4-bit increases net energy use in multi-hop reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the expected linear improvement in energy efficiency from lower numerical precision breaks down for multi-hop reasoning. Dropping from 16-bit to 8-bit or 4-bit precision raises total energy consumption while also lowering accuracy because hardware casting overhead and dequantization kernel latency dominate the cost of sequential step-by-step chains. Energy savings fail to amortize across the sequence of operations. The authors introduce the Critical Model Scale N* as a function of model size, batch size, and hardware to mark the point where the effect reverses. If the account holds, common quantization practices become counterproductive for tasks that require chaining multiple inferences.

Core claim

Neural scaling laws predict that energy E scales linearly with bit precision. In multi-hop reasoning this relation reverses: moving from 16-bit to 8-bit or 4-bit precision increases net energy consumption and degrades accuracy. The reversal is produced by hardware casting overhead together with the latency cost of dequantization kernels, which become the dominant term in sequential chains, plus the failure of energy amortization across successive steps. The paper therefore defines a Critical Model Scale N* that predicts when the trap dissolves or deepens; the formula depends on model size, batch size, and hardware configuration and is checked on models ranging from 0.6B to 72B parameters on

What carries the argument

The Critical Model Scale N*, a closed-form function of model size, batch size, and hardware configuration that locates the transition between the quantization trap and its dissolution.

If this is right

Linear scaling laws for energy versus precision are broken in practice for multi-hop reasoning.
The industry heuristic that smaller quantized models are always better becomes mathematically counterproductive for complex reasoning.
Net energy consumption rises, rather than falls, when precision is reduced in sequential chains.
Accuracy on chained reasoning tasks declines together with the energy increase.
The location of the critical scale N* can be shifted by changing batch size or hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hardware kernels that avoid repeated casting for long sequential workloads could eliminate the observed penalty.
The same overhead pattern is likely to appear in any inference setting that chains many dependent steps.
Raising batch size may be used as a practical lever to push a model past the critical scale N*.
Precision choices should be made task-by-task rather than applied uniformly across all workloads.

Load-bearing premise

Hardware casting overhead and dequantization kernel latency become the dominant costs specifically inside sequential multi-hop reasoning chains rather than being offset by other factors.

What would settle it

Measure total energy draw and task accuracy while running the same multi-hop reasoning benchmark at 16-bit, 8-bit, and 4-bit precision on models both below and above the predicted N*; the trap is confirmed if energy rises and accuracy falls at the lower precisions.

Figures

Figures reproduced from arXiv: 2602.13595 by Fei Han, Henry Han, Xiaodong Li, Xiaodong Wang, Xiyang Liu.

**Figure 2.** Figure 2: Sustainability Inversion (Mistral-7B, GSM8K).(A) Physical Telemetry: Reducing precision to 8/4-bit [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: The Size Paradox (Qwen3-0.6B). (A) 4-bit quantization triggers universal reasoning collapse regardless of architecture. (B) High 8-bit COR values (2.5–2.8×) prove casting overhead dominance is the mechanical driver of inefficiency. (C) Telemetry reveals that low-precision ”optimization” paradoxically results in a 400% energy penalty tency [25], leaving the serial de-quantization instructions as the absolut… view at source ↗

**Figure 4.** Figure 4: Sustainability Inversion of Qwen3-0.6B reasoning on MathQA(a–c) Physical telemetry reveals throughput collapse and energy spikes; FP16 is 4.2× more efficient than 8-bit on H100. (d–f) Sustainability indices expose systemic failure (SI ≈ 0.55) in quantized models due to unamortized casting overhead. FP16 strictly Pareto-dominates all configurations across L4, A100, and H100, proving brute-force bit-reductio… view at source ↗

**Figure 5.** Figure 5: Cross-Architectural Trap Evidence. (A) Throughput analysis locates B ∗ ≈ 64 for Falcon3, while Mistral-7B remains terminally trapped (B ∗ > 128). (B) COR > 1.0 identifies Casting Dominance as the mechanical bottleneck. (C) A scale-invariant ∼30% logic collapse proves that batch-driven efficiency gains fail to restore reasoning trust. ing Dominance regime (COR > 1.0), where the GPU spends more energy emulat… view at source ↗

**Figure 6.** Figure 6: Cross-backend validation of Theorem 4.5. (A) Accuracy degrades under both 4-bit backends. (B) AWQ [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Accuracy dynamics by reasoning depth. Left: The absolute accuracy gap between FP16 and 4-bit widens as chain length increases. Right: Relative loss nearly doubles from 1–2 to 3–4 steps, confirming that quantization errors compound multiplicatively across hops. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: COR trajectory from 0.6B to 72B (120×). The single-GPU line (blue) crosses zero between 14B and 32B, locating N∗ . MathQA (red) remains trapped at 32B due to smaller batches. The 72B point (orange star, 3×H100) reveals a dramatic reversal: COR= +3.00 because FP16 fits with headroom, removing the bandwidth bottleneck. Trap dissolution is a property of hardware saturation, not model scale alone (Theorem 4.6)… view at source ↗

read the original abstract

Neural scaling laws provide a predictable recipe for AI advancement: reducing numerical precision should linearly improve computational efficiency and energy profile ($E \propto \mathrm{bits}$). In this paper, we demonstrate that this scaling law breaks in the context of multi-hop reasoning. We reveal a 'quantization trap' where reducing precision from 16-bit to 8/4-bit paradoxically increases net energy consumption while degrading reasoning accuracy. We provide a rigorous theoretical decomposition that attributes this failure to hardware casting overhead, the hidden latency cost of dequantization kernels, which becomes a dominant bottleneck in sequential reasoning chains, as well as to a sequential energy amortization failure. As a result, scaling law breaking is unavoidable in practice. We formalize a Critical Model Scale $N^*$ that predicts when the trap dissolves or deepens as a function of model size, batch size, and hardware configuration, validated across a 120$\times$ range (0.6B--72B) on six GPU architectures. Our findings suggest that the industry's "smaller-is-better" heuristic is mathematically counterproductive for complex reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a real practical issue with quantization in multi-hop reasoning but the mechanism is not isolated from accuracy degradation effects.

read the letter

The main thing here is that the authors argue linear precision scaling fails for multi-hop reasoning because dequantization overheads in sequential chains drive up net energy use, and they formalize a critical scale N* to predict when the trap appears or disappears based on model size, batch, and hardware. They back this with runs across 0.6B to 72B models on six GPU types, which is a useful range for checking scale dependence. That broad testing is the strongest part of the work and gives it some practical relevance for people tuning inference on reasoning tasks. The theoretical breakdown of casting costs and amortization failure is also laid out clearly enough to follow. The soft spots are more substantial. The abstract gives no details on how energy was measured, what statistical controls were used, or whether accuracy was held fixed in any comparison. Without an ablation that isolates dequantization kernel latency from the model simply making errors and generating longer token sequences or deeper search, the energy increase could easily be a side effect of worse reasoning rather than the overhead itself. N* also looks like it was fit to the same experimental runs used to show the trap, which undercuts how predictive it really is. This is aimed at engineers and researchers who deploy large reasoning models and care about actual energy at scale. It raises a question worth checking, but the current support is preliminary. I would send it for peer review so the methods and data can be examined directly.

Referee Report

3 major / 2 minor

Summary. The paper claims that linear scaling laws for quantization break in multi-hop reasoning: reducing precision from 16-bit to 8/4-bit increases net energy consumption and degrades accuracy because hardware casting overhead and dequantization kernel latency become dominant in sequential chains, plus sequential energy amortization failure. It introduces a Critical Model Scale N* (function of model size, batch size, hardware) that predicts when the trap dissolves or deepens, with validation across 0.6B–72B models (120× range) on six GPU architectures. The conclusion is that the 'smaller-is-better' heuristic is counterproductive for complex reasoning.

Significance. If the central claim holds after rigorous controls, the result would be significant for deployment of quantized models on reasoning tasks, as it challenges the assumption that lower precision always yields linear efficiency gains and provides a predictive N* threshold. The broad hardware and scale validation range is a potential strength if methodology details are supplied.

major comments (3)

[Abstract] Abstract and validation description: the reported validation across a 120× model-size range on six GPU architectures provides no details on measurement methodology, error bars, data exclusion rules, or statistical controls, leaving the central claim that quantization increases net energy consumption only weakly supported.
[Theoretical Decomposition] Theoretical decomposition (attributing trap to dequantization kernels): the claim that casting/dequantization overhead becomes the dominant bottleneck specifically in sequential multi-hop chains lacks an ablation that holds model accuracy fixed while varying only the quantization casting path (e.g., native low-precision kernels vs. explicit dequant). Without this isolation the observed energy rise could be confounded by degraded reasoning producing more tokens or deeper search trees.
[Critical Model Scale N*] Critical Model Scale N* formalization: N* is presented as a function of model size, batch size, and hardware, but the manuscript does not clarify whether its parameters are independently derived or fitted to the same experimental data used to demonstrate the trap, raising a circularity concern that undermines the predictive claim.

minor comments (2)

[Abstract] The abstract would benefit from explicit mention of the specific multi-hop reasoning benchmarks or tasks used in the experiments.
[Critical Model Scale N*] Notation and definition of N* could be clarified with an explicit equation or parameter list to avoid ambiguity in how it depends on batch size and hardware.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. Below we provide point-by-point responses to the major comments, noting revisions where applicable.

read point-by-point responses

Referee: [Abstract] Abstract and validation description: the reported validation across a 120× model-size range on six GPU architectures provides no details on measurement methodology, error bars, data exclusion rules, or statistical controls, leaving the central claim that quantization increases net energy consumption only weakly supported.

Authors: We agree that additional methodological details are necessary to fully support the claims. In the revised manuscript, we have added a comprehensive 'Experimental Setup' subsection detailing the energy measurement methodology using NVIDIA DCGM and RAPL interfaces, reporting standard deviations from 10 independent runs as error bars, specifying data exclusion rules (e.g., discarding trials affected by GPU throttling or network latency), and including statistical analysis with ANOVA for multi-group comparisons. These additions strengthen the empirical support for the quantization trap phenomenon. revision: yes
Referee: [Theoretical Decomposition] Theoretical decomposition (attributing trap to dequantization kernels): the claim that casting/dequantization overhead becomes the dominant bottleneck specifically in sequential multi-hop chains lacks an ablation that holds model accuracy fixed while varying only the quantization casting path (e.g., native low-precision kernels vs. explicit dequant). Without this isolation the observed energy rise could be confounded by degraded reasoning producing more tokens or deeper search trees.

Authors: This is a fair criticism regarding potential confounds. To address it, we have incorporated a new ablation experiment in Section 4.3 where we fix the reasoning output (using oracle traces with identical token counts and search depths) and compare energy consumption between explicit dequantization paths and native low-precision kernels (e.g., via FP8 support in cuBLAS). The results confirm that the overhead is primarily from casting/dequantization in sequential chains, independent of accuracy degradation effects. We have updated the theoretical decomposition accordingly. revision: yes
Referee: [Critical Model Scale N*] Critical Model Scale N* formalization: N* is presented as a function of model size, batch size, and hardware, but the manuscript does not clarify whether its parameters are independently derived or fitted to the same experimental data used to demonstrate the trap, raising a circularity concern that undermines the predictive claim.

Authors: We appreciate the opportunity to clarify this point. The functional form and parameters of N* were derived analytically from hardware specifications (e.g., dequantization kernel latencies reported in vendor documentation and microbenchmarks) and architectural properties of the models, without fitting to the main experimental results. The experiments across scales serve as validation of the predictive model rather than parameter estimation. We have added an appendix detailing the derivation steps to make this independence explicit and remove any ambiguity. revision: partial

Circularity Check

1 steps flagged

N* formalization presented as first-principles predictor but reduces to fit on the same experimental data used to demonstrate the trap

specific steps

fitted input called prediction [Abstract (formalization of N*) and subsequent theoretical decomposition section]
"We formalize a Critical Model Scale N* that predicts when the trap dissolves or deepens as a function of model size, batch size, and hardware configuration, validated across a 120× range (0.6B--72B) on six GPU architectures."

N* is introduced as a predictive, first-principles construct derived from the hardware casting and amortization analysis. Its parameters and functional form are then 'validated' on precisely the same multi-model, multi-GPU dataset used to exhibit the paradoxical energy increase. The prediction therefore reproduces the input experimental trend by construction rather than forecasting an independent outcome.

full rationale

The paper's central theoretical contribution is the formalization of Critical Model Scale N* as a function of model size, batch size, and hardware that 'predicts' when the quantization trap dissolves or deepens. This is introduced after the theoretical decomposition attributing the energy increase to dequantization kernel overhead in sequential chains. However, N* is validated directly on the 0.6B–72B experimental sweep that also demonstrates the trap itself. No independent derivation or external benchmark is provided for the functional form or coefficients; the 'prediction' is therefore a post-hoc parameterization of the observed scaling behavior rather than an a-priori result. This matches the fitted-input-called-prediction pattern and produces partial circularity (score 6). The remainder of the derivation (energy decomposition) does not reduce to self-definition or self-citation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about hardware behavior in sequential chains and introduces N* whose exact functional form and any fitted constants are not specified in the abstract.

free parameters (1)

Critical Model Scale N*
Defined as a function of model size, batch size, and hardware configuration; parameters may be fitted to experimental results.

axioms (1)

domain assumption Hardware casting overhead and dequantization kernel latency dominate energy costs in sequential reasoning chains
Invoked in the theoretical decomposition to explain why the trap occurs.

pith-pipeline@v0.9.0 · 5497 in / 1265 out tokens · 79817 ms · 2026-05-15T22:24:35.697104+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 4 internal anchors

[1]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Hernandez, D., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models.arXiv preprint arXiv:2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[2]

Dettmers, T., Pagnoni, A., Holtzman, A., & Zettle- moyer, L. (2024). QLoRA: Efficient Finetuning of Quantized LLMs.Advances in Neural Information Processing Systems,36

work page 2024
[3]

Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). GPTQ: Accurate Post-Training Quanti- zation for Generative Pre-trained Transformers. In International Conference on Learning Representa- tions (ICLR)

work page 2023
[4]

Lin, J., Tang, J., Tang, H., Yang, S., Xiao, G., Dang, X., Gan, C., Han, S. (2024). AWQ: Activation-aware Weight Quantization for LLM Compression and Ac- celeration. InProceedings of Machine Learning and Systems (MLSys)

work page 2024
[5]

V., & Zhou, D

Wei, J., Wang, X., Schuurmans, D., Maeda, M., Polozov, A., Xia, Y., Chi, E., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reason- ing in large language models.Advances in Neural Information Processing Systems,35, 24824–24837

work page 2022
[6]

Mistral 7B

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., & others. (2023). Mistral 7B.arXiv preprint arXiv:2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

L., Chen, J., Hoefler, T., and Alistarh, D

Frantar, E., Castro, R. L., Chen, J., Hoefler, T., and Alistarh, D. MARLIN: Mixed-Precision Auto- Regressive Parallel Inference on Large Language Models.arXiv preprint arXiv:2408.11743, 2024

work page arXiv 2024
[8]

AWQ: Activation-aware Weight Quantization for LLM Compression and Ac- celeration

Lin, J., Tang, J., Tang, H., Yang, S., Xiao, G., Dang, X., Gan, C., and Han, S. AWQ: Activation-aware Weight Quantization for LLM Compression and Ac- celeration. InProceedings of Machine Learning and Systems (MLSys), 2024

work page 2024
[9]

QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks

Tseng, A., Chee, J., Sun, Q., Kuleshov, V., and De Sa, C. QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks. InInternational Conference on Machine Learning (ICML), pp. 48630–48656, 2024

work page 2024
[10]

FlexGen: High-Throughput Generative Infer- ence of Large Language Models with a Single GPU

Sheng, Y., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Chen, B., Liang, P., R´ e, C., Stoica, I., and Zhang, C. FlexGen: High-Throughput Generative Infer- ence of Large Language Models with a Single GPU. InInternational Conference on Machine Learning (ICML), 2023

work page 2023
[12]

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., & others. (2021). Training veri- fiers to solve math word problems.arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Amini, A., Gabriel, S., Lin, S., Koncel-Kedziorski, R., Choi, Y., & Hajishirzi, H. (2019). MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics

work page 2019
[14]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. l., Hendricks, L. A., Welbl, J., Clark, A., & others. (2022). Training Compute-Optimal Large Language Models.arXiv preprint arXiv:2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Mea- suring Massive Multitask Language Understanding. InInternational Conference on Learning Represen- tations (ICLR)

work page 2021
[16]

Zhao, Y., Lin, C., & others. (2023). Atom: Low- bit Quantization for Efficient and Accurate LLM Serving.arXiv preprint arXiv:2310.19102

work page arXiv 2023
[17]

Quantization Hurts Reason- ing? An Empirical Study on Quantized Reasoning Models,

R. Liu, Y. Sun, M. Zhang, H. Bai, X. Yu, T. Yu, C. Yuan, and L. Hou, “Quantization Hurts Reason- ing? An Empirical Study on Quantized Reasoning Models,”COLM, 2025

work page 2025
[18]

Quantization Meets Reason- ing: Exploring LLM Low-Bit Quantization Degra- dation for Mathematical Reasoning,

Z. Li, Y. Su, R. Yang, C. Xie, Z. Wang, Z. Xie, N. Wong, and H. Yang, “Quantization Meets Reason- ing: Exploring LLM Low-Bit Quantization Degra- dation for Mathematical Reasoning,”arXiv preprint arXiv:2501.03035, 2025. 18

work page arXiv 2025
[19]

NVIDIA. (2020). NVIDIA A100 Tensor Core GPU Architecture. NVIDIA Corporation

work page 2020
[20]

Ahmed, N., & Wahed, M. (2020). The De- democratization of AI: Deep Learning and the Com- pute Divide in Artificial Intelligence Research.arXiv preprint arXiv:2010.15581

work page arXiv 2020
[21]

NVIDIA. (2024). NVIDIA Management Library (NVML) Reference Guide. NVIDIA Corporation

work page 2024
[22]

Huang, H., & others. (2020). Analysis of GPU Power Consumption Using Internal Sensors. InProceedings of the 2020 IEEE International Parallel and Dis- tributed Processing Symposium

work page 2020
[23]

Henderson, P., Hu, J., Romoff, J., Brunskill, E., Ju- rafsky, D., & Pineau, J. (2020). Towards the System- atic Reporting of the Energy and Carbon Footprints of Machine Learning.Journal of Machine Learning Research,21

work page 2020
[24]

Kumbhare, N., & others. (2020). Understanding GPU Power: A Survey of Profiling, Modeling, and Simulation Methods.ACM Computing Surveys

work page 2020
[25]

Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Heek, J., Xiao, K., Agrawal, S., & Dean, J. (2023). Efficiently Scaling Transformer In- ference. InProceedings of the Sixth Conference on Machine Learning and Systems (MLSys)

work page 2023
[26]

Schmidt, V., Goyal, K., Joshi, A., Feld, B., Con- nell, L., Laskaris, N., Blank, D., & others. (2021). CodeCarbon: Estimate and Track Carbon Emissions from Machine Learning Computing

work page 2021
[27]

Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The Curious Case of Neural Text Degen- eration. InInternational Conference on Learning Representations (ICLR)

work page 2020
[28]

Zhang, M., Press, O., Merrill, W., Liu, A., & Smith, N. A. (2023). How language model hallucinations can snowball.arXiv preprint arXiv:2305.13534

work page arXiv 2023
[29]

(2024).The Falcon 3 Family of Open Models

Falcon-LLM Team. (2024).The Falcon 3 Family of Open Models. Technology Innovation Institute. [On- line]. Available: https://huggingface.co/blog/ falcon3

work page 2024
[30]

Benchmarking Emerging Deep Learning Quantization Methods for Energy Efficiency,

S. Rajput and T. Sharma, “Benchmarking Emerging Deep Learning Quantization Methods for Energy Efficiency,”ICSA-C, 2024

work page 2024
[31]

Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency,

E. Husomet al., “Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency,” TioT, 2025

work page 2025
[32]

NVIDIA. (2024). NVIDIA Blackwell Ar- chitecture Whitepaper. Retrieved from https://www.nvidia.com/en-us/data- center/blackwell-architecture/ 19 A Appendix Proposition A.1(Scaling Law Divergence).Let SI(θ)be a global utility function of precision p. A Quantization Trap is identified when: ∂SI ∂p >0 (14) signifying a fundamental breakdown of the linear scal...

work page 2024
[33]

In the Low-Precision scenario ( p), the movement energy is lower, but we must add the conversion overhead (ϕ)

Existence of the critical batch thresholdB ∗: In the native scenario ( π), there is zero conversion overhead because the hardware processes the data natively (ϕ = 0). In the Low-Precision scenario ( p), the movement energy is lower, but we must add the conversion overhead (ϕ). We look for the batch sizeB ∗ where: EnergyLow-Bit = EnergyNative Mathematicall...

work page
[34]

At low batch sizes (B < B ∗), the negative magnitude of the casting penalty |∂ϕ/∂p| exceeds the marginal bandwidth savings α/B, causing the gradient to sign-flip

Quantization Trap Gradient:The occurs when ∂E ∂p <0 (see Prop.A.1.) Differentiating the energy functional: ∂E ∂p =K α B + ∂ϕ ∂p (23) In sequential reasoning, the de-quantization function ϕ(h, p) is a step-wise non-increasing function:larger p, less conversion overhead:(∂ϕ/∂p≤0). At low batch sizes (B < B ∗), the negative magnitude of the casting penalty |...

work page
[35]

Thus, accuracy is a strictly non-decreasing function of precision, ∂TSI ∂p >0

Trust (TSI ):Because reasoning is a sequential product of logical transitions P (y) =Q P (hk), quantization noise ϵcompounds at each hop. Thus, accuracy is a strictly non-decreasing function of precision, ∂TSI ∂p >0

work page
[36]

Since SSI is inversely related to energy consumption, ∂SSI ∂p >0

Energy (SSI ):From Theorem A.4, when B < B ∗, the energy-per-query E decreases as precision p increases because the hardware-level casting tax ϕ(h, p) is eliminated. Since SSI is inversely related to energy consumption, ∂SSI ∂p >0

work page
[37]

Restoring precision removes the software-emulated de-quantization bottleneck, increasing throughput such that ∂ESI ∂p >0

Economy (ESI ):In the sequential regime ( B≈ 1), theCasting Overhead Ratio( COR) dominates the execution pipeline. Restoring precision removes the software-emulated de-quantization bottleneck, increasing throughput such that ∂ESI ∂p >0. Since all partial derivatives are positive, the global gradient ∂SI ∂p =P wi ∂vi ∂p > 0. By Proposition A.1, the system ...

work page
[38]

Efficiency Recovery: The hardware casting overhead is monotonically non-increasing, ∂COR ∂B ≤ 0, and energy-per- query decreases, ∂E ∂B <0

work page
[39]

Trust Invariance: The reasoning accuracy is invariant to batch size, ∂T ∂B = 0. Proof. 1. Proving Efficiency Recovery:In current hardware architectures, model weights are loaded and de-quantized once per batch per token. We model the per-hop latency components as: τcomp(p, B) =a comp(p)·B, τ cast(p, B) =a cast(p) (24) where acomp is the compute cost per e...

work page
[40]

Proving Trust Invariance:Reasoning accuracy T (p, B) is determined by the model’s logits, which are a function of the precisionp. For a fixed decoding algorithm, the output ˆyfor an inputxis: ˆy=f(x, p) (27) Since f(x, p) has no dependency on the batch size B (Assumption of Semantic Independence), the prediction for any query in the batch remains identica...

work page