arxiv: 2605.10673 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Compander-Aligned Query Geometry for Quantized Zeroth-Order Optimization

Yao Shu , Zilin Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:48 UTC · model grok-4.3

classification 💻 cs.LG

keywords zeroth-order optimizationquantized adaptationcompanderquery geometrylow-bit modelsNF4LLM fine-tuninggradient-free optimization

0 comments

The pith

Aligning zeroth-order queries to the compander grid makes query-time residuals exactly zero.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Quantized zeroth-order optimization faces a mismatch because low-precision rounding distorts the query endpoints in nonuniform codebooks. The paper models this with a compander function φ that transforms the problem to uniform quantization, then constructs aligned one-grid-step stencils in the transformed space. This alignment decomposes and eliminates the endpoint-rounding residual at query time, unlike generic queries that retain a Δ²/μ² term in their convergence bounds. A reader would care because it enables more reliable memory-efficient fine-tuning of quantized large language models without increasing the evaluation budget.

Core claim

The central discovery is that for a quantizer Q = φ^{-1} ∘ U ∘ φ, forming Rademacher stencils z ± Δr with z = φ(x) and mapping back to x-space via φ^{-1} removes the grid-span mismatch. Theory decomposes the estimator residuals and proves stationarity bounds free of the residual channel that generic off-grid queries exhibit. Experiments on synthetic functions isolate the channel and confirm its absence under CAQ-ZO, while practical NF4 fine-tuning of Qwen and Llama models yields better performance than the unaligned baseline.

What carries the argument

Compander-aligned query (CAQ) geometry: one-grid-step Rademacher stencils built in the uniform transformed domain before inverse companding.

If this is right

Generic off-grid queries retain a Δ²/μ² residual channel in stationarity bounds.
CAQ-ZO achieves exactly zero query-time residual for the same nonuniform quantizer.
The approach improves fine-tuning results for NF4-quantized Qwen and Llama under fixed budget.
Query geometry is the key to predicting and controlling ZO behavior in quantized settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This alignment technique may apply to other low-precision derivative-free methods.
It underscores the need to match query design to quantization geometry in hardware-constrained optimization.
Scalability to larger models and different quantizers remains to be explored in follow-up work.

Load-bearing premise

Nonuniform quantization can be exactly represented as the composition Q = φ^{-1} ∘ U ∘ φ, with the stationarity bounds holding for the NF4 quantizer in the experiments.

What would settle it

A direct measurement of the estimator residual or stationarity gap on a controlled quantized problem, expecting the predicted nonzero channel for off-grid queries and zero for CAQ-ZO.

Figures

Figures reproduced from arXiv: 2605.10673 by Yao Shu, Zilin Zhu.

**Figure 2.** Figure 2: Start-matched synthetic convergence under matched nonuniform low-bit forward evaluation. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Query-time estimator residual under the shared synthetic setting used by the main conver [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗

read the original abstract

Low-bit forward evaluation is an attractive route to memory-efficient zeroth-order (ZO) adaptation: the optimizer needs only scalar losses, and the model can be queried near deployment precision. The obstacle is that a quantized ZO query is not a continuous finite difference followed by harmless storage rounding. The query chooses endpoints, the low-precision engine rounds them, and the loss difference is measured along the rounded chord. For nonuniform companding quantizers, this makes the codebook insufficient to predict ZO behavior: a fixed weight-space radius can collapse in dense cells, over-span sparse cells, or assign a rounded chord to an unrounded update direction. We identify the missing object as query geometry and model scalar nonuniform quantization as $Q = \phi^{-1} \circ U \circ \phi$. CAQ-ZO (Compander-Aligned Queries for Zeroth-Order Optimization) forms one-grid-step Rademacher stencils $z \pm \Delta r$ in $z = \phi(x)$, maps endpoints back through $\phi^{-1}$, and updates in $z$. Our theory proves the grid-span mismatch, decomposes endpoint-rounding estimator residuals, and gives stationarity bounds in which generic off-grid queries retain a $\Delta^2/\mu^2$ residual channel while CAQ-ZO makes the query-time residual exactly zero. Synthetic experiments isolate this channel, and matched NF4 Qwen/Llama fine-tuning shows that CAQ-ZO improves the trained NF4 baseline under the same quantizer and evaluation budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAQ-ZO aligns ZO queries to the compander grid to zero the query-time residual under their model, with theory and NF4 experiments showing the fix works in practice.

read the letter

The main thing to know is that this paper fixes a geometry mismatch in quantized zeroth-order optimization. By building one-grid-step stencils inside the companded space and mapping endpoints back, they make the finite-difference residual exactly zero at query time instead of leaving a Delta squared over mu squared channel that generic queries retain. The theory decomposes the endpoint-rounding effect and derives the corresponding stationarity bounds under the Q equals phi inverse of uniform of phi model. Synthetic tests isolate that residual cleanly, and the NF4 runs on Qwen and Llama report gains over the matched baseline with the same budget. That combination of targeted analysis and real-model results is the useful part. The construction is new relative to the cited ZO and quantization work; it is not just re-packaging existing finite-difference tricks. The paper does well at stating the problem in terms of codebook density and at keeping the math tied to the explicit quantization composition. The soft spot is the gap between the ideal compander model and actual NF4 implementations. Block scaling, clipping, and any non-ideal rounding can introduce extra estimator terms that the zero-residual claim does not cover. If those terms are material, the advantage shrinks even if the overall method still helps. The abstract also leaves out error-bar details and exact data-exclusion rules, so the size of the reported gains needs checking against variance. This is for researchers working on memory-efficient LLM adaptation or low-bit ZO methods. Anyone already using quantized forward passes for training will see the query-geometry angle as a concrete lever. It deserves a serious referee because the core claim is testable, the assumptions are stated, and the experiments are on relevant models. I would send it out for review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes CAQ-ZO for quantized zeroth-order optimization. It models scalar nonuniform quantization exactly as the composition Q = φ^{-1} ∘ U ∘ φ, identifies a grid-span mismatch in query geometry, decomposes endpoint-rounding residuals in the finite-difference estimator, and derives stationarity bounds showing that generic off-grid queries retain a Δ²/μ² residual channel while CAQ-ZO (one-grid-step Rademacher stencils in the companded domain) makes the query-time residual exactly zero. Synthetic experiments isolate the channel and matched NF4 fine-tuning on Qwen/Llama models reports gains over the quantized baseline under fixed quantizer and evaluation budget.

Significance. If the central claims hold, the work supplies a principled, low-overhead correction for quantization-induced bias in ZO gradient estimates that is directly applicable to memory-efficient adaptation of large models. The explicit decomposition of residuals and the parameter-free zero-residual guarantee under the stated model are technically clean contributions; the real-model NF4 experiments add practical weight. The approach could inform the design of future quantized ZO and related low-precision optimizers.

major comments (2)

[Theory (stationarity bounds derivation)] The stationarity bounds and the claim that CAQ-ZO achieves exactly zero query-time residual are derived under the exact representation Q = φ^{-1} ∘ U ∘ φ. Practical NF4 (with block scaling, clipping, and non-ideal rounding) may introduce additional unmodeled terms in the finite-difference estimator; the manuscript must either prove that these terms remain negligible or bound their effect on the residual channel, as this assumption is load-bearing for the 'exactly zero' result.
[Experiments] The experimental section reports that synthetic runs isolate the residual channel and that NF4 Qwen/Llama fine-tuning shows gains, yet the provided description lacks explicit error-bar statistics, number of independent runs, and data-exclusion criteria. Without these, it is impossible to confirm that the observed improvements are statistically robust and not sensitive to post-hoc choices.

minor comments (1)

[Abstract and §2] Notation for the compander φ and the grid step Δ should be introduced with a single forward reference to the model equation to avoid repeated re-definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of the theoretical assumptions and experimental reporting. We address each major comment below, indicating the revisions we plan to incorporate.

read point-by-point responses

Referee: [Theory (stationarity bounds derivation)] The stationarity bounds and the claim that CAQ-ZO achieves exactly zero query-time residual are derived under the exact representation Q = φ^{-1} ∘ U ∘ φ. Practical NF4 (with block scaling, clipping, and non-ideal rounding) may introduce additional unmodeled terms in the finite-difference estimator; the manuscript must either prove that these terms remain negligible or bound their effect on the residual channel, as this assumption is load-bearing for the 'exactly zero' result.

Authors: We agree that the stationarity bounds and the exact-zero residual guarantee are derived under the idealized model Q = φ^{-1} ∘ U ∘ φ, which captures the core nonuniform companding behavior. The manuscript already notes that this is an exact representation for the scalar quantizer without block scaling. For practical NF4, block-wise scaling, clipping, and non-ideal rounding introduce secondary perturbations. Our synthetic experiments isolate the grid-span mismatch residual under the model, while the NF4 fine-tuning results on Qwen and Llama demonstrate that CAQ-ZO still yields measurable gains over the quantized baseline under identical quantizer and budget. In the revision we will add a new subsection that (i) explicitly states the scope of the idealized model, (ii) derives a first-order bound showing that the additional residual terms from block scaling and clipping contribute at most O(Δ) to the estimator (rather than inflating the Δ²/μ² channel), and (iii) reports an empirical ablation on a small model confirming that these terms remain small relative to the compander-induced residual for typical NF4 block sizes. This addresses the load-bearing nature of the assumption without overstating the guarantee. revision: partial
Referee: [Experiments] The experimental section reports that synthetic runs isolate the residual channel and that NF4 Qwen/Llama fine-tuning shows gains, yet the provided description lacks explicit error-bar statistics, number of independent runs, and data-exclusion criteria. Without these, it is impossible to confirm that the observed improvements are statistically robust and not sensitive to post-hoc choices.

Authors: We acknowledge that the current manuscript does not report error bars, the number of independent runs, or data-exclusion criteria, which limits assessment of statistical robustness. In the revised version we will expand the experimental section to include: results averaged over 5 independent runs with standard-error bars for both synthetic and NF4 fine-tuning experiments; explicit statement that no data points or runs were excluded; and the random seeds used for reproducibility. The synthetic isolation experiments will additionally report variance across multiple random quantization grids. These additions will make the statistical claims verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation follows directly from explicit model and standard assumptions

full rationale

The paper adopts the quantization representation Q = φ^{-1} ∘ U ∘ φ as an explicit modeling assumption and derives the grid-span mismatch, residual decomposition, and stationarity bounds (including the Δ²/μ² channel for off-grid queries and exact zero for CAQ-ZO) from this model combined with standard ZO finite-difference analysis. CAQ-ZO is defined to place stencils on the uniform grid in the companded space z = φ(x), so the zero-residual property holds by direct substitution into the model rather than by fitting or self-referential closure. No load-bearing self-citations, no parameters fitted to data then relabeled as predictions, and no uniqueness theorems imported from prior author work. Synthetic experiments isolate the modeled channel while NF4 runs use the same quantizer family as the assumption; the chain is self-contained against the stated model.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; the central modeling choice is the compander representation of quantization, treated here as a domain assumption rather than a fitted parameter.

axioms (1)

domain assumption Nonuniform quantization is exactly representable as Q = φ^{-1} ∘ U ∘ φ for some compander φ
Invoked to derive the grid-span mismatch and residual decomposition for off-grid versus aligned queries.

pith-pipeline@v0.9.0 · 5571 in / 1322 out tokens · 44603 ms · 2026-05-12T03:48:05.307728+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We write scalar nonuniform quantization as a companding quantizer Q=ϕ^{-1}∘U∘ϕ, where U is a uniform grid in the coordinate z=ϕ(x). ... CAQ-ZO forms one-grid-step Rademacher stencils z±∆r in z=ϕ(x)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2(Endpoint-Rounding Estimator Residual)... for CAQ-ZO endpointsz±∆rk, ifz∈G, then ˆ∇(F◦Q◦ϕ^{-1})(z)−ˆ∇(F◦ϕ^{-1})(z)=0

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 2 internal anchors

[1]

ZOQO: Zero-order quantized optimization

Noga Bar and Raja Giryes. ZOQO: Zero-order quantized optimization. InProc. ICASSP, 2025

work page 2025
[2]

Low-rank quantization-aware training for LLMs

Yelysei Bondarenko, Riccardo Del Chiaro, and Markus Nagel. Low-rank quantization-aware training for LLMs. arXiv:2406.06385, arXiv, 2024

work page arXiv 2024
[3]

EfficientQAT: Efficient quantization-aware training for large language models

Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. EfficientQAT: Efficient quantization-aware training for large language models. InProc. ACL, 2025

work page 2025
[4]

Test-time model adaptation for quantized neural networks

Zeshuai Deng, Guohao Chen, Shuaicheng Niu, Hui Luo, Shuhai Zhang, Yifan Yang, Renjie Chen, Wei Luo, and Mingkui Tan. Test-time model adaptation for quantized neural networks. InProc. ACM MM, 2025

work page 2025
[5]

QLoRA: Efficient finetuning of quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InProc. NeurIPS, 2023

work page 2023
[6]

Duchi, Michael I

John C. Duchi, Michael I. Jordan, Martin J. Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations.IEEE Trans. Inf. Theory, 61(5):2788–2806, 2015

work page 2015
[7]

Stepping forward on the last mile

Chen Feng, Shaojie Zhuo, Xiaopeng Zhang, Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, and Andrew Zou Li. Stepping forward on the last mile. InProc. NeurIPS, 2024

work page 2024
[8]

Stochastic zeroth-order gradient and Hessian estimators: Variance reduction and refined bias bounds.Inf

Yasong Feng and Tianyu Wang. Stochastic zeroth-order gradient and Hessian estimators: Variance reduction and refined bias bounds.Inf. Inference, 12(3):1514–1545, 2023

work page 2023
[9]

OPTQ: Accurate quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. OPTQ: Accurate quantization for generative pre-trained transformers. InProc. ICLR, 2023

work page 2023
[10]

Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM J

Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM J. Optim., 23(4):2341–2368, 2013

work page 2013
[11]

Gray and David L

Robert M. Gray and David L. Neuhoff. Quantization.IEEE Trans. Inf. Theory, 44(6):2325–2383, 1998

work page 1998
[12]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InProc. ICLR, 2022

work page 2022
[13]

G.711: Pulse code modulation (PCM) of voice frequencies

ITU-T. G.711: Pulse code modulation (PCM) of voice frequencies. Recommendation ITU-T G.711, International Telecommunication Union, 1988

work page 1988
[14]

Quantization and training of neural networks for efficient integer-arithmetic-only inference

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProc. CVPR, 2018

work page 2018
[15]

N. S. Jayant and Peter Noll.Digital Coding of Waveforms: Principles and Applications to Speech and Video. Prentice-Hall, 1984

work page 1984
[16]

LoftQ: LoRA-fine-tuning-aware quantization for large language models

Yixiao Li, Yifan Yu, Chen Liang, Nikos Karampatziakis, Pengcheng He, Weizhu Chen, and Tuo Zhao. LoftQ: LoRA-fine-tuning-aware quantization for large language models. InProc. ICLR, 2024

work page 2024
[17]

AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InProc. MLSys, 2024

work page 2024
[18]

Stuart P. Lloyd. Least squares quantization in PCM.IEEE Trans. Inf. Theory, 28(2):129–137, 1982

work page 1982
[19]

Lee, Danqi Chen, and Sanjeev Arora

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. InProc. NeurIPS, 2023. 10

work page 2023
[20]

Quantizing for minimum distortion.IRE Trans

Joel Max. Quantizing for minimum distortion.IRE Trans. Inf. Theory, 6(1):7–12, 1960

work page 1960
[21]

Random gradient-free minimization of convex functions

Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. F ound. Comput. Math., 17(2):527–566, 2017

work page 2017
[22]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report. arXiv:2412.15115, arXiv, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

An optimal algorithm for bandit and zero-order convex optimization with two-point feedback.JMLR, 18(52):1–11, 2017

Ohad Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback.JMLR, 18(52):1–11, 2017

work page 2017
[24]

Fine-tuning quantized neural networks with zeroth-order optimization

Sifeng Shang, Jiayi Zhou, Chenyu Lin, Minxian Li, and Kaiyang Zhou. Fine-tuning quantized neural networks with zeroth-order optimization. InProc. ICLR, 2026

work page 2026
[25]

James C. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation.IEEE Trans. Automat. Control, 37(3):332–341, 1992

work page 1992
[26]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Quantized evolution strategies: High-precision fine-tuning of quantized LLMs at low-precision cost

Yinggan Xu, Risto Miikkulainen, and Xin Qiu. Quantized evolution strategies: High-precision fine-tuning of quantized LLMs at low-precision cost. arXiv:2602.03120, arXiv, 2026

work page arXiv 2026
[28]

Poor man’s training on MCUs: A memory-efficient quantized back-propagation-free approach

Yequan Zhao, Hai Li, Ian Young, and Zheng Zhang. Poor man’s training on MCUs: A memory-efficient quantized back-propagation-free approach. arXiv:2411.05873, arXiv, 2024

work page arXiv 2024
[29]

QuZO: Quantized zeroth-order fine-tuning for large language models

Jiajun Zhou, Yifan Yang, Kai Zhen, Ziyue Liu, Yequan Zhao, Ershad Banijamali, Athanasios Mouchtaris, Ngai Wong, and Zheng Zhang. QuZO: Quantized zeroth-order fine-tuning for large language models. InProc. EMNLP, 2025. A Related Work Classical derivative-free and simultaneous-perturbation methods estimate gradients from function values rather than backprop...

work page 2025