Adaptive directional gradients for parameterised quantum circuits

Brian Coyle; El Amine Cherrat; Elham Kashefi; Snehal Raj; Virag Umathe

arxiv: 2606.09734 · v1 · pith:JCSONW3Tnew · submitted 2026-06-08 · 🪐 quant-ph · cs.LG

Adaptive directional gradients for parameterised quantum circuits

Brian Coyle , Snehal Raj , Virag Umathe , El Amine Cherrat , Elham Kashefi This is my paper

Pith reviewed 2026-06-27 16:29 UTC · model grok-4.3

classification 🪐 quant-ph cs.LG

keywords parameterised quantum circuitsforward gradientsgradient estimationvariational quantum algorithmsmeasurement costSPSAparameter-shift ruleadaptive optimization

0 comments

The pith

Forward-mode directional derivatives yield unbiased gradient estimates for parameterised quantum circuits at tunable measurement cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that an unbiased estimator of the gradient for parameterised quantum circuits can be obtained by averaging any number of random directional derivatives computed via the forward mode. This construction requires no ancilla qubits or controlled gates and recovers the parameter-shift rule, SPSA, and random coordinate descent as special cases when the number of directions is varied. A sympathetic reader would care because the dominant cost in hardware training of such circuits is the number of measurements needed for gradients, and the new estimator allows that cost to be traded against variance in a controlled way. From the same second-moment analysis the authors derive an adaptive optimiser, QUIVER, whose shot allocation follows from a closed-form minimum-cost rule. Numerical experiments then show that circuits with up to 1770 parameters on 60 qubits can be trained orders of magnitude more efficiently than with the parameter-shift rule alone.

Core claim

A framework of forward gradient estimators for PQCs, based on the forward mode of automatic differentiation, yields an unbiased estimator of the gradient by averaging a freely tunable number of random directional derivatives and recovers SPSA, random coordinate descent, and the parameter-shift rule as limiting cases, with no ancilla qubits or controlled-gate overhead. Stochastic quantum forward gradient descent converges under standard assumptions, with an explicit second-moment expansion that interpolates between the single-direction extreme of SPSA and the full-gradient extreme of parameter-shift. Within this framework the authors derive QUIVER, an adaptive optimiser whose update rule foll

What carries the argument

The stochastic forward gradient estimator obtained by averaging a tunable number of random directional derivatives of the circuit output expectation value.

If this is right

Stochastic forward gradient descent converges under the same assumptions used for classical SGD.
The variance of the estimator interpolates continuously between the SPSA and parameter-shift extremes.
QUIVER's closed-form shot allocation minimises total measurement cost for a target variance.
Circuits with 60 qubits and 1770 parameters train orders of magnitude faster than with the parameter-shift rule.
QUIVER outperforms iCANS and gCANS on QAOA and VQE benchmark problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same directional-derivative construction could be applied to any variational quantum algorithm whose cost function is an expectation value.
Because the method is ancilla-free it may combine directly with existing error-mitigation protocols without increasing circuit depth.
At large parameter counts the optimal number of directions may become a hyper-parameter that itself needs adaptive tuning.
If the variance model holds, similar cost-optimal allocation rules could be derived for other stochastic estimators used in quantum machine learning.

Load-bearing premise

That the second-moment expansion of the directional-derivative estimator correctly predicts variance under the measurement-cost model used to derive QUIVER's allocation rule.

What would settle it

Compute the empirical bias of the averaged directional derivative estimator on a single-parameter circuit whose analytic gradient is known; the bias must remain zero for any finite number of directions.

Figures

Figures reproduced from arXiv: 2606.09734 by Brian Coyle, El Amine Cherrat, Elham Kashefi, Snehal Raj, Virag Umathe.

**Figure 2.** Figure 2: FIG. 2 [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: FIG. 3 [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 5.** Figure 5: FIG. 5 [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: FIG. 6 [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: FIG. 7 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: FIG. 8 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 10.** Figure 10: FIG. 10 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: FIG. 11 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: FIG. 12 [PITH_FULL_IMAGE:figures/full_fig_p034_12.png] view at source ↗

**Figure 13.** Figure 13: FIG. 13 [PITH_FULL_IMAGE:figures/full_fig_p035_13.png] view at source ↗

read the original abstract

Training parameterised quantum circuits (PQCs) on quantum hardware is bottlenecked by the measurement cost of gradient estimation, which under the parameter-shift rule scales linearly in the number of trainable parameters and dominates the total shot budget of training at scale. In this work, we propose a framework of forward gradient estimators for PQCs, based on the forward mode of automatic differentiation, that yields an unbiased estimator of the gradient by averaging a freely tunable number of random directional derivatives and recovers SPSA, random coordinate descent, and the parameter-shift rule as limiting cases, with no ancilla qubits or controlled-gate overhead. We prove that stochastic quantum forward gradient descent converges under standard assumptions, with an explicit second-moment expansion that interpolates between the single-direction extreme of SPSA and the full-gradient extreme of parameter-shift. Within this framework we derive QUIVER (Quantum Iterative V-adaptive Estimator Rule), an adaptive optimiser for parameterised circuits whose update rule follows from a closed-form minimum measurement-cost allocation. We show numerically that forward gradients train Hamming-weight-preserving orthogonal quantum neural networks with up to 60 qubits and 1770 parameters on the ECG5000 and MNIST datasets orders of magnitude more efficiently than the parameter-shift rule. We also demonstrate that our proposed QUIVER optimiser can outperform iCANS and gCANS measurement-frugal optimisers on optimisation problems using the quantum approximate optimisation algorithm and quantum simulation with the variational quantum eigensolver.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The forward-mode random-direction estimator is a clean unification that recovers SPSA and parameter-shift as limits, but QUIVER's efficiency edge rests on a variance model whose accuracy for general PQCs is not fully checked.

read the letter

The paper's core contribution is a forward-mode estimator that averages a tunable number of random directional derivatives to get an unbiased gradient for PQCs. It needs no ancilla or controlled operations and recovers the usual methods as special cases. They also give a convergence proof for stochastic gradient descent under standard assumptions and an explicit second-moment expansion.

The new piece is the derivation of QUIVER, an adaptive rule that chooses how many directions to sample by minimizing a closed-form measurement-cost expression. The 60-qubit experiments on ECG5000 and MNIST, plus the QAOA and VQE tests against iCANS and gCANS, show clear practical gains over plain parameter-shift.

The unbiased estimator and the convergence result stand on their own. The numerics are large enough to be worth attention.

The soft spot is the measurement-cost model used to derive QUIVER. It assumes a specific interpolation in the second-moment expansion between the single-direction and full-basis extremes. If real circuit noise, parameter correlations, or the multi-frequency behavior of the cost function deviate from that scaling, the allocation rule is no longer guaranteed to be optimal. The paper does not appear to include a direct check of how sensitive the reported savings are to violations of that model.

This work is aimed at people training PQCs at scale who care about shot budgets. A reader focused on variational optimization methods would find the framework and the scale of the tests useful. The combination of a proof, an explicit adaptive rule, and 60-qubit results is enough to merit a serious referee, even if the variance-model dependence needs closer scrutiny in revision.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes a framework of forward gradient estimators for parameterised quantum circuits based on forward-mode automatic differentiation. It constructs an unbiased gradient estimator by averaging a tunable number of random directional derivatives, recovering SPSA, random coordinate descent, and the parameter-shift rule as limiting cases without ancilla qubits or controlled gates. The authors prove convergence of stochastic forward gradient descent under standard assumptions, supply an explicit second-moment expansion of the estimator, and derive the QUIVER adaptive optimizer from a closed-form minimum-measurement-cost allocation rule. Large-scale numerical results are presented for training up to 60-qubit, 1770-parameter Hamming-weight-preserving orthogonal quantum neural networks on ECG5000 and MNIST, as well as comparisons on QAOA and VQE problems against iCANS and gCANS.

Significance. If the central claims hold, the work provides a tunable, ancilla-free alternative to the parameter-shift rule that can substantially reduce measurement overhead for large PQCs. The explicit convergence proof for stochastic quantum forward gradient descent and the large-scale numerical demonstrations on circuits with 1770 parameters constitute clear strengths. The QUIVER rule offers a principled adaptive strategy whose practical advantage, however, is tied to the validity of the underlying variance model.

major comments (1)

[section deriving the QUIVER allocation rule and second-moment expansion] The second-moment expansion used to derive the closed-form QUIVER allocation rule assumes a specific measurement-cost model under which the variance interpolates between the SPSA (single-direction) and parameter-shift (full-basis) extremes. For general PQCs this scaling may be violated by circuit-specific correlations, non-independent shot noise, or the multi-frequency dependence of f(θ + t v) when v is non-coordinate; in that case the derived allocation ceases to be optimal and the headline measurement-efficiency claims for QUIVER no longer follow. This assumption is load-bearing for the adaptive optimizer and the numerical advantage reported in the experiments.

minor comments (1)

[Abstract and numerical experiments] The abstract and experimental sections report large efficiency gains but omit error bars, dataset splits, and explicit variance-model parameters; adding these would strengthen verifiability of the comparisons.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the key assumptions in the QUIVER derivation. We respond to the major comment below.

read point-by-point responses

Referee: The second-moment expansion used to derive the closed-form QUIVER allocation rule assumes a specific measurement-cost model under which the variance interpolates between the SPSA (single-direction) and parameter-shift (full-basis) extremes. For general PQCs this scaling may be violated by circuit-specific correlations, non-independent shot noise, or the multi-frequency dependence of f(θ + t v) when v is non-coordinate; in that case the derived allocation ceases to be optimal and the headline measurement-efficiency claims for QUIVER no longer follow. This assumption is load-bearing for the adaptive optimizer and the numerical advantage reported in the experiments.

Authors: The second-moment expansion is derived under the explicit assumption of independent additive shot noise with variance scaling as 1/M per direction. This produces the stated interpolation and the closed-form allocation. We agree that circuit-specific correlations, non-independent noise, or multi-frequency effects in non-coordinate directions can violate the model, rendering the allocation suboptimal in those cases. The estimator itself remains unbiased for any choice of directions. The reported numerical advantages are observed on the specific circuits tested (Hamming-weight-preserving QNNs, QAOA, VQE). We will revise the manuscript to state the variance-model assumptions more prominently, add a limitations paragraph discussing potential violations, and qualify the optimality claims for general PQCs. This is a partial revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The core estimator is obtained directly from forward-mode automatic differentiation and is unbiased by construction. The second-moment expansion is stated to be explicit and derived from the estimator itself, interpolating between known limits. QUIVER follows from a closed-form allocation rule under an explicitly stated measurement-cost model; this is a derivation under assumptions rather than a reduction of the result to its inputs by definition or by fitting. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and no known empirical patterns are merely renamed. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework relies on standard stochastic-gradient convergence assumptions and an explicit measurement-cost model whose parameters enter the QUIVER allocation; no new physical entities are introduced.

free parameters (1)

number of random directions
Tunable integer that controls the bias-variance tradeoff of the gradient estimator and directly affects total measurement cost.

axioms (2)

domain assumption Standard assumptions for convergence of stochastic gradient descent
Invoked to prove convergence of stochastic quantum forward gradient descent with the stated second-moment expansion.
domain assumption Measurement cost is linear in the number of directional derivative estimates and independent of circuit depth
Underlies the closed-form minimum-cost allocation rule for QUIVER.

pith-pipeline@v0.9.1-grok · 5798 in / 1585 out tokens · 35541 ms · 2026-06-27T16:29:33.533797+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 25 canonical work pages · 6 internal anchors

[1]

Cerezo, A

M. Cerezo, A. Arrasmith, R. Babbush, S. C. Benjamin, S. Endo, K. Fujii, J. R. McClean, K. Mitarai, X. Yuan, L. Cincio, and P. J. Coles, Variational quantum algo- rithms, Nat Rev Phys3, 625 (2021)

2021
[2]

Bhartiet al., Noisy intermediate-scale quantum algo- rithms, Rev

K. Bhartiet al., Noisy intermediate-scale quantum algo- rithms, Rev. Mod. Phys.94, 015004 (2022)

2022
[3]

Larocca, N

M. Larocca, N. Ju, D. García-Martín, P. J. Coles, and M. Cerezo, Theory of overparametrization in quantum neural networks, Nat Comput Sci3, 542 (2023)

2023
[4]

Delgado, F

A. Delgado, F. Rios, and K. E. Hamilton, Identifying overparameterizationinQuantumCircuitBornMachines (2023), arXiv:2307.03292

work page arXiv 2023
[5]

García-Martín, M

D. García-Martín, M. Larocca, and M. Cerezo, Effects of noise on the overparametrization of quantum neural networks, Phys. Rev. Res.6, 013295 (2024)

2024
[6]

Holmes, K

Z. Holmes, K. Sharma, M. Cerezo, and P. J. Coles, Con- necting ansatz expressibility to gradient magnitudes and barren plateaus, PRX Quantum3, 010313 (2022)

2022
[7]

Schuld and N

M. Schuld and N. Killoran, Is quantum advantage the right goal for quantum machine learning?, PRX Quan- tum3, 030101 (2022)

2022
[8]

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learningrepresentationsbyback-propagatingerrors,Na- ture323, 533 (1986)

1986
[9]

A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind, Automatic Differentiation in Machine Learning: a Survey, Journal of Machine Learning Research18, 1 (2018)

2018
[10]

Abbas, R

A. Abbas, R. King, H.-Y. Huang, W. J. Hug- gins, R. Movassagh, D. Gilboa, and J. R. McClean, On quantum backpropagation, information reuse, and cheating measurement collapse, inAdvances in Neu- ral Information Processing Systems, Vol. 36 (2023) arXiv:2305.13362

work page arXiv 2023
[11]

Bowles, D

J. Bowles, D. Wierichs, and C.-Y. Park, Backpropagation scaling in parameterised quantum circuits, Quantum9, 1873 (2025)

2025
[12]

Coyle, S

B. Coyle, S. Raj, N. Mathur, E. A. Cherrat, N. Jain, S. Kazdaghli, and I. Kerenidis, Training-efficient density quantum machine learning, npj Quantum Inf.11, 172 (2025), arXiv:2405.20237

work page arXiv 2025
[13]

Chinzei, S

K. Chinzei, S. Yamano, Q. H. Tran, Y. Endo, and H. Oshima, Trade-off between Gradient Measurement Ef- ficiency and Expressivity in Deep Quantum Neural Net- works, npj Quantum Inf.11, 79 (2025)

2025
[14]

J.Spall,Multivariatestochasticapproximationusingasi- multaneous perturbation gradient approximation, IEEE Transactions on Automatic Control37, 332 (1992). 19

1992
[15]

Z. Ding, T. Ko, J. Yao, L. Lin, and X. Li, Random coor- dinate descent: A simple alternative for optimizing pa- rameterized quantum circuits, Phys. Rev. Res.6, 033029 (2024)

2024
[16]

A. G. Baydin, B. A. Pearlmutter, D. Syme, F. Wood, and P. Torr, Gradients without Backpropagation (2022), arXiv:2202.08587

work page arXiv 2022
[17]

Silver, A

D. Silver, A. Goyal, I. Danihelka, M. Hessel, and H. v. Hasselt, Learning by Directional Gradient Descent, in International Conference on Learning Representations (2022)

2022
[18]

SEGA: Variance Reduction via Gradient Sketching

F. Hanzely, K. Mishchenko, and P. Richtarik, SEGA: Variance Reduction via Gradient Sketching, inAdvances in Neural Information Processing Systems, Vol. 31 (2018) arXiv:1809.03054

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Hinton, The Forward-Forward Algorithm: Some Pre- liminary Investigations (2022), arXiv:2212.13345

G. Hinton, The Forward-Forward Algorithm: Some Pre- liminary Investigations (2022), arXiv:2212.13345

work page arXiv 2022
[20]

Fournier, S

L. Fournier, S. Rivaud, E. Belilovsky, M. Eickenberg, and E. Oyallon, Can Forward Gradient Match Backpropaga- tion?, inFortieth International Conference on Machine Learning(2023) arXiv:2306.06968

work page arXiv 2023
[21]

M. Ren, S. Kornblith, R. Liao, and G. Hinton, Scal- ing Forward Gradient With Local Losses, inInterna- tional Conference on Learning Representations(2023) arXiv:2210.03310

work page arXiv 2023
[22]

Coupling Adaptive Batch Sizes with Learning Rates

L. Balles, J. Romero, and P. Hennig, Coupling Adaptive Batch Sizes with Learning Rates, inUncertainty in Ar- tificial Intelligence(2017) arXiv:1612.05086

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

J. M. Kübler, A. Arrasmith, L. Cincio, and P. J. Coles, An Adaptive Optimizer for Measurement-Frugal Varia- tional Algorithms, Quantum4, 263 (2020)

2020
[24]

A. Gu, A. Lowe, P. A. Dub, P. J. Coles, and A. Arrasmith, Adaptive shot allocation for fast con- vergence in variational quantum algorithms (2021), arXiv:2108.10434

work page arXiv 2021
[25]

Landman, N

J. Landman, N. Mathur, Y. Y. Li, M. Strahm, S. Kazdaghli, A. Prakash, and I. Kerenidis, Quantum Methods for Neural Networks and Application to Medi- cal Image Classification, Quantum6, 881 (2022)

2022
[26]

Monbroussou, J

L. Monbroussou, J. Landman, A. B. Grilo, R. Kukla, and E. Kashefi, Trainability and Expressivity of Hamming- Weight Preserving Quantum Circuits for Machine Learn- ing, Quantum9, 1745 (2025)

2025
[27]

D. P. Kingma and J. Ba, Adam: A Method for Stochastic Optimization, inInternational Conference on Learning Representations(2015) arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015
[28]

Bradbury, R

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. Van- derPlas, S. Wanderman-Milne, and Q. Zhang, JAX: com- posable transformations of Python+NumPy programs (2018)

2018
[29]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

A. Paszkeet al., PyTorch: An Imperative Style, High-Performance Deep Learning Library (2019), arXiv:1912.01703

work page internal anchor Pith review Pith/arXiv arXiv 2019
[30]

Martín Abadiet al., TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems (2015), software available from tensorflow.org

2015
[31]

Griewank, K

A. Griewank, K. Kulshreshtha, and A. Walther, On the numerical stability of algorithmic differentiation, Com- puting94, 125 (2012)

2012
[32]

Schmidhuber, Deep learning in neural networks: An overview, Neural Networks61, 85 (2015)

J. Schmidhuber, Deep learning in neural networks: An overview, Neural Networks61, 85 (2015)

2015
[33]

Pérez-Salinas, A

A. Pérez-Salinas, A. Cervera-Lierta, E. Gil-Fuster, and J. I. Latorre, Data re-uploading for a universal quantum classifier, Quantum4, 226 (2020)

2020
[34]

Romero, R

J. Romero, R. Babbush, J. R. McClean, C. Hempel, P. J. Love, and A. Aspuru-Guzik, Strategies for quantum com- puting molecular energies using the unitary coupled clus- ter ansatz, Quantum Sci. Technol.4, 014008 (2018)

2018
[35]

Classification with Quantum Neural Networks on Near Term Processors

E. Farhi and H. Neven, Classification with Quan- tum Neural Networks on Near Term Processors (2018), arXiv:1802.06002

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

Mitarai, M

K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii, Quan- tum circuit learning, Phys. Rev. A98, 032309 (2018)

2018
[37]

D.Wierichs, J.Izaac, C.Wang,andC.Y.-Y.Lin,General parameter-shift rules for quantum gradients, Quantum6, 677 (2022)

2022
[38]

Kyriienko and V

O. Kyriienko and V. E. Elfving, Generalized quantum circuit differentiation rules, Phys. Rev. A104, 052417 (2021)

2021
[39]

G.-L. R. Anselmetti, D. Wierichs, C. Gogolin, and R. M. Parrish, Local, expressive, quantum-number-preserving VQE ansätze for fermionic systems, New J. Phys.23, 113010 (2021)

2021
[40]

Sweke, F

R. Sweke, F. Wilde, J. Meyer, M. Schuld, P. K. Faehrmann, B. Meynard-Piganeau, and J. Eisert, Stochastic gradient descent for hybrid quantum-classical optimization, Quantum4, 314 (2020)

2020
[41]

C.Moussa, M.H.Gordon, M.Baczyk, M.Cerezo, L.Cin- cio, and P. J. Coles, Resource frugal optimizer for quan- tum machine learning, Quantum Sci. Technol.8, 045019 (2023)

2023
[42]

J. C. Spall, A Stochastic Approximation Technique for Generating Maximum Likelihood Parameter Estimates, in1987 American Control Conference(1987) pp. 1161– 1167

1987
[43]

Bhatnagar, H

S. Bhatnagar, H. Prasad, and L. Prashanth, Stochastic Approximation Algorithms, inStochastic Recursive Al- gorithms for Optimization(Springer, 2013) pp. 17–28

2013
[44]

C. Cade, L. Mineh, A. Montanaro, and S. Stanisic, Strategies for solving the Fermi-Hubbard model on near- term quantum computers, Phys. Rev. B102, 235122 (2020)

2020
[45]

Gacon, C

J. Gacon, C. Zoufal, G. Carleo, and S. Woerner, Simul- taneous Perturbation Stochastic Approximation of the Quantum Fisher Information, Quantum5, 567 (2021)

2021
[46]

N. Jain, B. Coyle, E. Kashefi, and N. Kumar, Graph neu- ral network initialisation of quantum approximate opti- misation, Quantum6, 861 (2022)

2022
[47]

Sauvage and F

F. Sauvage and F. Mintert, Optimal quantum control with poor statistics, PRX Quantum1, 020322 (2020)

2020
[48]

Bonet-Monroig, H

X. Bonet-Monroig, H. Wang, D. Vermetten, B. Senjean, C. Moussa, T. Bäck, V. Dunjko, and T. E. O’Brien, Per- formance comparison of optimization methods on vari- ational quantum algorithms, Physical Review A107, 032407 (2023), arXiv:2111.13454 [quant-ph]

work page arXiv 2023
[49]

Nesterov, Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems, SIAM J

Y. Nesterov, Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems, SIAM J. Optim. 22, 341 (2012)

2012
[50]

Richtárik and M

P. Richtárik and M. Takáč, Iteration complexity of ran- domized block-coordinate descent methods for minimiz- ing a composite function, Math. Program.144, 1 (2014)

2014
[51]

Arrasmith, L

A. Arrasmith, L. Cincio, R. D. Somma, and P. J. Coles, Operator Sampling for Shot-frugal Optimization in Vari- ational Algorithms (2020), arXiv:2004.06252

work page arXiv 2020
[52]

van Straaten and B

B. van Straaten and B. Koczor, Measurement cost of metric-aware variational quantum algorithms, PRX Quantum2, 030324 (2021). 20

2021
[53]

Boyd and B

G. Boyd and B. Koczor, Training variational quantum circuits with CoVaR: Covariance root finding with clas- sical shadows, Phys. Rev. X12, 041022 (2022)

2022
[54]

G.García-Pérez, M.A.C.Rossi, B.Sokolov, F.Tacchino, P. K. Barkoutsos, G. Mazzola, I. Tavernelli, and S. Man- iscalco, Learning to measure: Adaptive informationally complete generalized measurements for quantum algo- rithms, PRX Quantum2, 040342 (2021)

2021
[55]

Pramanik and M

S. Pramanik and M. G. Chandra, Stochastic Shadow Descent: Training Parametrized Quantum Circuits with Shadows of Gradients (2025), arXiv:2511.12168

work page arXiv 2025
[56]

Flügel, D

K. Flügel, D. Coquelin, M. Götz, and C. Debus, Beyond Backpropagation: Optimization with Multi-Tangent For- ward Gradients (2024), arXiv:2410.17764

work page arXiv 2024
[57]

Bos and J

T. Bos and J. Schmidt-Hieber, Convergence guarantees for forward gradient descent in the linear regression model, Journal of Statistical Planning and Inference233, 106174 (2024)

2024
[58]

Dexheimer and J

N. Dexheimer and J. Schmidt-Hieber, Improving the Convergence Rates of Forward Gradient Descent with Repeated Sampling (2024), arXiv:2411.17567

work page arXiv 2024
[59]

Singhal, B

U. Singhal, B. Cheung, K. Chandra, J. Ragan-Kelley, J. B. Tenenbaum, T. A. Poggio, and S. X. Yu, How to guess a gradient (2023), arXiv:2312.04709

work page arXiv 2023
[60]

Z. Wang, S. Markou, and A. Campbell, Towards Scal- able Backpropagation-Free Gradient Estimation (2025), arXiv:2511.03110

work page arXiv 2025
[61]

Panchal, S

K. Panchal, S. Choudhary, Y. Brun, and H. Guan, The Cost of Avoiding Backpropagation (2025), arXiv:2506.21833

work page arXiv 2025
[62]

A.D.Cobb, A.G.Baydin, B.A.Pearlmutter,andS.Jha, Second-Order Forward-Mode Automatic Differentiation for Optimization, inInternational Conference on Learn- ing Representations(2025) arXiv:2408.10419

work page arXiv 2025
[63]

Y. Yu, R. Xia, Q. Ma, M. Lengyel, and G. Hennequin, Second-Order Forward-Mode Optimization of Recurrent Neural Networks for Neuroscience, inAdvances in Neural Information Processing Systems, Vol. 37 (2024)

2024
[64]

Stokes, J

J. Stokes, J. Izaac, N. Killoran, and G. Carleo, Quantum Natural Gradient, Quantum4, 269 (2020)

2020
[65]

A. Mari, T. R. Bromley, and N. Killoran, Estimating the gradient and higher-order derivatives on quantum hard- ware, Physical Review A103, 012405 (2021)

2021
[66]

R. M. Parrish, G.-L. R. Anselmetti, and C. Gogolin, An- alytical Ground- and Excited-State Gradients for Molec- ular Electronic Structure Theory from Hybrid Quan- tum/Classical Methods (2021), arXiv:2110.05040

work page arXiv 2021
[67]

M. M. Wolf,Mathematical Foundations of Supervised Learning(Lecture notes, Technical University of Munich, 2023)

2023
[68]

Talagrand, Concentration of measure and isoperimet- ric inequalities in product spaces, Publications Mathé- matiques de l’IHÉS81, 73 (1995)

M. Talagrand, Concentration of measure and isoperimet- ric inequalities in product spaces, Publications Mathé- matiques de l’IHÉS81, 73 (1995)

1995
[69]

Cerezo, A

M. Cerezo, A. Sone, T. Volkoff, L. Cincio, and P. J. Coles, Cost function dependent barren plateaus in shal- low parametrized quantum circuits, Nature Communica- tions12, 1791 (2021)

2021
[70]

Kandala, A

A. Kandala, A. Mezzacapo, K. Temme, M. Takita, M. Brink, J. M. Chow, and J. M. Gambetta, Hardware- efficient variational quantum eigensolver for small molecules and quantum magnets, Nature549, 242 (2017)

2017
[71]

A Quantum Approximate Optimization Algorithm

E. Farhi, J. Goldstone, and S. Gutmann, A quan- tum approximate optimization algorithm (2014), arXiv:1411.4028 [quant-ph]

work page internal anchor Pith review Pith/arXiv arXiv 2014
[72]

Herrman, P

R. Herrman, P. C. Lotshaw, J. Ostrowski, T. S. Humble, and G. Siopsis, Multi-angle quantum approximate opti- mization algorithm, Scientific Reports12, 6781 (2022), arXiv:2109.11455. Appendix A: Unbiasedness of the forward gradient estimator We prove that theV-direction,M-shot forward gradient estimator eq. (A2) is unbiased in theε→0limit, adapting the cla...

work page arXiv 2022
[73]

IfE[∥g (t)(θ)∥2]≤γ 2 for allθ, tandη∈[0,1/(2µ)], then E[f(θ (T) )]−f(θ ⋆)≤(1−2µη) T f(θ (0))−f(θ ⋆) + Lη γ2 4µ .(D5)
[74]

A,E[ egF(θ)] =∇f(θ), so the estimator is unbiased

IfE[∥g (t)(θ)∥2]≤β 2∥∇f(θ)∥ 2 for allθ, tandη= 1/(Lβ 2), then E[f(θ (T) )]−f(θ ⋆)≤ 1− µ Lβ2 T f(θ (0))−f(θ ⋆) .(D6) Proof of Proposition 4.Part (i).By App. A,E[ egF(θ)] =∇f(θ), so the estimator is unbiased. By Lemma 4 with κ= 1(Rademacher), E ∥egF∥2 = N+V−1 V ∥∇f∥2 =:β 2 ∥∇f∥2. This is the bounded relative second moment condition of part 2 of Lemma 5. Set...
[75]

Lemma 6(Variance-with-measurement decomposition).With unbiased single-shot estimatorsE m[e∇vℓ Lm] =∇ vℓ L and i.i.d

Variance decomposition over measurements Toexpressthegaininaformwhereeachrandomdirectioncontributesaseparatesignalandnoisetermwedecompose the measurement-side expectation of the directional-derivative variance. Lemma 6(Variance-with-measurement decomposition).With unbiased single-shot estimatorsE m[e∇vℓ Lm] =∇ vℓ L and i.i.d. measurement trials, Em h Varv...
[76]

Per-direction gain and learning-rate criterion Substituting Lemma 1 and Lemma 6 into eq. (E1): E[GF] =η∥∇L∥ 2 − Lη2 2 E h ∥e∇ F L∥2 i ≈η∥∇L∥ 2 − Lη2 2 · N+V+κ−2 V · 1 V VX ℓ=1 (∇vℓ L)2 + Varm[e∇vℓ Lm] M = 1 V VX ℓ=1 η∥∇L∥ 2 − Lη2 2 N+V+κ−2 V (∇vℓ L)2 + Varm[e∇vℓ Lm] M | {z } =:γ vℓ , where the second line uses Lemma 1 for the second-moment term and Lemma ...
[77]

(E3) becomes a function ofMℓ alone

Optimal per-direction shot allocation Allowing the number of shots to depend on the direction,M→M ℓ, the per-direction gainγ vℓ from eq. (E3) becomes a function ofMℓ alone. Maximising the gain-per-shotγ vℓ /Mℓ overM ℓ and rearranging yields the optimal per-direction allocation referenced from Section VIIA. Lemma 7(Optimal per-direction shot allocation).Le...
[78]

For isotropic zero-mean unit-variance directions,E v[(∇vℓ L)2] =∥∇L∥ 2

Fixed-MoptimalV Under Assumption 1 the per-direction measurement variance concentrates,Var m[e∇vℓ Lm]≈¯σ 2 ∇ for allℓ. For isotropic zero-mean unit-variance directions,E v[(∇vℓ L)2] =∥∇L∥ 2. Taking this expectation in the per-direction gain eq. (E3) and averaging over theVdirections: E[GF]≈η∥∇L∥ 2 − Lη2 2 N+V+κ−2 V ∥∇L∥2 + ¯σ2 ∇ M .(E6) WithMfixed, the ga...
[79]

(35)) is MSE(V, M) = (N−1)∥g∥ 2 V + N¯σ2 ∇ V M ,(I4) and the cost-minimisation problem is min V, M >0 2V Ms.t.MSE(V, M)≤τ 2, M≥M min.(I5) The proof has five steps

Proof of Theorem 1: optimal allocation Setup.The MSE of the Rademacher forward gradient estimator withVdirections andMshots per direction (eq. (35)) is MSE(V, M) = (N−1)∥g∥ 2 V + N¯σ2 ∇ V M ,(I4) and the cost-minimisation problem is min V, M >0 2V Ms.t.MSE(V, M)≤τ 2, M≥M min.(I5) The proof has five steps. Proof.1.EliminateV.The MSE constraint in eq. (I5) ...
[80]

First we establish the Cramér–Rao lower bound eq

Proof of Corollary 1: CRB-level optimality The proof has two parts. First we establish the Cramér–Rao lower bound eq. (40) on the MSE of any unbiased estimator ofgthat queries the shot-noise oracle a total ofBtimes. Second we show that the forward-gradient estimator at the optimal allocation of Theorem 1 attains this bound up to a constant that vanishes a...

[1] [1]

Cerezo, A

M. Cerezo, A. Arrasmith, R. Babbush, S. C. Benjamin, S. Endo, K. Fujii, J. R. McClean, K. Mitarai, X. Yuan, L. Cincio, and P. J. Coles, Variational quantum algo- rithms, Nat Rev Phys3, 625 (2021)

2021

[2] [2]

Bhartiet al., Noisy intermediate-scale quantum algo- rithms, Rev

K. Bhartiet al., Noisy intermediate-scale quantum algo- rithms, Rev. Mod. Phys.94, 015004 (2022)

2022

[3] [3]

Larocca, N

M. Larocca, N. Ju, D. García-Martín, P. J. Coles, and M. Cerezo, Theory of overparametrization in quantum neural networks, Nat Comput Sci3, 542 (2023)

2023

[4] [4]

Delgado, F

A. Delgado, F. Rios, and K. E. Hamilton, Identifying overparameterizationinQuantumCircuitBornMachines (2023), arXiv:2307.03292

work page arXiv 2023

[5] [5]

García-Martín, M

D. García-Martín, M. Larocca, and M. Cerezo, Effects of noise on the overparametrization of quantum neural networks, Phys. Rev. Res.6, 013295 (2024)

2024

[6] [6]

Holmes, K

Z. Holmes, K. Sharma, M. Cerezo, and P. J. Coles, Con- necting ansatz expressibility to gradient magnitudes and barren plateaus, PRX Quantum3, 010313 (2022)

2022

[7] [7]

Schuld and N

M. Schuld and N. Killoran, Is quantum advantage the right goal for quantum machine learning?, PRX Quan- tum3, 030101 (2022)

2022

[8] [8]

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learningrepresentationsbyback-propagatingerrors,Na- ture323, 533 (1986)

1986

[9] [9]

A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind, Automatic Differentiation in Machine Learning: a Survey, Journal of Machine Learning Research18, 1 (2018)

2018

[10] [10]

Abbas, R

A. Abbas, R. King, H.-Y. Huang, W. J. Hug- gins, R. Movassagh, D. Gilboa, and J. R. McClean, On quantum backpropagation, information reuse, and cheating measurement collapse, inAdvances in Neu- ral Information Processing Systems, Vol. 36 (2023) arXiv:2305.13362

work page arXiv 2023

[11] [11]

Bowles, D

J. Bowles, D. Wierichs, and C.-Y. Park, Backpropagation scaling in parameterised quantum circuits, Quantum9, 1873 (2025)

2025

[12] [12]

Coyle, S

B. Coyle, S. Raj, N. Mathur, E. A. Cherrat, N. Jain, S. Kazdaghli, and I. Kerenidis, Training-efficient density quantum machine learning, npj Quantum Inf.11, 172 (2025), arXiv:2405.20237

work page arXiv 2025

[13] [13]

Chinzei, S

K. Chinzei, S. Yamano, Q. H. Tran, Y. Endo, and H. Oshima, Trade-off between Gradient Measurement Ef- ficiency and Expressivity in Deep Quantum Neural Net- works, npj Quantum Inf.11, 79 (2025)

2025

[14] [14]

J.Spall,Multivariatestochasticapproximationusingasi- multaneous perturbation gradient approximation, IEEE Transactions on Automatic Control37, 332 (1992). 19

1992

[15] [15]

Z. Ding, T. Ko, J. Yao, L. Lin, and X. Li, Random coor- dinate descent: A simple alternative for optimizing pa- rameterized quantum circuits, Phys. Rev. Res.6, 033029 (2024)

2024

[16] [16]

A. G. Baydin, B. A. Pearlmutter, D. Syme, F. Wood, and P. Torr, Gradients without Backpropagation (2022), arXiv:2202.08587

work page arXiv 2022

[17] [17]

Silver, A

D. Silver, A. Goyal, I. Danihelka, M. Hessel, and H. v. Hasselt, Learning by Directional Gradient Descent, in International Conference on Learning Representations (2022)

2022

[18] [18]

SEGA: Variance Reduction via Gradient Sketching

F. Hanzely, K. Mishchenko, and P. Richtarik, SEGA: Variance Reduction via Gradient Sketching, inAdvances in Neural Information Processing Systems, Vol. 31 (2018) arXiv:1809.03054

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [19]

Hinton, The Forward-Forward Algorithm: Some Pre- liminary Investigations (2022), arXiv:2212.13345

G. Hinton, The Forward-Forward Algorithm: Some Pre- liminary Investigations (2022), arXiv:2212.13345

work page arXiv 2022

[20] [20]

Fournier, S

L. Fournier, S. Rivaud, E. Belilovsky, M. Eickenberg, and E. Oyallon, Can Forward Gradient Match Backpropaga- tion?, inFortieth International Conference on Machine Learning(2023) arXiv:2306.06968

work page arXiv 2023

[21] [21]

M. Ren, S. Kornblith, R. Liao, and G. Hinton, Scal- ing Forward Gradient With Local Losses, inInterna- tional Conference on Learning Representations(2023) arXiv:2210.03310

work page arXiv 2023

[22] [22]

Coupling Adaptive Batch Sizes with Learning Rates

L. Balles, J. Romero, and P. Hennig, Coupling Adaptive Batch Sizes with Learning Rates, inUncertainty in Ar- tificial Intelligence(2017) arXiv:1612.05086

work page internal anchor Pith review Pith/arXiv arXiv 2017

[23] [23]

J. M. Kübler, A. Arrasmith, L. Cincio, and P. J. Coles, An Adaptive Optimizer for Measurement-Frugal Varia- tional Algorithms, Quantum4, 263 (2020)

2020

[24] [24]

A. Gu, A. Lowe, P. A. Dub, P. J. Coles, and A. Arrasmith, Adaptive shot allocation for fast con- vergence in variational quantum algorithms (2021), arXiv:2108.10434

work page arXiv 2021

[25] [25]

Landman, N

J. Landman, N. Mathur, Y. Y. Li, M. Strahm, S. Kazdaghli, A. Prakash, and I. Kerenidis, Quantum Methods for Neural Networks and Application to Medi- cal Image Classification, Quantum6, 881 (2022)

2022

[26] [26]

Monbroussou, J

L. Monbroussou, J. Landman, A. B. Grilo, R. Kukla, and E. Kashefi, Trainability and Expressivity of Hamming- Weight Preserving Quantum Circuits for Machine Learn- ing, Quantum9, 1745 (2025)

2025

[27] [27]

D. P. Kingma and J. Ba, Adam: A Method for Stochastic Optimization, inInternational Conference on Learning Representations(2015) arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015

[28] [28]

Bradbury, R

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. Van- derPlas, S. Wanderman-Milne, and Q. Zhang, JAX: com- posable transformations of Python+NumPy programs (2018)

2018

[29] [29]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

A. Paszkeet al., PyTorch: An Imperative Style, High-Performance Deep Learning Library (2019), arXiv:1912.01703

work page internal anchor Pith review Pith/arXiv arXiv 2019

[30] [30]

Martín Abadiet al., TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems (2015), software available from tensorflow.org

2015

[31] [31]

Griewank, K

A. Griewank, K. Kulshreshtha, and A. Walther, On the numerical stability of algorithmic differentiation, Com- puting94, 125 (2012)

2012

[32] [32]

Schmidhuber, Deep learning in neural networks: An overview, Neural Networks61, 85 (2015)

J. Schmidhuber, Deep learning in neural networks: An overview, Neural Networks61, 85 (2015)

2015

[33] [33]

Pérez-Salinas, A

A. Pérez-Salinas, A. Cervera-Lierta, E. Gil-Fuster, and J. I. Latorre, Data re-uploading for a universal quantum classifier, Quantum4, 226 (2020)

2020

[34] [34]

Romero, R

J. Romero, R. Babbush, J. R. McClean, C. Hempel, P. J. Love, and A. Aspuru-Guzik, Strategies for quantum com- puting molecular energies using the unitary coupled clus- ter ansatz, Quantum Sci. Technol.4, 014008 (2018)

2018

[35] [35]

Classification with Quantum Neural Networks on Near Term Processors

E. Farhi and H. Neven, Classification with Quan- tum Neural Networks on Near Term Processors (2018), arXiv:1802.06002

work page internal anchor Pith review Pith/arXiv arXiv 2018

[36] [36]

Mitarai, M

K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii, Quan- tum circuit learning, Phys. Rev. A98, 032309 (2018)

2018

[37] [37]

D.Wierichs, J.Izaac, C.Wang,andC.Y.-Y.Lin,General parameter-shift rules for quantum gradients, Quantum6, 677 (2022)

2022

[38] [38]

Kyriienko and V

O. Kyriienko and V. E. Elfving, Generalized quantum circuit differentiation rules, Phys. Rev. A104, 052417 (2021)

2021

[39] [39]

G.-L. R. Anselmetti, D. Wierichs, C. Gogolin, and R. M. Parrish, Local, expressive, quantum-number-preserving VQE ansätze for fermionic systems, New J. Phys.23, 113010 (2021)

2021

[40] [40]

Sweke, F

R. Sweke, F. Wilde, J. Meyer, M. Schuld, P. K. Faehrmann, B. Meynard-Piganeau, and J. Eisert, Stochastic gradient descent for hybrid quantum-classical optimization, Quantum4, 314 (2020)

2020

[41] [41]

C.Moussa, M.H.Gordon, M.Baczyk, M.Cerezo, L.Cin- cio, and P. J. Coles, Resource frugal optimizer for quan- tum machine learning, Quantum Sci. Technol.8, 045019 (2023)

2023

[42] [42]

J. C. Spall, A Stochastic Approximation Technique for Generating Maximum Likelihood Parameter Estimates, in1987 American Control Conference(1987) pp. 1161– 1167

1987

[43] [43]

Bhatnagar, H

S. Bhatnagar, H. Prasad, and L. Prashanth, Stochastic Approximation Algorithms, inStochastic Recursive Al- gorithms for Optimization(Springer, 2013) pp. 17–28

2013

[44] [44]

C. Cade, L. Mineh, A. Montanaro, and S. Stanisic, Strategies for solving the Fermi-Hubbard model on near- term quantum computers, Phys. Rev. B102, 235122 (2020)

2020

[45] [45]

Gacon, C

J. Gacon, C. Zoufal, G. Carleo, and S. Woerner, Simul- taneous Perturbation Stochastic Approximation of the Quantum Fisher Information, Quantum5, 567 (2021)

2021

[46] [46]

N. Jain, B. Coyle, E. Kashefi, and N. Kumar, Graph neu- ral network initialisation of quantum approximate opti- misation, Quantum6, 861 (2022)

2022

[47] [47]

Sauvage and F

F. Sauvage and F. Mintert, Optimal quantum control with poor statistics, PRX Quantum1, 020322 (2020)

2020

[48] [48]

Bonet-Monroig, H

X. Bonet-Monroig, H. Wang, D. Vermetten, B. Senjean, C. Moussa, T. Bäck, V. Dunjko, and T. E. O’Brien, Per- formance comparison of optimization methods on vari- ational quantum algorithms, Physical Review A107, 032407 (2023), arXiv:2111.13454 [quant-ph]

work page arXiv 2023

[49] [49]

Nesterov, Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems, SIAM J

Y. Nesterov, Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems, SIAM J. Optim. 22, 341 (2012)

2012

[50] [50]

Richtárik and M

P. Richtárik and M. Takáč, Iteration complexity of ran- domized block-coordinate descent methods for minimiz- ing a composite function, Math. Program.144, 1 (2014)

2014

[51] [51]

Arrasmith, L

A. Arrasmith, L. Cincio, R. D. Somma, and P. J. Coles, Operator Sampling for Shot-frugal Optimization in Vari- ational Algorithms (2020), arXiv:2004.06252

work page arXiv 2020

[52] [52]

van Straaten and B

B. van Straaten and B. Koczor, Measurement cost of metric-aware variational quantum algorithms, PRX Quantum2, 030324 (2021). 20

2021

[53] [53]

Boyd and B

G. Boyd and B. Koczor, Training variational quantum circuits with CoVaR: Covariance root finding with clas- sical shadows, Phys. Rev. X12, 041022 (2022)

2022

[54] [54]

G.García-Pérez, M.A.C.Rossi, B.Sokolov, F.Tacchino, P. K. Barkoutsos, G. Mazzola, I. Tavernelli, and S. Man- iscalco, Learning to measure: Adaptive informationally complete generalized measurements for quantum algo- rithms, PRX Quantum2, 040342 (2021)

2021

[55] [55]

Pramanik and M

S. Pramanik and M. G. Chandra, Stochastic Shadow Descent: Training Parametrized Quantum Circuits with Shadows of Gradients (2025), arXiv:2511.12168

work page arXiv 2025

[56] [56]

Flügel, D

K. Flügel, D. Coquelin, M. Götz, and C. Debus, Beyond Backpropagation: Optimization with Multi-Tangent For- ward Gradients (2024), arXiv:2410.17764

work page arXiv 2024

[57] [57]

Bos and J

T. Bos and J. Schmidt-Hieber, Convergence guarantees for forward gradient descent in the linear regression model, Journal of Statistical Planning and Inference233, 106174 (2024)

2024

[58] [58]

Dexheimer and J

N. Dexheimer and J. Schmidt-Hieber, Improving the Convergence Rates of Forward Gradient Descent with Repeated Sampling (2024), arXiv:2411.17567

work page arXiv 2024

[59] [59]

Singhal, B

U. Singhal, B. Cheung, K. Chandra, J. Ragan-Kelley, J. B. Tenenbaum, T. A. Poggio, and S. X. Yu, How to guess a gradient (2023), arXiv:2312.04709

work page arXiv 2023

[60] [60]

Z. Wang, S. Markou, and A. Campbell, Towards Scal- able Backpropagation-Free Gradient Estimation (2025), arXiv:2511.03110

work page arXiv 2025

[61] [61]

Panchal, S

K. Panchal, S. Choudhary, Y. Brun, and H. Guan, The Cost of Avoiding Backpropagation (2025), arXiv:2506.21833

work page arXiv 2025

[62] [62]

A.D.Cobb, A.G.Baydin, B.A.Pearlmutter,andS.Jha, Second-Order Forward-Mode Automatic Differentiation for Optimization, inInternational Conference on Learn- ing Representations(2025) arXiv:2408.10419

work page arXiv 2025

[63] [63]

Y. Yu, R. Xia, Q. Ma, M. Lengyel, and G. Hennequin, Second-Order Forward-Mode Optimization of Recurrent Neural Networks for Neuroscience, inAdvances in Neural Information Processing Systems, Vol. 37 (2024)

2024

[64] [64]

Stokes, J

J. Stokes, J. Izaac, N. Killoran, and G. Carleo, Quantum Natural Gradient, Quantum4, 269 (2020)

2020

[65] [65]

A. Mari, T. R. Bromley, and N. Killoran, Estimating the gradient and higher-order derivatives on quantum hard- ware, Physical Review A103, 012405 (2021)

2021

[66] [66]

R. M. Parrish, G.-L. R. Anselmetti, and C. Gogolin, An- alytical Ground- and Excited-State Gradients for Molec- ular Electronic Structure Theory from Hybrid Quan- tum/Classical Methods (2021), arXiv:2110.05040

work page arXiv 2021

[67] [67]

M. M. Wolf,Mathematical Foundations of Supervised Learning(Lecture notes, Technical University of Munich, 2023)

2023

[68] [68]

Talagrand, Concentration of measure and isoperimet- ric inequalities in product spaces, Publications Mathé- matiques de l’IHÉS81, 73 (1995)

M. Talagrand, Concentration of measure and isoperimet- ric inequalities in product spaces, Publications Mathé- matiques de l’IHÉS81, 73 (1995)

1995

[69] [69]

Cerezo, A

M. Cerezo, A. Sone, T. Volkoff, L. Cincio, and P. J. Coles, Cost function dependent barren plateaus in shal- low parametrized quantum circuits, Nature Communica- tions12, 1791 (2021)

2021

[70] [70]

Kandala, A

A. Kandala, A. Mezzacapo, K. Temme, M. Takita, M. Brink, J. M. Chow, and J. M. Gambetta, Hardware- efficient variational quantum eigensolver for small molecules and quantum magnets, Nature549, 242 (2017)

2017

[71] [71]

A Quantum Approximate Optimization Algorithm

E. Farhi, J. Goldstone, and S. Gutmann, A quan- tum approximate optimization algorithm (2014), arXiv:1411.4028 [quant-ph]

work page internal anchor Pith review Pith/arXiv arXiv 2014

[72] [72]

Herrman, P

R. Herrman, P. C. Lotshaw, J. Ostrowski, T. S. Humble, and G. Siopsis, Multi-angle quantum approximate opti- mization algorithm, Scientific Reports12, 6781 (2022), arXiv:2109.11455. Appendix A: Unbiasedness of the forward gradient estimator We prove that theV-direction,M-shot forward gradient estimator eq. (A2) is unbiased in theε→0limit, adapting the cla...

work page arXiv 2022

[73] [73]

IfE[∥g (t)(θ)∥2]≤γ 2 for allθ, tandη∈[0,1/(2µ)], then E[f(θ (T) )]−f(θ ⋆)≤(1−2µη) T f(θ (0))−f(θ ⋆) + Lη γ2 4µ .(D5)

[74] [74]

A,E[ egF(θ)] =∇f(θ), so the estimator is unbiased

IfE[∥g (t)(θ)∥2]≤β 2∥∇f(θ)∥ 2 for allθ, tandη= 1/(Lβ 2), then E[f(θ (T) )]−f(θ ⋆)≤ 1− µ Lβ2 T f(θ (0))−f(θ ⋆) .(D6) Proof of Proposition 4.Part (i).By App. A,E[ egF(θ)] =∇f(θ), so the estimator is unbiased. By Lemma 4 with κ= 1(Rademacher), E ∥egF∥2 = N+V−1 V ∥∇f∥2 =:β 2 ∥∇f∥2. This is the bounded relative second moment condition of part 2 of Lemma 5. Set...

[75] [75]

Lemma 6(Variance-with-measurement decomposition).With unbiased single-shot estimatorsE m[e∇vℓ Lm] =∇ vℓ L and i.i.d

Variance decomposition over measurements Toexpressthegaininaformwhereeachrandomdirectioncontributesaseparatesignalandnoisetermwedecompose the measurement-side expectation of the directional-derivative variance. Lemma 6(Variance-with-measurement decomposition).With unbiased single-shot estimatorsE m[e∇vℓ Lm] =∇ vℓ L and i.i.d. measurement trials, Em h Varv...

[76] [76]

Per-direction gain and learning-rate criterion Substituting Lemma 1 and Lemma 6 into eq. (E1): E[GF] =η∥∇L∥ 2 − Lη2 2 E h ∥e∇ F L∥2 i ≈η∥∇L∥ 2 − Lη2 2 · N+V+κ−2 V · 1 V VX ℓ=1 (∇vℓ L)2 + Varm[e∇vℓ Lm] M = 1 V VX ℓ=1 η∥∇L∥ 2 − Lη2 2 N+V+κ−2 V (∇vℓ L)2 + Varm[e∇vℓ Lm] M | {z } =:γ vℓ , where the second line uses Lemma 1 for the second-moment term and Lemma ...

[77] [77]

(E3) becomes a function ofMℓ alone

Optimal per-direction shot allocation Allowing the number of shots to depend on the direction,M→M ℓ, the per-direction gainγ vℓ from eq. (E3) becomes a function ofMℓ alone. Maximising the gain-per-shotγ vℓ /Mℓ overM ℓ and rearranging yields the optimal per-direction allocation referenced from Section VIIA. Lemma 7(Optimal per-direction shot allocation).Le...

[78] [78]

For isotropic zero-mean unit-variance directions,E v[(∇vℓ L)2] =∥∇L∥ 2

Fixed-MoptimalV Under Assumption 1 the per-direction measurement variance concentrates,Var m[e∇vℓ Lm]≈¯σ 2 ∇ for allℓ. For isotropic zero-mean unit-variance directions,E v[(∇vℓ L)2] =∥∇L∥ 2. Taking this expectation in the per-direction gain eq. (E3) and averaging over theVdirections: E[GF]≈η∥∇L∥ 2 − Lη2 2 N+V+κ−2 V ∥∇L∥2 + ¯σ2 ∇ M .(E6) WithMfixed, the ga...

[79] [79]

(35)) is MSE(V, M) = (N−1)∥g∥ 2 V + N¯σ2 ∇ V M ,(I4) and the cost-minimisation problem is min V, M >0 2V Ms.t.MSE(V, M)≤τ 2, M≥M min.(I5) The proof has five steps

Proof of Theorem 1: optimal allocation Setup.The MSE of the Rademacher forward gradient estimator withVdirections andMshots per direction (eq. (35)) is MSE(V, M) = (N−1)∥g∥ 2 V + N¯σ2 ∇ V M ,(I4) and the cost-minimisation problem is min V, M >0 2V Ms.t.MSE(V, M)≤τ 2, M≥M min.(I5) The proof has five steps. Proof.1.EliminateV.The MSE constraint in eq. (I5) ...

[80] [80]

First we establish the Cramér–Rao lower bound eq

Proof of Corollary 1: CRB-level optimality The proof has two parts. First we establish the Cramér–Rao lower bound eq. (40) on the MSE of any unbiased estimator ofgthat queries the shot-noise oracle a total ofBtimes. Second we show that the forward-gradient estimator at the optimal allocation of Theorem 1 attains this bound up to a constant that vanishes a...