pith. sign in

arxiv: 2604.18846 · v1 · submitted 2026-04-20 · 🪐 quant-ph · cs.LG

Trainability Beyond Linearity in Variational Quantum Objectives

Pith reviewed 2026-05-10 04:04 UTC · model grok-4.3

classification 🪐 quant-ph cs.LG
keywords variational quantum algorithmsbarren plateaustrainabilityaffine lossesgradient suppressionloss functionsquantum objectivespolynomial width
0
0 comments X

The pith

Variational quantum objectives admit a fixed-observable representation exactly when the loss is affine in the measured statistics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that barren-plateau proofs based on concentration apply to a variational quantum objective only when that objective can be rewritten as the expectation of one fixed observable. This rewriting is possible if and only if the classical loss function is affine with respect to the bit-string probabilities returned by the quantum circuit. For all other losses the reduction is unavailable, so the standard proof template does not automatically cover them. A gradient decomposition into model responsivity, loss-side signal and transmittance then separates losses that must inherit exponential suppression from those that can in principle amplify signal. When measurements are coarse-grained at polynomial width rather than exponential, the second class can produce gradients orders of magnitude larger than the affine baseline, as demonstrated on a charge-conserving system.

Core claim

The objective itself admits a fixed-observable representation if and only if the loss is affine in the measured statistics, thereby identifying the exact boundary of the standard concentration-based proof template. Existing transfer results for non-affine losses achieve this reduction under additional assumptions; the characterization implies that such a reduction is not structurally available for a class of non-affine objectives. Beyond the affine regime a chain-rule decomposition reveals three governing factors—model responsivity, loss-side signal, and transmittance—and induces a loss-class dichotomy: bounded-gradient losses inherit suppression while amplification-capable losses can in-pr)

What carries the argument

The fixed-observable representation of the variational objective, which is available precisely when the loss function is affine in the measured statistics and which allows direct application of concentration bounds.

If this is right

  • Non-affine losses lie outside the automatic reach of existing concentration proofs for barren plateaus.
  • Bounded-gradient losses inherit the same exponential suppression shown for affine cases.
  • Amplification-capable losses can in principle produce non-vanishing gradients by counteracting suppression.
  • At exponential width both classes fail to train, yet for structurally distinct reasons.
  • At polynomial width the exponential obstruction relaxes and the loss-class distinction becomes decisive for trainability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Algorithm designers should test amplification-capable losses whenever the measurement interface can be restricted to polynomial width.
  • New barren-plateau proofs will be required for non-affine objectives rather than relying on reductions to fixed observables.
  • The same loss-class distinction may apply to other variational problems that output probability vectors rather than single expectations.

Load-bearing premise

The chain-rule decomposition fully captures the governing factors without additional quantum-specific constraints or post-selection effects that could alter the loss-class dichotomy or the polynomial-width relaxation.

What would settle it

A concrete non-affine loss that nonetheless reduces to a single fixed observable, or an amplification-capable objective at polynomial width whose gradient scaling remains statistically indistinguishable from the exponential decay of affine baselines at equal shot budgets.

Figures

Figures reproduced from arXiv: 2604.18846 by Gordon Ma, Xiufan Li.

Figure 1
Figure 1. Figure 1: FIG. 1. Setup of the numerical demonstration. (a) A domain-wall initial state on [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIG. 2. At fixed block-interface width [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FIG. 3. In the chain-rule decomposition of Theorem [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: FIG. 4. Ridgeline view of the per-circuit finite-shot NLL theorem-estimator distribution on the canonical 60-circuit [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: FIG. 5. Accepted shot frontiers [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FIG. 6. Comparison of the charge-conserving joint-block results for block counts [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
read the original abstract

Barren-plateau results have established exponential gradient suppression as a widely cited obstacle to the scalability of variational quantum algorithms. When and whether these results extend to a given objective has been addressed through loss-specific arguments, but a general structural characterization has remained open. We show that the objective itself admits a fixed-observable representation if and only if the loss is affine in the measured statistics, thereby identifying the exact boundary of the standard concentration-based proof template. Existing transfer results for non-affine losses achieve this reduction under additional assumptions; our characterization implies that such a reduction is not structurally available for a class of non-affine objectives, placing them outside the automatic reach of the existing proof template. Beyond the affine regime, a chain-rule decomposition reveals three governing factors -- model responsivity, loss-side signal, and transmittance -- and induces a loss-class dichotomy: bounded-gradient losses inherit suppression, while amplification-capable losses can in principle counteract it. In the exponentially wide setting, both classes fail, but for different structural reasons. When the interface is instead designed at polynomial width -- exposing coarse-grained statistics rather than individual bitstring probabilities -- the exponential-dimensional obstruction is relaxed and the dichotomy plays a genuine role. In a numerical demonstration on a charge-conserving quantum system, the amplification-capable objective produces resolved gradients several orders of magnitude larger than affine and inheriting baselines at comparable shot budgets. Over the tested interval, its scaling trend is statistically distinguished from the exponential trend of both alternatives. The boundary is affine; what lies beyond it is a representation-design problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that variational quantum objectives admit a fixed-observable representation if and only if the loss is affine in the measured statistics, marking the exact boundary for standard concentration-based barren-plateau proofs. For non-affine losses, a chain-rule decomposition of the gradient into model responsivity, loss-side signal, and transmittance induces a dichotomy: bounded-gradient losses inherit exponential suppression while amplification-capable losses can in principle counteract it. Both classes fail in the exponentially wide regime for different reasons, but polynomial-width coarse-graining relaxes the dimensional obstruction. Numerical results on a charge-conserving system show the amplification-capable objective yielding orders-of-magnitude larger resolved gradients and a statistically distinct scaling trend from affine and inheriting baselines at fixed shot budgets.

Significance. If the iff characterization and decomposition hold, the work supplies a general structural criterion for when existing barren-plateau templates apply, replacing loss-by-loss arguments with a clean boundary condition. The polynomial-width relaxation and the three-factor decomposition offer concrete guidance for objective design. The numerical demonstration, with its reported statistical distinction in scaling and gradient magnitude improvement, supplies falsifiable evidence that the dichotomy is operative once the exponential interface is removed. Credit is due for the parameter-free structural claim and the reproducible charge-conserving example.

major comments (2)
  1. The chain-rule decomposition (main text, following the fixed-observable characterization) assumes the loss acts directly on raw measurement statistics. Post-selection on outcomes or entanglement-induced conditioning can introduce effective non-linearities or additional transmittance factors not captured by the three-term split; the charge-conserving numerical example does not test these cases, so the claimed loss-class dichotomy may not be robust in general variational settings. A concrete extension or counter-example addressing post-selection would be required to confirm the decomposition governs all relevant quantum constraints.
  2. Section on polynomial-width coarse-graining: the claim that this interface relaxes the exponential obstruction relies on the dichotomy surviving once individual bit-string probabilities are replaced by coarse statistics. If post-selection or entanglement alters the effective loss class, the relaxation argument would need re-derivation; the current numerical support is limited to one system and does not vary the coarse-graining width explicitly.
minor comments (2)
  1. The final sentence of the abstract is information-dense; splitting it would improve readability without changing content.
  2. Notation for the three governing factors (responsivity, signal, transmittance) should be introduced with a single equation reference to aid cross-referencing.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful reading and constructive comments, which help clarify the scope of our claims. We respond to each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: The chain-rule decomposition (main text, following the fixed-observable characterization) assumes the loss acts directly on raw measurement statistics. Post-selection on outcomes or entanglement-induced conditioning can introduce effective non-linearities or additional transmittance factors not captured by the three-term split; the charge-conserving numerical example does not test these cases, so the claimed loss-class dichotomy may not be robust in general variational settings. A concrete extension or counter-example addressing post-selection would be required to confirm the decomposition governs all relevant quantum constraints.

    Authors: We agree that the derivation of the chain-rule decomposition assumes the loss acts directly on raw measurement statistics. Post-selection and entanglement-induced conditioning can introduce effective nonlinearities or additional factors not captured by the three-term split. The charge-conserving numerical example does not incorporate post-selection and therefore does not test these cases. A full extension to conditioned statistics would require new theoretical machinery and lies outside the scope of the present work. We will revise the manuscript to explicitly state the assumptions under which the decomposition holds and to discuss its limitations in the presence of post-selection, thereby preventing any implication of universality for the loss-class dichotomy. revision: partial

  2. Referee: Section on polynomial-width coarse-graining: the claim that this interface relaxes the exponential obstruction relies on the dichotomy surviving once individual bit-string probabilities are replaced by coarse statistics. If post-selection or entanglement alters the effective loss class, the relaxation argument would need re-derivation; the current numerical support is limited to one system and does not vary the coarse-graining width explicitly.

    Authors: We concur that the relaxation argument at polynomial widths presupposes that the loss-class dichotomy continues to apply to the coarse-grained statistics. Post-selection or entanglement could alter the effective loss class, necessitating re-derivation in those settings. The numerical demonstration is restricted to a single charge-conserving system and does not explicitly vary the coarse-graining width. We will revise the polynomial-width section to clarify the conditions required for the relaxation to hold and to note the dependence on the loss class remaining unaltered by conditioning. We will also acknowledge the limited scope of the numerical evidence. revision: partial

standing simulated objections not resolved
  • Providing a concrete extension or counter-example for the decomposition under post-selection, which would require substantial new theoretical development beyond the current manuscript.

Circularity Check

0 steps flagged

Derivation is a self-contained structural characterization with no reduction to inputs by construction

full rationale

The central claim is an if-and-only-if characterization: the objective admits a fixed-observable representation precisely when the loss is affine in the measured statistics. This follows directly from algebraic expansion of the objective definition and the chain-rule gradient decomposition into responsivity, signal, and transmittance factors. No parameter is fitted and then renamed as a prediction, no self-citation supplies a load-bearing uniqueness theorem, and the polynomial-width relaxation is introduced as an explicit design choice rather than smuggled via prior work. The numerical charge-conserving example is presented only as illustration of the dichotomy, not as the proof of the boundary. The derivation therefore remains independent of its own outputs and satisfies the default expectation of non-circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on differentiability of the loss with respect to measurement statistics and on the validity of the chain-rule decomposition under standard quantum measurement models; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption The loss function is differentiable with respect to the measured statistics
    Invoked to apply the chain-rule decomposition that separates responsivity, loss-side signal, and transmittance.

pith-pipeline@v0.9.0 · 5571 in / 1240 out tokens · 39296 ms · 2026-05-10T04:04:10.710719+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

  1. [1]

    , Om}andc∈Rsuch that L(θ) = Tr(H ρ(θ)) +c for allθwithF(ρ(θ))∈U

    There existH∈span{O 1, . . . , Om}andc∈Rsuch that L(θ) = Tr(H ρ(θ)) +c for allθwithF(ρ(θ))∈U. 3 2.fis affine onU, i.e. there exista∈R m andc∈Rsuch that f(F) =a ⊤F+cfor allF∈U. In particular, iffis non-affine onU, thenLadmits no fixed-observable representation onUrelative to the chosen interface. Proof.The direction (2)⇒(1) is immediate: defineH :=P j ajOj...

  2. [2]

    the target distributionqis approximately uniform on a support of sizes

  3. [3]

    the transmittanceT(θ)does not vanish exponentially inn. Then the feature-space gradient has magnitude ∥∇pfNLL(pθ)∥ ≈ N√s , so that for high-entropy data (s≈N), the loss-side signal grows as2 n/2, and for sharply concentrated data (s=O(1)), it grows as2 n. Combined with Theorem 3, this implies that the parameter-space gra- dient∥∇ θL∥need not inherit the e...

  4. [4]

    the leading left singular vector ofJ F (θ)is uniformly distributed on the sphereS m−1, independently of its singular values

  5. [5]

    ThenT(θ)has the same distribution as|⟨u, v⟩|, whereu, vare independent, uniformly random unit vectors inR m

    the normalized loss gradientˆg F (θ) :=g F (θ)/∥gF (θ)∥points in a uniformly random direction in Rm, independently ofJ F (θ). ThenT(θ)has the same distribution as|⟨u, v⟩|, whereu, vare independent, uniformly random unit vectors inR m. In particular, E[T(θ)] = Θ(1/ √m), with concentration at this scale. Proof sketch.By assumption,u max(JF (θ))is uniformly ...

  6. [6]

    geff F (θ) ≥ 1 poly(n) for allθ∈A n

    the effective loss-side signal is amplification-capable and remains at least inverse-polynomial on the relevant initialization set, i.e. geff F (θ) ≥ 1 poly(n) for allθ∈A n

  7. [7]

    the feature-map Jacobian is not already flattened on the chosen interface, σmax JF (θ) ≥ 1 poly(n)

  8. [8]

    Equivalently, onA n all three chain-rule factors are at worst polynomially small

    the induced transmittance is not exponentially suppressed, T(θ)≥ 1 poly(n) . Equivalently, onA n all three chain-rule factors are at worst polynomially small. Under these conditions, Theorem 3 yields ∥∇θL(θ)∥ ≥ 1 poly(n) for allθ∈A n, 17 and hence Eθ[∥∇θL(θ)∥]≥ 1 poly(n) . Thus exponential barren-plateau suppression is not structurally enforced on the cho...

  9. [9]

    McClean, Sergio Boixo, Vadim N

    Jarrod R. McClean, Sergio Boixo, Vadim N. Smelyanskiy, Ryan Babbush, and Hartmut Neven. Barren plateaus in quantum neural network training landscapes.Nature Communications, 9(1):4812, November 2018

  10. [10]

    Cerezo, Akira Sone, Tyler V olkoff, Lukasz Cincio, and Patrick J

    M. Cerezo, Akira Sone, Tyler V olkoff, Lukasz Cincio, and Patrick J. Coles. Cost function dependent barren plateaus in shallow parametrized quantum circuits.Nature Communications, 12(1):1791, March 2021

  11. [11]

    Larocca, S

    Martin Larocca, Supanut Thanasilp, Samson Wang, Kunal Sharma, Jacob Biamonte, Patrick J. Coles, Lukasz Cincio, Jarrod R. McClean, Zoë Holmes, and M. Cerezo. Barren Plateaus in Variational Quantum Computing. Nature Reviews Physics, 7(4):174–189, March 2025. arXiv:2405.00781 [quant-ph]

  12. [12]

    Kieferova, O

    Maria Kieferova, Ortiz Marrero Carlos, and Nathan Wiebe. Quantum Generative Training Using Rényi Diver- gences, June 2021. arXiv:2106.09567 [quant-ph]

  13. [13]

    Coopmans and M

    Luuk Coopmans and Marcello Benedetti. On the Sample Complexity of Quantum Boltzmann Machine Learning. Communications Physics, 7(1):274, August 2024. arXiv:2306.14969 [quant-ph]

  14. [14]

    Quantum Hamiltonian- Based Models and the Variational Quantum Thermalizer Algorithm, October 2019

    Guillaume Verdon, Jacob Marks, Sasha Nanda, Stefan Leichenauer, and Jack Hidary. Quantum Hamiltonian- Based Models and the Variational Quantum Thermalizer Algorithm, October 2019. arXiv:1910.02071 [quant- ph]

  15. [15]

    Improving Variational Quantum Optimization using CVaR.Quantum, 4:256, April 2020

    Panagiotis Kl Barkoutsos, Giacomo Nannicini, Anton Robert, Ivano Tavernelli, and Stefan Woerner. Improving Variational Quantum Optimization using CVaR.Quantum, 4:256, April 2020. arXiv:1907.04769 [quant-ph]

  16. [16]

    Subtleties in the trainability of quantum machine learning models.Quantum Machine Intelligence, 5(1):21, June 2023

    Supanut Thanasilp, Samson Wang, Nhat Anh Nghiem, Patrick Coles, and Marco Cerezo. Subtleties in the trainability of quantum machine learning models.Quantum Machine Intelligence, 5(1):21, June 2023

  17. [17]

    On barren plateaus and cost function locality in variational quantum algorithms.Journal of Physics A: Mathematical and Theoretical, 54(24):245301, June 2021

    Alexey Uvarov and Jacob Biamonte. On barren plateaus and cost function locality in variational quantum algorithms.Journal of Physics A: Mathematical and Theoretical, 54(24):245301, June 2021. arXiv:2011.10530 [quant-ph]

  18. [18]

    Exploiting symmetry in variational quantum machine learning.PRX Quantum, 4(1):010328, March 2023

    Johannes Jakob Meyer, Marian Mularski, Elies Gil-Fuster, Antonio Anna Mele, Francesco Arzani, Alissa Wilms, and Jens Eisert. Exploiting symmetry in variational quantum machine learning.PRX Quantum, 4(1):010328, March 2023. arXiv:2205.06217 [quant-ph]

  19. [19]

    Grant, L

    Edward Grant, Leonard Wossnig, Mateusz Ostaszewski, and Marcello Benedetti. An initialization strat- egy for addressing barren plateaus in parametrized quantum circuits.Quantum, 3:214, December 2019. arXiv:1903.05076 [quant-ph]

  20. [20]

    IQP Born Machines under Data-dependent and Agnostic Initialization Strategies, March 2026

    Sacha Lerch, Joseph Bowles, Ricard Puig, Erik Armengol, Zoë Holmes, and Supanut Thanasilp. IQP Born Machines under Data-dependent and Agnostic Initialization Strategies, March 2026. arXiv:2603.14576 [quant- ph]

  21. [21]

    Cerezo, Martin Larocca, Diego García-Martín, N

    M. Cerezo, Martin Larocca, Diego García-Martín, N. L. Diaz, Paolo Braccia, Enrico Fontana, Manuel S. Rudolph, Pablo Bermejo, Aroosa Ijaz, Supanut Thanasilp, Eric R. Anschuetz, and Zoë Holmes. Does prov- able absence of barren plateaus imply classical simulability?Nature Communications, 16(1):7907, August

  22. [22]

    arXiv:2312.09121 [quant-ph]

  23. [23]

    Rudolph, Zoë Holmes, Lukasz Cincio, and M

    Pablo Bermejo, Paolo Braccia, Manuel S. Rudolph, Zoë Holmes, Lukasz Cincio, and M. Cerezo. Quantum Convolutional Neural Networks are Effectively Classically Simulable.PRX Quantum, 7(2):020304, April 2026

  24. [24]

    Anschuetz and Bobak T

    Eric R. Anschuetz and Bobak T. Kiani. Quantum variational algorithms are swamped with traps.Nature Communications, 13(1):7760, December 2022

  25. [25]

    Noise-induced shallow circuits and the absence of barren plateaus.Nature Physics, April 2026

    Antonio Anna Mele, Armando Angrisani, Soumik Ghosh, Sumeet Khatri, Jens Eisert, Daniel Stilck França, and Yihui Quek. Noise-induced shallow circuits and the absence of barren plateaus.Nature Physics, April 2026

  26. [26]

    Rudolph, Sacha Lerch, Supanut Thanasilp, Oriel Kiss, Oxana Shaya, Sofia Vallecorsa, Michele Grossi, and Zoë Holmes

    Manuel S. Rudolph, Sacha Lerch, Supanut Thanasilp, Oriel Kiss, Oxana Shaya, Sofia Vallecorsa, Michele Grossi, and Zoë Holmes. Trainability barriers and opportunities in quantum generative modeling.npj Quantum Information, 10(1):116, November 2024

  27. [27]

    Pitfalls when tackling the exponential concentration of parameterized quantum models, July 2025

    Reyhaneh Aghaei Saem, Behrang Tafreshi, Zoë Holmes, and Supanut Thanasilp. Pitfalls when tackling the exponential concentration of parameterized quantum models, July 2025. arXiv:2507.22054 [quant-ph]

  28. [28]

    Fiderer, Hendrik Poulsen Nautrup, Jonas M

    Sofiene Jerbi, Lukas J. Fiderer, Hendrik Poulsen Nautrup, Jonas M. Kübler, Hans J. Briegel, and Vedran Dunjko. Quantum machine learning beyond kernel methods.Nature Communications, 14(1):517, January 2023

  29. [29]

    Schreiber, Jens Eisert, and Johannes Jakob Meyer

    Franz J. Schreiber, Jens Eisert, and Johannes Jakob Meyer. Classical Surrogates for Quantum Learning Models. Physical Review Letters, 131(10):100803, September 2023

  30. [30]

    Cam- bridge University Press, 1 edition, September 2018

    Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science. Cam- bridge University Press, 1 edition, September 2018

  31. [31]

    Clement L. Canonne. A Survey on Distribution Testing: Your Data is Big. But is it Blue?Theory of Computing, 1(1):1–100, 2020. 19 Appendix A: Affine objectives and fixed-observable structure This appendix provides the formal classification underlying Theorem 1. Fix a measurementinterface given by linearly independent Hermitian observablesO 1, . . . , Om an...

  32. [32]

    relative to the interface

    Fixed-observable structure relative to an interface Definition 9(Fixed-observable structure relative to an interface).We say thatL(θ) =f(F(ρ(θ)))admits afixed-observable representation onUrelative to the interface{O j}if there existH∈span{O 1, . . . , Om} andc∈Rsuch that L(θ) = Tr H ρ(θ) +c for allθwithF(θ)∈U. The restrictionH∈span{O j}is the operational ...

  33. [33]

    The main text proves the non-affine⇒no-fixed-observable direction by contradiction

    Proof of Theorem 1 Theorem 1 states that, on any non-empty open setUexplored by the ansatz, a fixed-observable repre- sentation exists if and only iffis affine onU. The main text proves the non-affine⇒no-fixed-observable direction by contradiction. Here we record both directions for completeness. Proof of Theorem 1. Affine⇒fixed observable.Iff(F) =a ⊤F+co...

  34. [34]

    Allowing observables that depend explicitly onθ defines a different regime

    Scope: fixed versus parameter-dependent observables Remark 4 Parameter-dependent observables are post hocConcentration-based barren-plateau argu- ments are formulated forfixed(parameter-independent) operators: one studies expectations of fixed observ- ables with respect to randomly initialized circuit states. Allowing observables that depend explicitly on...

  35. [35]

    random orientation

    Connection to concentration-based barren-plateau proofs Remark 5 When barren-plateau machinery appliesTheorem 1 identifies exactly when the objective it- self has fixed-observable structure relative to a chosen measurement interface: precisely in the affine regime. In that regime, standard gate-based identities (such as parameter-shift rules, when availab...

  36. [36]

    effective dimension

    An elliptically contoured null model and the effective dimension We model anisotropy in feature space by a positive semidefinite “shape” matrixΣ⪰0onR m. A canonical way to generate an elliptical direction is to take an isotropic vector and apply a linear map: if z∼ N(0, I m)andx= Σ 1/2z, thenxhas covarianceΣ. Normalizing yields a random unit direction sup...

  37. [37]

    A spectral “menu” and its implications for transmittance Table II previews several common spectral shapes forΣand the corresponding behavior ofd eff(Σ)and Trms(Σ). The examples should be read as modeling templates: they do not assert that a given interface pro- duces a particular spectrum, but rather show how different coarse-graining designs can interpol...

  38. [38]

    Well-conditioned spectra.Assumeλ i ∈[λ min, λmax]and defineκ=λ max/λmin

    Derivations behind Table II a. Well-conditioned spectra.Assumeλ i ∈[λ min, λmax]and defineκ=λ max/λmin. Using Tr(Σ2) =P i λ2 i ≤λ max P i λi =λ maxTr(Σ), we obtain deff(Σ) = Tr(Σ)2 Tr(Σ2) ≥ Tr(Σ) λmax ≥ mλmin λmax = m κ . The upper boundd eff(Σ)≤mholds generally. By Lemma 10,T rms ≍1/ √deff, yieldingT rms ≲ p κ/m. b. Low-rank spectra.Ifλ 1 =· · ·=λ r >0an...

  39. [39]

    overlap parameter

    Scope, limitations, and a two-shape remark This appendix is intended as a refinement for settings where Euclidean isotropy of the loss-side direc- tion is not a credible baseline after choosing a structured compressed interface and head. When Euclidean isotropy does hold (as in the one-sided isotropy null used in the main text), the1/ √moverlap scale is r...

  40. [40]

    Setup and notation Fix a system sizen, a classical head (linear, JSD, or NLL), and a shot budgetM. Across all reported system sizes, the teacher target distribution is sampled at a fixed budget of2×10 5 shots, so that teacher estimation does not introduce an additional size-dependent variable into the comparison. For each circuitc, letr= 1, . . . , Rindex...

  41. [41]

    Each circuit is evaluated atR= 30independent finite-shot repetitions per shot budget; all reported frontier points M ∗(n)fall within this regime

    Single-parameter signal-to-noise ratio The canonical single-parameter probe usesC= 200independent random circuit instances. Each circuit is evaluated atR= 30independent finite-shot repetitions per shot budget; all reported frontier points M ∗(n)fall within this regime. For each circuitc, define the circuit-level signal-to-noise ratio SNRsingle c (M) := |¯...

  42. [42]

    In the subspace setting, the gradient estimate is vector-valued:ˆgc,r(M)∈R s

    Multi-parameter signal-to-noise ratio The full-subspace study uses aC= 60circuit ensemble with a fixed subspace sizes= 32, and evaluates the finite-shot frontier usingR= 200independent repetitions per circuit/subspace group. In the subspace setting, the gradient estimate is vector-valued:ˆgc,r(M)∈R s. The circuit-level signal- to-noise ratio generalizes t...

  43. [43]

    Multi-parameter relative bias Signal-to-noise measures whether the gradient estimate isresolvedrelative to its own shot noise. It does not measure whether the estimate isfaithfulto the true gradient—a resolved estimate can be systematically biased if, for example, the non-linear head distorts the gradient under finite sampling. To guard against this, we i...

  44. [44]

    Multi-parameter accepted frontier The multi-parameter frontier imposes both reliability and fidelity simultaneously: M ∗ multi(n) := min M : MedSNRmulti(M)≥κ,MedRelBias(M)≤τ ,(C2) withκ= 2andτ= 0.5in all reported experiments. The SNR condition ensures that the subspace gra- dient is resolved above shot noise; the RelBias condition ensures that the resolve...

  45. [45]

    Choice of thresholds The thresholdsκ= 2andτ= 0.5are operational conventions, not fundamental constants.κ= 2 requires the signal to be twice the noise scale, ensuring the directional gradient can be clearly resolved above shot-to-shot fluctuation.τ= 0.5permits up to 50% relative deviation from the exact gradient, which is deliberately permissive: the goal ...

  46. [46]

    3 compress a genuinely broad and non-Gaussian cross-circuit distribution for the NLL finite-shot estimator

    Distributional structure of the NLL finite-shot estimator Theq25/q75bars in Fig. 3 compress a genuinely broad and non-Gaussian cross-circuit distribution for the NLL finite-shot estimator. To make this explicit, Fig. 4 shows a ridgeline view of the per-circuit finite-shot distributions at the same accepted joint frontierM ∗(n)used in the main-text theorem...

  47. [47]

    5 reports the accepted multi-parameter shot frontiersM ∗(n)used in the theorem-aligned decom- position of Section VI

    Multi-parameter shot frontiers Fig. 5 reports the accepted multi-parameter shot frontiersM ∗(n)used in the theorem-aligned decom- position of Section VI. The frontiers are obtained under the joint criterion of Eq. (C2), which requires both MedSNRmulti ≥2andMedRelBias≤0.5across theC= 60circuit ensemble at each system size. The absolute frontier values and ...

  48. [48]

    Sensitivity to block count To check the sensitivity of the charge-conserving joint-block results to the choice of block count, we repeated the single-parameter and full-subspace probes atb= 6. The protocol is otherwise unchanged: the same teacher and student families, the same depth schedules, the same head definitions, and the same single- and multi-para...