pith. sign in

arxiv: 2504.19239 · v2 · submitted 2025-04-27 · 🪐 quant-ph · cs.LG

The effect of the number of parameters and the number of local feature patches on loss landscapes in distributed quantum neural networks

Pith reviewed 2026-05-22 18:49 UTC · model grok-4.3

classification 🪐 quant-ph cs.LG
keywords quantum neural networksloss landscapesHessian analysisdistributed patchesoptimization stabilitybarren plateausclassical data
0
0 comments X

The pith

Increasing the number of local patches reduces the largest Hessian eigenvalue at minima in distributed quantum neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how parameter count and the number of overlapping local feature patches shape the loss landscape when classical data is split across independent quantum neural networks whose outputs are aggregated. It finds that more parameters produce deeper and sharper minima, while more patches lower the dominant Hessian eigenvalue at those minima. This effect is derived theoretically and confirmed empirically, indicating that the patch distribution supplies implicit structural regularization that improves optimization stability. The full Hessian spectrum shows a bulk of near-zero eigenvalues plus distinct spikes equal to the number of classes, matching patterns seen in classical deep learning. The results suggest the distributed-patch design can help make quantum models for classical data more trainable as system size grows.

Core claim

Increasing the number of patches significantly reduces the largest Hessian eigenvalue at minima, derived from the aggregation of independent patch outputs and verified through Hessian analysis and loss-landscape visualization.

What carries the argument

The distributed architecture that processes overlapping local patches with separate quantum neural networks and aggregates their outputs for the final prediction.

If this is right

  • More parameters produce deeper and sharper loss landscapes.
  • Higher patch counts lower the dominant Hessian eigenvalue and promote optimization stability.
  • The Hessian eigenspectrum consists of a bulk of near-zero values plus outlier spikes matching the number of classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same patch-count effect might appear in other quantum models that split input data across multiple circuits.
  • Hardware experiments could test whether noise alters the observed reduction in largest eigenvalue.
  • The structural similarity to classical neural-network Hessians suggests classical regularization ideas could transfer to quantum settings.

Load-bearing premise

Aggregating outputs from independent patches preserves the essential curvature properties without new cross-patch correlations that would change the Hessian spectrum.

What would settle it

Measure the largest Hessian eigenvalue at converged minima for the same task while varying only the number of patches and check whether it decreases as patch count rises.

read the original abstract

Quantum neural networks hold promise for tackling computationally challenging tasks that are intractable for classical computers. However, their practical application is hindered by significant optimization challenges, arising from complex loss landscapes characterized by barren plateaus and numerous local minima. These problems become more severe as the number of parameters or qubits increases, hampering effective training. To mitigate these optimization challenges, particularly for classical data, we distribute overlapping local patches across multiple quantum neural networks, processing each patch with an independent quantum neural network, and aggregating their outputs for prediction. In this study, we investigate how the number of parameters and patches affects the loss landscape geometry of this distributed quantum neural network architecture via theoretical and empirical Hessian analyses and loss landscape visualization. Our results confirm that increasing the number of parameters tends to lead to deeper and sharper loss landscapes. Crucially, we theoretically derive and empirically demonstrate that increasing the number of patches significantly reduces the largest Hessian eigenvalue at minima. Furthermore, our analysis of the full Hessian eigenspectrum reveals a structure consisting of a bulk of near-zero eigenvalues and distinct outlier spikes corresponding to the number of classes, similar to classical deep learning models. These findings suggest that our distributed patch approach acts as a form of implicit structural regularization, promoting optimization stability and potentially enhancing generalization. Our study provides valuable insights into optimization challenges and highlights that the distributed patch approach is a promising strategy for developing more trainable and scalable quantum machine learning models for classical data tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies the loss landscape geometry of distributed quantum neural networks that process classical data by splitting it into overlapping local patches, each handled by an independent QNN whose outputs are aggregated. It claims that increasing the number of parameters produces deeper and sharper landscapes, while increasing the number of patches reduces the largest Hessian eigenvalue at minima (theoretically derived and empirically confirmed), acting as implicit structural regularization; the full Hessian eigenspectrum exhibits a bulk of near-zero eigenvalues plus outlier spikes whose count matches the number of classes, mirroring classical deep networks.

Significance. If the central derivation and empirical results hold, the work supplies concrete architectural guidance for improving trainability and stability in quantum machine learning on classical data. The explicit link between patch count and Hessian curvature, together with the reported spectral structure, offers a falsifiable prediction and a bridge to classical deep-learning analyses of loss landscapes; the provision of both theoretical derivation and Hessian-based empirical checks is a strength.

major comments (2)
  1. [Theoretical derivation] Theoretical derivation section: the claim that the dominant Hessian eigenvalue scales inversely with patch count P assumes that the Hessian of the aggregated loss L = f(∑_p QNN_p(patch_p)) has negligible cross-patch second-derivative terms. Overlapping patches share classical input features, so the Jacobian of each patch output with respect to shared data produces nonzero off-block entries; these terms are not bounded by the per-patch analysis and could increase rather than decrease the largest eigenvalue. The manuscript must either derive an explicit bound on the cross terms or demonstrate empirically that they remain small.
  2. [Empirical results] Empirical Hessian analysis (results section): the reported reduction in the largest eigenvalue with increasing patches is presented without visible error bars, exact data-exclusion criteria, or the number of independent optimization runs used to locate the minima. Because the central claim rests on this quantitative reduction, the statistical robustness of the effect must be documented (e.g., mean and standard deviation across seeds).
minor comments (2)
  1. [Figures] Figure captions should explicitly state the color scale and normalization used for the loss-landscape visualizations.
  2. [Notation] Notation for the aggregated loss function should be introduced once and used consistently; the current text occasionally switches between summation and concatenation symbols.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We have addressed each of the major comments point by point below, and we believe these revisions will enhance the clarity and robustness of our findings.

read point-by-point responses
  1. Referee: [Theoretical derivation] Theoretical derivation section: the claim that the dominant Hessian eigenvalue scales inversely with patch count P assumes that the Hessian of the aggregated loss L = f(∑_p QNN_p(patch_p)) has negligible cross-patch second-derivative terms. Overlapping patches share classical input features, so the Jacobian of each patch output with respect to shared data produces nonzero off-block entries; these terms are not bounded by the per-patch analysis and could increase rather than decrease the largest eigenvalue. The manuscript must either derive an explicit bound on the cross terms or demonstrate empirically that they remain small.

    Authors: We appreciate the referee pointing out the potential impact of cross-patch second-derivative terms due to overlapping patches. Our theoretical derivation primarily considers the contribution from individual patches and the effect of aggregation in reducing the curvature. While cross terms exist, for the mean-squared error or cross-entropy loss functions used, these terms are proportional to the product of gradients from different patches and tend to average out or remain smaller in magnitude as P increases. To strengthen the manuscript, we will add an empirical demonstration in the revised version by computing the norm of the cross-block Hessians and showing they are significantly smaller than the intra-patch blocks, thus not altering the inverse scaling with P. This addresses the concern without requiring a full analytical bound, which would be complex given the quantum circuit specifics. revision: partial

  2. Referee: [Empirical results] Empirical Hessian analysis (results section): the reported reduction in the largest eigenvalue with increasing patches is presented without visible error bars, exact data-exclusion criteria, or the number of independent optimization runs used to locate the minima. Because the central claim rests on this quantitative reduction, the statistical robustness of the effect must be documented (e.g., mean and standard deviation across seeds).

    Authors: We agree that providing statistical details is essential for validating the empirical findings. In the revised manuscript, we will update the results section to include error bars indicating the standard deviation over 10 independent optimization runs with different random seeds for each patch count. We will also explicitly state the optimization procedure, including that minima were located using the Adam optimizer with a fixed learning rate and that runs failing to reach a loss below a threshold of 0.1 were excluded from the Hessian analysis (affecting less than 5% of runs). The reported values will be the mean largest eigenvalue with standard deviations, confirming the consistent reduction with increasing P. revision: yes

Circularity Check

0 steps flagged

No significant circularity in theoretical derivation of patch-count effect on Hessian spectrum

full rationale

The paper presents an explicit theoretical derivation that the aggregated loss L = f(∑_p QNN_p(patch_p)) yields a largest Hessian eigenvalue that decreases with patch count P, together with an empirical Hessian analysis and loss-landscape visualization. This chain is constructed from the architecture definition and standard second-derivative expansion; it does not reduce to a self-definition, a fitted parameter relabeled as a prediction, or a load-bearing self-citation whose content is itself unverified. The eigenspectrum comparison to classical models is presented as an observed structural similarity rather than a renaming or smuggling of an ansatz. Because the central claim rests on an independent analytic step whose assumptions are stated and whose outputs are checked against direct computation, the derivation remains self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard quantum circuit assumptions and the validity of the Hessian as a local curvature descriptor; no new particles or forces are introduced.

free parameters (2)
  • number of patches
    Varied experimentally to observe effect on Hessian; treated as a controllable architectural hyperparameter rather than fitted constant.
  • number of parameters per QNN
    Varied to study depth of loss landscape; chosen as independent variable.
axioms (2)
  • domain assumption The loss landscape of a variational quantum circuit can be meaningfully characterized by its Hessian at critical points.
    Invoked when performing theoretical and empirical Hessian analyses.
  • domain assumption Aggregation of independent patch outputs does not introduce dominant cross-term correlations that invalidate the single-patch curvature analysis.
    Required for the claim that increasing patches reduces the global largest eigenvalue.

pith-pipeline@v0.9.0 · 5786 in / 1409 out tokens · 32678 ms · 2026-05-22T18:49:26.496180+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 4 internal anchors

  1. [1]

    : Variational quantum algorithms

    Cerezo, M., Arrasmith, A., Babbush, R., Benjamin, S.C., Endo, S., Fujii, K., McClean, J.R., Mitarai, K., Yuan, X., Cincio, L., et al. : Variational quantum algorithms. Nature Reviews Physics 3(9), 625–644 (2021)

  2. [2]

    Gambetta, J.M.: Hardware-efficient variational quantum eigensolver for small molecules and quantum magnets

    Kandala, A., Mezzacapo, A., Temme, K., Takita, M., Brink, M., Chow, J.M., 12 Table B1: The Contribution rate of Principal Components d = 50 d = 100 d = 150 d = 200 PC1 PC2 PC1 PC2 PC1 PC2 PC1 PC2 nqc = 4 0.8753 0.0849 0.8295 0.1467 0.9386 0.0371 0.9578 0.0280 nqc = 9 0.8880 0.0784 0.8489 0.0869 0.9217 0.0454 0.9586 0.0284 nqc = 16 0.8585 0.0699 0.9024 0.0...

  3. [3]

    Nature communications 5(1), 4213 (2014)

    Peruzzo, A., McClean, J., Shadbolt, P., Yung, M.-H., Zhou, X.-Q., Love, P.J., Aspuru-Guzik, A., O’brien, J.L.: A variational eigenvalue solver on a photonic quantum processor. Nature communications 5(1), 4213 (2014)

  4. [4]

    A Quantum Approximate Optimization Algorithm

    Farhi, E., Goldstone, J., Gutmann, S.: A quantum approximate optimization algorithm. arXiv preprint arXiv:1411.4028 (2014)

  5. [5]

    Classification with Quantum Neural Networks on Near Term Processors

    Farhi, E., Neven, H.: Classification with quantum neural networks on near term processors. arXiv preprint arXiv:1802.06002 (2018)

  6. [6]

    Physical Review A 98(3), 032309 (2018)

    Mitarai, K., Negoro, M., Kitagawa, M., Fujii, K.: Quantum circuit learning. Physical Review A 98(3), 032309 (2018)

  7. [7]

    Nature Computational Science1(6), 403–409 (2021)

    Abbas, A., Sutter, D., Zoufal, C., Lucchi, A., Figalli, A., Woerner, S.: The power of quantum neural networks. Nature Computational Science1(6), 403–409 (2021)

  8. [8]

    PRX quantum 3(1), 010313 (2022)

    Holmes, Z., Sharma, K., Cerezo, M., Coles, P.J.: Connecting ansatz expressibility to gradient magnitudes and barren plateaus. PRX quantum 3(1), 010313 (2022)

  9. [9]

    In: International Conference on Machine Learning, pp

    You, X., Wu, X.: Exponentially many local minima in quantum neural networks. In: International Conference on Machine Learning, pp. 12144–12155 (2021). PMLR

  10. [10]

    Physical review letters 127(12), 120502 (2021)

    Bittel, L., Kliesch, M.: Training variational quantum algorithms is np-hard. Physical review letters 127(12), 120502 (2021)

  11. [11]

    Nature communications 9(1), 4812 (2018)

    McClean, J.R., Boixo, S., Smelyanskiy, V.N., Babbush, R., Neven, H.: Barren plateaus in quantum neural network training landscapes. Nature communications 9(1), 4812 (2018)

  12. [12]

    C2 : These figures show the training and test losses at the end of each epoch

    Cerezo, M., Sone, A., Volkoff, T., Cincio, L., Coles, P.J.: Cost function dependent 13 (a) nqc = 4, d = 50 (b) nqc = 4, d = 100 (c) nqc = 4, d = 150 (d) nqc = 4, d = 200 (e) nqc = 9, d = 50 (f) nqc = 9, d = 100 (g) nqc = 9, d = 150 (h) nqc = 9, d = 200 (i) nqc = 16, d = 50 (j) nqc = 16, d = 100 (k) nqc = 16, d = 150 (l) nqc = 16, d = 200 Fig. C2 : These f...

  13. [13]

    Nature communications 12(1), 6961 (2021)

    Wang, S., Fontana, E., Cerezo, M., Sharma, K., Sone, A., Cincio, L., Coles, P.J.: Noise-induced barren plateaus in variational quantum algorithms. Nature communications 12(1), 6961 (2021)

  14. [14]

    Advances in neural information processing systems 31 (2018)

    Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T.: Visualizing the loss landscape of neural nets. Advances in neural information processing systems 31 (2018)

  15. [15]

    arXiv preprint arXiv:2005.00060 (2020)

    Zhao, P., Chen, P.-Y., Das, P., Ramamurthy, K.N., Lin, X.: Bridging mode connectivity in loss landscapes and adversarial robustness. arXiv preprint arXiv:2005.00060 (2020)

  16. [16]

    Advances in neural information processing systems 31 (2018) 14

    Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D.P., Wilson, A.G.: Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems 31 (2018) 14

  17. [17]

    arXiv preprint arXiv:1912.02757 (2019)

    Fort, S., Hu, H., Lakshminarayanan, B.: Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757 (2019)

  18. [18]

    In: 2020 IEEE International Conference on Big Data (Big Data), pp

    Yao, Z., Gholami, A., Keutzer, K., Mahoney, M.W.: Pyhessian: Neural networks through the lens of the hessian. In: 2020 IEEE International Conference on Big Data (Big Data), pp. 581–590 (2020). IEEE

  19. [19]

    & Perdomo-Ortiz, A

    Rudolph, M.S., Sim, S., Raza, A., Stechly, M., McClean, J.R., Anschuetz, E.R., Serrano, L., Perdomo-Ortiz, A.: Orqviz: Visualizing high-dimensional landscapes in variational quantum algorithms. arXiv preprint arXiv:2111.04695 (2021)

  20. [20]

    Quantum Science and Technology 6(2), 025011 (2021)

    Huembeli, P., Dauphin, A.: Characterizing the loss landscape of variational quantum circuits. Quantum Science and Technology 6(2), 025011 (2021)

  21. [21]

    Quantum Machine Intelligence 5(2), 23 (2023)

    Pira, L., Ferrie, C.: An invitation to distributed quantum neural networks. Quantum Machine Intelligence 5(2), 23 (2023)

  22. [22]

    Physical Review X 6(2), 021043 (2016)

    Bravyi, S., Smith, G., Smolin, J.A.: Trading classical and quantum computational resources. Physical Review X 6(2), 021043 (2016)

  23. [23]

    Physical review letters 125(15), 150504 (2020)

    Peng, T., Harrow, A.W., Ozols, M., Wu, X.: Simulating large quantum circuits on a small quantum computer. Physical review letters 125(15), 150504 (2020)

  24. [24]

    Quantum 7, 1078 (2023)

    Marshall, S.C., Gyurik, C., Dunjko, V.: High dimensional quantum machine learning with small quantum computers. Quantum 7, 1078 (2023)

  25. [25]

    Quantum Machine Intelligence 6(1), 15 (2024)

    Kawase, Y.: Distributed quantum neural networks via partitioned features encoding. Quantum Machine Intelligence 6(1), 15 (2024)

  26. [26]

    Computational Linguistics 48(3), 733–763 (2022)

    Dufter, P., Schmitt, M., Sch¨ utze, H.: Position information in transformers: An overview. Computational Linguistics 48(3), 733–763 (2022)

  27. [27]

    Adam: A Method for Stochastic Optimization

    Kingma, D.P.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  28. [28]

    https://github.com/mit-han-lab/torchquantum

    TorchQuantum (2024). https://github.com/mit-han-lab/torchquantum

  29. [29]

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large- batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016)

  30. [30]

    Nature communications 15(1), 5200 (2024) 15

    Thanasilp, S., Wang, S., Cerezo, M., Holmes, Z.: Exponential concentration in quantum kernel methods. Nature communications 15(1), 5200 (2024) 15