The effect of the number of parameters and the number of local feature patches on loss landscapes in distributed quantum neural networks
Pith reviewed 2026-05-22 18:49 UTC · model grok-4.3
The pith
Increasing the number of local patches reduces the largest Hessian eigenvalue at minima in distributed quantum neural networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Increasing the number of patches significantly reduces the largest Hessian eigenvalue at minima, derived from the aggregation of independent patch outputs and verified through Hessian analysis and loss-landscape visualization.
What carries the argument
The distributed architecture that processes overlapping local patches with separate quantum neural networks and aggregates their outputs for the final prediction.
If this is right
- More parameters produce deeper and sharper loss landscapes.
- Higher patch counts lower the dominant Hessian eigenvalue and promote optimization stability.
- The Hessian eigenspectrum consists of a bulk of near-zero values plus outlier spikes matching the number of classes.
Where Pith is reading between the lines
- The same patch-count effect might appear in other quantum models that split input data across multiple circuits.
- Hardware experiments could test whether noise alters the observed reduction in largest eigenvalue.
- The structural similarity to classical neural-network Hessians suggests classical regularization ideas could transfer to quantum settings.
Load-bearing premise
Aggregating outputs from independent patches preserves the essential curvature properties without new cross-patch correlations that would change the Hessian spectrum.
What would settle it
Measure the largest Hessian eigenvalue at converged minima for the same task while varying only the number of patches and check whether it decreases as patch count rises.
read the original abstract
Quantum neural networks hold promise for tackling computationally challenging tasks that are intractable for classical computers. However, their practical application is hindered by significant optimization challenges, arising from complex loss landscapes characterized by barren plateaus and numerous local minima. These problems become more severe as the number of parameters or qubits increases, hampering effective training. To mitigate these optimization challenges, particularly for classical data, we distribute overlapping local patches across multiple quantum neural networks, processing each patch with an independent quantum neural network, and aggregating their outputs for prediction. In this study, we investigate how the number of parameters and patches affects the loss landscape geometry of this distributed quantum neural network architecture via theoretical and empirical Hessian analyses and loss landscape visualization. Our results confirm that increasing the number of parameters tends to lead to deeper and sharper loss landscapes. Crucially, we theoretically derive and empirically demonstrate that increasing the number of patches significantly reduces the largest Hessian eigenvalue at minima. Furthermore, our analysis of the full Hessian eigenspectrum reveals a structure consisting of a bulk of near-zero eigenvalues and distinct outlier spikes corresponding to the number of classes, similar to classical deep learning models. These findings suggest that our distributed patch approach acts as a form of implicit structural regularization, promoting optimization stability and potentially enhancing generalization. Our study provides valuable insights into optimization challenges and highlights that the distributed patch approach is a promising strategy for developing more trainable and scalable quantum machine learning models for classical data tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies the loss landscape geometry of distributed quantum neural networks that process classical data by splitting it into overlapping local patches, each handled by an independent QNN whose outputs are aggregated. It claims that increasing the number of parameters produces deeper and sharper landscapes, while increasing the number of patches reduces the largest Hessian eigenvalue at minima (theoretically derived and empirically confirmed), acting as implicit structural regularization; the full Hessian eigenspectrum exhibits a bulk of near-zero eigenvalues plus outlier spikes whose count matches the number of classes, mirroring classical deep networks.
Significance. If the central derivation and empirical results hold, the work supplies concrete architectural guidance for improving trainability and stability in quantum machine learning on classical data. The explicit link between patch count and Hessian curvature, together with the reported spectral structure, offers a falsifiable prediction and a bridge to classical deep-learning analyses of loss landscapes; the provision of both theoretical derivation and Hessian-based empirical checks is a strength.
major comments (2)
- [Theoretical derivation] Theoretical derivation section: the claim that the dominant Hessian eigenvalue scales inversely with patch count P assumes that the Hessian of the aggregated loss L = f(∑_p QNN_p(patch_p)) has negligible cross-patch second-derivative terms. Overlapping patches share classical input features, so the Jacobian of each patch output with respect to shared data produces nonzero off-block entries; these terms are not bounded by the per-patch analysis and could increase rather than decrease the largest eigenvalue. The manuscript must either derive an explicit bound on the cross terms or demonstrate empirically that they remain small.
- [Empirical results] Empirical Hessian analysis (results section): the reported reduction in the largest eigenvalue with increasing patches is presented without visible error bars, exact data-exclusion criteria, or the number of independent optimization runs used to locate the minima. Because the central claim rests on this quantitative reduction, the statistical robustness of the effect must be documented (e.g., mean and standard deviation across seeds).
minor comments (2)
- [Figures] Figure captions should explicitly state the color scale and normalization used for the loss-landscape visualizations.
- [Notation] Notation for the aggregated loss function should be introduced once and used consistently; the current text occasionally switches between summation and concatenation symbols.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments on our manuscript. We have addressed each of the major comments point by point below, and we believe these revisions will enhance the clarity and robustness of our findings.
read point-by-point responses
-
Referee: [Theoretical derivation] Theoretical derivation section: the claim that the dominant Hessian eigenvalue scales inversely with patch count P assumes that the Hessian of the aggregated loss L = f(∑_p QNN_p(patch_p)) has negligible cross-patch second-derivative terms. Overlapping patches share classical input features, so the Jacobian of each patch output with respect to shared data produces nonzero off-block entries; these terms are not bounded by the per-patch analysis and could increase rather than decrease the largest eigenvalue. The manuscript must either derive an explicit bound on the cross terms or demonstrate empirically that they remain small.
Authors: We appreciate the referee pointing out the potential impact of cross-patch second-derivative terms due to overlapping patches. Our theoretical derivation primarily considers the contribution from individual patches and the effect of aggregation in reducing the curvature. While cross terms exist, for the mean-squared error or cross-entropy loss functions used, these terms are proportional to the product of gradients from different patches and tend to average out or remain smaller in magnitude as P increases. To strengthen the manuscript, we will add an empirical demonstration in the revised version by computing the norm of the cross-block Hessians and showing they are significantly smaller than the intra-patch blocks, thus not altering the inverse scaling with P. This addresses the concern without requiring a full analytical bound, which would be complex given the quantum circuit specifics. revision: partial
-
Referee: [Empirical results] Empirical Hessian analysis (results section): the reported reduction in the largest eigenvalue with increasing patches is presented without visible error bars, exact data-exclusion criteria, or the number of independent optimization runs used to locate the minima. Because the central claim rests on this quantitative reduction, the statistical robustness of the effect must be documented (e.g., mean and standard deviation across seeds).
Authors: We agree that providing statistical details is essential for validating the empirical findings. In the revised manuscript, we will update the results section to include error bars indicating the standard deviation over 10 independent optimization runs with different random seeds for each patch count. We will also explicitly state the optimization procedure, including that minima were located using the Adam optimizer with a fixed learning rate and that runs failing to reach a loss below a threshold of 0.1 were excluded from the Hessian analysis (affecting less than 5% of runs). The reported values will be the mean largest eigenvalue with standard deviations, confirming the consistent reduction with increasing P. revision: yes
Circularity Check
No significant circularity in theoretical derivation of patch-count effect on Hessian spectrum
full rationale
The paper presents an explicit theoretical derivation that the aggregated loss L = f(∑_p QNN_p(patch_p)) yields a largest Hessian eigenvalue that decreases with patch count P, together with an empirical Hessian analysis and loss-landscape visualization. This chain is constructed from the architecture definition and standard second-derivative expansion; it does not reduce to a self-definition, a fitted parameter relabeled as a prediction, or a load-bearing self-citation whose content is itself unverified. The eigenspectrum comparison to classical models is presented as an observed structural similarity rather than a renaming or smuggling of an ansatz. Because the central claim rests on an independent analytic step whose assumptions are stated and whose outputs are checked against direct computation, the derivation remains self-contained.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of patches
- number of parameters per QNN
axioms (2)
- domain assumption The loss landscape of a variational quantum circuit can be meaningfully characterized by its Hessian at critical points.
- domain assumption Aggregation of independent patch outputs does not introduce dominant cross-term correlations that invalidate the single-patch curvature analysis.
Reference graph
Works this paper leans on
-
[1]
: Variational quantum algorithms
Cerezo, M., Arrasmith, A., Babbush, R., Benjamin, S.C., Endo, S., Fujii, K., McClean, J.R., Mitarai, K., Yuan, X., Cincio, L., et al. : Variational quantum algorithms. Nature Reviews Physics 3(9), 625–644 (2021)
work page 2021
-
[2]
Kandala, A., Mezzacapo, A., Temme, K., Takita, M., Brink, M., Chow, J.M., 12 Table B1: The Contribution rate of Principal Components d = 50 d = 100 d = 150 d = 200 PC1 PC2 PC1 PC2 PC1 PC2 PC1 PC2 nqc = 4 0.8753 0.0849 0.8295 0.1467 0.9386 0.0371 0.9578 0.0280 nqc = 9 0.8880 0.0784 0.8489 0.0869 0.9217 0.0454 0.9586 0.0284 nqc = 16 0.8585 0.0699 0.9024 0.0...
work page 2017
-
[3]
Nature communications 5(1), 4213 (2014)
Peruzzo, A., McClean, J., Shadbolt, P., Yung, M.-H., Zhou, X.-Q., Love, P.J., Aspuru-Guzik, A., O’brien, J.L.: A variational eigenvalue solver on a photonic quantum processor. Nature communications 5(1), 4213 (2014)
work page 2014
-
[4]
A Quantum Approximate Optimization Algorithm
Farhi, E., Goldstone, J., Gutmann, S.: A quantum approximate optimization algorithm. arXiv preprint arXiv:1411.4028 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[5]
Classification with Quantum Neural Networks on Near Term Processors
Farhi, E., Neven, H.: Classification with quantum neural networks on near term processors. arXiv preprint arXiv:1802.06002 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[6]
Physical Review A 98(3), 032309 (2018)
Mitarai, K., Negoro, M., Kitagawa, M., Fujii, K.: Quantum circuit learning. Physical Review A 98(3), 032309 (2018)
work page 2018
-
[7]
Nature Computational Science1(6), 403–409 (2021)
Abbas, A., Sutter, D., Zoufal, C., Lucchi, A., Figalli, A., Woerner, S.: The power of quantum neural networks. Nature Computational Science1(6), 403–409 (2021)
work page 2021
-
[8]
PRX quantum 3(1), 010313 (2022)
Holmes, Z., Sharma, K., Cerezo, M., Coles, P.J.: Connecting ansatz expressibility to gradient magnitudes and barren plateaus. PRX quantum 3(1), 010313 (2022)
work page 2022
-
[9]
In: International Conference on Machine Learning, pp
You, X., Wu, X.: Exponentially many local minima in quantum neural networks. In: International Conference on Machine Learning, pp. 12144–12155 (2021). PMLR
work page 2021
-
[10]
Physical review letters 127(12), 120502 (2021)
Bittel, L., Kliesch, M.: Training variational quantum algorithms is np-hard. Physical review letters 127(12), 120502 (2021)
work page 2021
-
[11]
Nature communications 9(1), 4812 (2018)
McClean, J.R., Boixo, S., Smelyanskiy, V.N., Babbush, R., Neven, H.: Barren plateaus in quantum neural network training landscapes. Nature communications 9(1), 4812 (2018)
work page 2018
-
[12]
C2 : These figures show the training and test losses at the end of each epoch
Cerezo, M., Sone, A., Volkoff, T., Cincio, L., Coles, P.J.: Cost function dependent 13 (a) nqc = 4, d = 50 (b) nqc = 4, d = 100 (c) nqc = 4, d = 150 (d) nqc = 4, d = 200 (e) nqc = 9, d = 50 (f) nqc = 9, d = 100 (g) nqc = 9, d = 150 (h) nqc = 9, d = 200 (i) nqc = 16, d = 50 (j) nqc = 16, d = 100 (k) nqc = 16, d = 150 (l) nqc = 16, d = 200 Fig. C2 : These f...
work page 2021
-
[13]
Nature communications 12(1), 6961 (2021)
Wang, S., Fontana, E., Cerezo, M., Sharma, K., Sone, A., Cincio, L., Coles, P.J.: Noise-induced barren plateaus in variational quantum algorithms. Nature communications 12(1), 6961 (2021)
work page 2021
-
[14]
Advances in neural information processing systems 31 (2018)
Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T.: Visualizing the loss landscape of neural nets. Advances in neural information processing systems 31 (2018)
work page 2018
-
[15]
arXiv preprint arXiv:2005.00060 (2020)
Zhao, P., Chen, P.-Y., Das, P., Ramamurthy, K.N., Lin, X.: Bridging mode connectivity in loss landscapes and adversarial robustness. arXiv preprint arXiv:2005.00060 (2020)
-
[16]
Advances in neural information processing systems 31 (2018) 14
Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D.P., Wilson, A.G.: Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems 31 (2018) 14
work page 2018
-
[17]
arXiv preprint arXiv:1912.02757 (2019)
Fort, S., Hu, H., Lakshminarayanan, B.: Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757 (2019)
-
[18]
In: 2020 IEEE International Conference on Big Data (Big Data), pp
Yao, Z., Gholami, A., Keutzer, K., Mahoney, M.W.: Pyhessian: Neural networks through the lens of the hessian. In: 2020 IEEE International Conference on Big Data (Big Data), pp. 581–590 (2020). IEEE
work page 2020
-
[19]
Rudolph, M.S., Sim, S., Raza, A., Stechly, M., McClean, J.R., Anschuetz, E.R., Serrano, L., Perdomo-Ortiz, A.: Orqviz: Visualizing high-dimensional landscapes in variational quantum algorithms. arXiv preprint arXiv:2111.04695 (2021)
-
[20]
Quantum Science and Technology 6(2), 025011 (2021)
Huembeli, P., Dauphin, A.: Characterizing the loss landscape of variational quantum circuits. Quantum Science and Technology 6(2), 025011 (2021)
work page 2021
-
[21]
Quantum Machine Intelligence 5(2), 23 (2023)
Pira, L., Ferrie, C.: An invitation to distributed quantum neural networks. Quantum Machine Intelligence 5(2), 23 (2023)
work page 2023
-
[22]
Physical Review X 6(2), 021043 (2016)
Bravyi, S., Smith, G., Smolin, J.A.: Trading classical and quantum computational resources. Physical Review X 6(2), 021043 (2016)
work page 2016
-
[23]
Physical review letters 125(15), 150504 (2020)
Peng, T., Harrow, A.W., Ozols, M., Wu, X.: Simulating large quantum circuits on a small quantum computer. Physical review letters 125(15), 150504 (2020)
work page 2020
-
[24]
Marshall, S.C., Gyurik, C., Dunjko, V.: High dimensional quantum machine learning with small quantum computers. Quantum 7, 1078 (2023)
work page 2023
-
[25]
Quantum Machine Intelligence 6(1), 15 (2024)
Kawase, Y.: Distributed quantum neural networks via partitioned features encoding. Quantum Machine Intelligence 6(1), 15 (2024)
work page 2024
-
[26]
Computational Linguistics 48(3), 733–763 (2022)
Dufter, P., Schmitt, M., Sch¨ utze, H.: Position information in transformers: An overview. Computational Linguistics 48(3), 733–763 (2022)
work page 2022
-
[27]
Adam: A Method for Stochastic Optimization
Kingma, D.P.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[28]
https://github.com/mit-han-lab/torchquantum
TorchQuantum (2024). https://github.com/mit-han-lab/torchquantum
work page 2024
-
[29]
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large- batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[30]
Nature communications 15(1), 5200 (2024) 15
Thanasilp, S., Wang, S., Cerezo, M., Holmes, Z.: Exponential concentration in quantum kernel methods. Nature communications 15(1), 5200 (2024) 15
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.