The effect of the number of parameters and the number of local feature patches on loss landscapes in distributed quantum neural networks

Yoshiaki Kawase

arxiv: 2504.19239 · v2 · submitted 2025-04-27 · 🪐 quant-ph · cs.LG

The effect of the number of parameters and the number of local feature patches on loss landscapes in distributed quantum neural networks

Yoshiaki Kawase This is my paper

Pith reviewed 2026-05-22 18:49 UTC · model grok-4.3

classification 🪐 quant-ph cs.LG

keywords quantum neural networksloss landscapesHessian analysisdistributed patchesoptimization stabilitybarren plateausclassical data

0 comments

The pith

Increasing the number of local patches reduces the largest Hessian eigenvalue at minima in distributed quantum neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how parameter count and the number of overlapping local feature patches shape the loss landscape when classical data is split across independent quantum neural networks whose outputs are aggregated. It finds that more parameters produce deeper and sharper minima, while more patches lower the dominant Hessian eigenvalue at those minima. This effect is derived theoretically and confirmed empirically, indicating that the patch distribution supplies implicit structural regularization that improves optimization stability. The full Hessian spectrum shows a bulk of near-zero eigenvalues plus distinct spikes equal to the number of classes, matching patterns seen in classical deep learning. The results suggest the distributed-patch design can help make quantum models for classical data more trainable as system size grows.

Core claim

Increasing the number of patches significantly reduces the largest Hessian eigenvalue at minima, derived from the aggregation of independent patch outputs and verified through Hessian analysis and loss-landscape visualization.

What carries the argument

The distributed architecture that processes overlapping local patches with separate quantum neural networks and aggregates their outputs for the final prediction.

If this is right

More parameters produce deeper and sharper loss landscapes.
Higher patch counts lower the dominant Hessian eigenvalue and promote optimization stability.
The Hessian eigenspectrum consists of a bulk of near-zero values plus outlier spikes matching the number of classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same patch-count effect might appear in other quantum models that split input data across multiple circuits.
Hardware experiments could test whether noise alters the observed reduction in largest eigenvalue.
The structural similarity to classical neural-network Hessians suggests classical regularization ideas could transfer to quantum settings.

Load-bearing premise

Aggregating outputs from independent patches preserves the essential curvature properties without new cross-patch correlations that would change the Hessian spectrum.

What would settle it

Measure the largest Hessian eigenvalue at converged minima for the same task while varying only the number of patches and check whether it decreases as patch count rises.

read the original abstract

Quantum neural networks hold promise for tackling computationally challenging tasks that are intractable for classical computers. However, their practical application is hindered by significant optimization challenges, arising from complex loss landscapes characterized by barren plateaus and numerous local minima. These problems become more severe as the number of parameters or qubits increases, hampering effective training. To mitigate these optimization challenges, particularly for classical data, we distribute overlapping local patches across multiple quantum neural networks, processing each patch with an independent quantum neural network, and aggregating their outputs for prediction. In this study, we investigate how the number of parameters and patches affects the loss landscape geometry of this distributed quantum neural network architecture via theoretical and empirical Hessian analyses and loss landscape visualization. Our results confirm that increasing the number of parameters tends to lead to deeper and sharper loss landscapes. Crucially, we theoretically derive and empirically demonstrate that increasing the number of patches significantly reduces the largest Hessian eigenvalue at minima. Furthermore, our analysis of the full Hessian eigenspectrum reveals a structure consisting of a bulk of near-zero eigenvalues and distinct outlier spikes corresponding to the number of classes, similar to classical deep learning models. These findings suggest that our distributed patch approach acts as a form of implicit structural regularization, promoting optimization stability and potentially enhancing generalization. Our study provides valuable insights into optimization challenges and highlights that the distributed patch approach is a promising strategy for developing more trainable and scalable quantum machine learning models for classical data tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

More patches lower the dominant Hessian eigenvalue at minima in this distributed QNN setup, but the derivation leaves cross terms from overlapping inputs unaddressed.

read the letter

The key point is that increasing the number of overlapping patches across independent QNNs reduces the largest Hessian eigenvalue at the loss minima, which the authors present as a form of implicit regularization that could improve training stability. They back this with both a derivation and full eigenspectrum plots that also show class-numbered outlier spikes on top of a near-zero bulk, matching patterns seen in classical networks. The loss landscape visualizations add a concrete check on the geometry claims. What is new here is the specific application of Hessian analysis to a patch-distributed quantum architecture rather than single-circuit QNNs, plus the direct link from patch count to curvature reduction. The empirical side looks solid enough for the scale they ran, with clear plots that do not overclaim the results. The soft spot is the theoretical step that treats the aggregated loss as preserving per-patch curvature without extra cross terms. Because the patches overlap on the input features, the Jacobian through the shared data should produce nonzero off-block entries in the full Hessian; those terms are not bounded or isolated in the analysis, so the claimed inverse scaling with patch count may not hold cleanly. The experiments still show the reduction, but it is not obvious how much the overlap affects the outcome. This is the sort of paper that would interest people working on practical QML training for classical data and looking for architectural levers instead of ansatz redesigns. A reader focused on barren plateau mitigation would get a usable idea to test. I would send it to peer review; the observation is specific enough and the evidence is there to evaluate even if the theory needs tightening on the cross terms.

Referee Report

2 major / 2 minor

Summary. The paper studies the loss landscape geometry of distributed quantum neural networks that process classical data by splitting it into overlapping local patches, each handled by an independent QNN whose outputs are aggregated. It claims that increasing the number of parameters produces deeper and sharper landscapes, while increasing the number of patches reduces the largest Hessian eigenvalue at minima (theoretically derived and empirically confirmed), acting as implicit structural regularization; the full Hessian eigenspectrum exhibits a bulk of near-zero eigenvalues plus outlier spikes whose count matches the number of classes, mirroring classical deep networks.

Significance. If the central derivation and empirical results hold, the work supplies concrete architectural guidance for improving trainability and stability in quantum machine learning on classical data. The explicit link between patch count and Hessian curvature, together with the reported spectral structure, offers a falsifiable prediction and a bridge to classical deep-learning analyses of loss landscapes; the provision of both theoretical derivation and Hessian-based empirical checks is a strength.

major comments (2)

[Theoretical derivation] Theoretical derivation section: the claim that the dominant Hessian eigenvalue scales inversely with patch count P assumes that the Hessian of the aggregated loss L = f(∑_p QNN_p(patch_p)) has negligible cross-patch second-derivative terms. Overlapping patches share classical input features, so the Jacobian of each patch output with respect to shared data produces nonzero off-block entries; these terms are not bounded by the per-patch analysis and could increase rather than decrease the largest eigenvalue. The manuscript must either derive an explicit bound on the cross terms or demonstrate empirically that they remain small.
[Empirical results] Empirical Hessian analysis (results section): the reported reduction in the largest eigenvalue with increasing patches is presented without visible error bars, exact data-exclusion criteria, or the number of independent optimization runs used to locate the minima. Because the central claim rests on this quantitative reduction, the statistical robustness of the effect must be documented (e.g., mean and standard deviation across seeds).

minor comments (2)

[Figures] Figure captions should explicitly state the color scale and normalization used for the loss-landscape visualizations.
[Notation] Notation for the aggregated loss function should be introduced once and used consistently; the current text occasionally switches between summation and concatenation symbols.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We have addressed each of the major comments point by point below, and we believe these revisions will enhance the clarity and robustness of our findings.

read point-by-point responses

Referee: [Theoretical derivation] Theoretical derivation section: the claim that the dominant Hessian eigenvalue scales inversely with patch count P assumes that the Hessian of the aggregated loss L = f(∑_p QNN_p(patch_p)) has negligible cross-patch second-derivative terms. Overlapping patches share classical input features, so the Jacobian of each patch output with respect to shared data produces nonzero off-block entries; these terms are not bounded by the per-patch analysis and could increase rather than decrease the largest eigenvalue. The manuscript must either derive an explicit bound on the cross terms or demonstrate empirically that they remain small.

Authors: We appreciate the referee pointing out the potential impact of cross-patch second-derivative terms due to overlapping patches. Our theoretical derivation primarily considers the contribution from individual patches and the effect of aggregation in reducing the curvature. While cross terms exist, for the mean-squared error or cross-entropy loss functions used, these terms are proportional to the product of gradients from different patches and tend to average out or remain smaller in magnitude as P increases. To strengthen the manuscript, we will add an empirical demonstration in the revised version by computing the norm of the cross-block Hessians and showing they are significantly smaller than the intra-patch blocks, thus not altering the inverse scaling with P. This addresses the concern without requiring a full analytical bound, which would be complex given the quantum circuit specifics. revision: partial
Referee: [Empirical results] Empirical Hessian analysis (results section): the reported reduction in the largest eigenvalue with increasing patches is presented without visible error bars, exact data-exclusion criteria, or the number of independent optimization runs used to locate the minima. Because the central claim rests on this quantitative reduction, the statistical robustness of the effect must be documented (e.g., mean and standard deviation across seeds).

Authors: We agree that providing statistical details is essential for validating the empirical findings. In the revised manuscript, we will update the results section to include error bars indicating the standard deviation over 10 independent optimization runs with different random seeds for each patch count. We will also explicitly state the optimization procedure, including that minima were located using the Adam optimizer with a fixed learning rate and that runs failing to reach a loss below a threshold of 0.1 were excluded from the Hessian analysis (affecting less than 5% of runs). The reported values will be the mean largest eigenvalue with standard deviations, confirming the consistent reduction with increasing P. revision: yes

Circularity Check

0 steps flagged

No significant circularity in theoretical derivation of patch-count effect on Hessian spectrum

full rationale

The paper presents an explicit theoretical derivation that the aggregated loss L = f(∑_p QNN_p(patch_p)) yields a largest Hessian eigenvalue that decreases with patch count P, together with an empirical Hessian analysis and loss-landscape visualization. This chain is constructed from the architecture definition and standard second-derivative expansion; it does not reduce to a self-definition, a fitted parameter relabeled as a prediction, or a load-bearing self-citation whose content is itself unverified. The eigenspectrum comparison to classical models is presented as an observed structural similarity rather than a renaming or smuggling of an ansatz. Because the central claim rests on an independent analytic step whose assumptions are stated and whose outputs are checked against direct computation, the derivation remains self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard quantum circuit assumptions and the validity of the Hessian as a local curvature descriptor; no new particles or forces are introduced.

free parameters (2)

number of patches
Varied experimentally to observe effect on Hessian; treated as a controllable architectural hyperparameter rather than fitted constant.
number of parameters per QNN
Varied to study depth of loss landscape; chosen as independent variable.

axioms (2)

domain assumption The loss landscape of a variational quantum circuit can be meaningfully characterized by its Hessian at critical points.
Invoked when performing theoretical and empirical Hessian analyses.
domain assumption Aggregation of independent patch outputs does not introduce dominant cross-term correlations that invalidate the single-patch curvature analysis.
Required for the claim that increasing patches reduces the global largest eigenvalue.

pith-pipeline@v0.9.0 · 5786 in / 1409 out tokens · 32678 ms · 2026-05-22T18:49:26.496180+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 4 internal anchors

[1]

: Variational quantum algorithms

Cerezo, M., Arrasmith, A., Babbush, R., Benjamin, S.C., Endo, S., Fujii, K., McClean, J.R., Mitarai, K., Yuan, X., Cincio, L., et al. : Variational quantum algorithms. Nature Reviews Physics 3(9), 625–644 (2021)

work page 2021
[2]

Gambetta, J.M.: Hardware-efficient variational quantum eigensolver for small molecules and quantum magnets

Kandala, A., Mezzacapo, A., Temme, K., Takita, M., Brink, M., Chow, J.M., 12 Table B1: The Contribution rate of Principal Components d = 50 d = 100 d = 150 d = 200 PC1 PC2 PC1 PC2 PC1 PC2 PC1 PC2 nqc = 4 0.8753 0.0849 0.8295 0.1467 0.9386 0.0371 0.9578 0.0280 nqc = 9 0.8880 0.0784 0.8489 0.0869 0.9217 0.0454 0.9586 0.0284 nqc = 16 0.8585 0.0699 0.9024 0.0...

work page 2017
[3]

Nature communications 5(1), 4213 (2014)

Peruzzo, A., McClean, J., Shadbolt, P., Yung, M.-H., Zhou, X.-Q., Love, P.J., Aspuru-Guzik, A., O’brien, J.L.: A variational eigenvalue solver on a photonic quantum processor. Nature communications 5(1), 4213 (2014)

work page 2014
[4]

A Quantum Approximate Optimization Algorithm

Farhi, E., Goldstone, J., Gutmann, S.: A quantum approximate optimization algorithm. arXiv preprint arXiv:1411.4028 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[5]

Classification with Quantum Neural Networks on Near Term Processors

Farhi, E., Neven, H.: Classification with quantum neural networks on near term processors. arXiv preprint arXiv:1802.06002 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Physical Review A 98(3), 032309 (2018)

Mitarai, K., Negoro, M., Kitagawa, M., Fujii, K.: Quantum circuit learning. Physical Review A 98(3), 032309 (2018)

work page 2018
[7]

Nature Computational Science1(6), 403–409 (2021)

Abbas, A., Sutter, D., Zoufal, C., Lucchi, A., Figalli, A., Woerner, S.: The power of quantum neural networks. Nature Computational Science1(6), 403–409 (2021)

work page 2021
[8]

PRX quantum 3(1), 010313 (2022)

Holmes, Z., Sharma, K., Cerezo, M., Coles, P.J.: Connecting ansatz expressibility to gradient magnitudes and barren plateaus. PRX quantum 3(1), 010313 (2022)

work page 2022
[9]

In: International Conference on Machine Learning, pp

You, X., Wu, X.: Exponentially many local minima in quantum neural networks. In: International Conference on Machine Learning, pp. 12144–12155 (2021). PMLR

work page 2021
[10]

Physical review letters 127(12), 120502 (2021)

Bittel, L., Kliesch, M.: Training variational quantum algorithms is np-hard. Physical review letters 127(12), 120502 (2021)

work page 2021
[11]

Nature communications 9(1), 4812 (2018)

McClean, J.R., Boixo, S., Smelyanskiy, V.N., Babbush, R., Neven, H.: Barren plateaus in quantum neural network training landscapes. Nature communications 9(1), 4812 (2018)

work page 2018
[12]

C2 : These figures show the training and test losses at the end of each epoch

Cerezo, M., Sone, A., Volkoff, T., Cincio, L., Coles, P.J.: Cost function dependent 13 (a) nqc = 4, d = 50 (b) nqc = 4, d = 100 (c) nqc = 4, d = 150 (d) nqc = 4, d = 200 (e) nqc = 9, d = 50 (f) nqc = 9, d = 100 (g) nqc = 9, d = 150 (h) nqc = 9, d = 200 (i) nqc = 16, d = 50 (j) nqc = 16, d = 100 (k) nqc = 16, d = 150 (l) nqc = 16, d = 200 Fig. C2 : These f...

work page 2021
[13]

Nature communications 12(1), 6961 (2021)

Wang, S., Fontana, E., Cerezo, M., Sharma, K., Sone, A., Cincio, L., Coles, P.J.: Noise-induced barren plateaus in variational quantum algorithms. Nature communications 12(1), 6961 (2021)

work page 2021
[14]

Advances in neural information processing systems 31 (2018)

Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T.: Visualizing the loss landscape of neural nets. Advances in neural information processing systems 31 (2018)

work page 2018
[15]

arXiv preprint arXiv:2005.00060 (2020)

Zhao, P., Chen, P.-Y., Das, P., Ramamurthy, K.N., Lin, X.: Bridging mode connectivity in loss landscapes and adversarial robustness. arXiv preprint arXiv:2005.00060 (2020)

work page arXiv 2005
[16]

Advances in neural information processing systems 31 (2018) 14

Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D.P., Wilson, A.G.: Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems 31 (2018) 14

work page 2018
[17]

arXiv preprint arXiv:1912.02757 (2019)

Fort, S., Hu, H., Lakshminarayanan, B.: Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757 (2019)

work page arXiv 1912
[18]

In: 2020 IEEE International Conference on Big Data (Big Data), pp

Yao, Z., Gholami, A., Keutzer, K., Mahoney, M.W.: Pyhessian: Neural networks through the lens of the hessian. In: 2020 IEEE International Conference on Big Data (Big Data), pp. 581–590 (2020). IEEE

work page 2020
[19]

& Perdomo-Ortiz, A

Rudolph, M.S., Sim, S., Raza, A., Stechly, M., McClean, J.R., Anschuetz, E.R., Serrano, L., Perdomo-Ortiz, A.: Orqviz: Visualizing high-dimensional landscapes in variational quantum algorithms. arXiv preprint arXiv:2111.04695 (2021)

work page arXiv 2021
[20]

Quantum Science and Technology 6(2), 025011 (2021)

Huembeli, P., Dauphin, A.: Characterizing the loss landscape of variational quantum circuits. Quantum Science and Technology 6(2), 025011 (2021)

work page 2021
[21]

Quantum Machine Intelligence 5(2), 23 (2023)

Pira, L., Ferrie, C.: An invitation to distributed quantum neural networks. Quantum Machine Intelligence 5(2), 23 (2023)

work page 2023
[22]

Physical Review X 6(2), 021043 (2016)

Bravyi, S., Smith, G., Smolin, J.A.: Trading classical and quantum computational resources. Physical Review X 6(2), 021043 (2016)

work page 2016
[23]

Physical review letters 125(15), 150504 (2020)

Peng, T., Harrow, A.W., Ozols, M., Wu, X.: Simulating large quantum circuits on a small quantum computer. Physical review letters 125(15), 150504 (2020)

work page 2020
[24]

Quantum 7, 1078 (2023)

Marshall, S.C., Gyurik, C., Dunjko, V.: High dimensional quantum machine learning with small quantum computers. Quantum 7, 1078 (2023)

work page 2023
[25]

Quantum Machine Intelligence 6(1), 15 (2024)

Kawase, Y.: Distributed quantum neural networks via partitioned features encoding. Quantum Machine Intelligence 6(1), 15 (2024)

work page 2024
[26]

Computational Linguistics 48(3), 733–763 (2022)

Dufter, P., Schmitt, M., Sch¨ utze, H.: Position information in transformers: An overview. Computational Linguistics 48(3), 733–763 (2022)

work page 2022
[27]

Adam: A Method for Stochastic Optimization

Kingma, D.P.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[28]

https://github.com/mit-han-lab/torchquantum

TorchQuantum (2024). https://github.com/mit-han-lab/torchquantum

work page 2024
[29]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large- batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[30]

Nature communications 15(1), 5200 (2024) 15

Thanasilp, S., Wang, S., Cerezo, M., Holmes, Z.: Exponential concentration in quantum kernel methods. Nature communications 15(1), 5200 (2024) 15

work page 2024

[1] [1]

: Variational quantum algorithms

Cerezo, M., Arrasmith, A., Babbush, R., Benjamin, S.C., Endo, S., Fujii, K., McClean, J.R., Mitarai, K., Yuan, X., Cincio, L., et al. : Variational quantum algorithms. Nature Reviews Physics 3(9), 625–644 (2021)

work page 2021

[2] [2]

Gambetta, J.M.: Hardware-efficient variational quantum eigensolver for small molecules and quantum magnets

Kandala, A., Mezzacapo, A., Temme, K., Takita, M., Brink, M., Chow, J.M., 12 Table B1: The Contribution rate of Principal Components d = 50 d = 100 d = 150 d = 200 PC1 PC2 PC1 PC2 PC1 PC2 PC1 PC2 nqc = 4 0.8753 0.0849 0.8295 0.1467 0.9386 0.0371 0.9578 0.0280 nqc = 9 0.8880 0.0784 0.8489 0.0869 0.9217 0.0454 0.9586 0.0284 nqc = 16 0.8585 0.0699 0.9024 0.0...

work page 2017

[3] [3]

Nature communications 5(1), 4213 (2014)

Peruzzo, A., McClean, J., Shadbolt, P., Yung, M.-H., Zhou, X.-Q., Love, P.J., Aspuru-Guzik, A., O’brien, J.L.: A variational eigenvalue solver on a photonic quantum processor. Nature communications 5(1), 4213 (2014)

work page 2014

[4] [4]

A Quantum Approximate Optimization Algorithm

Farhi, E., Goldstone, J., Gutmann, S.: A quantum approximate optimization algorithm. arXiv preprint arXiv:1411.4028 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[5] [5]

Classification with Quantum Neural Networks on Near Term Processors

Farhi, E., Neven, H.: Classification with quantum neural networks on near term processors. arXiv preprint arXiv:1802.06002 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Physical Review A 98(3), 032309 (2018)

Mitarai, K., Negoro, M., Kitagawa, M., Fujii, K.: Quantum circuit learning. Physical Review A 98(3), 032309 (2018)

work page 2018

[7] [7]

Nature Computational Science1(6), 403–409 (2021)

Abbas, A., Sutter, D., Zoufal, C., Lucchi, A., Figalli, A., Woerner, S.: The power of quantum neural networks. Nature Computational Science1(6), 403–409 (2021)

work page 2021

[8] [8]

PRX quantum 3(1), 010313 (2022)

Holmes, Z., Sharma, K., Cerezo, M., Coles, P.J.: Connecting ansatz expressibility to gradient magnitudes and barren plateaus. PRX quantum 3(1), 010313 (2022)

work page 2022

[9] [9]

In: International Conference on Machine Learning, pp

You, X., Wu, X.: Exponentially many local minima in quantum neural networks. In: International Conference on Machine Learning, pp. 12144–12155 (2021). PMLR

work page 2021

[10] [10]

Physical review letters 127(12), 120502 (2021)

Bittel, L., Kliesch, M.: Training variational quantum algorithms is np-hard. Physical review letters 127(12), 120502 (2021)

work page 2021

[11] [11]

Nature communications 9(1), 4812 (2018)

McClean, J.R., Boixo, S., Smelyanskiy, V.N., Babbush, R., Neven, H.: Barren plateaus in quantum neural network training landscapes. Nature communications 9(1), 4812 (2018)

work page 2018

[12] [12]

C2 : These figures show the training and test losses at the end of each epoch

Cerezo, M., Sone, A., Volkoff, T., Cincio, L., Coles, P.J.: Cost function dependent 13 (a) nqc = 4, d = 50 (b) nqc = 4, d = 100 (c) nqc = 4, d = 150 (d) nqc = 4, d = 200 (e) nqc = 9, d = 50 (f) nqc = 9, d = 100 (g) nqc = 9, d = 150 (h) nqc = 9, d = 200 (i) nqc = 16, d = 50 (j) nqc = 16, d = 100 (k) nqc = 16, d = 150 (l) nqc = 16, d = 200 Fig. C2 : These f...

work page 2021

[13] [13]

Nature communications 12(1), 6961 (2021)

Wang, S., Fontana, E., Cerezo, M., Sharma, K., Sone, A., Cincio, L., Coles, P.J.: Noise-induced barren plateaus in variational quantum algorithms. Nature communications 12(1), 6961 (2021)

work page 2021

[14] [14]

Advances in neural information processing systems 31 (2018)

Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T.: Visualizing the loss landscape of neural nets. Advances in neural information processing systems 31 (2018)

work page 2018

[15] [15]

arXiv preprint arXiv:2005.00060 (2020)

Zhao, P., Chen, P.-Y., Das, P., Ramamurthy, K.N., Lin, X.: Bridging mode connectivity in loss landscapes and adversarial robustness. arXiv preprint arXiv:2005.00060 (2020)

work page arXiv 2005

[16] [16]

Advances in neural information processing systems 31 (2018) 14

Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D.P., Wilson, A.G.: Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems 31 (2018) 14

work page 2018

[17] [17]

arXiv preprint arXiv:1912.02757 (2019)

Fort, S., Hu, H., Lakshminarayanan, B.: Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757 (2019)

work page arXiv 1912

[18] [18]

In: 2020 IEEE International Conference on Big Data (Big Data), pp

Yao, Z., Gholami, A., Keutzer, K., Mahoney, M.W.: Pyhessian: Neural networks through the lens of the hessian. In: 2020 IEEE International Conference on Big Data (Big Data), pp. 581–590 (2020). IEEE

work page 2020

[19] [19]

& Perdomo-Ortiz, A

Rudolph, M.S., Sim, S., Raza, A., Stechly, M., McClean, J.R., Anschuetz, E.R., Serrano, L., Perdomo-Ortiz, A.: Orqviz: Visualizing high-dimensional landscapes in variational quantum algorithms. arXiv preprint arXiv:2111.04695 (2021)

work page arXiv 2021

[20] [20]

Quantum Science and Technology 6(2), 025011 (2021)

Huembeli, P., Dauphin, A.: Characterizing the loss landscape of variational quantum circuits. Quantum Science and Technology 6(2), 025011 (2021)

work page 2021

[21] [21]

Quantum Machine Intelligence 5(2), 23 (2023)

Pira, L., Ferrie, C.: An invitation to distributed quantum neural networks. Quantum Machine Intelligence 5(2), 23 (2023)

work page 2023

[22] [22]

Physical Review X 6(2), 021043 (2016)

Bravyi, S., Smith, G., Smolin, J.A.: Trading classical and quantum computational resources. Physical Review X 6(2), 021043 (2016)

work page 2016

[23] [23]

Physical review letters 125(15), 150504 (2020)

Peng, T., Harrow, A.W., Ozols, M., Wu, X.: Simulating large quantum circuits on a small quantum computer. Physical review letters 125(15), 150504 (2020)

work page 2020

[24] [24]

Quantum 7, 1078 (2023)

Marshall, S.C., Gyurik, C., Dunjko, V.: High dimensional quantum machine learning with small quantum computers. Quantum 7, 1078 (2023)

work page 2023

[25] [25]

Quantum Machine Intelligence 6(1), 15 (2024)

Kawase, Y.: Distributed quantum neural networks via partitioned features encoding. Quantum Machine Intelligence 6(1), 15 (2024)

work page 2024

[26] [26]

Computational Linguistics 48(3), 733–763 (2022)

Dufter, P., Schmitt, M., Sch¨ utze, H.: Position information in transformers: An overview. Computational Linguistics 48(3), 733–763 (2022)

work page 2022

[27] [27]

Adam: A Method for Stochastic Optimization

Kingma, D.P.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[28] [28]

https://github.com/mit-han-lab/torchquantum

TorchQuantum (2024). https://github.com/mit-han-lab/torchquantum

work page 2024

[29] [29]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large- batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[30] [30]

Nature communications 15(1), 5200 (2024) 15

Thanasilp, S., Wang, S., Cerezo, M., Holmes, Z.: Exponential concentration in quantum kernel methods. Nature communications 15(1), 5200 (2024) 15

work page 2024