pith. sign in

arxiv: 2509.15113 · v2 · submitted 2025-09-18 · 💻 cs.LG

Low-rank surrogate modeling and stochastic zero-order optimization for training of neural networks with black-box layers

Pith reviewed 2026-05-18 15:20 UTC · model grok-4.3

classification 💻 cs.LG
keywords low-rank surrogate modelingstochastic zero-order optimizationhybrid neural networksblack-box physical layersprojector-splitting integratorend-to-end traininghardware-aware deep learning
0
0 comments X

The pith

A dynamic low-rank surrogate and stochastic zero-order optimization enable end-to-end training of hybrid networks that include non-differentiable physical black-box layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that hybrid neural networks mixing standard digital layers with physical components that act as non-differentiable black boxes can still be trained effectively from start to finish. This would matter because physical hardware such as photonic devices promises lower energy use and faster inference, yet it has been hard to include in gradient-based learning pipelines. The method pairs direct stochastic zero-order updates on the physical parameters with a lightweight low-rank model that stands in for the black box during backpropagation. The surrogate is refreshed after each forward pass by an implicit projector-splitting integrator that requires only a few hardware queries instead of full matrix reconstruction. When the approach succeeds, the hybrid models reach accuracy levels close to fully digital baselines on computer vision, audio classification, and language modeling tasks.

Core claim

The authors state that stochastic zeroth-order optimization handles updates to the internal parameters of the physical layer while a dynamic low-rank surrogate model, refreshed after each forward pass by the implicit projector-splitting integrator, supplies the gradients needed to train the digital layers. This combination permits reliable end-to-end training of hybrid architectures that incorporate various non-differentiable physical components and produces accuracy comparable to digital baselines across vision, audio, and language tasks.

What carries the argument

The implicit projector-splitting integrator that updates the low-rank surrogate model after each forward pass to approximate the input-output behavior of the black-box physical layer.

If this is right

  • End-to-end training becomes possible for networks that contain spatial light modulators, microring resonators, or Mach-Zehnder interferometers.
  • Only a small number of hardware queries per iteration are needed to keep the surrogate current.
  • The same framework works for computer vision, audio classification, and language modeling while staying close to digital baseline accuracy.
  • Gradient-free optimization and hardware-aware training are combined into one practical pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same surrogate-plus-zero-order pattern could be tried on other non-differentiable black-box modules such as certain analog circuits or quantum simulators.
  • Controlling the rank of the surrogate offers a direct dial for trading approximation quality against extra compute during the update step.
  • One could measure whether the reduced number of hardware queries actually lowers total energy cost compared with methods that query the physical layer more heavily.

Load-bearing premise

The low-rank surrogate must stay accurate enough to the black-box physical layer's true input-output mapping that the gradients it supplies remain useful for updating the rest of the network.

What would settle it

Training the same hybrid models on the reported tasks but replacing the dynamic surrogate with either a fixed random approximation or no approximation at all, then checking whether accuracy falls well below the digital baseline or training fails to converge.

Figures

Figures reproduced from arXiv: 2509.15113 by Andrei Chertkov, Artem Basharin, Evgeny Frolov, Ivan Oseledets, Mikhail Saygin, Stanislav Straupe.

Figure 1
Figure 1. Figure 1: Schematic representation of the proposed method [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the physical layers simulated in this work: a) MRR weight [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy results averaged over five independent runs for CIFAR-10 image clas [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗
read the original abstract

The growing demand for energy-efficient, high-performance AI systems has led to increased attention on alternative computing platforms (e.g., photonic, neuromorphic) due to their potential to accelerate learning and inference. However, integrating such physical components into deep learning pipelines remains challenging, as physical devices often offer limited expressiveness, and their non-differentiable nature renders on-device backpropagation difficult or infeasible. This motivates the development of hybrid architectures that combine digital neural networks with reconfigurable physical layers, which effectively behave as black boxes. In this work, we present a framework for the end-to-end training of such hybrid networks. This framework integrates stochastic zeroth-order optimization for updating the physical layer's internal parameters with a dynamic low-rank surrogate model that enables gradient propagation through the physical layer. A key component of our approach is the implicit projector-splitting integrator algorithm, which updates the lightweight surrogate model after each forward pass with minimal hardware queries, thereby avoiding costly full matrix reconstruction. We demonstrate our method across diverse deep learning tasks, including: computer vision, audio classification, and language modeling. Notably, across all modalities, the proposed approach achieves near-digital baseline accuracy and consistently enables effective end-to-end training of hybrid models incorporating various non-differentiable physical components (spatial light modulators, microring resonators, and Mach-Zehnder interferometers). This work bridges hardware-aware deep learning and gradient-free optimization, thereby offering a practical pathway for integrating non-differentiable physical components into scalable, end-to-end trainable AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents a framework for end-to-end training of hybrid neural networks that combine digital layers with non-differentiable black-box physical components (spatial light modulators, microring resonators, Mach-Zehnder interferometers). It integrates stochastic zeroth-order optimization to update physical-layer parameters together with a dynamic low-rank surrogate model that is refreshed after each forward pass by an implicit projector-splitting integrator; the surrogate is intended to furnish gradients through the physical layer while requiring only minimal hardware queries. Experiments are reported on computer vision, audio classification, and language modeling tasks, with the claim that near-digital baseline accuracy is attained across modalities.

Significance. If the surrogate approximation remains sufficiently accurate to support stable gradient estimates, the work would offer a practical route for incorporating energy-efficient physical hardware into trainable deep-learning pipelines. Credit is due for the multi-modal experimental scope and for the dynamic update mechanism that avoids repeated full-matrix reconstructions. The integration of zeroth-order optimization with an online low-rank surrogate constitutes a concrete contribution at the intersection of hardware-aware learning and gradient-free methods.

major comments (3)
  1. Abstract: the central performance claim that the method 'achieves near-digital baseline accuracy' is stated without any numerical values, standard deviations, or direct baseline comparisons, rendering it impossible to evaluate whether the low-rank surrogate actually delivers usable gradients.
  2. Method section (description of implicit projector-splitting integrator): no error bound, convergence analysis, or empirical measurement of the surrogate approximation error is supplied; because the entire end-to-end training claim rests on this approximation being faithful enough for back-propagation, the absence of such quantification is load-bearing.
  3. Experiments section: no ablation is presented on surrogate rank, query budget per update, or sensitivity to input dimensionality; without these controls the assertion that 'minimal hardware queries' suffice cannot be assessed for the high-dimensional inputs arising in language modeling.
minor comments (2)
  1. Notation for the low-rank factors (U, S, V) and the precise update equations of the integrator should be collected in a single displayed block for clarity.
  2. Figure captions could explicitly state the number of hardware queries used per surrogate refresh to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We are pleased that the significance of the work at the intersection of hardware-aware learning and gradient-free methods is recognized. We address each of the major comments below and will incorporate revisions as indicated.

read point-by-point responses
  1. Referee: Abstract: the central performance claim that the method 'achieves near-digital baseline accuracy' is stated without any numerical values, standard deviations, or direct baseline comparisons, rendering it impossible to evaluate whether the low-rank surrogate actually delivers usable gradients.

    Authors: We agree that the abstract would benefit from more quantitative details to support the performance claim. In the revised manuscript, we will update the abstract to include specific numerical accuracy values (e.g., top-1 accuracy percentages), standard deviations from repeated experiments, and explicit comparisons to the digital baselines for the vision, audio, and language modeling tasks. This will provide a clearer assessment of the surrogate's effectiveness in enabling usable gradients. revision: yes

  2. Referee: Method section (description of implicit projector-splitting integrator): no error bound, convergence analysis, or empirical measurement of the surrogate approximation error is supplied; because the entire end-to-end training claim rests on this approximation being faithful enough for back-propagation, the absence of such quantification is load-bearing.

    Authors: We acknowledge that quantifying the surrogate approximation error is crucial for validating the approach. The current manuscript emphasizes the practical implementation and end-to-end results, but we will add empirical measurements of the approximation error, such as the Frobenius norm difference between the surrogate and true mappings over the course of training, in a new subsection or figure in the revised version. For theoretical error bounds and convergence analysis, developing rigorous guarantees for the stochastic setting with black-box layers is non-trivial and beyond the scope of the current work; however, we will include a discussion on the observed stability and empirical convergence rates to address this concern. revision: partial

  3. Referee: Experiments section: no ablation is presented on surrogate rank, query budget per update, or sensitivity to input dimensionality; without these controls the assertion that 'minimal hardware queries' suffice cannot be assessed for the high-dimensional inputs arising in language modeling.

    Authors: We agree that ablations on key hyperparameters would strengthen the experimental section. In the revision, we will add results or analysis on the impact of surrogate rank (e.g., varying rank from 10 to 100) and query budget per update on the final accuracy. Regarding sensitivity to input dimensionality, we will highlight the language modeling experiments, which use high-dimensional token embeddings, and demonstrate that the method maintains performance with a fixed small query budget. If feasible within page limits, a dedicated ablation study will be included. revision: yes

Circularity Check

0 steps flagged

No circularity: framework integrates external optimization and hardware measurements with empirical validation

full rationale

The paper proposes a hybrid training framework that combines stochastic zeroth-order optimization for physical-layer parameters with a dynamic low-rank surrogate model updated via the implicit projector-splitting integrator after each forward pass. The central performance claim of near-digital baseline accuracy is supported by direct empirical results across computer vision, audio classification, and language modeling tasks using actual hardware queries on components such as spatial light modulators and microring resonators. No derivation step reduces a prediction or result to a quantity defined inside the paper by construction, and the approach relies on standard external primitives rather than self-referential definitions or load-bearing self-citations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that physical black-box layers admit useful low-rank approximations and that zero-order perturbations can be used to optimize their internal parameters without explicit gradients.

free parameters (1)
  • surrogate rank
    The rank of the low-rank model is a modeling choice that controls approximation quality versus query cost.
axioms (1)
  • domain assumption Physical device response can be approximated by a low-rank linear map for the purpose of gradient flow
    Invoked to justify the surrogate model that enables backpropagation through the black-box layer.
invented entities (1)
  • dynamic low-rank surrogate model no independent evidence
    purpose: Approximates the non-differentiable physical layer to permit gradient-based updates in the digital part of the network
    New modeling construct introduced to bridge the differentiability gap.

pith-pipeline@v0.9.0 · 5826 in / 1364 out tokens · 51872 ms · 2026-05-18T15:20:08.347940+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 1 internal anchor

  1. [1]

    I. H. Sarker, Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions, SN computer science 2 (6) (2021) 1–20

  2. [2]

    Bente, S

    I. Bente, S. Taheriniya, F. Lenzini, F. Brückerhoff-Plückelmann, M. Kues, H. Bhaskaran, C. D. Wright, W. Pernice, The potential of multidimensional photonic computing, Nature Reviews Physics (2025) 1–12

  3. [3]

    Moralis-Pegios, G

    M. Moralis-Pegios, G. Mourgias-Alexandris, A. Tsakyridis, G. Gi- amougiannis, A. Totovic, G. Dabos, N. Passalis, M. Kirtas, T. Ruti- rawut, F. Gardes, et al., Neuromorphic silicon photonics and hardware- aware deep learning for high-speed inference, Journal of Lightwave Tech- nology 40 (10) (2022) 3243–3254

  4. [4]

    S.-Y. Ma, T. Wang, J. Laydevant, L. G. Wright, P. L. McMahon, Quantum-limited stochastic optical neural networks operating at a few quanta per activation, Nature Communications 16 (1) (2025) 359

  5. [5]

    S. R. Ahmed, R. Baghdadi, M. Bernadskiy, N. Bowman, R. Braid, J. Carr, C. Chen, P. Ciccarella, M. Cole, J. Cooke, et al., Universal photonic artificial intelligence acceleration, Nature 640 (8058) (2025) 368–374. 26

  6. [6]

    B. J. Shastri, A. N. Tait, T. Ferreira de Lima, W. H. Pernice, H. Bhaskaran, C. D. Wright, P. R. Prucnal, Photonics for artificial in- telligence and neuromorphic computing, Nature Photonics 15 (2) (2021) 102–114

  7. [7]

    K. Liao, T. Dai, Q. Yan, X. Hu, Q. Gong, Integrated photonic neural networks: Opportunities and challenges, ACS Photonics 10 (7) (2023) 2001–2010

  8. [8]

    Montes McNeil, Y

    A. Montes McNeil, Y. Li, A. Zhang, M. Moebius, Y. Liu, Fundamentals and recent developments of free-space optical neural networks, Journal of Applied Physics 136 (3) (2024)

  9. [9]

    Moralis-Pegios, G

    M. Moralis-Pegios, G. Giamougiannis, A. Tsakyridis, D. Lazovsky, N. Pleros, Perfect linear optics using silicon photonics, Nature Com- munications 15 (1) (2024) 5468

  10. [10]

    Najjar Amiri, A

    A. Najjar Amiri, A. D. Vit, K. Gorgulu, E. S. Magden, Deep photonic network platform enabling arbitrary and broadband optical functional- ity, Nature Communications 15 (1) (2024) 1432

  11. [11]

    Yildirim, N

    M. Yildirim, N. U. Dinc, I. Oguz, D. Psaltis, C. Moser, Nonlinear pro- cessing with linear optics, Nature Photonics 18 (10) (2024) 1076–1082

  12. [12]

    H. Wang, J. Hu, A. Morandi, A. Nardi, F. Xia, X. Li, R. Savo, Q. Liu, R.Grange, S.Gigan, Photonicsbreakthroughs2024: Nonlinearphotonic computing at scale, IEEE Photonics Journal (2025)

  13. [13]

    Lubich, I

    C. Lubich, I. V. Oseledets, A projector-splitting integrator for dynami- cal low-rank approximation, BIT Numerical Mathematics 54 (1) (2014) 171–188

  14. [14]

    Zhang, B

    A.Chen, Y.Zhang, J.Jia, J.Diffenderfer, J.Liu, K.Parasyris, Y.Zhang, Z. Zhang, B. Kailkhura, S. Liu, Deepzero: Scaling up zeroth-order op- timization for deep model training, arXiv preprint arXiv:2310.02025 (2023)

  15. [15]

    Olaleke, I

    O. Olaleke, I. Oseledets, E. Frolov, Dynamic modeling of user prefer- ences for stable recommendations, in: Proceedings of the 29th ACM Conference on User Modeling, Adaptation and Personalization, 2021, pp. 262–266. 27

  16. [16]

    S. Liu, B. Kailkhura, P.-Y. Chen, P. Ting, S. Chang, L. Amini, Zeroth- order stochastic variance reduction for nonconvex optimization, Ad- vances in neural information processing systems 31 (2018)

  17. [17]

    Malladi, T

    S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, S. Arora, Fine-tuning language models with just forward passes, Advances in Neu- ral Information Processing Systems 36 (2023) 53038–53075

  18. [18]

    Y. Chen, Y. Zhang, L. Cao, K. Yuan, Z. Wen, Enhancing zeroth-order fine-tuning for language models with low-rank structures, arXiv preprint arXiv:2410.07698 (2024)

  19. [19]

    Chaubard, M

    F. Chaubard, M. Kochenderfer, Scaling recurrent neural networks to a billion parameters with zero-order optimization, arXiv preprint arXiv:2505.17852 (2025)

  20. [20]

    S. Wang, L. Yu, J. Li, Lora-ga: Low-rank adaptation with gradient approximation, Advances in Neural Information Processing Systems 37 (2024) 54905–54931

  21. [21]

    J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, Y. Tian, Galore: Memory-efficient llm training by gradient low-rank projection, arXiv preprint arXiv:2403.03507 (2024)

  22. [22]

    3017–3021

    M.Gooneratne, K.C.Sim, P.Zadrazil, A.Kabel, F.Beaufays, G.Motta, Low-rank gradient approximation for memory-efficient on-device train- ing of deep neural network, in: ICASSP 2020-2020 IEEE Interna- tionalConferenceonAcoustics, SpeechandSignalProcessing(ICASSP), IEEE, 2020, pp. 3017–3021

  23. [23]

    Fournier, S

    L. Fournier, S. Rivaud, E. Belilovsky, M. Eickenberg, E. Oyallon, Can forward gradient match backpropagation?, in: International Conference on Machine Learning, PMLR, 2023, pp. 10249–10264

  24. [24]

    Refael, J

    Y. Refael, J. Svirsky, B. Shustin, W. Huleihel, O. Lindenbaum, Adarankgrad: Adaptive gradient-rank and moments for memory- efficient llms training and fine-tuning, arXiv preprint arXiv:2410.17881 (2024). 28

  25. [25]

    Sumo: Subspace-aware moment- orthogonalization for accelerating memory-efficient llm training.arXiv preprint arXiv:2505.24749, 2025

    Y. Refael, G. Smorodinsky, T. Tirer, O. Lindenbaum, SUMO: Subspace- aware moment-orthogonalization for accelerating memory-efficient LLM training, arXiv preprint arXiv:2505.24749 (2025)

  26. [26]

    T. Fu, Y. Zang, Y. Huang, Z. Du, H. Huang, C. Hu, M. Chen, S. Yang, H. Chen, Photonic machine learning with on-chip diffractive optics, Na- ture Communications 14 (1) (2023) 70

  27. [27]

    J. Gu, H. Zhu, C. Feng, Z. Jiang, R. Chen, D. Pan, L2ight: Enabling on-chip learning for optical neural networks via efficient in-situ subspace optimization, in: M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, J. W. Vaughan (Eds.), Advances in Neural Information Processing Sys- tems, Vol. 34, Curran Associates, Inc., 2021, pp. 8649–8661

  28. [28]

    Y. Zhao, X. Yu, Z. Chen, Z. Liu, S. Liu, Z. Zhang, Tensor-compressed back-propagation-free training for (physics-informed) neural networks (2023)

  29. [29]

    Z. Qu, Z. Zhou, Y. Tong, L. Thiele, p-meta: Towards on-device deep model adaptation, in: Proceedings of the 28th ACM SIGKDD Confer- ence on Knowledge Discovery and Data Mining, KDD ’22, ACM, 2022, p. 1441–1451

  30. [30]

    S. Pai, Z. Sun, T. W. Hughes, T. Park, B. Bartlett, I. A. D. Williamson, M. Minkov, M. Milanizadeh, N. Abebe, F. Morichetti, A. Melloni, S. Fan, O. Solgaard, D. A. B. Miller, Experimentally realized in situ backpropagation for deep learning in photonic neural networks, Science 380 (6643) (2023) 398–404

  31. [31]

    T. Zhou, L. Fang, T. Yan, J. Wu, Y. Li, J. Fan, H. Wu, X. Lin, Q. Dai, In situ optical backpropagation training of diffractive optical neural net- works, Photon. Res. 8 (6) (2020) 940–953

  32. [32]

    Spall, X

    J. Spall, X. Guo, T. D. Barrett, A. I. Lvovsky, Fully reconfigurable coherent optical vector–matrix multiplication, Opt. Lett. 45 (20) (2020) 5752–5755

  33. [33]

    Bandyopadhyay, A

    S. Bandyopadhyay, A. Sludds, S. Krastanov, R. Hamerly, N. Harris, D. Bunandar, M. Streshinsky, M. Hochberg, D. Englund, Single-chip photonic deep neural network with forward-only training, Nature Pho- tonics 18 (12) (2024) 1335–1343. 29

  34. [34]

    Z. Wang, K. Müller, M. Filipovich, J. Launay, R. Ohana, G. Pariente, S. Mokaadi, C. Brossollet, F. Moreau, A. Cappelli, I. Poli, I. Carron, L. Daudet, F. Krzakala, S. Gigan, Streamlined optical training of large- scale modern deep learning architectures with direct feedback alignment (2025)

  35. [35]

    The forward-forward algorithm: Some preliminary investi- gations.ArXiv Preprint ArXiv:2212.13345

    G. Hinton, The forward-forward algorithm: Some preliminary investi- gations, arXiv preprint arXiv:2212.13345 2 (3) (2022) 5

  36. [36]

    I. Oguz, J. Ke, Q. Weng, F. Yang, M. Yildirim, N. U. Dinc, J.-L. Hsieh, C. Moser, D. Psaltis, Forward–forward training of an optical neural network, Opt. Lett. 48 (20) (2023) 5249–5252

  37. [37]

    A. N. McCaughan, B. G. Oripov, N. Ganesh, S. W. Nam, A. Dienstfrey, S. M. Buckley, Multiplexed gradient descent: Fast online training of modern datasets on hardware neural networks without backpropagation, APL Machine Learning 1 (2) (2023) 026118

  38. [38]

    S. Pai, I. A. Williamson, T. W. Hughes, M. Minkov, O. Solgaard, S. Fan, D. A. Miller, Parallel fault-tolerant programming of an arbitrary feed- forward photonic network, arXiv preprint arXiv:1909.06179 (2019)

  39. [39]

    Laporte, J

    F. Laporte, J. Dambre, P. Bienstman, Highly parallel simulation and optimization of photonic circuits in time and frequency domain based on the deep-learning framework pytorch, Scientific reports 9 (1) (2019) 5918

  40. [40]

    J. Gu, H. Zhu, C. Feng, Z. Jiang, R. T. Chen, D. Z. Pan, L2ight: En- abling on-chip learning for optical neural networks via efficient in-situ subspace optimization, in: Conference on Neural Information Processing Systems (NeurIPS), 2021

  41. [41]

    Z. Yin, M. Zhang, A. Begovic, R. Huang, J. Zhang, J. Gu, Simphony: A device-circuit-architecture cross-layer modeling and simulation frame- work for heterogeneous electronic-photonic ai system, arXiv preprint arXiv:2411.13715 (2024)

  42. [42]

    Zheng, Z

    Z. Zheng, Z. Duan, H. Chen, R. Yang, S. Gao, H. Zhang, H. Xiong, X. Lin, Dual adaptive training of photonic neural networks, Nature Ma- chine Intelligence (2023) 1–11. 30

  43. [43]

    Giamougiannis, A

    G. Giamougiannis, A. Tsakyridis, Y. Ma, A. Totović, M. Moralis-Pegios, D. Lazovsky, N. Pleros, A coherent photonic crossbar for scalable univer- sal linear optics, Journal of Lightwave Technology 41 (8) (2023) 2425– 2442

  44. [44]

    A. N. Tait, A. X. Wu, T. F. de Lima, E. Zhou, B. J. Shastri, M. A. Nah- mias, P. R. Prucnal, Microring weight banks, IEEE Journal of Selected Topics in Quantum Electronics 22 (6) (2016) 312–325

  45. [45]

    Tamura, J

    N. Tamura, J. C. Wyant, Two-dimensional matrix multiplication using coherent optical techniques, Opt. Eng. 18 (198) (1979)

  46. [46]

    T. Dao, B. Chen, N. Sohoni, A. Desai, M. Poli, J. Grogan, A. Liu, A. Rao, A. Rudra, C. Ré, Monarch: Expressive structured matrices for efficient and accurate training (2022)

  47. [47]

    S. Qiu, A. Potapczynski, M. Finzi, M. Goldblum, A. G. Wilson, Com- pute better spent: Replacing dense layers with structured matrices, arXiv preprint arXiv:2406.06248 (2024)

  48. [48]

    W. R. Clements, P. C. Humphreys, B. J. Metcalf, W. S. Kolthammer, I. A. Walmsley, Optimal design for universal multiport interferometers, Optica 3 (12) (2016) 1460–1465

  49. [49]

    Hamerly, S

    R. Hamerly, S. Bandyopadhyay, D. Englund, Asymptotically fault- tolerantprogrammablephotonics, NatureCommunications13(1)(2022) 6831

  50. [50]

    M. Dong, M. Zimmermann, D. Heim, H. Choi, G. Clark, A. J. Leen- heer, K. J. Palm, A. Witte, D. Dominguez, G. Gilbert, M. Eichenfield, D.Englund, Programmablephotonicintegratedmeshesformodulargen- eration of optical entanglement links, npj Quantum Information 9 (1) (2023) 42

  51. [51]

    Krizhevsky, G

    A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images (2009)

  52. [52]

    Salamon, C

    J. Salamon, C. Jacoby, J. P. Bello, A dataset and taxonomy for urban sound research, in: Proceedings of the 22nd ACM international confer- ence on Multimedia, 2014, pp. 1041–1044. 31

  53. [53]

    ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,

    B. Desplanques, J. Thienpondt, K. Demuynck, ECAPA-TDNN: Em- phasized channel attention, propagation and aggregation in tdnn based speaker verification, arXiv preprint arXiv:2005.07143 (2020)

  54. [54]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Languagemodelsareunsupervisedmultitasklearners, OpenAIblog1(8) (2019) 9

  55. [55]

    Penedo, H

    G. Penedo, H. Kydlíček, L. B. allal, A. Lozhkov, M. Mitchell, C. Raffel, L. V. Werra, T. Wolf, The fineweb datasets: Decanting the web for the finest text data at scale, in: The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. 32