Low-rank surrogate modeling and stochastic zero-order optimization for training of neural networks with black-box layers

Andrei Chertkov; Artem Basharin; Evgeny Frolov; Ivan Oseledets; Mikhail Saygin; Stanislav Straupe

arxiv: 2509.15113 · v2 · submitted 2025-09-18 · 💻 cs.LG

Low-rank surrogate modeling and stochastic zero-order optimization for training of neural networks with black-box layers

Andrei Chertkov , Artem Basharin , Mikhail Saygin , Evgeny Frolov , Stanislav Straupe , Ivan Oseledets This is my paper

Pith reviewed 2026-05-18 15:20 UTC · model grok-4.3

classification 💻 cs.LG

keywords low-rank surrogate modelingstochastic zero-order optimizationhybrid neural networksblack-box physical layersprojector-splitting integratorend-to-end traininghardware-aware deep learning

0 comments

The pith

A dynamic low-rank surrogate and stochastic zero-order optimization enable end-to-end training of hybrid networks that include non-differentiable physical black-box layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that hybrid neural networks mixing standard digital layers with physical components that act as non-differentiable black boxes can still be trained effectively from start to finish. This would matter because physical hardware such as photonic devices promises lower energy use and faster inference, yet it has been hard to include in gradient-based learning pipelines. The method pairs direct stochastic zero-order updates on the physical parameters with a lightweight low-rank model that stands in for the black box during backpropagation. The surrogate is refreshed after each forward pass by an implicit projector-splitting integrator that requires only a few hardware queries instead of full matrix reconstruction. When the approach succeeds, the hybrid models reach accuracy levels close to fully digital baselines on computer vision, audio classification, and language modeling tasks.

Core claim

The authors state that stochastic zeroth-order optimization handles updates to the internal parameters of the physical layer while a dynamic low-rank surrogate model, refreshed after each forward pass by the implicit projector-splitting integrator, supplies the gradients needed to train the digital layers. This combination permits reliable end-to-end training of hybrid architectures that incorporate various non-differentiable physical components and produces accuracy comparable to digital baselines across vision, audio, and language tasks.

What carries the argument

The implicit projector-splitting integrator that updates the low-rank surrogate model after each forward pass to approximate the input-output behavior of the black-box physical layer.

If this is right

End-to-end training becomes possible for networks that contain spatial light modulators, microring resonators, or Mach-Zehnder interferometers.
Only a small number of hardware queries per iteration are needed to keep the surrogate current.
The same framework works for computer vision, audio classification, and language modeling while staying close to digital baseline accuracy.
Gradient-free optimization and hardware-aware training are combined into one practical pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same surrogate-plus-zero-order pattern could be tried on other non-differentiable black-box modules such as certain analog circuits or quantum simulators.
Controlling the rank of the surrogate offers a direct dial for trading approximation quality against extra compute during the update step.
One could measure whether the reduced number of hardware queries actually lowers total energy cost compared with methods that query the physical layer more heavily.

Load-bearing premise

The low-rank surrogate must stay accurate enough to the black-box physical layer's true input-output mapping that the gradients it supplies remain useful for updating the rest of the network.

What would settle it

Training the same hybrid models on the reported tasks but replacing the dynamic surrogate with either a fixed random approximation or no approximation at all, then checking whether accuracy falls well below the digital baseline or training fails to converge.

Figures

Figures reproduced from arXiv: 2509.15113 by Andrei Chertkov, Artem Basharin, Evgeny Frolov, Ivan Oseledets, Mikhail Saygin, Stanislav Straupe.

**Figure 2.** Figure 2: Illustration of the physical layers simulated in this work: a) MRR weight [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy results averaged over five independent runs for CIFAR-10 image clas [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗

read the original abstract

The growing demand for energy-efficient, high-performance AI systems has led to increased attention on alternative computing platforms (e.g., photonic, neuromorphic) due to their potential to accelerate learning and inference. However, integrating such physical components into deep learning pipelines remains challenging, as physical devices often offer limited expressiveness, and their non-differentiable nature renders on-device backpropagation difficult or infeasible. This motivates the development of hybrid architectures that combine digital neural networks with reconfigurable physical layers, which effectively behave as black boxes. In this work, we present a framework for the end-to-end training of such hybrid networks. This framework integrates stochastic zeroth-order optimization for updating the physical layer's internal parameters with a dynamic low-rank surrogate model that enables gradient propagation through the physical layer. A key component of our approach is the implicit projector-splitting integrator algorithm, which updates the lightweight surrogate model after each forward pass with minimal hardware queries, thereby avoiding costly full matrix reconstruction. We demonstrate our method across diverse deep learning tasks, including: computer vision, audio classification, and language modeling. Notably, across all modalities, the proposed approach achieves near-digital baseline accuracy and consistently enables effective end-to-end training of hybrid models incorporating various non-differentiable physical components (spatial light modulators, microring resonators, and Mach-Zehnder interferometers). This work bridges hardware-aware deep learning and gradient-free optimization, thereby offering a practical pathway for integrating non-differentiable physical components into scalable, end-to-end trainable AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical recipe for end-to-end training of hybrid networks that include non-differentiable physical layers by combining a dynamic low-rank surrogate with stochastic zero-order updates, though the evidence on surrogate fidelity is still thin.

read the letter

The main thing to know is that this work shows how to train networks mixing standard digital layers with real black-box physical components like spatial light modulators or microring resonators. They use stochastic zero-order optimization to tune the physical parameters and maintain a low-rank surrogate of the layer's input-output map so that gradients can flow back into the digital part. The surrogate gets refreshed after each forward pass with an implicit projector-splitting integrator that limits the number of hardware queries. They run the approach on computer vision, audio classification, and language modeling and report accuracies close to fully digital baselines across those tasks.

Referee Report

3 major / 2 minor

Summary. The manuscript presents a framework for end-to-end training of hybrid neural networks that combine digital layers with non-differentiable black-box physical components (spatial light modulators, microring resonators, Mach-Zehnder interferometers). It integrates stochastic zeroth-order optimization to update physical-layer parameters together with a dynamic low-rank surrogate model that is refreshed after each forward pass by an implicit projector-splitting integrator; the surrogate is intended to furnish gradients through the physical layer while requiring only minimal hardware queries. Experiments are reported on computer vision, audio classification, and language modeling tasks, with the claim that near-digital baseline accuracy is attained across modalities.

Significance. If the surrogate approximation remains sufficiently accurate to support stable gradient estimates, the work would offer a practical route for incorporating energy-efficient physical hardware into trainable deep-learning pipelines. Credit is due for the multi-modal experimental scope and for the dynamic update mechanism that avoids repeated full-matrix reconstructions. The integration of zeroth-order optimization with an online low-rank surrogate constitutes a concrete contribution at the intersection of hardware-aware learning and gradient-free methods.

major comments (3)

Abstract: the central performance claim that the method 'achieves near-digital baseline accuracy' is stated without any numerical values, standard deviations, or direct baseline comparisons, rendering it impossible to evaluate whether the low-rank surrogate actually delivers usable gradients.
Method section (description of implicit projector-splitting integrator): no error bound, convergence analysis, or empirical measurement of the surrogate approximation error is supplied; because the entire end-to-end training claim rests on this approximation being faithful enough for back-propagation, the absence of such quantification is load-bearing.
Experiments section: no ablation is presented on surrogate rank, query budget per update, or sensitivity to input dimensionality; without these controls the assertion that 'minimal hardware queries' suffice cannot be assessed for the high-dimensional inputs arising in language modeling.

minor comments (2)

Notation for the low-rank factors (U, S, V) and the precise update equations of the integrator should be collected in a single displayed block for clarity.
Figure captions could explicitly state the number of hardware queries used per surrogate refresh to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We are pleased that the significance of the work at the intersection of hardware-aware learning and gradient-free methods is recognized. We address each of the major comments below and will incorporate revisions as indicated.

read point-by-point responses

Referee: Abstract: the central performance claim that the method 'achieves near-digital baseline accuracy' is stated without any numerical values, standard deviations, or direct baseline comparisons, rendering it impossible to evaluate whether the low-rank surrogate actually delivers usable gradients.

Authors: We agree that the abstract would benefit from more quantitative details to support the performance claim. In the revised manuscript, we will update the abstract to include specific numerical accuracy values (e.g., top-1 accuracy percentages), standard deviations from repeated experiments, and explicit comparisons to the digital baselines for the vision, audio, and language modeling tasks. This will provide a clearer assessment of the surrogate's effectiveness in enabling usable gradients. revision: yes
Referee: Method section (description of implicit projector-splitting integrator): no error bound, convergence analysis, or empirical measurement of the surrogate approximation error is supplied; because the entire end-to-end training claim rests on this approximation being faithful enough for back-propagation, the absence of such quantification is load-bearing.

Authors: We acknowledge that quantifying the surrogate approximation error is crucial for validating the approach. The current manuscript emphasizes the practical implementation and end-to-end results, but we will add empirical measurements of the approximation error, such as the Frobenius norm difference between the surrogate and true mappings over the course of training, in a new subsection or figure in the revised version. For theoretical error bounds and convergence analysis, developing rigorous guarantees for the stochastic setting with black-box layers is non-trivial and beyond the scope of the current work; however, we will include a discussion on the observed stability and empirical convergence rates to address this concern. revision: partial
Referee: Experiments section: no ablation is presented on surrogate rank, query budget per update, or sensitivity to input dimensionality; without these controls the assertion that 'minimal hardware queries' suffice cannot be assessed for the high-dimensional inputs arising in language modeling.

Authors: We agree that ablations on key hyperparameters would strengthen the experimental section. In the revision, we will add results or analysis on the impact of surrogate rank (e.g., varying rank from 10 to 100) and query budget per update on the final accuracy. Regarding sensitivity to input dimensionality, we will highlight the language modeling experiments, which use high-dimensional token embeddings, and demonstrate that the method maintains performance with a fixed small query budget. If feasible within page limits, a dedicated ablation study will be included. revision: yes

Circularity Check

0 steps flagged

No circularity: framework integrates external optimization and hardware measurements with empirical validation

full rationale

The paper proposes a hybrid training framework that combines stochastic zeroth-order optimization for physical-layer parameters with a dynamic low-rank surrogate model updated via the implicit projector-splitting integrator after each forward pass. The central performance claim of near-digital baseline accuracy is supported by direct empirical results across computer vision, audio classification, and language modeling tasks using actual hardware queries on components such as spatial light modulators and microring resonators. No derivation step reduces a prediction or result to a quantity defined inside the paper by construction, and the approach relies on standard external primitives rather than self-referential definitions or load-bearing self-citations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that physical black-box layers admit useful low-rank approximations and that zero-order perturbations can be used to optimize their internal parameters without explicit gradients.

free parameters (1)

surrogate rank
The rank of the low-rank model is a modeling choice that controls approximation quality versus query cost.

axioms (1)

domain assumption Physical device response can be approximated by a low-rank linear map for the purpose of gradient flow
Invoked to justify the surrogate model that enables backpropagation through the black-box layer.

invented entities (1)

dynamic low-rank surrogate model no independent evidence
purpose: Approximates the non-differentiable physical layer to permit gradient-based updates in the digital part of the network
New modeling construct introduced to bridge the differentiability gap.

pith-pipeline@v0.9.0 · 5826 in / 1364 out tokens · 51872 ms · 2026-05-18T15:20:08.347940+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dynamic low-rank surrogate model ... implicit projector-splitting integrator algorithm, which updates the lightweight surrogate model after each forward pass with minimal hardware queries
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

stochastic zeroth-order optimization for updating the physical layer's internal parameters

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 1 internal anchor

[1]

I. H. Sarker, Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions, SN computer science 2 (6) (2021) 1–20

work page 2021
[2]

Bente, S

I. Bente, S. Taheriniya, F. Lenzini, F. Brückerhoff-Plückelmann, M. Kues, H. Bhaskaran, C. D. Wright, W. Pernice, The potential of multidimensional photonic computing, Nature Reviews Physics (2025) 1–12

work page 2025
[3]

Moralis-Pegios, G

M. Moralis-Pegios, G. Mourgias-Alexandris, A. Tsakyridis, G. Gi- amougiannis, A. Totovic, G. Dabos, N. Passalis, M. Kirtas, T. Ruti- rawut, F. Gardes, et al., Neuromorphic silicon photonics and hardware- aware deep learning for high-speed inference, Journal of Lightwave Tech- nology 40 (10) (2022) 3243–3254

work page 2022
[4]

S.-Y. Ma, T. Wang, J. Laydevant, L. G. Wright, P. L. McMahon, Quantum-limited stochastic optical neural networks operating at a few quanta per activation, Nature Communications 16 (1) (2025) 359

work page 2025
[5]

S. R. Ahmed, R. Baghdadi, M. Bernadskiy, N. Bowman, R. Braid, J. Carr, C. Chen, P. Ciccarella, M. Cole, J. Cooke, et al., Universal photonic artificial intelligence acceleration, Nature 640 (8058) (2025) 368–374. 26

work page 2025
[6]

B. J. Shastri, A. N. Tait, T. Ferreira de Lima, W. H. Pernice, H. Bhaskaran, C. D. Wright, P. R. Prucnal, Photonics for artificial in- telligence and neuromorphic computing, Nature Photonics 15 (2) (2021) 102–114

work page 2021
[7]

K. Liao, T. Dai, Q. Yan, X. Hu, Q. Gong, Integrated photonic neural networks: Opportunities and challenges, ACS Photonics 10 (7) (2023) 2001–2010

work page 2023
[8]

Montes McNeil, Y

A. Montes McNeil, Y. Li, A. Zhang, M. Moebius, Y. Liu, Fundamentals and recent developments of free-space optical neural networks, Journal of Applied Physics 136 (3) (2024)

work page 2024
[9]

Moralis-Pegios, G

M. Moralis-Pegios, G. Giamougiannis, A. Tsakyridis, D. Lazovsky, N. Pleros, Perfect linear optics using silicon photonics, Nature Com- munications 15 (1) (2024) 5468

work page 2024
[10]

Najjar Amiri, A

A. Najjar Amiri, A. D. Vit, K. Gorgulu, E. S. Magden, Deep photonic network platform enabling arbitrary and broadband optical functional- ity, Nature Communications 15 (1) (2024) 1432

work page 2024
[11]

Yildirim, N

M. Yildirim, N. U. Dinc, I. Oguz, D. Psaltis, C. Moser, Nonlinear pro- cessing with linear optics, Nature Photonics 18 (10) (2024) 1076–1082

work page 2024
[12]

H. Wang, J. Hu, A. Morandi, A. Nardi, F. Xia, X. Li, R. Savo, Q. Liu, R.Grange, S.Gigan, Photonicsbreakthroughs2024: Nonlinearphotonic computing at scale, IEEE Photonics Journal (2025)

work page 2025
[13]

Lubich, I

C. Lubich, I. V. Oseledets, A projector-splitting integrator for dynami- cal low-rank approximation, BIT Numerical Mathematics 54 (1) (2014) 171–188

work page 2014
[14]

Zhang, B

A.Chen, Y.Zhang, J.Jia, J.Diffenderfer, J.Liu, K.Parasyris, Y.Zhang, Z. Zhang, B. Kailkhura, S. Liu, Deepzero: Scaling up zeroth-order op- timization for deep model training, arXiv preprint arXiv:2310.02025 (2023)

work page arXiv 2023
[15]

Olaleke, I

O. Olaleke, I. Oseledets, E. Frolov, Dynamic modeling of user prefer- ences for stable recommendations, in: Proceedings of the 29th ACM Conference on User Modeling, Adaptation and Personalization, 2021, pp. 262–266. 27

work page 2021
[16]

S. Liu, B. Kailkhura, P.-Y. Chen, P. Ting, S. Chang, L. Amini, Zeroth- order stochastic variance reduction for nonconvex optimization, Ad- vances in neural information processing systems 31 (2018)

work page 2018
[17]

Malladi, T

S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, S. Arora, Fine-tuning language models with just forward passes, Advances in Neu- ral Information Processing Systems 36 (2023) 53038–53075

work page 2023
[18]

Y. Chen, Y. Zhang, L. Cao, K. Yuan, Z. Wen, Enhancing zeroth-order fine-tuning for language models with low-rank structures, arXiv preprint arXiv:2410.07698 (2024)

work page arXiv 2024
[19]

Chaubard, M

F. Chaubard, M. Kochenderfer, Scaling recurrent neural networks to a billion parameters with zero-order optimization, arXiv preprint arXiv:2505.17852 (2025)

work page arXiv 2025
[20]

S. Wang, L. Yu, J. Li, Lora-ga: Low-rank adaptation with gradient approximation, Advances in Neural Information Processing Systems 37 (2024) 54905–54931

work page 2024
[21]

J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, Y. Tian, Galore: Memory-efficient llm training by gradient low-rank projection, arXiv preprint arXiv:2403.03507 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

3017–3021

M.Gooneratne, K.C.Sim, P.Zadrazil, A.Kabel, F.Beaufays, G.Motta, Low-rank gradient approximation for memory-efficient on-device train- ing of deep neural network, in: ICASSP 2020-2020 IEEE Interna- tionalConferenceonAcoustics, SpeechandSignalProcessing(ICASSP), IEEE, 2020, pp. 3017–3021

work page 2020
[23]

Fournier, S

L. Fournier, S. Rivaud, E. Belilovsky, M. Eickenberg, E. Oyallon, Can forward gradient match backpropagation?, in: International Conference on Machine Learning, PMLR, 2023, pp. 10249–10264

work page 2023
[24]

Refael, J

Y. Refael, J. Svirsky, B. Shustin, W. Huleihel, O. Lindenbaum, Adarankgrad: Adaptive gradient-rank and moments for memory- efficient llms training and fine-tuning, arXiv preprint arXiv:2410.17881 (2024). 28

work page arXiv 2024
[25]

Sumo: Subspace-aware moment- orthogonalization for accelerating memory-efficient llm training.arXiv preprint arXiv:2505.24749, 2025

Y. Refael, G. Smorodinsky, T. Tirer, O. Lindenbaum, SUMO: Subspace- aware moment-orthogonalization for accelerating memory-efficient LLM training, arXiv preprint arXiv:2505.24749 (2025)

work page arXiv 2025
[26]

T. Fu, Y. Zang, Y. Huang, Z. Du, H. Huang, C. Hu, M. Chen, S. Yang, H. Chen, Photonic machine learning with on-chip diffractive optics, Na- ture Communications 14 (1) (2023) 70

work page 2023
[27]

J. Gu, H. Zhu, C. Feng, Z. Jiang, R. Chen, D. Pan, L2ight: Enabling on-chip learning for optical neural networks via efficient in-situ subspace optimization, in: M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, J. W. Vaughan (Eds.), Advances in Neural Information Processing Sys- tems, Vol. 34, Curran Associates, Inc., 2021, pp. 8649–8661

work page 2021
[28]

Y. Zhao, X. Yu, Z. Chen, Z. Liu, S. Liu, Z. Zhang, Tensor-compressed back-propagation-free training for (physics-informed) neural networks (2023)

work page 2023
[29]

Z. Qu, Z. Zhou, Y. Tong, L. Thiele, p-meta: Towards on-device deep model adaptation, in: Proceedings of the 28th ACM SIGKDD Confer- ence on Knowledge Discovery and Data Mining, KDD ’22, ACM, 2022, p. 1441–1451

work page 2022
[30]

S. Pai, Z. Sun, T. W. Hughes, T. Park, B. Bartlett, I. A. D. Williamson, M. Minkov, M. Milanizadeh, N. Abebe, F. Morichetti, A. Melloni, S. Fan, O. Solgaard, D. A. B. Miller, Experimentally realized in situ backpropagation for deep learning in photonic neural networks, Science 380 (6643) (2023) 398–404

work page 2023
[31]

T. Zhou, L. Fang, T. Yan, J. Wu, Y. Li, J. Fan, H. Wu, X. Lin, Q. Dai, In situ optical backpropagation training of diffractive optical neural net- works, Photon. Res. 8 (6) (2020) 940–953

work page 2020
[32]

Spall, X

J. Spall, X. Guo, T. D. Barrett, A. I. Lvovsky, Fully reconfigurable coherent optical vector–matrix multiplication, Opt. Lett. 45 (20) (2020) 5752–5755

work page 2020
[33]

Bandyopadhyay, A

S. Bandyopadhyay, A. Sludds, S. Krastanov, R. Hamerly, N. Harris, D. Bunandar, M. Streshinsky, M. Hochberg, D. Englund, Single-chip photonic deep neural network with forward-only training, Nature Pho- tonics 18 (12) (2024) 1335–1343. 29

work page 2024
[34]

Z. Wang, K. Müller, M. Filipovich, J. Launay, R. Ohana, G. Pariente, S. Mokaadi, C. Brossollet, F. Moreau, A. Cappelli, I. Poli, I. Carron, L. Daudet, F. Krzakala, S. Gigan, Streamlined optical training of large- scale modern deep learning architectures with direct feedback alignment (2025)

work page 2025
[35]

The forward-forward algorithm: Some preliminary investi- gations.ArXiv Preprint ArXiv:2212.13345

G. Hinton, The forward-forward algorithm: Some preliminary investi- gations, arXiv preprint arXiv:2212.13345 2 (3) (2022) 5

work page arXiv 2022
[36]

I. Oguz, J. Ke, Q. Weng, F. Yang, M. Yildirim, N. U. Dinc, J.-L. Hsieh, C. Moser, D. Psaltis, Forward–forward training of an optical neural network, Opt. Lett. 48 (20) (2023) 5249–5252

work page 2023
[37]

A. N. McCaughan, B. G. Oripov, N. Ganesh, S. W. Nam, A. Dienstfrey, S. M. Buckley, Multiplexed gradient descent: Fast online training of modern datasets on hardware neural networks without backpropagation, APL Machine Learning 1 (2) (2023) 026118

work page 2023
[38]

S. Pai, I. A. Williamson, T. W. Hughes, M. Minkov, O. Solgaard, S. Fan, D. A. Miller, Parallel fault-tolerant programming of an arbitrary feed- forward photonic network, arXiv preprint arXiv:1909.06179 (2019)

work page arXiv 1909
[39]

Laporte, J

F. Laporte, J. Dambre, P. Bienstman, Highly parallel simulation and optimization of photonic circuits in time and frequency domain based on the deep-learning framework pytorch, Scientific reports 9 (1) (2019) 5918

work page 2019
[40]

J. Gu, H. Zhu, C. Feng, Z. Jiang, R. T. Chen, D. Z. Pan, L2ight: En- abling on-chip learning for optical neural networks via efficient in-situ subspace optimization, in: Conference on Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[41]

Z. Yin, M. Zhang, A. Begovic, R. Huang, J. Zhang, J. Gu, Simphony: A device-circuit-architecture cross-layer modeling and simulation frame- work for heterogeneous electronic-photonic ai system, arXiv preprint arXiv:2411.13715 (2024)

work page arXiv 2024
[42]

Zheng, Z

Z. Zheng, Z. Duan, H. Chen, R. Yang, S. Gao, H. Zhang, H. Xiong, X. Lin, Dual adaptive training of photonic neural networks, Nature Ma- chine Intelligence (2023) 1–11. 30

work page 2023
[43]

Giamougiannis, A

G. Giamougiannis, A. Tsakyridis, Y. Ma, A. Totović, M. Moralis-Pegios, D. Lazovsky, N. Pleros, A coherent photonic crossbar for scalable univer- sal linear optics, Journal of Lightwave Technology 41 (8) (2023) 2425– 2442

work page 2023
[44]

A. N. Tait, A. X. Wu, T. F. de Lima, E. Zhou, B. J. Shastri, M. A. Nah- mias, P. R. Prucnal, Microring weight banks, IEEE Journal of Selected Topics in Quantum Electronics 22 (6) (2016) 312–325

work page 2016
[45]

Tamura, J

N. Tamura, J. C. Wyant, Two-dimensional matrix multiplication using coherent optical techniques, Opt. Eng. 18 (198) (1979)

work page 1979
[46]

T. Dao, B. Chen, N. Sohoni, A. Desai, M. Poli, J. Grogan, A. Liu, A. Rao, A. Rudra, C. Ré, Monarch: Expressive structured matrices for efficient and accurate training (2022)

work page 2022
[47]

S. Qiu, A. Potapczynski, M. Finzi, M. Goldblum, A. G. Wilson, Com- pute better spent: Replacing dense layers with structured matrices, arXiv preprint arXiv:2406.06248 (2024)

work page arXiv 2024
[48]

W. R. Clements, P. C. Humphreys, B. J. Metcalf, W. S. Kolthammer, I. A. Walmsley, Optimal design for universal multiport interferometers, Optica 3 (12) (2016) 1460–1465

work page 2016
[49]

Hamerly, S

R. Hamerly, S. Bandyopadhyay, D. Englund, Asymptotically fault- tolerantprogrammablephotonics, NatureCommunications13(1)(2022) 6831

work page 2022
[50]

M. Dong, M. Zimmermann, D. Heim, H. Choi, G. Clark, A. J. Leen- heer, K. J. Palm, A. Witte, D. Dominguez, G. Gilbert, M. Eichenfield, D.Englund, Programmablephotonicintegratedmeshesformodulargen- eration of optical entanglement links, npj Quantum Information 9 (1) (2023) 42

work page 2023
[51]

Krizhevsky, G

A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images (2009)

work page 2009
[52]

Salamon, C

J. Salamon, C. Jacoby, J. P. Bello, A dataset and taxonomy for urban sound research, in: Proceedings of the 22nd ACM international confer- ence on Multimedia, 2014, pp. 1041–1044. 31

work page 2014
[53]

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,

B. Desplanques, J. Thienpondt, K. Demuynck, ECAPA-TDNN: Em- phasized channel attention, propagation and aggregation in tdnn based speaker verification, arXiv preprint arXiv:2005.07143 (2020)

work page arXiv 2005
[54]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Languagemodelsareunsupervisedmultitasklearners, OpenAIblog1(8) (2019) 9

work page 2019
[55]

Penedo, H

G. Penedo, H. Kydlíček, L. B. allal, A. Lozhkov, M. Mitchell, C. Raffel, L. V. Werra, T. Wolf, The fineweb datasets: Decanting the web for the finest text data at scale, in: The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. 32

work page 2024

[1] [1]

I. H. Sarker, Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions, SN computer science 2 (6) (2021) 1–20

work page 2021

[2] [2]

Bente, S

I. Bente, S. Taheriniya, F. Lenzini, F. Brückerhoff-Plückelmann, M. Kues, H. Bhaskaran, C. D. Wright, W. Pernice, The potential of multidimensional photonic computing, Nature Reviews Physics (2025) 1–12

work page 2025

[3] [3]

Moralis-Pegios, G

M. Moralis-Pegios, G. Mourgias-Alexandris, A. Tsakyridis, G. Gi- amougiannis, A. Totovic, G. Dabos, N. Passalis, M. Kirtas, T. Ruti- rawut, F. Gardes, et al., Neuromorphic silicon photonics and hardware- aware deep learning for high-speed inference, Journal of Lightwave Tech- nology 40 (10) (2022) 3243–3254

work page 2022

[4] [4]

S.-Y. Ma, T. Wang, J. Laydevant, L. G. Wright, P. L. McMahon, Quantum-limited stochastic optical neural networks operating at a few quanta per activation, Nature Communications 16 (1) (2025) 359

work page 2025

[5] [5]

S. R. Ahmed, R. Baghdadi, M. Bernadskiy, N. Bowman, R. Braid, J. Carr, C. Chen, P. Ciccarella, M. Cole, J. Cooke, et al., Universal photonic artificial intelligence acceleration, Nature 640 (8058) (2025) 368–374. 26

work page 2025

[6] [6]

B. J. Shastri, A. N. Tait, T. Ferreira de Lima, W. H. Pernice, H. Bhaskaran, C. D. Wright, P. R. Prucnal, Photonics for artificial in- telligence and neuromorphic computing, Nature Photonics 15 (2) (2021) 102–114

work page 2021

[7] [7]

K. Liao, T. Dai, Q. Yan, X. Hu, Q. Gong, Integrated photonic neural networks: Opportunities and challenges, ACS Photonics 10 (7) (2023) 2001–2010

work page 2023

[8] [8]

Montes McNeil, Y

A. Montes McNeil, Y. Li, A. Zhang, M. Moebius, Y. Liu, Fundamentals and recent developments of free-space optical neural networks, Journal of Applied Physics 136 (3) (2024)

work page 2024

[9] [9]

Moralis-Pegios, G

M. Moralis-Pegios, G. Giamougiannis, A. Tsakyridis, D. Lazovsky, N. Pleros, Perfect linear optics using silicon photonics, Nature Com- munications 15 (1) (2024) 5468

work page 2024

[10] [10]

Najjar Amiri, A

A. Najjar Amiri, A. D. Vit, K. Gorgulu, E. S. Magden, Deep photonic network platform enabling arbitrary and broadband optical functional- ity, Nature Communications 15 (1) (2024) 1432

work page 2024

[11] [11]

Yildirim, N

M. Yildirim, N. U. Dinc, I. Oguz, D. Psaltis, C. Moser, Nonlinear pro- cessing with linear optics, Nature Photonics 18 (10) (2024) 1076–1082

work page 2024

[12] [12]

H. Wang, J. Hu, A. Morandi, A. Nardi, F. Xia, X. Li, R. Savo, Q. Liu, R.Grange, S.Gigan, Photonicsbreakthroughs2024: Nonlinearphotonic computing at scale, IEEE Photonics Journal (2025)

work page 2025

[13] [13]

Lubich, I

C. Lubich, I. V. Oseledets, A projector-splitting integrator for dynami- cal low-rank approximation, BIT Numerical Mathematics 54 (1) (2014) 171–188

work page 2014

[14] [14]

Zhang, B

A.Chen, Y.Zhang, J.Jia, J.Diffenderfer, J.Liu, K.Parasyris, Y.Zhang, Z. Zhang, B. Kailkhura, S. Liu, Deepzero: Scaling up zeroth-order op- timization for deep model training, arXiv preprint arXiv:2310.02025 (2023)

work page arXiv 2023

[15] [15]

Olaleke, I

O. Olaleke, I. Oseledets, E. Frolov, Dynamic modeling of user prefer- ences for stable recommendations, in: Proceedings of the 29th ACM Conference on User Modeling, Adaptation and Personalization, 2021, pp. 262–266. 27

work page 2021

[16] [16]

S. Liu, B. Kailkhura, P.-Y. Chen, P. Ting, S. Chang, L. Amini, Zeroth- order stochastic variance reduction for nonconvex optimization, Ad- vances in neural information processing systems 31 (2018)

work page 2018

[17] [17]

Malladi, T

S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, S. Arora, Fine-tuning language models with just forward passes, Advances in Neu- ral Information Processing Systems 36 (2023) 53038–53075

work page 2023

[18] [18]

Y. Chen, Y. Zhang, L. Cao, K. Yuan, Z. Wen, Enhancing zeroth-order fine-tuning for language models with low-rank structures, arXiv preprint arXiv:2410.07698 (2024)

work page arXiv 2024

[19] [19]

Chaubard, M

F. Chaubard, M. Kochenderfer, Scaling recurrent neural networks to a billion parameters with zero-order optimization, arXiv preprint arXiv:2505.17852 (2025)

work page arXiv 2025

[20] [20]

S. Wang, L. Yu, J. Li, Lora-ga: Low-rank adaptation with gradient approximation, Advances in Neural Information Processing Systems 37 (2024) 54905–54931

work page 2024

[21] [21]

J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, Y. Tian, Galore: Memory-efficient llm training by gradient low-rank projection, arXiv preprint arXiv:2403.03507 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

3017–3021

M.Gooneratne, K.C.Sim, P.Zadrazil, A.Kabel, F.Beaufays, G.Motta, Low-rank gradient approximation for memory-efficient on-device train- ing of deep neural network, in: ICASSP 2020-2020 IEEE Interna- tionalConferenceonAcoustics, SpeechandSignalProcessing(ICASSP), IEEE, 2020, pp. 3017–3021

work page 2020

[23] [23]

Fournier, S

L. Fournier, S. Rivaud, E. Belilovsky, M. Eickenberg, E. Oyallon, Can forward gradient match backpropagation?, in: International Conference on Machine Learning, PMLR, 2023, pp. 10249–10264

work page 2023

[24] [24]

Refael, J

Y. Refael, J. Svirsky, B. Shustin, W. Huleihel, O. Lindenbaum, Adarankgrad: Adaptive gradient-rank and moments for memory- efficient llms training and fine-tuning, arXiv preprint arXiv:2410.17881 (2024). 28

work page arXiv 2024

[25] [25]

Sumo: Subspace-aware moment- orthogonalization for accelerating memory-efficient llm training.arXiv preprint arXiv:2505.24749, 2025

Y. Refael, G. Smorodinsky, T. Tirer, O. Lindenbaum, SUMO: Subspace- aware moment-orthogonalization for accelerating memory-efficient LLM training, arXiv preprint arXiv:2505.24749 (2025)

work page arXiv 2025

[26] [26]

T. Fu, Y. Zang, Y. Huang, Z. Du, H. Huang, C. Hu, M. Chen, S. Yang, H. Chen, Photonic machine learning with on-chip diffractive optics, Na- ture Communications 14 (1) (2023) 70

work page 2023

[27] [27]

J. Gu, H. Zhu, C. Feng, Z. Jiang, R. Chen, D. Pan, L2ight: Enabling on-chip learning for optical neural networks via efficient in-situ subspace optimization, in: M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, J. W. Vaughan (Eds.), Advances in Neural Information Processing Sys- tems, Vol. 34, Curran Associates, Inc., 2021, pp. 8649–8661

work page 2021

[28] [28]

Y. Zhao, X. Yu, Z. Chen, Z. Liu, S. Liu, Z. Zhang, Tensor-compressed back-propagation-free training for (physics-informed) neural networks (2023)

work page 2023

[29] [29]

Z. Qu, Z. Zhou, Y. Tong, L. Thiele, p-meta: Towards on-device deep model adaptation, in: Proceedings of the 28th ACM SIGKDD Confer- ence on Knowledge Discovery and Data Mining, KDD ’22, ACM, 2022, p. 1441–1451

work page 2022

[30] [30]

S. Pai, Z. Sun, T. W. Hughes, T. Park, B. Bartlett, I. A. D. Williamson, M. Minkov, M. Milanizadeh, N. Abebe, F. Morichetti, A. Melloni, S. Fan, O. Solgaard, D. A. B. Miller, Experimentally realized in situ backpropagation for deep learning in photonic neural networks, Science 380 (6643) (2023) 398–404

work page 2023

[31] [31]

T. Zhou, L. Fang, T. Yan, J. Wu, Y. Li, J. Fan, H. Wu, X. Lin, Q. Dai, In situ optical backpropagation training of diffractive optical neural net- works, Photon. Res. 8 (6) (2020) 940–953

work page 2020

[32] [32]

Spall, X

J. Spall, X. Guo, T. D. Barrett, A. I. Lvovsky, Fully reconfigurable coherent optical vector–matrix multiplication, Opt. Lett. 45 (20) (2020) 5752–5755

work page 2020

[33] [33]

Bandyopadhyay, A

S. Bandyopadhyay, A. Sludds, S. Krastanov, R. Hamerly, N. Harris, D. Bunandar, M. Streshinsky, M. Hochberg, D. Englund, Single-chip photonic deep neural network with forward-only training, Nature Pho- tonics 18 (12) (2024) 1335–1343. 29

work page 2024

[34] [34]

Z. Wang, K. Müller, M. Filipovich, J. Launay, R. Ohana, G. Pariente, S. Mokaadi, C. Brossollet, F. Moreau, A. Cappelli, I. Poli, I. Carron, L. Daudet, F. Krzakala, S. Gigan, Streamlined optical training of large- scale modern deep learning architectures with direct feedback alignment (2025)

work page 2025

[35] [35]

The forward-forward algorithm: Some preliminary investi- gations.ArXiv Preprint ArXiv:2212.13345

G. Hinton, The forward-forward algorithm: Some preliminary investi- gations, arXiv preprint arXiv:2212.13345 2 (3) (2022) 5

work page arXiv 2022

[36] [36]

I. Oguz, J. Ke, Q. Weng, F. Yang, M. Yildirim, N. U. Dinc, J.-L. Hsieh, C. Moser, D. Psaltis, Forward–forward training of an optical neural network, Opt. Lett. 48 (20) (2023) 5249–5252

work page 2023

[37] [37]

A. N. McCaughan, B. G. Oripov, N. Ganesh, S. W. Nam, A. Dienstfrey, S. M. Buckley, Multiplexed gradient descent: Fast online training of modern datasets on hardware neural networks without backpropagation, APL Machine Learning 1 (2) (2023) 026118

work page 2023

[38] [38]

S. Pai, I. A. Williamson, T. W. Hughes, M. Minkov, O. Solgaard, S. Fan, D. A. Miller, Parallel fault-tolerant programming of an arbitrary feed- forward photonic network, arXiv preprint arXiv:1909.06179 (2019)

work page arXiv 1909

[39] [39]

Laporte, J

F. Laporte, J. Dambre, P. Bienstman, Highly parallel simulation and optimization of photonic circuits in time and frequency domain based on the deep-learning framework pytorch, Scientific reports 9 (1) (2019) 5918

work page 2019

[40] [40]

J. Gu, H. Zhu, C. Feng, Z. Jiang, R. T. Chen, D. Z. Pan, L2ight: En- abling on-chip learning for optical neural networks via efficient in-situ subspace optimization, in: Conference on Neural Information Processing Systems (NeurIPS), 2021

work page 2021

[41] [41]

Z. Yin, M. Zhang, A. Begovic, R. Huang, J. Zhang, J. Gu, Simphony: A device-circuit-architecture cross-layer modeling and simulation frame- work for heterogeneous electronic-photonic ai system, arXiv preprint arXiv:2411.13715 (2024)

work page arXiv 2024

[42] [42]

Zheng, Z

Z. Zheng, Z. Duan, H. Chen, R. Yang, S. Gao, H. Zhang, H. Xiong, X. Lin, Dual adaptive training of photonic neural networks, Nature Ma- chine Intelligence (2023) 1–11. 30

work page 2023

[43] [43]

Giamougiannis, A

G. Giamougiannis, A. Tsakyridis, Y. Ma, A. Totović, M. Moralis-Pegios, D. Lazovsky, N. Pleros, A coherent photonic crossbar for scalable univer- sal linear optics, Journal of Lightwave Technology 41 (8) (2023) 2425– 2442

work page 2023

[44] [44]

A. N. Tait, A. X. Wu, T. F. de Lima, E. Zhou, B. J. Shastri, M. A. Nah- mias, P. R. Prucnal, Microring weight banks, IEEE Journal of Selected Topics in Quantum Electronics 22 (6) (2016) 312–325

work page 2016

[45] [45]

Tamura, J

N. Tamura, J. C. Wyant, Two-dimensional matrix multiplication using coherent optical techniques, Opt. Eng. 18 (198) (1979)

work page 1979

[46] [46]

T. Dao, B. Chen, N. Sohoni, A. Desai, M. Poli, J. Grogan, A. Liu, A. Rao, A. Rudra, C. Ré, Monarch: Expressive structured matrices for efficient and accurate training (2022)

work page 2022

[47] [47]

S. Qiu, A. Potapczynski, M. Finzi, M. Goldblum, A. G. Wilson, Com- pute better spent: Replacing dense layers with structured matrices, arXiv preprint arXiv:2406.06248 (2024)

work page arXiv 2024

[48] [48]

W. R. Clements, P. C. Humphreys, B. J. Metcalf, W. S. Kolthammer, I. A. Walmsley, Optimal design for universal multiport interferometers, Optica 3 (12) (2016) 1460–1465

work page 2016

[49] [49]

Hamerly, S

R. Hamerly, S. Bandyopadhyay, D. Englund, Asymptotically fault- tolerantprogrammablephotonics, NatureCommunications13(1)(2022) 6831

work page 2022

[50] [50]

M. Dong, M. Zimmermann, D. Heim, H. Choi, G. Clark, A. J. Leen- heer, K. J. Palm, A. Witte, D. Dominguez, G. Gilbert, M. Eichenfield, D.Englund, Programmablephotonicintegratedmeshesformodulargen- eration of optical entanglement links, npj Quantum Information 9 (1) (2023) 42

work page 2023

[51] [51]

Krizhevsky, G

A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images (2009)

work page 2009

[52] [52]

Salamon, C

J. Salamon, C. Jacoby, J. P. Bello, A dataset and taxonomy for urban sound research, in: Proceedings of the 22nd ACM international confer- ence on Multimedia, 2014, pp. 1041–1044. 31

work page 2014

[53] [53]

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,

B. Desplanques, J. Thienpondt, K. Demuynck, ECAPA-TDNN: Em- phasized channel attention, propagation and aggregation in tdnn based speaker verification, arXiv preprint arXiv:2005.07143 (2020)

work page arXiv 2005

[54] [54]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Languagemodelsareunsupervisedmultitasklearners, OpenAIblog1(8) (2019) 9

work page 2019

[55] [55]

Penedo, H

G. Penedo, H. Kydlíček, L. B. allal, A. Lozhkov, M. Mitchell, C. Raffel, L. V. Werra, T. Wolf, The fineweb datasets: Decanting the web for the finest text data at scale, in: The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. 32

work page 2024