Low-rank surrogate modeling and stochastic zero-order optimization for training of neural networks with black-box layers
Pith reviewed 2026-05-18 15:20 UTC · model grok-4.3
The pith
A dynamic low-rank surrogate and stochastic zero-order optimization enable end-to-end training of hybrid networks that include non-differentiable physical black-box layers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors state that stochastic zeroth-order optimization handles updates to the internal parameters of the physical layer while a dynamic low-rank surrogate model, refreshed after each forward pass by the implicit projector-splitting integrator, supplies the gradients needed to train the digital layers. This combination permits reliable end-to-end training of hybrid architectures that incorporate various non-differentiable physical components and produces accuracy comparable to digital baselines across vision, audio, and language tasks.
What carries the argument
The implicit projector-splitting integrator that updates the low-rank surrogate model after each forward pass to approximate the input-output behavior of the black-box physical layer.
If this is right
- End-to-end training becomes possible for networks that contain spatial light modulators, microring resonators, or Mach-Zehnder interferometers.
- Only a small number of hardware queries per iteration are needed to keep the surrogate current.
- The same framework works for computer vision, audio classification, and language modeling while staying close to digital baseline accuracy.
- Gradient-free optimization and hardware-aware training are combined into one practical pipeline.
Where Pith is reading between the lines
- The same surrogate-plus-zero-order pattern could be tried on other non-differentiable black-box modules such as certain analog circuits or quantum simulators.
- Controlling the rank of the surrogate offers a direct dial for trading approximation quality against extra compute during the update step.
- One could measure whether the reduced number of hardware queries actually lowers total energy cost compared with methods that query the physical layer more heavily.
Load-bearing premise
The low-rank surrogate must stay accurate enough to the black-box physical layer's true input-output mapping that the gradients it supplies remain useful for updating the rest of the network.
What would settle it
Training the same hybrid models on the reported tasks but replacing the dynamic surrogate with either a fixed random approximation or no approximation at all, then checking whether accuracy falls well below the digital baseline or training fails to converge.
Figures
read the original abstract
The growing demand for energy-efficient, high-performance AI systems has led to increased attention on alternative computing platforms (e.g., photonic, neuromorphic) due to their potential to accelerate learning and inference. However, integrating such physical components into deep learning pipelines remains challenging, as physical devices often offer limited expressiveness, and their non-differentiable nature renders on-device backpropagation difficult or infeasible. This motivates the development of hybrid architectures that combine digital neural networks with reconfigurable physical layers, which effectively behave as black boxes. In this work, we present a framework for the end-to-end training of such hybrid networks. This framework integrates stochastic zeroth-order optimization for updating the physical layer's internal parameters with a dynamic low-rank surrogate model that enables gradient propagation through the physical layer. A key component of our approach is the implicit projector-splitting integrator algorithm, which updates the lightweight surrogate model after each forward pass with minimal hardware queries, thereby avoiding costly full matrix reconstruction. We demonstrate our method across diverse deep learning tasks, including: computer vision, audio classification, and language modeling. Notably, across all modalities, the proposed approach achieves near-digital baseline accuracy and consistently enables effective end-to-end training of hybrid models incorporating various non-differentiable physical components (spatial light modulators, microring resonators, and Mach-Zehnder interferometers). This work bridges hardware-aware deep learning and gradient-free optimization, thereby offering a practical pathway for integrating non-differentiable physical components into scalable, end-to-end trainable AI systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a framework for end-to-end training of hybrid neural networks that combine digital layers with non-differentiable black-box physical components (spatial light modulators, microring resonators, Mach-Zehnder interferometers). It integrates stochastic zeroth-order optimization to update physical-layer parameters together with a dynamic low-rank surrogate model that is refreshed after each forward pass by an implicit projector-splitting integrator; the surrogate is intended to furnish gradients through the physical layer while requiring only minimal hardware queries. Experiments are reported on computer vision, audio classification, and language modeling tasks, with the claim that near-digital baseline accuracy is attained across modalities.
Significance. If the surrogate approximation remains sufficiently accurate to support stable gradient estimates, the work would offer a practical route for incorporating energy-efficient physical hardware into trainable deep-learning pipelines. Credit is due for the multi-modal experimental scope and for the dynamic update mechanism that avoids repeated full-matrix reconstructions. The integration of zeroth-order optimization with an online low-rank surrogate constitutes a concrete contribution at the intersection of hardware-aware learning and gradient-free methods.
major comments (3)
- Abstract: the central performance claim that the method 'achieves near-digital baseline accuracy' is stated without any numerical values, standard deviations, or direct baseline comparisons, rendering it impossible to evaluate whether the low-rank surrogate actually delivers usable gradients.
- Method section (description of implicit projector-splitting integrator): no error bound, convergence analysis, or empirical measurement of the surrogate approximation error is supplied; because the entire end-to-end training claim rests on this approximation being faithful enough for back-propagation, the absence of such quantification is load-bearing.
- Experiments section: no ablation is presented on surrogate rank, query budget per update, or sensitivity to input dimensionality; without these controls the assertion that 'minimal hardware queries' suffice cannot be assessed for the high-dimensional inputs arising in language modeling.
minor comments (2)
- Notation for the low-rank factors (U, S, V) and the precise update equations of the integrator should be collected in a single displayed block for clarity.
- Figure captions could explicitly state the number of hardware queries used per surrogate refresh to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We are pleased that the significance of the work at the intersection of hardware-aware learning and gradient-free methods is recognized. We address each of the major comments below and will incorporate revisions as indicated.
read point-by-point responses
-
Referee: Abstract: the central performance claim that the method 'achieves near-digital baseline accuracy' is stated without any numerical values, standard deviations, or direct baseline comparisons, rendering it impossible to evaluate whether the low-rank surrogate actually delivers usable gradients.
Authors: We agree that the abstract would benefit from more quantitative details to support the performance claim. In the revised manuscript, we will update the abstract to include specific numerical accuracy values (e.g., top-1 accuracy percentages), standard deviations from repeated experiments, and explicit comparisons to the digital baselines for the vision, audio, and language modeling tasks. This will provide a clearer assessment of the surrogate's effectiveness in enabling usable gradients. revision: yes
-
Referee: Method section (description of implicit projector-splitting integrator): no error bound, convergence analysis, or empirical measurement of the surrogate approximation error is supplied; because the entire end-to-end training claim rests on this approximation being faithful enough for back-propagation, the absence of such quantification is load-bearing.
Authors: We acknowledge that quantifying the surrogate approximation error is crucial for validating the approach. The current manuscript emphasizes the practical implementation and end-to-end results, but we will add empirical measurements of the approximation error, such as the Frobenius norm difference between the surrogate and true mappings over the course of training, in a new subsection or figure in the revised version. For theoretical error bounds and convergence analysis, developing rigorous guarantees for the stochastic setting with black-box layers is non-trivial and beyond the scope of the current work; however, we will include a discussion on the observed stability and empirical convergence rates to address this concern. revision: partial
-
Referee: Experiments section: no ablation is presented on surrogate rank, query budget per update, or sensitivity to input dimensionality; without these controls the assertion that 'minimal hardware queries' suffice cannot be assessed for the high-dimensional inputs arising in language modeling.
Authors: We agree that ablations on key hyperparameters would strengthen the experimental section. In the revision, we will add results or analysis on the impact of surrogate rank (e.g., varying rank from 10 to 100) and query budget per update on the final accuracy. Regarding sensitivity to input dimensionality, we will highlight the language modeling experiments, which use high-dimensional token embeddings, and demonstrate that the method maintains performance with a fixed small query budget. If feasible within page limits, a dedicated ablation study will be included. revision: yes
Circularity Check
No circularity: framework integrates external optimization and hardware measurements with empirical validation
full rationale
The paper proposes a hybrid training framework that combines stochastic zeroth-order optimization for physical-layer parameters with a dynamic low-rank surrogate model updated via the implicit projector-splitting integrator after each forward pass. The central performance claim of near-digital baseline accuracy is supported by direct empirical results across computer vision, audio classification, and language modeling tasks using actual hardware queries on components such as spatial light modulators and microring resonators. No derivation step reduces a prediction or result to a quantity defined inside the paper by construction, and the approach relies on standard external primitives rather than self-referential definitions or load-bearing self-citations.
Axiom & Free-Parameter Ledger
free parameters (1)
- surrogate rank
axioms (1)
- domain assumption Physical device response can be approximated by a low-rank linear map for the purpose of gradient flow
invented entities (1)
-
dynamic low-rank surrogate model
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dynamic low-rank surrogate model ... implicit projector-splitting integrator algorithm, which updates the lightweight surrogate model after each forward pass with minimal hardware queries
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
stochastic zeroth-order optimization for updating the physical layer's internal parameters
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
I. H. Sarker, Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions, SN computer science 2 (6) (2021) 1–20
work page 2021
- [2]
-
[3]
M. Moralis-Pegios, G. Mourgias-Alexandris, A. Tsakyridis, G. Gi- amougiannis, A. Totovic, G. Dabos, N. Passalis, M. Kirtas, T. Ruti- rawut, F. Gardes, et al., Neuromorphic silicon photonics and hardware- aware deep learning for high-speed inference, Journal of Lightwave Tech- nology 40 (10) (2022) 3243–3254
work page 2022
-
[4]
S.-Y. Ma, T. Wang, J. Laydevant, L. G. Wright, P. L. McMahon, Quantum-limited stochastic optical neural networks operating at a few quanta per activation, Nature Communications 16 (1) (2025) 359
work page 2025
-
[5]
S. R. Ahmed, R. Baghdadi, M. Bernadskiy, N. Bowman, R. Braid, J. Carr, C. Chen, P. Ciccarella, M. Cole, J. Cooke, et al., Universal photonic artificial intelligence acceleration, Nature 640 (8058) (2025) 368–374. 26
work page 2025
-
[6]
B. J. Shastri, A. N. Tait, T. Ferreira de Lima, W. H. Pernice, H. Bhaskaran, C. D. Wright, P. R. Prucnal, Photonics for artificial in- telligence and neuromorphic computing, Nature Photonics 15 (2) (2021) 102–114
work page 2021
-
[7]
K. Liao, T. Dai, Q. Yan, X. Hu, Q. Gong, Integrated photonic neural networks: Opportunities and challenges, ACS Photonics 10 (7) (2023) 2001–2010
work page 2023
-
[8]
A. Montes McNeil, Y. Li, A. Zhang, M. Moebius, Y. Liu, Fundamentals and recent developments of free-space optical neural networks, Journal of Applied Physics 136 (3) (2024)
work page 2024
-
[9]
M. Moralis-Pegios, G. Giamougiannis, A. Tsakyridis, D. Lazovsky, N. Pleros, Perfect linear optics using silicon photonics, Nature Com- munications 15 (1) (2024) 5468
work page 2024
-
[10]
A. Najjar Amiri, A. D. Vit, K. Gorgulu, E. S. Magden, Deep photonic network platform enabling arbitrary and broadband optical functional- ity, Nature Communications 15 (1) (2024) 1432
work page 2024
-
[11]
M. Yildirim, N. U. Dinc, I. Oguz, D. Psaltis, C. Moser, Nonlinear pro- cessing with linear optics, Nature Photonics 18 (10) (2024) 1076–1082
work page 2024
-
[12]
H. Wang, J. Hu, A. Morandi, A. Nardi, F. Xia, X. Li, R. Savo, Q. Liu, R.Grange, S.Gigan, Photonicsbreakthroughs2024: Nonlinearphotonic computing at scale, IEEE Photonics Journal (2025)
work page 2025
- [13]
- [14]
-
[15]
O. Olaleke, I. Oseledets, E. Frolov, Dynamic modeling of user prefer- ences for stable recommendations, in: Proceedings of the 29th ACM Conference on User Modeling, Adaptation and Personalization, 2021, pp. 262–266. 27
work page 2021
-
[16]
S. Liu, B. Kailkhura, P.-Y. Chen, P. Ting, S. Chang, L. Amini, Zeroth- order stochastic variance reduction for nonconvex optimization, Ad- vances in neural information processing systems 31 (2018)
work page 2018
-
[17]
S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, S. Arora, Fine-tuning language models with just forward passes, Advances in Neu- ral Information Processing Systems 36 (2023) 53038–53075
work page 2023
- [18]
-
[19]
F. Chaubard, M. Kochenderfer, Scaling recurrent neural networks to a billion parameters with zero-order optimization, arXiv preprint arXiv:2505.17852 (2025)
-
[20]
S. Wang, L. Yu, J. Li, Lora-ga: Low-rank adaptation with gradient approximation, Advances in Neural Information Processing Systems 37 (2024) 54905–54931
work page 2024
-
[21]
J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, Y. Tian, Galore: Memory-efficient llm training by gradient low-rank projection, arXiv preprint arXiv:2403.03507 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
M.Gooneratne, K.C.Sim, P.Zadrazil, A.Kabel, F.Beaufays, G.Motta, Low-rank gradient approximation for memory-efficient on-device train- ing of deep neural network, in: ICASSP 2020-2020 IEEE Interna- tionalConferenceonAcoustics, SpeechandSignalProcessing(ICASSP), IEEE, 2020, pp. 3017–3021
work page 2020
-
[23]
L. Fournier, S. Rivaud, E. Belilovsky, M. Eickenberg, E. Oyallon, Can forward gradient match backpropagation?, in: International Conference on Machine Learning, PMLR, 2023, pp. 10249–10264
work page 2023
- [24]
-
[25]
Y. Refael, G. Smorodinsky, T. Tirer, O. Lindenbaum, SUMO: Subspace- aware moment-orthogonalization for accelerating memory-efficient LLM training, arXiv preprint arXiv:2505.24749 (2025)
-
[26]
T. Fu, Y. Zang, Y. Huang, Z. Du, H. Huang, C. Hu, M. Chen, S. Yang, H. Chen, Photonic machine learning with on-chip diffractive optics, Na- ture Communications 14 (1) (2023) 70
work page 2023
-
[27]
J. Gu, H. Zhu, C. Feng, Z. Jiang, R. Chen, D. Pan, L2ight: Enabling on-chip learning for optical neural networks via efficient in-situ subspace optimization, in: M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, J. W. Vaughan (Eds.), Advances in Neural Information Processing Sys- tems, Vol. 34, Curran Associates, Inc., 2021, pp. 8649–8661
work page 2021
-
[28]
Y. Zhao, X. Yu, Z. Chen, Z. Liu, S. Liu, Z. Zhang, Tensor-compressed back-propagation-free training for (physics-informed) neural networks (2023)
work page 2023
-
[29]
Z. Qu, Z. Zhou, Y. Tong, L. Thiele, p-meta: Towards on-device deep model adaptation, in: Proceedings of the 28th ACM SIGKDD Confer- ence on Knowledge Discovery and Data Mining, KDD ’22, ACM, 2022, p. 1441–1451
work page 2022
-
[30]
S. Pai, Z. Sun, T. W. Hughes, T. Park, B. Bartlett, I. A. D. Williamson, M. Minkov, M. Milanizadeh, N. Abebe, F. Morichetti, A. Melloni, S. Fan, O. Solgaard, D. A. B. Miller, Experimentally realized in situ backpropagation for deep learning in photonic neural networks, Science 380 (6643) (2023) 398–404
work page 2023
-
[31]
T. Zhou, L. Fang, T. Yan, J. Wu, Y. Li, J. Fan, H. Wu, X. Lin, Q. Dai, In situ optical backpropagation training of diffractive optical neural net- works, Photon. Res. 8 (6) (2020) 940–953
work page 2020
- [32]
-
[33]
S. Bandyopadhyay, A. Sludds, S. Krastanov, R. Hamerly, N. Harris, D. Bunandar, M. Streshinsky, M. Hochberg, D. Englund, Single-chip photonic deep neural network with forward-only training, Nature Pho- tonics 18 (12) (2024) 1335–1343. 29
work page 2024
-
[34]
Z. Wang, K. Müller, M. Filipovich, J. Launay, R. Ohana, G. Pariente, S. Mokaadi, C. Brossollet, F. Moreau, A. Cappelli, I. Poli, I. Carron, L. Daudet, F. Krzakala, S. Gigan, Streamlined optical training of large- scale modern deep learning architectures with direct feedback alignment (2025)
work page 2025
-
[35]
The forward-forward algorithm: Some preliminary investi- gations.ArXiv Preprint ArXiv:2212.13345
G. Hinton, The forward-forward algorithm: Some preliminary investi- gations, arXiv preprint arXiv:2212.13345 2 (3) (2022) 5
-
[36]
I. Oguz, J. Ke, Q. Weng, F. Yang, M. Yildirim, N. U. Dinc, J.-L. Hsieh, C. Moser, D. Psaltis, Forward–forward training of an optical neural network, Opt. Lett. 48 (20) (2023) 5249–5252
work page 2023
-
[37]
A. N. McCaughan, B. G. Oripov, N. Ganesh, S. W. Nam, A. Dienstfrey, S. M. Buckley, Multiplexed gradient descent: Fast online training of modern datasets on hardware neural networks without backpropagation, APL Machine Learning 1 (2) (2023) 026118
work page 2023
- [38]
-
[39]
F. Laporte, J. Dambre, P. Bienstman, Highly parallel simulation and optimization of photonic circuits in time and frequency domain based on the deep-learning framework pytorch, Scientific reports 9 (1) (2019) 5918
work page 2019
-
[40]
J. Gu, H. Zhu, C. Feng, Z. Jiang, R. T. Chen, D. Z. Pan, L2ight: En- abling on-chip learning for optical neural networks via efficient in-situ subspace optimization, in: Conference on Neural Information Processing Systems (NeurIPS), 2021
work page 2021
- [41]
- [42]
-
[43]
G. Giamougiannis, A. Tsakyridis, Y. Ma, A. Totović, M. Moralis-Pegios, D. Lazovsky, N. Pleros, A coherent photonic crossbar for scalable univer- sal linear optics, Journal of Lightwave Technology 41 (8) (2023) 2425– 2442
work page 2023
-
[44]
A. N. Tait, A. X. Wu, T. F. de Lima, E. Zhou, B. J. Shastri, M. A. Nah- mias, P. R. Prucnal, Microring weight banks, IEEE Journal of Selected Topics in Quantum Electronics 22 (6) (2016) 312–325
work page 2016
- [45]
-
[46]
T. Dao, B. Chen, N. Sohoni, A. Desai, M. Poli, J. Grogan, A. Liu, A. Rao, A. Rudra, C. Ré, Monarch: Expressive structured matrices for efficient and accurate training (2022)
work page 2022
- [47]
-
[48]
W. R. Clements, P. C. Humphreys, B. J. Metcalf, W. S. Kolthammer, I. A. Walmsley, Optimal design for universal multiport interferometers, Optica 3 (12) (2016) 1460–1465
work page 2016
-
[49]
R. Hamerly, S. Bandyopadhyay, D. Englund, Asymptotically fault- tolerantprogrammablephotonics, NatureCommunications13(1)(2022) 6831
work page 2022
-
[50]
M. Dong, M. Zimmermann, D. Heim, H. Choi, G. Clark, A. J. Leen- heer, K. J. Palm, A. Witte, D. Dominguez, G. Gilbert, M. Eichenfield, D.Englund, Programmablephotonicintegratedmeshesformodulargen- eration of optical entanglement links, npj Quantum Information 9 (1) (2023) 42
work page 2023
-
[51]
A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images (2009)
work page 2009
-
[52]
J. Salamon, C. Jacoby, J. P. Bello, A dataset and taxonomy for urban sound research, in: Proceedings of the 22nd ACM international confer- ence on Multimedia, 2014, pp. 1041–1044. 31
work page 2014
-
[53]
B. Desplanques, J. Thienpondt, K. Demuynck, ECAPA-TDNN: Em- phasized channel attention, propagation and aggregation in tdnn based speaker verification, arXiv preprint arXiv:2005.07143 (2020)
-
[54]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Languagemodelsareunsupervisedmultitasklearners, OpenAIblog1(8) (2019) 9
work page 2019
-
[55]
G. Penedo, H. Kydlíček, L. B. allal, A. Lozhkov, M. Mitchell, C. Raffel, L. V. Werra, T. Wolf, The fineweb datasets: Decanting the web for the finest text data at scale, in: The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. 32
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.