Constrained Variable Projection for Structured Problems

Emanuele Zangrando; Francesco Rinaldi; Francesco Tudisco; Sara Venturini

arxiv: 2606.23939 · v1 · pith:4BYMFIA6new · submitted 2026-06-22 · 🧮 math.OC · cs.LG· cs.NA· math.NA

Constrained Variable Projection for Structured Problems

Emanuele Zangrando , Sara Venturini , Francesco Rinaldi , Francesco Tudisco This is my paper

Pith reviewed 2026-06-26 06:57 UTC · model grok-4.3

classification 🧮 math.OC cs.LGcs.NAmath.NA

keywords variable projectionbilevel optimizationconditional gradientnonlinear least squaresconstrained optimizationdictionary learningsparse autoencoding

0 comments

The pith

Interpreting variable projection as a collapsed bilevel problem produces exact reduced gradients and a convergent conditional-gradient algorithm for constrained nonlinear least-squares tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats the elimination of linear variables in variable projection as the lower level of a bilevel program, leaving an upper-level problem whose variables obey convex constraints. This perspective supplies exact gradient expressions for the reduced objective that are compatible with automatic differentiation. A conditional-gradient algorithm is then applied to the reduced problem, together with convergence guarantees that rest on standard smoothness and compactness. Experiments on sparse autoencoding, dictionary learning, blind deconvolution, and few-shot learning indicate gains in wall-clock time and data efficiency compared with joint optimization of all variables.

Core claim

By interpreting variable projection as a collapsed bilevel optimization problem, exact reduced-gradient formulas compatible with automatic differentiation are derived and a conditional-gradient algorithm is proposed for the resulting constrained reduced problem, with convergence guarantees under standard smoothness and compactness assumptions and extensions discussed for structured lower-level variables.

What carries the argument

The collapsed bilevel formulation that eliminates the lower-level least-squares solve to produce an equivalent constrained reduced problem.

If this is right

The conditional-gradient algorithm converges to a stationary point of the reduced problem under the stated assumptions.
Exact reduced gradients can be obtained via automatic differentiation without additional derivation.
The framework extends to cases with structured lower-level variables.
Wall-clock and data efficiency improve relative to joint optimization on the four reported application domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bilevel collapse could be applied to other separable convex subproblems beyond least squares.
Implementation inside existing automatic-differentiation libraries becomes straightforward for models that contain linear layers subject to constraints.
The approach may reduce memory requirements during training when many variables can be eliminated exactly.

Load-bearing premise

The reduced objective is smooth and the feasible set is compact.

What would settle it

A concrete instance satisfying the smoothness and compactness conditions on which the conditional-gradient iterates fail to converge to a stationary point of the reduced problem.

Figures

Figures reproduced from arXiv: 2606.23939 by Emanuele Zangrando, Francesco Rinaldi, Francesco Tudisco, Sara Venturini.

**Figure 2.** Figure 2: Convergence of CR-VarPro against Projected gradient descent for the Dictionary Learning [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: Convergence of CR-VarPro against Projected gradient descent for the blind deconvolution [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Fine tuning results ResNet-18 on CIFAR10. Frozen Ridge fits only the linear head as a solu [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

Variable projection is a classical technique for separable nonlinear least-squares problems, in which variables that enter linearly are eliminated exactly, yielding a reduced nonlinear problem. By expressing this framework as a particular instance of a broader class of bilevel optimization problems, we develop a constrained variable-projection framework for data-science models, where the remaining variables are subject to convex constraints and the eliminated variables arise from a lower-level least-squares problem. In particular, by interpreting variable projection as a collapsed bilevel optimization problem, we derive exact reduced-gradient formulas compatible with automatic differentiation and propose a conditional-gradient algorithm for the resulting constrained reduced problem. We establish convergence guarantees under standard smoothness and compactness assumptions, and discuss extensions to structured lower-level variables. Numerical experiments on sparse autoencoding, dictionary learning, blind deconvolution, and few-shot learning suggest that the method can improve wall-clock efficiency and data efficiency relative to natural joint-optimization baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Frames variable projection as a collapsed bilevel problem to get exact reduced gradients under convex constraints on the nonlinear variables, then applies conditional gradient with standard convergence.

read the letter

The main takeaway is that the authors treat classical variable projection as a special case of bilevel optimization, collapse the lower level, and obtain exact reduced gradients that work with automatic differentiation even when the upper-level variables sit inside convex constraints. They then solve the reduced problem with a conditional-gradient (Frank-Wolfe) method and prove convergence under the usual smoothness-plus-compactness assumptions.

This is a clean, incremental extension of the variable-projection idea to the constrained setting that appears in dictionary learning, autoencoders, and deconvolution. The bilevel view is the part that feels new relative to the cited literature, and the claim that the gradients remain exact is the part worth checking.

The abstract gives no equations, so the actual derivation of the reduced gradients cannot be verified here. The numerical section is described only as “suggestive,” with no information on how the baselines were tuned or whether the reported speed-ups survive different random seeds. Those are the usual soft spots for an optimization paper at this stage.

The work is aimed at people who already use variable projection or bilevel methods inside structured machine-learning models. A reader who needs a practical way to keep the linear variables eliminated while respecting convex constraints on the rest will find the framework directly usable.

I would send it to peer review. The central reduction looks worth referee time even if the experiments need more detail.

Referee Report

0 major / 2 minor

Summary. The paper claims that variable projection can be recast as a collapsed bilevel optimization problem, yielding exact reduced-gradient formulas compatible with automatic differentiation. It proposes a conditional-gradient (Frank-Wolfe) algorithm for the resulting constrained reduced problem, establishes convergence under standard smoothness and compactness assumptions on the reduced objective and feasible set, discusses extensions to structured lower-level variables, and reports numerical experiments on sparse autoencoding, dictionary learning, blind deconvolution, and few-shot learning that suggest gains in wall-clock and data efficiency over joint-optimization baselines.

Significance. If the derivations and convergence analysis hold, the work supplies a principled extension of classical variable projection to settings with convex constraints on the nonlinear variables. The bilevel reduction produces exact reduced gradients that integrate with AD, and the convergence result rests explicitly on the usual conditions for Frank-Wolfe methods rather than novel assumptions. The numerical comparisons provide concrete evidence of practical utility for structured data-science problems.

minor comments (2)

[§3] §3 (or wherever the reduced-gradient derivation appears): the manuscript should explicitly state the precise form of the reduced gradient (e.g., the expression involving the pseudoinverse or adjoint) so that readers can verify compatibility with AD without reconstructing the bilevel collapse.
[Numerical experiments] Numerical experiments section: the description of post-hoc parameter choices (step-size rules, stopping tolerances, constraint sets) should be expanded so that the reported efficiency gains can be reproduced from the stated protocol.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the manuscript, the recognition of its contributions to extending variable projection via bilevel collapse, and the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper frames variable projection as a collapsed bilevel problem to derive exact reduced-gradient formulas, then applies a conditional-gradient method with convergence under explicitly stated standard smoothness and compactness assumptions on the reduced objective. These assumptions are the usual Frank-Wolfe conditions and are not derived from or equivalent to the target result. The abstract and reader's summary give no indication of self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claim to its own inputs. The derivation is therefore self-contained against external mathematical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on interpreting variable projection as a collapsed bilevel problem and on standard smoothness/compactness conditions for convergence; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption standard smoothness and compactness assumptions suffice for convergence of the conditional-gradient algorithm on the reduced problem
Invoked to establish convergence guarantees (abstract, final sentence)

pith-pipeline@v0.9.1-grok · 5691 in / 1184 out tokens · 18974 ms · 2026-06-26T06:57:04.581758+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 28 canonical work pages

[1]

AGRAWAL, B

A. AGRAWAL, B. AMOS, S. BARRATT, S. BOYD, S. DIAMOND,ANDJ. Z. KOLTER,Differen- tiable convex optimization layers, in Advances in Neural Information Processing Systems, vol. 32, 2019, pp. 9558–9570

2019
[2]

ALAIN ANDY

G. ALAIN ANDY. BENGIO,Understanding intermediate layers using linear classifier probes, 2017

2017
[3]

AMOS ANDJ

B. AMOS ANDJ. Z. KOLTER,OptNet: Differentiable optimization as a layer in neural networks, in Proceedings of the 34th International Conference on Machine Learning, vol. 70 of Proceedings of Machine Learning Research, PMLR, 2017, pp. 136–145

2017
[4]

J. F. BARD,Practical Bilevel Optimization: Algorithms and Applications, vol. 30 of Nonconvex Optimization and Its Applications, Springer, Boston, MA, 1998,https://doi.org/10.1007/ 978-1-4757-2836-1

1998
[5]

BENGIO, L

Y. BENGIO, L. YAO, G. ALAIN,ANDP. VINCENT,Generalized denoising auto-encoders as gen- erative models, 2013,https://arxiv.org/abs/1305.6663

Pith/arXiv arXiv 2013
[6]

D. P. BERTSEKAS,Convex optimization algorithms, Athena Scientific, Belmont, 2015

2015
[7]

BLONDEL, Q

M. BLONDEL, Q. BERTHET, M. CUTURI, R. FROSTIG, S. HOYER, F. LLINARES-L ´OPEZ, F. PE- DREGOSA,ANDJ.-P. VERT,Efficient and modular implicit differentiation, in Advances in Neural Information Processing Systems, vol. 35, 2022, pp. 5230–5242

2022
[8]

I. M. BOMZE, F. RINALDI,ANDD. ZEFFIRO,Active set complexity of the away-step Frank–Wolfe algorithm, SIAM Journal on Optimization, 30 (2020), pp. 2470–2500,https://doi.org/10. 1137/19M1309419

2020
[9]

I. M. BOMZE, F. RINALDI,ANDD. ZEFFIRO,Frank–Wolfe and friends: a journey into projection- free first-order optimization methods, 4OR, 19 (2021), pp. 313–345,https://doi.org/10. 1007/s10288-021-00493-y

2021
[10]

BRAUN, A

G. BRAUN, A. CARDERERA, C. W. COMBETTES, H. HASSANI, A. KARBASI, A. MOKHTARI, ANDS. POKUTTA,Conditional Gradient Methods: From Core Principles to AI Applications, MOS- SIAM Series on Optimization, Society for Industrial and Applied Mathematics, 2025,https: //doi.org/10.1137/1.9781611978568

work page doi:10.1137/1.9781611978568 2025
[11]

J. J. BRUST,Nonlinear least-squares for large-scale machine learning using stochastic jacobian estimates, arXiv preprint arXiv:1412.6980, (2021)

Pith/arXiv arXiv 2021
[12]

B ¨ARLIGEA, P

A. B ¨ARLIGEA, P. HOCHSTAFFL,ANDF. SCHREIER,A generalized variable projection algorithm for least squares problems in atmospheric remote sensing, Mathematics, 11 (2023),https://doi. org/10.3390/math11132839

work page doi:10.3390/math11132839 2023
[13]

E. J. CAND `ES, X. LI, Y. MA,ANDJ. WRIGHT,Robust principal component analysis?, Journal of the ACM, 58 (2011), pp. 11:1–11:37,https://doi.org/10.1145/1970392.1970395. 17

work page doi:10.1145/1970392.1970395 2011
[14]

CARLINI ANDD

N. CARLINI ANDD. WAGNER,Towards evaluating the robustness of neural networks, in 2017 IEEE Symposium on Security and Privacy, IEEE, 2017, pp. 39–57,https://doi.org/10.1109/ SP.2017.49

2017
[15]

CHAVENT,Nonlinear least squares for inverse problems, Scientific Computation, Springer, Dor- drecht, Netherlands, 2010 ed., Oct

G. CHAVENT,Nonlinear least squares for inverse problems, Scientific Computation, Springer, Dor- drecht, Netherlands, 2010 ed., Oct. 2009,https://doi.org/10.1007/978-90-481-2785-6

work page doi:10.1007/978-90-481-2785-6 2010
[16]

C. CHEN, D. LI, J. YAN,ANDX. YANG,Modeling dynamic user preference via dictionary learn- ing for sequential recommendation, IEEE Transactions on Knowledge and Data Engineering, 34 (2022), pp. 5446–5458,https://doi.org/10.1109/TKDE.2021.3050407

work page doi:10.1109/tkde.2021.3050407 2022
[17]

G. CHEN, P. XUE, M. GAN, J. CHEN, W. GUO,ANDC. P. CHEN,Variable projection algo- rithms: Theoretical insights and a novel approach for problems with large residual, Automatica, 177 (2025), p. 112300,https://doi.org/10.1016/j.automatica.2025.112300

work page doi:10.1016/j.automatica.2025.112300 2025
[18]

COLSON, P

B. COLSON, P. MARCOTTE,ANDG. SAVARD,An overview of bilevel optimization, Annals of Operations Research, 153 (2007), pp. 235–256,https://doi.org/10.1007/ s10479-007-0176-2

2007
[19]

C. W. COMBETTES ANDS. POKUTTA,Complexity of linear minimization and projection on some sets, Operations Research Letters, 49 (2021), pp. 565–571,https://doi.org/10.1016/j.orl. 2021.06.019

work page doi:10.1016/j.orl 2021
[20]

DEMPE ANDA

S. DEMPE ANDA. B. ZEMKOHO, eds.,Bilevel Optimization: Advances and Next Challenges, vol. 161 of Springer Optimization and Its Applications, Springer, Cham, 2020,https://doi. org/10.1007/978-3-030-52119-6

work page doi:10.1007/978-3-030-52119-6 2020
[21]

DONG ANDJ

S. DONG ANDJ. YANG,Numerical approximation of partial differential equations by a variable projection method with artificial neural networks, Computer Methods in Applied Mechanics and Engineering, 398 (2022), p. 115284,https://doi.org/10.1016/j.cma.2022.115284

work page doi:10.1016/j.cma.2022.115284 2022
[22]

D. L. DONOHO,Compressed sensing, IEEE Transactions on Information Theory, 52 (2006), pp. 1289–1306,https://doi.org/10.1109/TIT.2006.871582

work page doi:10.1109/tit.2006.871582 2006
[23]

DUCHI, S

J. DUCHI, S. SHALEV-SHWARTZ, Y. SINGER,ANDT. CHANDRA,Efficient projections onto the ℓ1-ball for learning in high dimensions, in Proceedings of the 25th International Conference on Machine Learning, 2008, pp. 272–279,https://doi.org/10.1145/1390156.1390191

work page doi:10.1145/1390156.1390191 2008
[24]

DUS,Grassmannian geometry and global convergence of variable projection for neural net- works, 2026,https://arxiv.org/abs/2601.22897

M. DUS,Grassmannian geometry and global convergence of variable projection for neural net- works, 2026,https://arxiv.org/abs/2601.22897

arXiv 2026
[25]

M. I. ESPANOL ANDG. JERONIMO,Local convergence analysis of a variable projection method for regularized separable nonlinear inverse problems, SIAM Journal on Matrix Analysis and Ap- plications, 46 (2025), pp. 858–878,https://doi.org/10.1137/24M1639087

work page doi:10.1137/24m1639087 2025
[26]

FRANCESCHI, P

L. FRANCESCHI, P. FRASCONI, S. SALZO, R. GRAZZI,ANDM. PONTIL,Bilevel programming for hyperparameter optimization and meta-learning, in Proceedings of the 35th International Con- ference on Machine Learning, vol. 80 of Proceedings of Machine Learning Research, PMLR, 2018, pp. 1568–1577

2018
[27]

FRANK ANDP

M. FRANK ANDP. WOLFE,An algorithm for quadratic programming, Nav. Res. Logist. Q., 3 (1956), pp. 95–110,https://doi.org/10.1002/nav.3800030109

work page doi:10.1002/nav.3800030109 1956
[28]

GILLIS,Nonnegative Matrix Factorization, vol

N. GILLIS,Nonnegative Matrix Factorization, vol. 2 of Data Science, Society for Industrial and Applied Mathematics, Philadelphia, PA, 2020,https://doi.org/10.1137/1.9781611976410

work page doi:10.1137/1.9781611976410 2020
[29]

GOLUB ANDV

G. GOLUB ANDV. PEREYRA,Separable nonlinear least squares: the variable projection method and its applications, Inverse problems, 19 (2003), p. R1,https://doi.org/10.1088/ 0266-5611/19/2/201

2003
[30]

G. H. GOLUB ANDR. LEVEQUE,Extensions and uses of the variable projection algorithm for solving nonlinear least squares problems., in Proceedings of the 1979 Army Numerical Analysis and Computers Conference, 1979

1979
[31]

G. H. GOLUB ANDV. PEREYRA,The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate, SIAM Journal on Numerical Analysis, 10 (1973), pp. 413–432,https://doi.org/10.1137/0710036. 18

work page doi:10.1137/0710036 1973
[32]

GOULD, B

S. GOULD, B. FERNANDO, A. CHERIAN, P. ANDERSON, R. SANTACRUZ,ANDE. GUO,On dif- ferentiating parameterized argmin and argmax problems with application to bi-level optimization, arXiv preprint arXiv:1607.05447, (2016),https://arxiv.org/abs/1607.05447

Pith/arXiv arXiv 2016
[33]

Statistical

T. HASTIE, R. TIBSHIRANI,ANDM. WAINWRIGHT,Statistical Learning with Sparsity: The Lasso and Generalizations, Chapman and Hall/CRC, Boca Raton, FL, 2015,https://doi.org/ 10.1201/b18401

work page doi:10.1201/b18401 2015
[34]

K. HE, X. ZHANG, S. REN,ANDJ. SUN,Deep residual learning for image recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778, https://doi.org/10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016
[35]

G. E. HINTON ANDR. R. SALAKHUTDINOV,Reducing the dimensionality of data with neural networks, Science, 313 (2006), pp. 504–507,https://doi.org/10.1126/science.1127647

work page doi:10.1126/science.1127647 2006
[36]

HUBEN, H

R. HUBEN, H. CUNNINGHAM, L. R. SMITH, A. EWART,ANDL. SHARKEY,Sparse autoen- coders find highly interpretable features in language models, in The Twelfth International Confer- ence on Learning Representations, 2024

2024
[37]

JAGGI,Revisiting Frank-Wolfe: Projection-free sparse convex optimization, in Proceedings of the 30th International Conference on Machine Learning, S

M. JAGGI,Revisiting Frank-Wolfe: Projection-free sparse convex optimization, in Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester, eds., vol. 28 of Proceedings of Machine Learning Research, Atlanta, Georgia, USA, 17–19 Jun 2013, PMLR, pp. 427–435

2013
[38]

JAGGI,An equivalence between the lasso and support vector machines, in Regularization, Op- timization, Kernels, and Support Vector Machines, J

M. JAGGI,An equivalence between the lasso and support vector machines, in Regularization, Op- timization, Kernels, and Support Vector Machines, J. A. K. Suykens, ed., Taylor & Francis, 2014, pp. 1–26,https://doi.org/10.1201/b17558-4

work page doi:10.1201/b17558-4 2014
[39]

KAUFMAN,A variable projection method for solving separable nonlinear least squares problems, BIT Numerical Mathematics, 15 (1975), pp

L. KAUFMAN,A variable projection method for solving separable nonlinear least squares problems, BIT Numerical Mathematics, 15 (1975), pp. 49–57,https://doi.org/10.1007/ BF01932995

1975
[40]

KRIZHEVSKY ANDG

A. KRIZHEVSKY ANDG. HINTON,Learning multiple layers of features from tiny images, Tech. Report 0, University of Toronto, Toronto, Ontario, 2009

2009
[41]

W. H. LAWTON ANDE. A. SYLVESTRE,Elimination of linear parameters in nonlinear regression, Technometrics, 13 (1971), pp. 461–467,https://doi.org/10.2307/1267160

work page doi:10.2307/1267160 1971
[42]

LECUN, C

Y. LECUN, C. CORTES,ANDC. J. BURGES,The mnist database of handwritten digits, (1998)

1998
[43]

C. H. LIM ANDS. J. WRIGHT,A box-constrained approach for hard permutation problems, in Proceedings of the 33rd International Conference on Machine Learning, vol. 48 of Proceedings of Machine Learning Research, PMLR, 2016, pp. 2454–2463

2016
[44]

Selective review of offline change point detection methods,

I. MARKOVSKY,Recent progress on variable projection methods for structured low-rank ap- proximation, Signal Processing, 96 (2014), p. 406–419,https://doi.org/10.1016/j.sigpro. 2013.09.021

work page doi:10.1016/j.sigpro 2014
[45]

NEWMAN, J

E. NEWMAN, J. CHUNG, M. CHUNG,ANDL. RUTHOTTO,slimtrain—a stochastic approximation method for training separable deep neural networks, SIAM Journal on Scientific Computing, 44 (2022), pp. A2322–A2348,https://doi.org/10.1137/21M1452512

work page doi:10.1137/21m1452512 2022
[46]

NEWMAN, L

E. NEWMAN, L. RUTHOTTO, J. HART,ANDB.VANBLOEMENWAANDERS,Train like a (var)pro: Efficient training of neural networks with variable projection, SIAM Journal on Mathe- matics of Data Science, 3 (2021), pp. 1041–1066,https://doi.org/10.1137/20M1359511

work page doi:10.1137/20m1359511 2021
[47]

PETHICK, W

T. PETHICK, W. XIE, K. ANTONAKOPOULOS, Z. ZHU, A. SILVETI-FALLS,ANDV. CEVHER, Training deep learning models with norm-constrained LMOs, in Proceedings of the 42nd Interna- tional Conference on Machine Learning, vol. 267 of Proceedings of Machine Learning Research, PMLR, 2025, pp. 49069–49104

2025
[48]

RECHT, M

B. RECHT, M. FAZEL,ANDP. A. PARRILO,Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, SIAM Review, 52 (2010), pp. 471–501,https://doi. org/10.1137/070697835

work page doi:10.1137/070697835 2010
[49]

RUBINSTEIN, A

R. RUBINSTEIN, A. M. BRUCKSTEIN,ANDM. ELAD,Dictionaries for sparse representation modeling, Proceedings of the IEEE, 98 (2010), pp. 1045–1057,https://doi.org/10.1109/ JPROC.2010.2040551. 19

arXiv 2010
[50]

D. B. C. SALZER, M. I. ESPA ˜NOL,ANDG. JERONIMO,Variable projection methods for solving regularized separable inverse problems with applications to semi-blind image deblurring, 2026, https://arxiv.org/abs/2601.05224

arXiv 2026
[51]

Sequential Monte Carlo samplers

R. TIBSHIRANI,Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society: Series B (Methodological), 58 (1996), pp. 267–288,https://doi.org/10.1111/j. 2517-6161.1996.tb02080.x

work page doi:10.1111/j 1996
[52]

TSCHANNEN, O

M. TSCHANNEN, O. BACHEM,ANDM. LUCIC,Recent advances in autoencoder-based represen- tation learning, 2018,https://arxiv.org/abs/1812.05069

Pith/arXiv arXiv 2018
[53]

USEVICH ANDI

K. USEVICH ANDI. MARKOVSKY,Variable projection for affinely structured low-rank approx- imation in weighted 2-norms, Journal of Computational and Applied Mathematics, 272 (2014), pp. 430–448,https://doi.org/10.1016/j.cam.2013.04.034

work page doi:10.1016/j.cam.2013.04.034 2014
[54]

USEVICH ANDI

K. USEVICH ANDI. MARKOVSKY,Variable projection methods for approximate (greatest) com- mon divisor computations, Theoretical Computer Science, 681 (2017), pp. 176–198,https: //doi.org/10.1016/j.tcs.2017.03.028. Symbolic Numeric Computation

work page doi:10.1016/j.tcs.2017.03.028 2017
[55]

T.VANLEEUWEN ANDA. Y. ARAVKIN,Variable projection for nonsmooth problems, SIAM Journal on Scientific Computing, 43 (2021), pp. S249–S268,https://doi.org/10.1137/ 20M1348650

2021
[56]

XU, G.-Y

H.-L. XU, G.-Y. CHEN, S.-Q. CHENG, M. GAN,ANDJ. CHEN,Variable projection algorithms with sparse constraint for separable nonlinear models, Control Theory and Technology, 22 (2024), pp. 135–146,https://doi.org/10.1007/s11768-023-00194-3

work page doi:10.1007/s11768-023-00194-3 2024
[57]

ZANGRANDO, S

E. ZANGRANDO, S. VENTURINI, F. RINALDI,ANDF. TUDISCO,dEBORA: Efficient bilevel optimization-based low-rank adaptation, in International Conference on Learning Representations, 2025. 20

2025

[1] [1]

AGRAWAL, B

A. AGRAWAL, B. AMOS, S. BARRATT, S. BOYD, S. DIAMOND,ANDJ. Z. KOLTER,Differen- tiable convex optimization layers, in Advances in Neural Information Processing Systems, vol. 32, 2019, pp. 9558–9570

2019

[2] [2]

ALAIN ANDY

G. ALAIN ANDY. BENGIO,Understanding intermediate layers using linear classifier probes, 2017

2017

[3] [3]

AMOS ANDJ

B. AMOS ANDJ. Z. KOLTER,OptNet: Differentiable optimization as a layer in neural networks, in Proceedings of the 34th International Conference on Machine Learning, vol. 70 of Proceedings of Machine Learning Research, PMLR, 2017, pp. 136–145

2017

[4] [4]

J. F. BARD,Practical Bilevel Optimization: Algorithms and Applications, vol. 30 of Nonconvex Optimization and Its Applications, Springer, Boston, MA, 1998,https://doi.org/10.1007/ 978-1-4757-2836-1

1998

[5] [5]

BENGIO, L

Y. BENGIO, L. YAO, G. ALAIN,ANDP. VINCENT,Generalized denoising auto-encoders as gen- erative models, 2013,https://arxiv.org/abs/1305.6663

Pith/arXiv arXiv 2013

[6] [6]

D. P. BERTSEKAS,Convex optimization algorithms, Athena Scientific, Belmont, 2015

2015

[7] [7]

BLONDEL, Q

M. BLONDEL, Q. BERTHET, M. CUTURI, R. FROSTIG, S. HOYER, F. LLINARES-L ´OPEZ, F. PE- DREGOSA,ANDJ.-P. VERT,Efficient and modular implicit differentiation, in Advances in Neural Information Processing Systems, vol. 35, 2022, pp. 5230–5242

2022

[8] [8]

I. M. BOMZE, F. RINALDI,ANDD. ZEFFIRO,Active set complexity of the away-step Frank–Wolfe algorithm, SIAM Journal on Optimization, 30 (2020), pp. 2470–2500,https://doi.org/10. 1137/19M1309419

2020

[9] [9]

I. M. BOMZE, F. RINALDI,ANDD. ZEFFIRO,Frank–Wolfe and friends: a journey into projection- free first-order optimization methods, 4OR, 19 (2021), pp. 313–345,https://doi.org/10. 1007/s10288-021-00493-y

2021

[10] [10]

BRAUN, A

G. BRAUN, A. CARDERERA, C. W. COMBETTES, H. HASSANI, A. KARBASI, A. MOKHTARI, ANDS. POKUTTA,Conditional Gradient Methods: From Core Principles to AI Applications, MOS- SIAM Series on Optimization, Society for Industrial and Applied Mathematics, 2025,https: //doi.org/10.1137/1.9781611978568

work page doi:10.1137/1.9781611978568 2025

[11] [11]

J. J. BRUST,Nonlinear least-squares for large-scale machine learning using stochastic jacobian estimates, arXiv preprint arXiv:1412.6980, (2021)

Pith/arXiv arXiv 2021

[12] [12]

B ¨ARLIGEA, P

A. B ¨ARLIGEA, P. HOCHSTAFFL,ANDF. SCHREIER,A generalized variable projection algorithm for least squares problems in atmospheric remote sensing, Mathematics, 11 (2023),https://doi. org/10.3390/math11132839

work page doi:10.3390/math11132839 2023

[13] [13]

E. J. CAND `ES, X. LI, Y. MA,ANDJ. WRIGHT,Robust principal component analysis?, Journal of the ACM, 58 (2011), pp. 11:1–11:37,https://doi.org/10.1145/1970392.1970395. 17

work page doi:10.1145/1970392.1970395 2011

[14] [14]

CARLINI ANDD

N. CARLINI ANDD. WAGNER,Towards evaluating the robustness of neural networks, in 2017 IEEE Symposium on Security and Privacy, IEEE, 2017, pp. 39–57,https://doi.org/10.1109/ SP.2017.49

2017

[15] [15]

CHAVENT,Nonlinear least squares for inverse problems, Scientific Computation, Springer, Dor- drecht, Netherlands, 2010 ed., Oct

G. CHAVENT,Nonlinear least squares for inverse problems, Scientific Computation, Springer, Dor- drecht, Netherlands, 2010 ed., Oct. 2009,https://doi.org/10.1007/978-90-481-2785-6

work page doi:10.1007/978-90-481-2785-6 2010

[16] [16]

C. CHEN, D. LI, J. YAN,ANDX. YANG,Modeling dynamic user preference via dictionary learn- ing for sequential recommendation, IEEE Transactions on Knowledge and Data Engineering, 34 (2022), pp. 5446–5458,https://doi.org/10.1109/TKDE.2021.3050407

work page doi:10.1109/tkde.2021.3050407 2022

[17] [17]

G. CHEN, P. XUE, M. GAN, J. CHEN, W. GUO,ANDC. P. CHEN,Variable projection algo- rithms: Theoretical insights and a novel approach for problems with large residual, Automatica, 177 (2025), p. 112300,https://doi.org/10.1016/j.automatica.2025.112300

work page doi:10.1016/j.automatica.2025.112300 2025

[18] [18]

COLSON, P

B. COLSON, P. MARCOTTE,ANDG. SAVARD,An overview of bilevel optimization, Annals of Operations Research, 153 (2007), pp. 235–256,https://doi.org/10.1007/ s10479-007-0176-2

2007

[19] [19]

C. W. COMBETTES ANDS. POKUTTA,Complexity of linear minimization and projection on some sets, Operations Research Letters, 49 (2021), pp. 565–571,https://doi.org/10.1016/j.orl. 2021.06.019

work page doi:10.1016/j.orl 2021

[20] [20]

DEMPE ANDA

S. DEMPE ANDA. B. ZEMKOHO, eds.,Bilevel Optimization: Advances and Next Challenges, vol. 161 of Springer Optimization and Its Applications, Springer, Cham, 2020,https://doi. org/10.1007/978-3-030-52119-6

work page doi:10.1007/978-3-030-52119-6 2020

[21] [21]

DONG ANDJ

S. DONG ANDJ. YANG,Numerical approximation of partial differential equations by a variable projection method with artificial neural networks, Computer Methods in Applied Mechanics and Engineering, 398 (2022), p. 115284,https://doi.org/10.1016/j.cma.2022.115284

work page doi:10.1016/j.cma.2022.115284 2022

[22] [22]

D. L. DONOHO,Compressed sensing, IEEE Transactions on Information Theory, 52 (2006), pp. 1289–1306,https://doi.org/10.1109/TIT.2006.871582

work page doi:10.1109/tit.2006.871582 2006

[23] [23]

DUCHI, S

J. DUCHI, S. SHALEV-SHWARTZ, Y. SINGER,ANDT. CHANDRA,Efficient projections onto the ℓ1-ball for learning in high dimensions, in Proceedings of the 25th International Conference on Machine Learning, 2008, pp. 272–279,https://doi.org/10.1145/1390156.1390191

work page doi:10.1145/1390156.1390191 2008

[24] [24]

DUS,Grassmannian geometry and global convergence of variable projection for neural net- works, 2026,https://arxiv.org/abs/2601.22897

M. DUS,Grassmannian geometry and global convergence of variable projection for neural net- works, 2026,https://arxiv.org/abs/2601.22897

arXiv 2026

[25] [25]

M. I. ESPANOL ANDG. JERONIMO,Local convergence analysis of a variable projection method for regularized separable nonlinear inverse problems, SIAM Journal on Matrix Analysis and Ap- plications, 46 (2025), pp. 858–878,https://doi.org/10.1137/24M1639087

work page doi:10.1137/24m1639087 2025

[26] [26]

FRANCESCHI, P

L. FRANCESCHI, P. FRASCONI, S. SALZO, R. GRAZZI,ANDM. PONTIL,Bilevel programming for hyperparameter optimization and meta-learning, in Proceedings of the 35th International Con- ference on Machine Learning, vol. 80 of Proceedings of Machine Learning Research, PMLR, 2018, pp. 1568–1577

2018

[27] [27]

FRANK ANDP

M. FRANK ANDP. WOLFE,An algorithm for quadratic programming, Nav. Res. Logist. Q., 3 (1956), pp. 95–110,https://doi.org/10.1002/nav.3800030109

work page doi:10.1002/nav.3800030109 1956

[28] [28]

GILLIS,Nonnegative Matrix Factorization, vol

N. GILLIS,Nonnegative Matrix Factorization, vol. 2 of Data Science, Society for Industrial and Applied Mathematics, Philadelphia, PA, 2020,https://doi.org/10.1137/1.9781611976410

work page doi:10.1137/1.9781611976410 2020

[29] [29]

GOLUB ANDV

G. GOLUB ANDV. PEREYRA,Separable nonlinear least squares: the variable projection method and its applications, Inverse problems, 19 (2003), p. R1,https://doi.org/10.1088/ 0266-5611/19/2/201

2003

[30] [30]

G. H. GOLUB ANDR. LEVEQUE,Extensions and uses of the variable projection algorithm for solving nonlinear least squares problems., in Proceedings of the 1979 Army Numerical Analysis and Computers Conference, 1979

1979

[31] [31]

G. H. GOLUB ANDV. PEREYRA,The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate, SIAM Journal on Numerical Analysis, 10 (1973), pp. 413–432,https://doi.org/10.1137/0710036. 18

work page doi:10.1137/0710036 1973

[32] [32]

GOULD, B

S. GOULD, B. FERNANDO, A. CHERIAN, P. ANDERSON, R. SANTACRUZ,ANDE. GUO,On dif- ferentiating parameterized argmin and argmax problems with application to bi-level optimization, arXiv preprint arXiv:1607.05447, (2016),https://arxiv.org/abs/1607.05447

Pith/arXiv arXiv 2016

[33] [33]

Statistical

T. HASTIE, R. TIBSHIRANI,ANDM. WAINWRIGHT,Statistical Learning with Sparsity: The Lasso and Generalizations, Chapman and Hall/CRC, Boca Raton, FL, 2015,https://doi.org/ 10.1201/b18401

work page doi:10.1201/b18401 2015

[34] [34]

K. HE, X. ZHANG, S. REN,ANDJ. SUN,Deep residual learning for image recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778, https://doi.org/10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016

[35] [35]

G. E. HINTON ANDR. R. SALAKHUTDINOV,Reducing the dimensionality of data with neural networks, Science, 313 (2006), pp. 504–507,https://doi.org/10.1126/science.1127647

work page doi:10.1126/science.1127647 2006

[36] [36]

HUBEN, H

R. HUBEN, H. CUNNINGHAM, L. R. SMITH, A. EWART,ANDL. SHARKEY,Sparse autoen- coders find highly interpretable features in language models, in The Twelfth International Confer- ence on Learning Representations, 2024

2024

[37] [37]

JAGGI,Revisiting Frank-Wolfe: Projection-free sparse convex optimization, in Proceedings of the 30th International Conference on Machine Learning, S

M. JAGGI,Revisiting Frank-Wolfe: Projection-free sparse convex optimization, in Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester, eds., vol. 28 of Proceedings of Machine Learning Research, Atlanta, Georgia, USA, 17–19 Jun 2013, PMLR, pp. 427–435

2013

[38] [38]

JAGGI,An equivalence between the lasso and support vector machines, in Regularization, Op- timization, Kernels, and Support Vector Machines, J

M. JAGGI,An equivalence between the lasso and support vector machines, in Regularization, Op- timization, Kernels, and Support Vector Machines, J. A. K. Suykens, ed., Taylor & Francis, 2014, pp. 1–26,https://doi.org/10.1201/b17558-4

work page doi:10.1201/b17558-4 2014

[39] [39]

KAUFMAN,A variable projection method for solving separable nonlinear least squares problems, BIT Numerical Mathematics, 15 (1975), pp

L. KAUFMAN,A variable projection method for solving separable nonlinear least squares problems, BIT Numerical Mathematics, 15 (1975), pp. 49–57,https://doi.org/10.1007/ BF01932995

1975

[40] [40]

KRIZHEVSKY ANDG

A. KRIZHEVSKY ANDG. HINTON,Learning multiple layers of features from tiny images, Tech. Report 0, University of Toronto, Toronto, Ontario, 2009

2009

[41] [41]

W. H. LAWTON ANDE. A. SYLVESTRE,Elimination of linear parameters in nonlinear regression, Technometrics, 13 (1971), pp. 461–467,https://doi.org/10.2307/1267160

work page doi:10.2307/1267160 1971

[42] [42]

LECUN, C

Y. LECUN, C. CORTES,ANDC. J. BURGES,The mnist database of handwritten digits, (1998)

1998

[43] [43]

C. H. LIM ANDS. J. WRIGHT,A box-constrained approach for hard permutation problems, in Proceedings of the 33rd International Conference on Machine Learning, vol. 48 of Proceedings of Machine Learning Research, PMLR, 2016, pp. 2454–2463

2016

[44] [44]

Selective review of offline change point detection methods,

I. MARKOVSKY,Recent progress on variable projection methods for structured low-rank ap- proximation, Signal Processing, 96 (2014), p. 406–419,https://doi.org/10.1016/j.sigpro. 2013.09.021

work page doi:10.1016/j.sigpro 2014

[45] [45]

NEWMAN, J

E. NEWMAN, J. CHUNG, M. CHUNG,ANDL. RUTHOTTO,slimtrain—a stochastic approximation method for training separable deep neural networks, SIAM Journal on Scientific Computing, 44 (2022), pp. A2322–A2348,https://doi.org/10.1137/21M1452512

work page doi:10.1137/21m1452512 2022

[46] [46]

NEWMAN, L

E. NEWMAN, L. RUTHOTTO, J. HART,ANDB.VANBLOEMENWAANDERS,Train like a (var)pro: Efficient training of neural networks with variable projection, SIAM Journal on Mathe- matics of Data Science, 3 (2021), pp. 1041–1066,https://doi.org/10.1137/20M1359511

work page doi:10.1137/20m1359511 2021

[47] [47]

PETHICK, W

T. PETHICK, W. XIE, K. ANTONAKOPOULOS, Z. ZHU, A. SILVETI-FALLS,ANDV. CEVHER, Training deep learning models with norm-constrained LMOs, in Proceedings of the 42nd Interna- tional Conference on Machine Learning, vol. 267 of Proceedings of Machine Learning Research, PMLR, 2025, pp. 49069–49104

2025

[48] [48]

RECHT, M

B. RECHT, M. FAZEL,ANDP. A. PARRILO,Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, SIAM Review, 52 (2010), pp. 471–501,https://doi. org/10.1137/070697835

work page doi:10.1137/070697835 2010

[49] [49]

RUBINSTEIN, A

R. RUBINSTEIN, A. M. BRUCKSTEIN,ANDM. ELAD,Dictionaries for sparse representation modeling, Proceedings of the IEEE, 98 (2010), pp. 1045–1057,https://doi.org/10.1109/ JPROC.2010.2040551. 19

arXiv 2010

[50] [50]

D. B. C. SALZER, M. I. ESPA ˜NOL,ANDG. JERONIMO,Variable projection methods for solving regularized separable inverse problems with applications to semi-blind image deblurring, 2026, https://arxiv.org/abs/2601.05224

arXiv 2026

[51] [51]

Sequential Monte Carlo samplers

R. TIBSHIRANI,Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society: Series B (Methodological), 58 (1996), pp. 267–288,https://doi.org/10.1111/j. 2517-6161.1996.tb02080.x

work page doi:10.1111/j 1996

[52] [52]

TSCHANNEN, O

M. TSCHANNEN, O. BACHEM,ANDM. LUCIC,Recent advances in autoencoder-based represen- tation learning, 2018,https://arxiv.org/abs/1812.05069

Pith/arXiv arXiv 2018

[53] [53]

USEVICH ANDI

K. USEVICH ANDI. MARKOVSKY,Variable projection for affinely structured low-rank approx- imation in weighted 2-norms, Journal of Computational and Applied Mathematics, 272 (2014), pp. 430–448,https://doi.org/10.1016/j.cam.2013.04.034

work page doi:10.1016/j.cam.2013.04.034 2014

[54] [54]

USEVICH ANDI

K. USEVICH ANDI. MARKOVSKY,Variable projection methods for approximate (greatest) com- mon divisor computations, Theoretical Computer Science, 681 (2017), pp. 176–198,https: //doi.org/10.1016/j.tcs.2017.03.028. Symbolic Numeric Computation

work page doi:10.1016/j.tcs.2017.03.028 2017

[55] [55]

T.VANLEEUWEN ANDA. Y. ARAVKIN,Variable projection for nonsmooth problems, SIAM Journal on Scientific Computing, 43 (2021), pp. S249–S268,https://doi.org/10.1137/ 20M1348650

2021

[56] [56]

XU, G.-Y

H.-L. XU, G.-Y. CHEN, S.-Q. CHENG, M. GAN,ANDJ. CHEN,Variable projection algorithms with sparse constraint for separable nonlinear models, Control Theory and Technology, 22 (2024), pp. 135–146,https://doi.org/10.1007/s11768-023-00194-3

work page doi:10.1007/s11768-023-00194-3 2024

[57] [57]

ZANGRANDO, S

E. ZANGRANDO, S. VENTURINI, F. RINALDI,ANDF. TUDISCO,dEBORA: Efficient bilevel optimization-based low-rank adaptation, in International Conference on Learning Representations, 2025. 20

2025