pith. sign in

arxiv: 2606.23939 · v1 · pith:4BYMFIA6new · submitted 2026-06-22 · 🧮 math.OC · cs.LG· cs.NA· math.NA

Constrained Variable Projection for Structured Problems

Pith reviewed 2026-06-26 06:57 UTC · model grok-4.3

classification 🧮 math.OC cs.LGcs.NAmath.NA
keywords variable projectionbilevel optimizationconditional gradientnonlinear least squaresconstrained optimizationdictionary learningsparse autoencoding
0
0 comments X

The pith

Interpreting variable projection as a collapsed bilevel problem produces exact reduced gradients and a convergent conditional-gradient algorithm for constrained nonlinear least-squares tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats the elimination of linear variables in variable projection as the lower level of a bilevel program, leaving an upper-level problem whose variables obey convex constraints. This perspective supplies exact gradient expressions for the reduced objective that are compatible with automatic differentiation. A conditional-gradient algorithm is then applied to the reduced problem, together with convergence guarantees that rest on standard smoothness and compactness. Experiments on sparse autoencoding, dictionary learning, blind deconvolution, and few-shot learning indicate gains in wall-clock time and data efficiency compared with joint optimization of all variables.

Core claim

By interpreting variable projection as a collapsed bilevel optimization problem, exact reduced-gradient formulas compatible with automatic differentiation are derived and a conditional-gradient algorithm is proposed for the resulting constrained reduced problem, with convergence guarantees under standard smoothness and compactness assumptions and extensions discussed for structured lower-level variables.

What carries the argument

The collapsed bilevel formulation that eliminates the lower-level least-squares solve to produce an equivalent constrained reduced problem.

If this is right

  • The conditional-gradient algorithm converges to a stationary point of the reduced problem under the stated assumptions.
  • Exact reduced gradients can be obtained via automatic differentiation without additional derivation.
  • The framework extends to cases with structured lower-level variables.
  • Wall-clock and data efficiency improve relative to joint optimization on the four reported application domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bilevel collapse could be applied to other separable convex subproblems beyond least squares.
  • Implementation inside existing automatic-differentiation libraries becomes straightforward for models that contain linear layers subject to constraints.
  • The approach may reduce memory requirements during training when many variables can be eliminated exactly.

Load-bearing premise

The reduced objective is smooth and the feasible set is compact.

What would settle it

A concrete instance satisfying the smoothness and compactness conditions on which the conditional-gradient iterates fail to converge to a stationary point of the reduced problem.

Figures

Figures reproduced from arXiv: 2606.23939 by Emanuele Zangrando, Francesco Rinaldi, Francesco Tudisco, Sara Venturini.

Figure 1
Figure 1. Figure 1: Convergence of CR-VarPro against Projected gradient descent for the Autoencoding problem [PITH_FULL_IMAGE:figures/full_fig_p014_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Convergence of CR-VarPro against Projected gradient descent for the Dictionary Learning [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Convergence of CR-VarPro against Projected gradient descent for the blind deconvolution [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Fine tuning results ResNet-18 on CIFAR10. Frozen Ridge fits only the linear head as a solu [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
read the original abstract

Variable projection is a classical technique for separable nonlinear least-squares problems, in which variables that enter linearly are eliminated exactly, yielding a reduced nonlinear problem. By expressing this framework as a particular instance of a broader class of bilevel optimization problems, we develop a constrained variable-projection framework for data-science models, where the remaining variables are subject to convex constraints and the eliminated variables arise from a lower-level least-squares problem. In particular, by interpreting variable projection as a collapsed bilevel optimization problem, we derive exact reduced-gradient formulas compatible with automatic differentiation and propose a conditional-gradient algorithm for the resulting constrained reduced problem. We establish convergence guarantees under standard smoothness and compactness assumptions, and discuss extensions to structured lower-level variables. Numerical experiments on sparse autoencoding, dictionary learning, blind deconvolution, and few-shot learning suggest that the method can improve wall-clock efficiency and data efficiency relative to natural joint-optimization baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that variable projection can be recast as a collapsed bilevel optimization problem, yielding exact reduced-gradient formulas compatible with automatic differentiation. It proposes a conditional-gradient (Frank-Wolfe) algorithm for the resulting constrained reduced problem, establishes convergence under standard smoothness and compactness assumptions on the reduced objective and feasible set, discusses extensions to structured lower-level variables, and reports numerical experiments on sparse autoencoding, dictionary learning, blind deconvolution, and few-shot learning that suggest gains in wall-clock and data efficiency over joint-optimization baselines.

Significance. If the derivations and convergence analysis hold, the work supplies a principled extension of classical variable projection to settings with convex constraints on the nonlinear variables. The bilevel reduction produces exact reduced gradients that integrate with AD, and the convergence result rests explicitly on the usual conditions for Frank-Wolfe methods rather than novel assumptions. The numerical comparisons provide concrete evidence of practical utility for structured data-science problems.

minor comments (2)
  1. [§3] §3 (or wherever the reduced-gradient derivation appears): the manuscript should explicitly state the precise form of the reduced gradient (e.g., the expression involving the pseudoinverse or adjoint) so that readers can verify compatibility with AD without reconstructing the bilevel collapse.
  2. [Numerical experiments] Numerical experiments section: the description of post-hoc parameter choices (step-size rules, stopping tolerances, constraint sets) should be expanded so that the reported efficiency gains can be reproduced from the stated protocol.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the manuscript, the recognition of its contributions to extending variable projection via bilevel collapse, and the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper frames variable projection as a collapsed bilevel problem to derive exact reduced-gradient formulas, then applies a conditional-gradient method with convergence under explicitly stated standard smoothness and compactness assumptions on the reduced objective. These assumptions are the usual Frank-Wolfe conditions and are not derived from or equivalent to the target result. The abstract and reader's summary give no indication of self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claim to its own inputs. The derivation is therefore self-contained against external mathematical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on interpreting variable projection as a collapsed bilevel problem and on standard smoothness/compactness conditions for convergence; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption standard smoothness and compactness assumptions suffice for convergence of the conditional-gradient algorithm on the reduced problem
    Invoked to establish convergence guarantees (abstract, final sentence)

pith-pipeline@v0.9.1-grok · 5691 in / 1184 out tokens · 18974 ms · 2026-06-26T06:57:04.581758+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 28 canonical work pages

  1. [1]

    AGRAWAL, B

    A. AGRAWAL, B. AMOS, S. BARRATT, S. BOYD, S. DIAMOND,ANDJ. Z. KOLTER,Differen- tiable convex optimization layers, in Advances in Neural Information Processing Systems, vol. 32, 2019, pp. 9558–9570

  2. [2]

    ALAIN ANDY

    G. ALAIN ANDY. BENGIO,Understanding intermediate layers using linear classifier probes, 2017

  3. [3]

    AMOS ANDJ

    B. AMOS ANDJ. Z. KOLTER,OptNet: Differentiable optimization as a layer in neural networks, in Proceedings of the 34th International Conference on Machine Learning, vol. 70 of Proceedings of Machine Learning Research, PMLR, 2017, pp. 136–145

  4. [4]

    J. F. BARD,Practical Bilevel Optimization: Algorithms and Applications, vol. 30 of Nonconvex Optimization and Its Applications, Springer, Boston, MA, 1998,https://doi.org/10.1007/ 978-1-4757-2836-1

  5. [5]

    BENGIO, L

    Y. BENGIO, L. YAO, G. ALAIN,ANDP. VINCENT,Generalized denoising auto-encoders as gen- erative models, 2013,https://arxiv.org/abs/1305.6663

  6. [6]

    D. P. BERTSEKAS,Convex optimization algorithms, Athena Scientific, Belmont, 2015

  7. [7]

    BLONDEL, Q

    M. BLONDEL, Q. BERTHET, M. CUTURI, R. FROSTIG, S. HOYER, F. LLINARES-L ´OPEZ, F. PE- DREGOSA,ANDJ.-P. VERT,Efficient and modular implicit differentiation, in Advances in Neural Information Processing Systems, vol. 35, 2022, pp. 5230–5242

  8. [8]

    I. M. BOMZE, F. RINALDI,ANDD. ZEFFIRO,Active set complexity of the away-step Frank–Wolfe algorithm, SIAM Journal on Optimization, 30 (2020), pp. 2470–2500,https://doi.org/10. 1137/19M1309419

  9. [9]

    I. M. BOMZE, F. RINALDI,ANDD. ZEFFIRO,Frank–Wolfe and friends: a journey into projection- free first-order optimization methods, 4OR, 19 (2021), pp. 313–345,https://doi.org/10. 1007/s10288-021-00493-y

  10. [10]

    BRAUN, A

    G. BRAUN, A. CARDERERA, C. W. COMBETTES, H. HASSANI, A. KARBASI, A. MOKHTARI, ANDS. POKUTTA,Conditional Gradient Methods: From Core Principles to AI Applications, MOS- SIAM Series on Optimization, Society for Industrial and Applied Mathematics, 2025,https: //doi.org/10.1137/1.9781611978568

  11. [11]

    J. J. BRUST,Nonlinear least-squares for large-scale machine learning using stochastic jacobian estimates, arXiv preprint arXiv:1412.6980, (2021)

  12. [12]

    B ¨ARLIGEA, P

    A. B ¨ARLIGEA, P. HOCHSTAFFL,ANDF. SCHREIER,A generalized variable projection algorithm for least squares problems in atmospheric remote sensing, Mathematics, 11 (2023),https://doi. org/10.3390/math11132839

  13. [13]

    E. J. CAND `ES, X. LI, Y. MA,ANDJ. WRIGHT,Robust principal component analysis?, Journal of the ACM, 58 (2011), pp. 11:1–11:37,https://doi.org/10.1145/1970392.1970395. 17

  14. [14]

    CARLINI ANDD

    N. CARLINI ANDD. WAGNER,Towards evaluating the robustness of neural networks, in 2017 IEEE Symposium on Security and Privacy, IEEE, 2017, pp. 39–57,https://doi.org/10.1109/ SP.2017.49

  15. [15]

    CHAVENT,Nonlinear least squares for inverse problems, Scientific Computation, Springer, Dor- drecht, Netherlands, 2010 ed., Oct

    G. CHAVENT,Nonlinear least squares for inverse problems, Scientific Computation, Springer, Dor- drecht, Netherlands, 2010 ed., Oct. 2009,https://doi.org/10.1007/978-90-481-2785-6

  16. [16]

    C. CHEN, D. LI, J. YAN,ANDX. YANG,Modeling dynamic user preference via dictionary learn- ing for sequential recommendation, IEEE Transactions on Knowledge and Data Engineering, 34 (2022), pp. 5446–5458,https://doi.org/10.1109/TKDE.2021.3050407

  17. [17]

    G. CHEN, P. XUE, M. GAN, J. CHEN, W. GUO,ANDC. P. CHEN,Variable projection algo- rithms: Theoretical insights and a novel approach for problems with large residual, Automatica, 177 (2025), p. 112300,https://doi.org/10.1016/j.automatica.2025.112300

  18. [18]

    COLSON, P

    B. COLSON, P. MARCOTTE,ANDG. SAVARD,An overview of bilevel optimization, Annals of Operations Research, 153 (2007), pp. 235–256,https://doi.org/10.1007/ s10479-007-0176-2

  19. [19]

    C. W. COMBETTES ANDS. POKUTTA,Complexity of linear minimization and projection on some sets, Operations Research Letters, 49 (2021), pp. 565–571,https://doi.org/10.1016/j.orl. 2021.06.019

  20. [20]

    DEMPE ANDA

    S. DEMPE ANDA. B. ZEMKOHO, eds.,Bilevel Optimization: Advances and Next Challenges, vol. 161 of Springer Optimization and Its Applications, Springer, Cham, 2020,https://doi. org/10.1007/978-3-030-52119-6

  21. [21]

    DONG ANDJ

    S. DONG ANDJ. YANG,Numerical approximation of partial differential equations by a variable projection method with artificial neural networks, Computer Methods in Applied Mechanics and Engineering, 398 (2022), p. 115284,https://doi.org/10.1016/j.cma.2022.115284

  22. [22]

    D. L. DONOHO,Compressed sensing, IEEE Transactions on Information Theory, 52 (2006), pp. 1289–1306,https://doi.org/10.1109/TIT.2006.871582

  23. [23]

    DUCHI, S

    J. DUCHI, S. SHALEV-SHWARTZ, Y. SINGER,ANDT. CHANDRA,Efficient projections onto the ℓ1-ball for learning in high dimensions, in Proceedings of the 25th International Conference on Machine Learning, 2008, pp. 272–279,https://doi.org/10.1145/1390156.1390191

  24. [24]

    DUS,Grassmannian geometry and global convergence of variable projection for neural net- works, 2026,https://arxiv.org/abs/2601.22897

    M. DUS,Grassmannian geometry and global convergence of variable projection for neural net- works, 2026,https://arxiv.org/abs/2601.22897

  25. [25]

    M. I. ESPANOL ANDG. JERONIMO,Local convergence analysis of a variable projection method for regularized separable nonlinear inverse problems, SIAM Journal on Matrix Analysis and Ap- plications, 46 (2025), pp. 858–878,https://doi.org/10.1137/24M1639087

  26. [26]

    FRANCESCHI, P

    L. FRANCESCHI, P. FRASCONI, S. SALZO, R. GRAZZI,ANDM. PONTIL,Bilevel programming for hyperparameter optimization and meta-learning, in Proceedings of the 35th International Con- ference on Machine Learning, vol. 80 of Proceedings of Machine Learning Research, PMLR, 2018, pp. 1568–1577

  27. [27]

    FRANK ANDP

    M. FRANK ANDP. WOLFE,An algorithm for quadratic programming, Nav. Res. Logist. Q., 3 (1956), pp. 95–110,https://doi.org/10.1002/nav.3800030109

  28. [28]

    GILLIS,Nonnegative Matrix Factorization, vol

    N. GILLIS,Nonnegative Matrix Factorization, vol. 2 of Data Science, Society for Industrial and Applied Mathematics, Philadelphia, PA, 2020,https://doi.org/10.1137/1.9781611976410

  29. [29]

    GOLUB ANDV

    G. GOLUB ANDV. PEREYRA,Separable nonlinear least squares: the variable projection method and its applications, Inverse problems, 19 (2003), p. R1,https://doi.org/10.1088/ 0266-5611/19/2/201

  30. [30]

    G. H. GOLUB ANDR. LEVEQUE,Extensions and uses of the variable projection algorithm for solving nonlinear least squares problems., in Proceedings of the 1979 Army Numerical Analysis and Computers Conference, 1979

  31. [31]

    G. H. GOLUB ANDV. PEREYRA,The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate, SIAM Journal on Numerical Analysis, 10 (1973), pp. 413–432,https://doi.org/10.1137/0710036. 18

  32. [32]

    GOULD, B

    S. GOULD, B. FERNANDO, A. CHERIAN, P. ANDERSON, R. SANTACRUZ,ANDE. GUO,On dif- ferentiating parameterized argmin and argmax problems with application to bi-level optimization, arXiv preprint arXiv:1607.05447, (2016),https://arxiv.org/abs/1607.05447

  33. [33]

    Statistical

    T. HASTIE, R. TIBSHIRANI,ANDM. WAINWRIGHT,Statistical Learning with Sparsity: The Lasso and Generalizations, Chapman and Hall/CRC, Boca Raton, FL, 2015,https://doi.org/ 10.1201/b18401

  34. [34]

    K. HE, X. ZHANG, S. REN,ANDJ. SUN,Deep residual learning for image recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778, https://doi.org/10.1109/CVPR.2016.90

  35. [35]

    G. E. HINTON ANDR. R. SALAKHUTDINOV,Reducing the dimensionality of data with neural networks, Science, 313 (2006), pp. 504–507,https://doi.org/10.1126/science.1127647

  36. [36]

    HUBEN, H

    R. HUBEN, H. CUNNINGHAM, L. R. SMITH, A. EWART,ANDL. SHARKEY,Sparse autoen- coders find highly interpretable features in language models, in The Twelfth International Confer- ence on Learning Representations, 2024

  37. [37]

    JAGGI,Revisiting Frank-Wolfe: Projection-free sparse convex optimization, in Proceedings of the 30th International Conference on Machine Learning, S

    M. JAGGI,Revisiting Frank-Wolfe: Projection-free sparse convex optimization, in Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester, eds., vol. 28 of Proceedings of Machine Learning Research, Atlanta, Georgia, USA, 17–19 Jun 2013, PMLR, pp. 427–435

  38. [38]

    JAGGI,An equivalence between the lasso and support vector machines, in Regularization, Op- timization, Kernels, and Support Vector Machines, J

    M. JAGGI,An equivalence between the lasso and support vector machines, in Regularization, Op- timization, Kernels, and Support Vector Machines, J. A. K. Suykens, ed., Taylor & Francis, 2014, pp. 1–26,https://doi.org/10.1201/b17558-4

  39. [39]

    KAUFMAN,A variable projection method for solving separable nonlinear least squares problems, BIT Numerical Mathematics, 15 (1975), pp

    L. KAUFMAN,A variable projection method for solving separable nonlinear least squares problems, BIT Numerical Mathematics, 15 (1975), pp. 49–57,https://doi.org/10.1007/ BF01932995

  40. [40]

    KRIZHEVSKY ANDG

    A. KRIZHEVSKY ANDG. HINTON,Learning multiple layers of features from tiny images, Tech. Report 0, University of Toronto, Toronto, Ontario, 2009

  41. [41]

    W. H. LAWTON ANDE. A. SYLVESTRE,Elimination of linear parameters in nonlinear regression, Technometrics, 13 (1971), pp. 461–467,https://doi.org/10.2307/1267160

  42. [42]

    LECUN, C

    Y. LECUN, C. CORTES,ANDC. J. BURGES,The mnist database of handwritten digits, (1998)

  43. [43]

    C. H. LIM ANDS. J. WRIGHT,A box-constrained approach for hard permutation problems, in Proceedings of the 33rd International Conference on Machine Learning, vol. 48 of Proceedings of Machine Learning Research, PMLR, 2016, pp. 2454–2463

  44. [44]

    Selective review of offline change point detection methods,

    I. MARKOVSKY,Recent progress on variable projection methods for structured low-rank ap- proximation, Signal Processing, 96 (2014), p. 406–419,https://doi.org/10.1016/j.sigpro. 2013.09.021

  45. [45]

    NEWMAN, J

    E. NEWMAN, J. CHUNG, M. CHUNG,ANDL. RUTHOTTO,slimtrain—a stochastic approximation method for training separable deep neural networks, SIAM Journal on Scientific Computing, 44 (2022), pp. A2322–A2348,https://doi.org/10.1137/21M1452512

  46. [46]

    NEWMAN, L

    E. NEWMAN, L. RUTHOTTO, J. HART,ANDB.VANBLOEMENWAANDERS,Train like a (var)pro: Efficient training of neural networks with variable projection, SIAM Journal on Mathe- matics of Data Science, 3 (2021), pp. 1041–1066,https://doi.org/10.1137/20M1359511

  47. [47]

    PETHICK, W

    T. PETHICK, W. XIE, K. ANTONAKOPOULOS, Z. ZHU, A. SILVETI-FALLS,ANDV. CEVHER, Training deep learning models with norm-constrained LMOs, in Proceedings of the 42nd Interna- tional Conference on Machine Learning, vol. 267 of Proceedings of Machine Learning Research, PMLR, 2025, pp. 49069–49104

  48. [48]

    RECHT, M

    B. RECHT, M. FAZEL,ANDP. A. PARRILO,Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, SIAM Review, 52 (2010), pp. 471–501,https://doi. org/10.1137/070697835

  49. [49]

    RUBINSTEIN, A

    R. RUBINSTEIN, A. M. BRUCKSTEIN,ANDM. ELAD,Dictionaries for sparse representation modeling, Proceedings of the IEEE, 98 (2010), pp. 1045–1057,https://doi.org/10.1109/ JPROC.2010.2040551. 19

  50. [50]

    D. B. C. SALZER, M. I. ESPA ˜NOL,ANDG. JERONIMO,Variable projection methods for solving regularized separable inverse problems with applications to semi-blind image deblurring, 2026, https://arxiv.org/abs/2601.05224

  51. [51]

    Sequential Monte Carlo samplers

    R. TIBSHIRANI,Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society: Series B (Methodological), 58 (1996), pp. 267–288,https://doi.org/10.1111/j. 2517-6161.1996.tb02080.x

  52. [52]

    TSCHANNEN, O

    M. TSCHANNEN, O. BACHEM,ANDM. LUCIC,Recent advances in autoencoder-based represen- tation learning, 2018,https://arxiv.org/abs/1812.05069

  53. [53]

    USEVICH ANDI

    K. USEVICH ANDI. MARKOVSKY,Variable projection for affinely structured low-rank approx- imation in weighted 2-norms, Journal of Computational and Applied Mathematics, 272 (2014), pp. 430–448,https://doi.org/10.1016/j.cam.2013.04.034

  54. [54]

    USEVICH ANDI

    K. USEVICH ANDI. MARKOVSKY,Variable projection methods for approximate (greatest) com- mon divisor computations, Theoretical Computer Science, 681 (2017), pp. 176–198,https: //doi.org/10.1016/j.tcs.2017.03.028. Symbolic Numeric Computation

  55. [55]

    T.VANLEEUWEN ANDA. Y. ARAVKIN,Variable projection for nonsmooth problems, SIAM Journal on Scientific Computing, 43 (2021), pp. S249–S268,https://doi.org/10.1137/ 20M1348650

  56. [56]

    XU, G.-Y

    H.-L. XU, G.-Y. CHEN, S.-Q. CHENG, M. GAN,ANDJ. CHEN,Variable projection algorithms with sparse constraint for separable nonlinear models, Control Theory and Technology, 22 (2024), pp. 135–146,https://doi.org/10.1007/s11768-023-00194-3

  57. [57]

    ZANGRANDO, S

    E. ZANGRANDO, S. VENTURINI, F. RINALDI,ANDF. TUDISCO,dEBORA: Efficient bilevel optimization-based low-rank adaptation, in International Conference on Learning Representations, 2025. 20