A Scale-Shape Dual Newton Method for Entropic Least Squares
Pith reviewed 2026-05-07 08:36 UTC · model grok-4.3
The pith
A scale-shape decomposition yields a dual Newton method for entropy-regularized nonnegative least squares that converges globally linearly with O(log ε^{-1}) iterations and locally quadratically while avoiding overflow.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We give a damped inexact Newton method for entropy-regularized least-squares on the nonnegative orthant that converges globally at a linear rate with O(log ε^{-1}) iteration complexity, locally at a superlinear-to-quadratic rate, and is immune to the finite-precision overflow that limits classical dual solvers. A scale-shape decomposition of the primal -- separating its scale from its direction -- produces a dual with a nonsingular Jacobian. Objectives and Jacobians are evaluated through stable log-sum-exp and softmax primitives. Lambert W bounds on the scale uniformly control the Jacobian's spectrum, from which both rates follow. The solution map is jointly Lipschitz in the data, regular 0
What carries the argument
The scale-shape decomposition of the primal variable into a scalar scale and a shape vector on the unit simplex, which produces a dual problem whose Jacobian is nonsingular and has its spectrum uniformly bounded via Lambert W function estimates on the scale.
Load-bearing premise
The decomposition into scale and shape produces a dual Jacobian whose eigenvalues stay bounded away from zero and infinity by constants derived from the Lambert W function applied to the scale.
What would settle it
An instance of the entropic least-squares problem in which the Newton steps fail to achieve the predicted linear global convergence rate, or in which the Jacobian spectrum violates the Lambert W bounds, or where overflow occurs despite the use of stable log-sum-exp and softmax primitives.
Figures
read the original abstract
We give a damped inexact Newton method for entropy-regularized least-squares on the nonnegative orthant that converges globally at a linear rate with $O(\log\epsilon^{-1})$ iteration complexity, locally at a superlinear-to-quadratic rate, and is immune to the finite-precision overflow that limits classical dual solvers. A scale-shape decomposition of the primal -- separating its scale from its direction -- produces a dual with a nonsingular Jacobian. Objectives and Jacobians are evaluated through stable log-sum-exp and softmax primitives. Lambert W bounds on the scale uniformly control the Jacobian's spectrum, from which both rates follow. The solution map is jointly Lipschitz in the data, regularization parameter, and reference measure, and extends continuously to the vanishing-regularization limit. Experiments on a problem from analytic continuation of quantum Monte Carlo data confirm the predicted overflow resilience and convergence behavior.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a damped inexact Newton method for entropy-regularized least-squares problems on the nonnegative orthant. A scale-shape decomposition of the primal variable yields a dual formulation whose Jacobian is nonsingular, with its spectrum controlled uniformly by Lambert W bounds applied to the scale component. This structure is used to prove global linear convergence with O(log ε^{-1}) iteration complexity and local superlinear-to-quadratic convergence. The algorithm employs stable log-sum-exp and softmax primitives to achieve overflow immunity in finite precision. The solution map is shown to be jointly Lipschitz continuous in the data, regularization parameter, and reference measure, with continuous extension to the vanishing-regularization limit. Experiments on analytic continuation of quantum Monte Carlo data are presented to confirm the predicted stability and convergence behavior.
Significance. If the claimed rates and stability properties hold, the work would provide a theoretically grounded and numerically robust solver for entropic least-squares problems that arise in machine learning, statistics, and physics applications. The scale-shape decomposition combined with Lambert W spectral control is a novel analytical device that could extend to other entropy-regularized convex programs. The Lipschitz continuity result for the solution map is a strong property that supports sensitivity analysis and continuation methods. The experimental validation on a concrete scientific computing task adds practical weight to the theoretical claims.
major comments (2)
- [§4.2, Theorem 4.1] §4.2, Theorem 4.1: The global linear convergence with O(log ε^{-1}) complexity rests on the claim that Lambert W bounds on the scale alone uniformly control the spectrum of the dual Jacobian for arbitrary reference measures μ. The derivation does not explicitly bound the contribution of the shape component and the softmax-weighted terms involving the data matrix A and nonuniform μ; if these terms can produce eigenvalues outside the Lambert W envelope, both the global rate and the local superlinear rate would fail to hold in general.
- [§3.1, Eq. (8)] §3.1, Eq. (8): The nonsingularity of the dual Jacobian is asserted after the scale-shape split, but the argument that the resulting matrix remains well-conditioned independently of μ's heterogeneity is not accompanied by an explicit eigenvalue estimate or condition-number bound that excludes dependence on the shape variables. This is load-bearing for the choice of damping parameter and the inexact Newton step acceptance criterion.
minor comments (3)
- [Abstract] Abstract: The local convergence statement 'superlinear-to-quadratic' is imprecise; the manuscript should state the precise conditions under which quadratic convergence is recovered (e.g., exact Newton steps near the solution).
- [§5, Figure 1] §5, Figure 1: The convergence plots do not overlay the theoretical O(log ε^{-1}) reference line, making it impossible to visually confirm that the observed iteration counts match the predicted complexity.
- [References] References: Several relevant works on dual Newton methods for entropic problems and on the use of Lambert W functions in optimization are missing; adding them would better situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for identifying points where the convergence analysis would benefit from greater explicitness. We address each major comment below and will revise the manuscript to incorporate the requested clarifications.
read point-by-point responses
-
Referee: [§4.2, Theorem 4.1] §4.2, Theorem 4.1: The global linear convergence with O(log ε^{-1}) complexity rests on the claim that Lambert W bounds on the scale alone uniformly control the spectrum of the dual Jacobian for arbitrary reference measures μ. The derivation does not explicitly bound the contribution of the shape component and the softmax-weighted terms involving the data matrix A and nonuniform μ; if these terms can produce eigenvalues outside the Lambert W envelope, both the global rate and the local superlinear rate would fail to hold in general.
Authors: The scale-shape decomposition writes the primal variable as the product of a scalar scale s and a normalized shape vector u lying in the probability simplex. This normalization implies that all softmax-weighted inner products with A and μ are convex combinations whose spectral radii are bounded by constants independent of the particular u and of the heterogeneity of μ. Consequently the eigenvalues of the dual Jacobian remain sandwiched between two functions of the Lambert W value at s alone. The argument in the proof of Theorem 4.1 relies on this fact, but the bounding of the shape-induced perturbation is indeed only implicit. We will insert a short auxiliary lemma in §4.2 that explicitly derives the eigenvalue enclosure, confirming that no eigenvalue can escape the Lambert W envelope. The same lemma will also secure the local superlinear rate. revision: yes
-
Referee: [§3.1, Eq. (8)] §3.1, Eq. (8): The nonsingularity of the dual Jacobian is asserted after the scale-shape split, but the argument that the resulting matrix remains well-conditioned independently of μ's heterogeneity is not accompanied by an explicit eigenvalue estimate or condition-number bound that excludes dependence on the shape variables. This is load-bearing for the choice of damping parameter and the inexact Newton step acceptance criterion.
Authors: We agree that an explicit condition-number bound would make the justification of the damping parameter and the inexact-step tolerance fully transparent. After the scale-shape factorization the Jacobian is positive definite; its smallest and largest eigenvalues are controlled solely by the scale variable through the same Lambert W estimates, because the shape normalization prevents the heterogeneity of μ from inflating the condition number. We will add a brief remark immediately after Eq. (8) that states the resulting condition-number bound in terms of s only. This will directly support the algorithmic parameter choices used in the damped inexact Newton iteration. revision: yes
Circularity Check
No circularity; derivation proceeds from scale-shape split and Lambert W spectral bounds
full rationale
The paper derives the Newton method, global linear rate, and local superlinear rate directly from the scale-shape decomposition of the primal (producing a nonsingular dual Jacobian) together with explicit Lambert W bounds on the scale variable that control the spectrum. No step reduces a claimed prediction or uniqueness result to a fitted parameter, self-citation chain, or definitional renaming; the Jacobian nonsingularity and rate claims follow from the stated bounds and stable log-sum-exp/softmax primitives without circular reduction. The solution-map Lipschitz property is likewise obtained from the same decomposition. This is the normal case of an independent algorithmic derivation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction.Recovering a nonnegative density, spectrum, or intensity from noisy samples of a linear smoothing map is a recurring problem across applied math- ematics. Instances include analytic continuation of quantum Monte Carlo data for warm dense matter [ 11, 12, 17], maximum-entropy image reconstruction in astron- omy [20, 35], dual methods for entro...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.19889724 2026
-
[2]
The present method instead treats theunknownscaleτas an independent dual variable
controls exponential growth by solving a log-domain dual with L-BFGS [ 26]. The present method instead treats theunknownscaleτas an independent dual variable. For unknown scale, Skilling and Bryan[35], and Bryan [9] propose Newton methods that require a well-conditioned surrogate for the forward operator. The Jacobian analysis of section 3 removes this re...
-
[3]
Dual under scale-shape separation.We now formalize the scale-shape decomposition outlined in subsection 1.2. Writing x = τ p with τ = ⟨1, x⟩ and p∈ ∆n, the reformulated problem becomes (2.1) min x∈Rn + ψp(x) = min τ≥0 min x∈τ∆ n ϕp(x, τ), where the primal objective is (2.2)ϕ p(x, τ) := 1 2λ ∥Ax−b∥ 2 +⟨c, x⟩+ (g q +δ τ∆ n)(x). Since gq is strictly convex a...
-
[4]
This section develops a damped inexact Newton iteration for that equation
Damped inexact Newton on the scale-shape dual.Section 2 reduced the primal problem to solving F (z) = 0, where z = (y, τ) ∈R m ×R ++ bundles the dual and scale variables. This section develops a damped inexact Newton iteration for that equation. The analysis rests on three lemmas. First, the merit function ρ = ∥F∥ has compact sublevel sets whose τ-project...
-
[5]
These are developed individually in Lemmas 4.1 to 4.5
Bounds.The iteration complexity and the asymptotic rates of convergence of Algorithm 3.1 depend on a collection of bounds for F , DF , the search directions dk, and the step-sizeα k. These are developed individually in Lemmas 4.1 to 4.5. Standing setup.Throughout this section, fix β > 0, ¯η∈[0, 1), and η∈ [0,¯η]; let τβmax, τβmin, and yβmax be as in Lemma...
-
[6]
Hence (4.12) and (4.13) follow
-
[7]
Convergence.This section establishes the main convergence result for Algo- rithm 3.1. Theorem 5.1 describes (1) the global linear convergence of the merit ρ at an explicit rate ˆν <1; (2) the iteration complexity needed to reduce ρ(zk) or ∥zk+1 −z ∗∥ below ε; and (3) superlinear (resp. (1 + p)-superlinear) local rates under standard forcing-sequence condi...
-
[8]
Perturbation analysis.With the prior parameterized by r := logq , the solu- tion map (b, λ, r) 7→ (y∗, τ ∗, x∗)(b, λ, r) is jointly C∞ on Rm ×R ++ ×R n (Lemma 6.1), locally Lipschitz with computable constants (Theorem 6.2), and continuous as λ↓ 0 (Theorem 6.3). Notation.Write z∗ = ( y∗, τ ∗), p∗ := p(y∗), S∗ := S(y∗), and eH ∗ := eH(z∗) = λI + τ ∗ADiag (p...
-
[9]
Numerical experiments.Our test problem is a uniform-electron-gas (UEG) analytic-continuation instance from Chuna et al. [11]. The task is to recover a nonnegative spectral density x∗ from noisy imaginary-time observations b≈Ax ∗ by solving (1.1). The forward operator is a discretized periodic Laplace transform, Aij =e −tiωj +e −(β−ti)ωj , with inverse tem...
-
[10]
The merit tolerance and iteration cap are (ε, K max) = (10−8,300) throughout
The level-set parameter is β = 1.5 ρ(z0), and the safeguard ˆτβmin uses the closed-form bound λW (ζ)/A2 max from Lemma 3.1ii, floored at 10 −16. The merit tolerance and iteration cap are (ε, K max) = (10−8,300) throughout. 7.1. Overflow resilience.Any first- or second-order method applied to the unnormalized dual (1.3) must evaluate exp(A⊺y−c ), which can...
work page 2025
-
[11]
H. H. Bauschke and P. L. Combettes,Convex Analysis and Monotone Oper- ator Theory in Hilbert Spaces, CMS Books in Mathematics, Springer International Publishing, Cham, 2017
work page 2017
-
[12]
A. Beck,First-Order Methods in Optimization, SIAM-Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2017
work page 2017
-
[13]
A. Ben-Tal, M. Teboulle, and A. Charnes,The role of duality in opti- mization problems involving entropy functionals with applications to information theory, J. Optim. Theory Appl., 58 (1988), pp. 209–223
work page 1988
-
[14]
P. Blanchard, D. J. Higham, and N. J. Higham,Accurately computing the log-sum-exp and softmax functions, IMA J. Numer. Anal., 41 (2021), pp. 2311– 2330
work page 2021
-
[15]
J. Borwein and A. S. Lewis,Convergence of Best Entropy Estimates, SIAM J. Optim., 1 (1991), pp. 191–205
work page 1991
-
[16]
J. M. Borwein and A. S. Lewis,Partially finite convex programming, part i: Quasi relative interiors and duality theory, Math. Program., 57 (1992), pp. 15–48
work page 1992
-
[17]
J. M. Borwein and A. S. Lewis,Convex Analysis and Nonlinear Optimization: Theory and Examples, vol. 3, Springer Verlag, New York, 2nd ed., 2006
work page 2006
-
[18]
S. Boyd and L. Vandenberghe,Convex optimization, Cambridge University Press, 2004
work page 2004
-
[19]
R. K. Bryan,Maximum entropy analysis of oversampled data problems, Eur. Biophys. J., 18 (1990), pp. 165–174
work page 1990
-
[20]
J. V. Burke,Second order necessary and sufficient conditions for convex com- posite NDO, Math. Program., 38 (1987), pp. 287–302
work page 1987
- [21]
- [22]
-
[23]
R. Corless, G. Gonnet, D. Hare, D. Jeffrey, and D. Knuth,On the Lambert w function, Adv. Comput. Math., 5 (1996), pp. 329–359
work page 1996
-
[24]
T. M. Cover and J. A. Thomas,Elements of Information Theory, Wiley- Interscience, 2nd ed., 2006
work page 2006
-
[25]
M. Cuturi and G. Peyr ´e,Semi-dual Regularized Optimal Transport, SIAM Rev., 60 (2018), pp. 941–965
work page 2018
-
[26]
A. Decarreau, D. Hilhorst, C. Lemar´echal, and J. Navaza,Dual methods in entropy maximization. Application to some problems in crystallography, SIAM J. Optim., 2 (1992), pp. 173–197
work page 1992
-
[27]
T. Dornheim, M. B ¨ohme, D. Chapman, D. Kraus, T. Preston, Z. Mold- abekov, N. Schl ¨unzen, A. Cangi, T. D ¨oppner, and J. Vorberger, Imaginary-time correlation function thermometry: A new, high-accuracy and model-free temperature analysis technique for x-ray Thomson scattering data, Phys. Plasmas, 30 (2023), p. 042707
work page 2023
-
[28]
H. W. Engl, M. Hanke, and A. Neubauer,Regularization of Inverse Problems, vol. 375 of Mathematics and Its Applications, Kluwer Academic Publishers, Dordrecht, 1996
work page 1996
-
[29]
J. Eriksson,A note on solution of large sparse maximum entropy problems with linear equality constraints, Math. Program., 18 (1980), pp. 146–154. 28N. BARNFIELD, J. V. BURKE, M. P. FRIEDLANDER, AND T. HOHEISEL
work page 1980
-
[30]
Event Horizon Telescope Collaboration,First M87 event horizon telescope results. IV. Imaging the central supermassive black hole, The Astrophysical Journal Letters, 875 (2019), p. L4
work page 2019
-
[31]
G. Giuliani and G. Vignale,Quantum theory of the electron liquid, Cambridge university press, 2008
work page 2008
-
[32]
E. V. Haynsworth,Determination of the inertia of a partitioned hermitian matrix, Linear Algebra Appl., 1 (1968), pp. 73–81
work page 1968
-
[33]
R. A. Horn and C. R. Johnson,Matrix Analysis, Cambridge University Press, New York, NY, second edition, corrected reprint ed., 2017
work page 2017
-
[34]
E. T. Jaynes,Information Theory and Statistical Mechanics, Phys. Rev., 106 (1957), pp. 620–630
work page 1957
-
[35]
C. T. Kelley,Iterative methods for linear and nonlinear equations, SIAM, 1995
work page 1995
-
[36]
D. C. Liu and J. Nocedal,On the limited memory BFGS method for large scale optimization, Math. Program., 45 (1989), pp. 503–528
work page 1989
- [37]
-
[38]
Y. Malitsky and K. Mishchenko,Adaptive Gradient Descent without Descent, Aug. 2020
work page 2020
-
[39]
Nesterov,Lectures on Convex Optimization, vol
Y. Nesterov,Lectures on Convex Optimization, vol. 137 of Springer Optimiza- tion and Its Applications, Springer International Publishing, Cham, 2018
work page 2018
-
[40]
Ortega and Rheinbolt,Iterative Solutions of Nonlinear Equations in Several Variables, Academic Press, Series on Computer Science and Applied Mathematics, 1970
work page 1970
-
[41]
R. Rockafellar and R. J.-B. Wets,Variational Analysis, Springer Verlag, Heidelberg, Berlin, New York, 1998
work page 1998
-
[42]
R. T. Rockafellar,Convex analysis, Princeton Mathematical Series, Princeton University Press, Princeton, N. J., 1970
work page 1970
-
[43]
R. T. Rockafellar and L. A. Dontchev,Implicit Functions and Solution Mappings: A view from variational analysis, Springer Monographs in Mathematics, Springer, 2014
work page 2014
-
[44]
J. Shore and R. Johnson,Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy, IEEE Trans. Inform. Theory, 26 (1980), pp. 26–37
work page 1980
-
[45]
J. Skilling and R. K. Bryan,Maximum entropy image reconstruction: General algorithm, Mon. Not. R. Astron. Soc., 211 (1984), pp. 111–124
work page 1984
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.