pith. sign in

arxiv: 2606.21683 · v1 · pith:ECMTFQ2Tnew · submitted 2026-06-19 · 📊 stat.ML · cs.IT· cs.LG· math.IT

Finite-Sample Performance of Gradient Descent in Logistic Regression with Gaussian Design

Pith reviewed 2026-06-26 12:27 UTC · model grok-4.3

classification 📊 stat.ML cs.ITcs.LGmath.IT
keywords logistic regressiongradient descentfinite-sample analysisGaussian designconvergence ratesparameter estimation
0
0 comments X

The pith

Gradient descent on logistic regression reaches an l2 error of order sqrt of theta-star norm to the fifth times d over n with linear convergence under small stepsize.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies gradient descent applied to the logistic loss for estimating a fixed unknown parameter theta-star from n i.i.d. Gaussian design samples in logistic regression. It proves that with a small constant stepsize and zero initialization the iterates converge linearly to a neighborhood of theta-star whose radius scales as the square root of (theta-star norm to the fifth power times dimension over samples). This supplies explicit non-asymptotic finite-sample rates that improve on prior results lacking such error bounds or showing slower convergence. The proof rests on verifying an approximate invertibility condition for the gradient via uniform deviation bounds and eigenvalue analysis of the population Hessian.

Core claim

Under small O(1) stepsize and zero initialization, gradient descent linearly converges to a small neighborhood of theta* achieving an l2 error of order O(sqrt(||theta*||_2^5 d/n)). A faster local linear convergence to the same statistical error holds under a large Theta(||theta*||) stepsize. The gradient of the logistic loss satisfies an approximate invertibility condition that enables the contraction mapping for both empirical and population iterates.

What carries the argument

Approximate invertibility condition (AIC) on the gradient of the logistic loss, established by uniform control of empirical-population deviations via covering and peeling, followed by contraction of population GD iterates via eigenvalues of population Hessian matrices.

If this is right

  • Gradient descent attains the stated non-asymptotic l2 estimation rate under the given initialization and stepsize.
  • Local linear convergence accelerates when the stepsize is scaled up to Theta of the norm of theta-star.
  • A novel efficient estimator constructed from recent work reaches the sharper rate Theta of sqrt of (norm of theta-star times d over n), showing prior non-asymptotic bounds have suboptimal dependence on the signal strength.
  • In many regimes the tight statistical error rate is therefore Theta of sqrt of (norm of theta-star times d over n).

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The improved rate achieved by the novel estimator suggests that a refined stepsize schedule or analysis might allow plain gradient descent to reach the same dependence on theta-star norm.
  • The approximate invertibility technique may extend to other generalized linear models whose link functions produce comparable Hessian eigenvalue behavior.
  • The guarantees are most relevant in regimes where dimension d grows with sample size n but the signal strength remains moderate.

Load-bearing premise

The gradient of the logistic loss satisfies an approximate invertibility condition that supports contraction mapping for the iterates.

What would settle it

An explicit counter-example or simulation in which the empirical gradient deviates from its population counterpart enough to violate the approximate invertibility condition, causing GD iterates to fail to contract linearly into the claimed neighborhood, would falsify the result.

Figures

Figures reproduced from arXiv: 2606.21683 by Arya Mazumdar, Junren Chen.

Figure 1
Figure 1. Figure 1: Estimation and convergence of GD under Gaussian designs. [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparing Algorithm 2 with small stepsize GD. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_2.png] view at source ↗
read the original abstract

We consider the parameter estimation problem in logistic regression with Gaussian design: the estimation of a fixed unknown parameter $\theta^*\in \mathbb{R}^d$ ($\|\theta^*\|_2\ge 1$) from $n$ i.i.d. samples $\{(x_i,y_i)\}_{i=1}^n$, where $x_i\sim N(0,I_d)$ and $y_i|x_i \sim {\rm Bernoulli}(1/(1+\exp(-x_i^\top \theta^*)))$. Our main aim is to characterize the finite-sample estimation performance and convergence behavior of gradient descent (GD) on the maximum likelihood objective (i.e., the logistic loss). Under small $O(1)$ stepsize and $0$ initialization, we show that GD linearly converges to a small neighborhood of $\theta^*$ achieving an $\ell_2$ error of order $O(\sqrt{\|\theta^*\|_2^5d/n})$. This substantially goes beyond existing theoretical results that lack non-asymptotic estimation error rate and exhibit much slower parameter convergence. We also establish a faster local linear convergence to the same statistical error under a large $\Theta(\|\theta^*\|_2)$ stepsize. The main technical component is to show that the gradient of the logistic loss satisfies a certain approximate invertibility condition (AIC). To that end, we uniformly control the deviation of the gradient from its population counterpart by covering and peeling arguments, and then show that the population GD is a contraction by a delicate analysis based on the eigenvalues of population Hessian matrices. Finally, we build upon the recent work Matsumoto and Mazumdar (2025) and devise a novel efficient estimator that attains a sharper rate in high dimensions. This indicates that the existing non-asymptotic guarantees exhibit sub-optimal dependence on $\|\theta^*\|_2$, and that in many regimes $\Theta(\sqrt{\|\theta^*\|_2d/n})$ is the tight estimation error rate. Numerical examples are provided to corroborate our theoretical results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper analyzes finite-sample behavior of gradient descent (GD) on the logistic loss for parameter estimation in logistic regression with Gaussian design x_i ~ N(0,I_d). With small O(1) stepsize and zero initialization, GD is claimed to converge linearly to a neighborhood of θ* with ℓ2 error O(√(||θ*||_2^5 d/n)). A faster local linear convergence is shown under Θ(||θ*||_2) stepsize. The key technical step is establishing an approximate invertibility condition (AIC) on the gradient via covering/peeling for uniform deviation bounds and eigenvalue analysis of the population Hessian; a sharper estimator attaining O(√(||θ*||_2 d/n)) is constructed by building on Matsumoto and Mazumdar (2025).

Significance. If the non-asymptotic rates and AIC hold, the work supplies explicit finite-sample guarantees for GD in logistic regression that track the dependence on ||θ*||_2 (previously missing in the literature) and indicates near-optimality of the statistical radius in certain regimes. The combination of uniform deviation arguments with contraction mapping via population Hessian eigenvalues is a standard and appropriate technique for this setting.

major comments (3)
  1. [AIC proof (covering/peeling + population contraction)] The central contraction-mapping argument rests on the approximate invertibility condition (AIC). The claimed statistical radius contains an ||θ*||_2^5 factor inside the square root; this indicates that either the uniform deviation bound or the contraction modulus in the AIC proof degrades polynomially with ||θ*||_2. It is therefore necessary to verify explicitly that an O(1) stepsize still yields a contraction factor strictly less than 1 uniformly in the regime ||θ*||_2 ≫ 1 (see the population-Hessian eigenvalue analysis).
  2. [Population Hessian analysis] The eigenvalue analysis of the population Hessian that establishes contractivity of the population GD map must be checked for its dependence on ||θ*||_2. The logistic Hessian involves factors of the form σ'(x^T θ) x x^T whose smallest eigenvalues can deteriorate when ||θ*||_2 grows, potentially violating the uniform contraction needed for the claimed linear convergence rate.
  3. [Local convergence statement] The faster local linear convergence result under the larger Θ(||θ*||_2) stepsize is stated to reach the same statistical radius; the local neighborhood size in which this larger stepsize is valid must be shown to contain the statistical ball of radius O(√(||θ*||_2^5 d/n)) without circularity.
minor comments (2)
  1. [Notation section] Notation for the logistic loss and its gradient should be introduced once and used consistently; the transition between empirical and population quantities is occasionally ambiguous.
  2. [Numerical examples] The numerical experiments section would benefit from explicit reporting of the observed dependence of the error on ||θ*||_2 to corroborate the ||θ*||_2^5 factor.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment point by point below, providing clarifications from the existing analysis and indicating where revisions will be made.

read point-by-point responses
  1. Referee: [AIC proof (covering/peeling + population contraction)] The central contraction-mapping argument rests on the approximate invertibility condition (AIC). The claimed statistical radius contains an ||θ*||_2^5 factor inside the square root; this indicates that either the uniform deviation bound or the contraction modulus in the AIC proof degrades polynomially with ||θ*||_2. It is therefore necessary to verify explicitly that an O(1) stepsize still yields a contraction factor strictly less than 1 uniformly in the regime ||θ*||_2 ≫ 1 (see the population-Hessian eigenvalue analysis).

    Authors: The polynomial dependence on ||θ*||_2 in the statistical radius arises exclusively from the uniform deviation bounds derived via the covering and peeling arguments (which must control the tails of the logistic gradient under large signal strength). The contraction modulus itself remains strictly less than 1 with an O(1) stepsize, uniformly over ||θ*||_2 ≥ 1. This follows from the population Hessian eigenvalue analysis, which yields a contraction factor of at most 1 − c for an absolute constant c > 0 independent of ||θ*||_2. We will insert an explicit lemma stating this uniform lower bound on the contraction factor to make the separation between deviation and contraction transparent. revision: yes

  2. Referee: [Population Hessian analysis] The eigenvalue analysis of the population Hessian that establishes contractivity of the population GD map must be checked for its dependence on ||θ*||_2. The logistic Hessian involves factors of the form σ'(x^T θ) x x^T whose smallest eigenvalues can deteriorate when ||θ*||_2 grows, potentially violating the uniform contraction needed for the claimed linear convergence rate.

    Authors: Although individual realizations of σ'(x^T θ)xx^T can be small for large ||θ*||_2, the population expectation E[σ'(x^T θ*)xx^T] admits a uniform positive lower bound on its minimal eigenvalue (at least 1/8) for all ||θ*||_2 ≥ 1 under Gaussian design. This is established by reducing to a one-dimensional expectation over the marginal projection and using the fact that the logistic variance function remains sufficiently positive in expectation. The O(1) stepsize is chosen smaller than the reciprocal of the uniform upper bound on the Hessian operator norm, preserving the contraction. We will expand the eigenvalue section with the full uniform bounds and the explicit stepsize restriction. revision: yes

  3. Referee: [Local convergence statement] The faster local linear convergence result under the larger Θ(||θ*||_2) stepsize is stated to reach the same statistical radius; the local neighborhood size in which this larger stepsize is valid must be shown to contain the statistical ball of radius O(√(||θ*||_2^5 d/n)) without circularity.

    Authors: The region of validity for the Θ(||θ*||_2) stepsize is a ball of radius Ω(1/||θ*||_2) around θ*, whose size is independent of n and d. Because the statistical radius O(√(||θ*||_2^5 d/n)) tends to zero as n → ∞ (for fixed d and ||θ*||_2), there exists N0 such that for all n ≥ N0 the statistical ball lies strictly inside the local region. The local contraction is proved on the population map; the finite-sample deviation is controlled separately via the AIC and does not enter the definition of the local neighborhood. We will add an explicit comparison of the two radii in the local-convergence section to eliminate any appearance of circularity. revision: yes

Circularity Check

0 steps flagged

Minor self-citation for secondary sharper estimator; central GD convergence derived independently

full rationale

The paper's core derivation establishes linear convergence of GD to a neighborhood of size O(√(||θ*||₂⁵ d/n)) by proving the approximate invertibility condition (AIC) via uniform deviation bounds (covering/peeling) on the empirical gradient and eigenvalue analysis of the population Hessian to show contraction. This analysis is self-contained and does not reduce to any fitted input, self-definition, or prior self-citation. The only self-citation appears in the final paragraph, where Matsumoto and Mazumdar (2025) is invoked solely to construct an auxiliary efficient estimator with improved rate; this does not support or justify the main finite-sample GD claim. No other patterns (self-definitional, fitted predictions, ansatz smuggling, or renaming) are present. The result is therefore a standard first-principles analysis with one non-load-bearing self-citation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis rests on the standard Gaussian design assumption and introduces the AIC as a paper-specific technical condition; no data-fitted free parameters are used in the stated rates.

axioms (2)
  • domain assumption Covariates x_i are i.i.d. N(0, I_d)
    Standard modeling assumption for the design matrix in high-dimensional logistic regression.
  • ad hoc to paper Gradient of logistic loss satisfies approximate invertibility condition (AIC)
    Central technical device introduced to prove contraction of both empirical and population GD iterates.

pith-pipeline@v0.9.1-grok · 5902 in / 1399 out tokens · 26811 ms · 2026-06-26T12:27:11.892479+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 1 linked inside Pith

  1. [1]

    Robust uniform recovery of structured signals from nonlinear observations.arXiv preprint arXiv:2604.20075, 2026

    Pedro Abdalla, Radu Balan, and Junren Chen. Robust uniform recovery of structured signals from nonlinear observations.arXiv preprint arXiv:2604.20075, 2026

  2. [2]

    Model selection and minimax estimation in generalized linear models.IEEE Transactions on Information Theory, 62(6):3721–3730, 2016

    Felix Abramovich and Vadim Grinshtein. Model selection and minimax estimation in generalized linear models.IEEE Transactions on Information Theory, 62(6):3721–3730, 2016

  3. [3]

    Gradient descent converges linearly for logistic regres- sion on separable data

    Kyriakos Axiotis and Maxim Sviridenko. Gradient descent converges linearly for logistic regres- sion on separable data. InInternational Conference on Machine Learning, pages 1302–1319. PMLR, 2023

  4. [4]

    Rademacher and gaussian complexities: Risk bounds and structural results.Journal of machine learning research, 3(Nov):463–482, 2002

    Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results.Journal of machine learning research, 3(Nov):463–482, 2002

  5. [5]

    Oxford University Press, February 2013

    Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, February 2013

  6. [6]

    The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression.The Annals of Statistics, 48(1):27–42, 2020

    Emmanuel J Candès and Pragya Sur. The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression.The Annals of Statistics, 48(1):27–42, 2020. 11

  7. [7]

    Finite-sample performance of the maximum likelihood estimator in logistic regression.arXiv preprint arXiv:2411.02137, 2024

    Hugo Chardon, Matthieu Lerasle, and Jaouad Mourtada. Finite-sample performance of the maximum likelihood estimator in logistic regression.arXiv preprint arXiv:2411.02137, 2024

  8. [8]

    A unified approach to statistical estima- tion under nonlinear observations: tensor estimation and low-rank factorization.arXiv preprint arXiv:2510.16965, 2025

    Junren Chen, Lijun Ding, Dong Xia, and Ming Yuan. A unified approach to statistical estima- tion under nonlinear observations: tensor estimation and low-rank factorization.arXiv preprint arXiv:2510.16965, 2025

  9. [9]

    One-bit phase retrieval: Optimal rates and efficient algorithms

    Junren Chen and Ming Yuan. One-bit phase retrieval: Optimal rates and efficient algorithms. IEEE Transactions on Information Theory, 2026

  10. [10]

    Gradient descent on neural networks typically occurs at the edge of stability

    Jeremy Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. InInternational Conference on Learning Representations, 2021

  11. [11]

    Nbiht: An efficient algorithm for 1-bit compressed sensing with optimal error decay rate.IEEE Transactions on Information Theory, 68(2):1157–1177, 2021

    Michael P Friedlander, Halyun Jeong, Yaniv Plan, and Özgür Yılmaz. Nbiht: An efficient algorithm for 1-bit compressed sensing with optimal error decay rate.IEEE Transactions on Information Theory, 68(2):1157–1177, 2021

  12. [12]

    Academic press, 2013

    Morris W Hirsch, Stephen Smale, and Robert L Devaney.Differential equations, dynamical systems, and an introduction to chaos. Academic press, 2013

  13. [13]

    On the sample complexity of parameter estimation in logistic regression with normal design

    Daniel Hsu and Arya Mazumdar. On the sample complexity of parameter estimation in logistic regression with normal design. InThe Thirty Seventh Annual Conference on Learning Theory, pages 2418–2437. PMLR, 2024

  14. [14]

    Robust 1- bit compressive sensing via binary stable embeddings of sparse vectors.IEEE Transactions on Information Theory, 59(4):2082–2102, 2013

    Laurent Jacques, Jason N Laska, Petros T Boufounos, and Richard G Baraniuk. Robust 1- bit compressive sensing via binary stable embeddings of sparse vectors.IEEE Transactions on Information Theory, 59(4):2082–2102, 2013

  15. [15]

    The implicit bias of gradient descent on nonseparable data

    Ziwei Ji and Matus Telgarsky. The implicit bias of gradient descent on nonseparable data. In Conference on learning theory, pages 1772–1798. PMLR, 2019

  16. [16]

    Efficient learning of gener- alized linear and single index models with isotonic regression.Advances in Neural Information Processing Systems, 24, 2011

    Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of gener- alized linear and single index models with isotonic regression.Advances in Neural Information Processing Systems, 24, 2011

  17. [17]

    The isotron algorithm: High-dimensional isotonic regres- sion

    Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regres- sion. InCOLT, volume 1, page 9, 2009

  18. [18]

    Finite sample rates for logistic regression with small noise or few samples.Sankhya A, pages 1–70, 2024

    Felix Kuchelmeister and Sara van de Geer. Finite sample rates for logistic regression with small noise or few samples.Sankhya A, pages 1–70, 2024

  19. [19]

    Springer

    Michel Ledoux and Michel Talagrand.Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer

  20. [20]

    On the sample complexity of pac learning half-spaces against the uniform distribution.IEEE Transactions on Neural Networks, 6(6):1556–1559, 1995

    Philip M Long. On the sample complexity of pac learning half-spaces against the uniform distribution.IEEE Transactions on Neural Networks, 6(6):1556–1559, 1995

  21. [21]

    Robust multivariate mean estimation: The optimality of trimmed mean.The Annals of Statistics, 49(1):393–410, 2021

    Gábor Lugosi and Shahar Mendelson. Robust multivariate mean estimation: The optimality of trimmed mean.The Annals of Statistics, 49(1):393–410, 2021. 12

  22. [22]

    Binary iterative hard thresholding converges with optimal number of measurements for 1-bit compressed sensing.Journal of the ACM, 71(5):1–64, 2024

    Namiko Matsumoto and Arya Mazumdar. Binary iterative hard thresholding converges with optimal number of measurements for 1-bit compressed sensing.Journal of the ACM, 71(5):1–64, 2024

  23. [23]

    Learning sparse generalized linear models with binary outcomes via iterative hard thresholding

    Namiko Matsumoto and Arya Mazumdar. Learning sparse generalized linear models with binary outcomes via iterative hard thresholding. InProceedings of Thirty Eighth Conference on Learning Theory, volume 291 ofProceedings of Machine Learning Research, pages 3933–4032. PMLR, 30 Jun–04 Jul 2025

  24. [24]

    Nelder.Generalized linear models

    Peter McCullagh and John A. Nelder.Generalized linear models. Routledge, 2019

  25. [25]

    Gradient descent on logistic regression: Do large step-sizes work with data on the sphere?arXiv preprint arXiv:2507.11228, 2025

    Si Yi Meng, Baptiste Goujaud, Antonio Orvieto, and Christopher De Sa. Gradient descent on logistic regression: Do large step-sizes work with data on the sphere?arXiv preprint arXiv:2507.11228, 2025

  26. [26]

    Gradient descent on logistic regression with non-separable data and large step sizes.arXiv preprint arXiv:2406.05033, 2024

    Si Yi Meng, Antonio Orvieto, Daniel Yiming Cao, and Christopher De Sa. Gradient descent on logistic regression with non-separable data and large step sizes.arXiv preprint arXiv:2406.05033, 2024

  27. [27]

    Convergence of gradient descent on separable data

    Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Pedro Henrique Pamplona Savarese, Nathan Srebro, and Daniel Soudry. Convergence of gradient descent on separable data. InThe 22nd International Conference on Artificial Intelligence and Statistics, pages 3420–3428. PMLR, 2019

  28. [28]

    Springer, 2018

    Yurii Nesterov et al.Lectures on convex optimization, volume 137. Springer, 2018

  29. [29]

    Phase retrieval using alternating mini- mization.Advances in Neural Information Processing Systems, 26, 2013

    Praneeth Netrapalli, Prateek Jain, and Sujay Sanghavi. Phase retrieval using alternating mini- mization.Advances in Neural Information Processing Systems, 26, 2013

  30. [30]

    High-dimensionalestimationwithgeometric constraints.Information and Inference: A Journal of the IMA, 6(1):1–40, 2017

    YanivPlan, RomanVershynin, andElenaYudovina. High-dimensionalestimationwithgeometric constraints.Information and Inference: A Journal of the IMA, 6(1):1–40, 2017

  31. [31]

    Onpaclearningusingwinnow, perceptron, andaperceptron-likealgorithm

    RoccoAServedio. Onpaclearningusingwinnow, perceptron, andaperceptron-likealgorithm. In Proceedings of the Twelfth Annual Conference on Computational learning theory, pages 296–307, 1999

  32. [32]

    The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

    Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

  33. [33]

    A bound for the error in the normal approximation to the distribution of a sum of dependent random variables

    Charles Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. InProceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 2: Probability theory, volume 6, pages 583–603. University of California Press, 1972

  34. [34]

    Use of exchangeable pairs in the analysis of simulations.Lecture Notes-Monograph Series, pages 1–26, 2004

    Charles Stein, Persi Diaconis, Susan Holmes, and Gesine Reinert. Use of exchangeable pairs in the analysis of simulations.Lecture Notes-Monograph Series, pages 1–26, 2004

  35. [35]

    From logistic regression to the perceptron algorithm: Exploring gradient descent with large step sizes

    Alexander Tyurin. From logistic regression to the perceptron algorithm: Exploring gradient descent with large step sizes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20938–20946, 2025. 13

  36. [36]

    Cambridge University Press, 2018

    Roman Vershynin.High-dimensional probability: An introduction with applications in data sci- ence, volume 47. Cambridge University Press, 2018

  37. [37]

    Large stepsize gradient descent for logistic loss: Non-monotonicity of the loss improves optimization efficiency

    Jingfeng Wu, Peter L Bartlett, Matus Telgarsky, and Bin Yu. Large stepsize gradient descent for logistic loss: Non-monotonicity of the loss improves optimization efficiency. InThe Thirty Seventh Annual Conference on Learning Theory, pages 5019–5073. PMLR, 2024

  38. [38]

    Implicit bias of gradient descent for logistic regression at the edge of stability.Advances in Neural Information Processing Systems, 36:74229–74256, 2023

    Jingfeng Wu, Vladimir Braverman, and Jason D Lee. Implicit bias of gradient descent for logistic regression at the edge of stability.Advances in Neural Information Processing Systems, 36:74229–74256, 2023

  39. [39]

    Large stepsizes accelerate gradient descent for regularized logistic regression.arXiv preprint arXiv:2506.02336, 2025

    Jingfeng Wu, Pierre Marion, and Peter Bartlett. Large stepsizes accelerate gradient descent for regularized logistic regression.arXiv preprint arXiv:2506.02336, 2025

  40. [40]

    Gradient descent converges arbitrarily fast for logistic regression via large and adaptive stepsizes

    Ruiqi Zhang, Jingfeng Wu, and Peter Bartlett. Gradient descent converges arbitrarily fast for logistic regression via large and adaptive stepsizes. InForty-second International Conference on Machine Learning, 2025

  41. [41]

    localization

    Ruiqi Zhang, Jingfeng Wu, Licong Lin, and Peter L Bartlett. Minimax optimal convergence of gradient descent in logistic regression via large and adaptive stepsizes.arXiv preprint arXiv:2504.04105, 2025. A Proof of Theorem 1 (GD with Small Stepsize) The main ingredient is an approximate invertibility condition (AIC) that controls ∥u−θ ∗ −η·h θ∗(u)∥2,∀u∈B d...

  42. [42]

    C.1 Bounding| ˆλ− ∥θ ∗∥2|(norm estimation error) In order to bound|ˆλ− ∥θ ∗∥2|= q−1(ˆv⊤ˆr)− ∥θ∗∥2 ,we shall start with bounding ˆv⊤ˆr−q(∥θ∗∥2)

    In fact, by assuming the bound in Lemma 1 holds (this is within the probability promised in Theorem 3), then we have ∥θ∗∥2 ˆr− θ∗ ∥θ∗∥2 2 ≲polylog(n) r ∥θ∗∥2d n + ∥θ∗∥2d n .(70) In the remainder of this appendix, we focus on bounding the norm estimation error,|ˆλ− ∥θ ∗∥2|. C.1 Bounding| ˆλ− ∥θ ∗∥2|(norm estimation error) In order to bound|ˆλ− ∥θ ∗∥2|= q−1...