pith. sign in

and Telgarsky, M

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it
abstract

Gradient descent, when applied to the task of logistic regression, outputs iterates which are biased to follow a unique ray defined by the data. The direction of this ray is the maximum margin predictor of a maximal linearly separable subset of the data; the gradient descent iterates converge to this ray in direction at the rate $\mathcal{O}(\ln\ln t / \ln t)$. The ray does not pass through the origin in general, and its offset is the bounded global optimum of the risk over the remaining data; gradient descent recovers this offset at a rate $\mathcal{O}((\ln t)^2 / \sqrt{t})$.

fields

cs.LG 2

years

2026 2

verdicts

UNVERDICTED 2

representative citing papers

Efficient Logistic Regression with Mixture of Sigmoids

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

EW with Gaussian prior matches the optimal O(d log(Bn)) regret for online logistic regression at O(B^3 n^5) cost and converges geometrically to a truncated Gaussian vote in the large-B separable regime.

The Effect of Mini-Batch Noise on the Implicit Bias of Adam

cs.LG · 2026-02-02 · unverdicted · novelty 6.0

Mini-batch noise reverses how Adam's β2 controls anti-regularization, making default momentum values suitable for small batches but requiring β1 closer to β2 for large batches to favor flatter minima.

citing papers explorer

Showing 2 of 2 citing papers.

  • Efficient Logistic Regression with Mixture of Sigmoids cs.LG · 2026-04-03 · unverdicted · none · ref 27

    EW with Gaussian prior matches the optimal O(d log(Bn)) regret for online logistic regression at O(B^3 n^5) cost and converges geometrically to a truncated Gaussian vote in the large-B separable regime.

  • The Effect of Mini-Batch Noise on the Implicit Bias of Adam cs.LG · 2026-02-02 · unverdicted · none · ref 27 · internal anchor

    Mini-batch noise reverses how Adam's β2 controls anti-regularization, making default momentum values suitable for small batches but requiring β1 closer to β2 for large batches to favor flatter minima.