stat.ML

Machine Learning

Covers machine learning papers (supervised, unsupervised, semi-supervised learning, graphical models, reinforcement learning, bandits, high dimensional inference, etc.) with a statistical or theoretical grounding

stat.ML 2026-05-20 2 theorems

Contradiction graph decides VC dimension threshold for any m

by Jesse Campbell, Daniel Ibaibarriaga +1 more

Contradiction Graphs Determine VC Dimension

Vertices are realizable label sequences of length m; edges mark label disagreements on shared points, fixing whether dimension meets or tops

abstract click to expand

We study the contradiction graphs associated with binary concept classes. For a class $H \subseteq \{0,1\}^X$, the order-$m$ contradiction graph $G_m(H)$ has as vertices the $H$-realizable labeled sequences of length $m$, with two vertices adjacent when the two sequences assign opposite labels to some common domain point. Our main result is that the single graph $G_m(H)$ determines the threshold predicate $\mathrm{VCdim}(H)\ge m$. Consequently, the full sequence $(G_m(H))_{m \ge 1}$ determines the exact VC dimension and, in particular, detects finite versus infinite VC dimension, answering a question posed by Alon et al. (2024).

stat.ML 2026-05-19 2 theorems

Learned multipliers achieve optimal Theta(s/sqrt(N)) rate

by Tung Quoc Le, Anh Tuan Nguyen +1 more

Provably Data-driven Lagrangian Relaxation for Mixed Integer Linear Programming

Averaged stochastic gradient ascent matches the minimax bound for data-driven Lagrangian relaxation in MILPs.

abstract click to expand

Lagrangian Relaxation (LR) is a powerful technique for solving large-scale Mixed Integer Linear Programming (MILP), particularly those with decomposable structures, such as vehicle routing or unit commitment problems. By relaxing the coupling constraints, LR enables parallel subproblem solving and often yields tighter dual bounds than standard linear programming relaxations, which is crucial for efficient branch-and-bound pruning. While recent empirical work has shown promising results using machine learning to predict these multipliers, a theoretical understanding of such methods remains an open question. In this work, we bridge this gap by analyzing the problem of learning LR through the lens of Data-driven Algorithm Design, i.e., a statistical learning problem over a distribution of problem instances. Our contributions are as follows: first, we derive a generalization bound of $\mathcal{O}(s^{1.5}/\sqrt{N})$ for the learned multipliers, where $s$ is the number of coupling constraints and $N$ is the sample size. Second, we provide a minimax lower-bound of $\Omega(s/\sqrt{N})$, proving that a linear dependency is unavoidable. Third, we constructively close this theoretical gap by proving that Stochastic Gradient Ascent (SGA) with averaging achieves the minimax optimal rate $\Theta(s/\sqrt{N})$. Finally, we extend our framework to the learning-to-warm-start setting, proving that it achieves a fast, minimax-optimal rate of $\Theta(s/N)$ and establishing a theoretical advantage over direct multiplier prediction.

stat.ML 2026-05-19 2 theorems

Ridge regularization distorts feature-learning networks at vanishing strength

by George Whittle, Pranav Vaidhyanathan +3 more

Canonical Regularisation of Wide Feature-Learning Neural Networks

Gradient flow no longer selects the vanishing ridge solution outside the kernel regime, so a function-space energy defines geodesic ridge as

abstract click to expand

Wide neural networks in the feature-learning regime drive modern deep learning, and yet they remain far less studied than their kernel-regime counterparts. We consider a critical yet under-explored difference between these two regimes: the regulariser and prior implied by gradient flow training. This canonical regularisation property is well-studied in kernel regime networks -- of all the infinite global minima, gradient flow selects exactly the vanishing ridge solution -- and underpins the celebrated NN-GP correspondence, precisely allowing the modelling of noise during training. However, we prove ridge regularisation biases gradient flow in feature-learning regime networks, even in the infinitesimal limit of vanishing regularisation. Over training, ridge distorts the inductive bias of the network, with a particular damage done to pretrained networks where the implicit prior is informative. We resolve this by axiomatising the canonical regulariser as a regime-agnostic function-space energy and lift, which uniquely identifies ridge in the kernel regime, and crucially generalises to the feature-learning regime. By studying the Riemannian geometry of feature-learning networks, we derive geodesic ridge from our framework, generalising ridge to the feature-learning regime. Correspondingly, we prove the canonical function-space prior is a Riemannian Gibbs Process, generalising the more familiar Gaussian Process. As a practical contribution, we propose arc ridge as a minimax-robust, scalable surrogate to geodesic ridge, revealing a deep relationship between early stopping and canonical regularisation across learning regimes. Finally, we demonstrate the consequences of our theory empirically on both image processing and NLP transfer-learning problems.

stat.ML 2026-05-13 3 theorems

Gap counting sets critical scale for attention softmax

by Tomohiro Hayase, Ryo Karakida

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

Upper-tail accumulation scale unifies conflicting laws for rescaling inverse temperature with context length n.

abstract click to expand

Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$, ranging from $(\log n)^{1/2}$ to $\log n$ and $(\log n)^2$. We provide a general theory showing that the desirable scale is determined by the gap-counting function $N_n$ of each attention row. Counting how many competitors lie within each gap from the maximum, we define an upper-tail accumulation scale and prove that it gives the critical inverse-temperature scale for softmax concentration: below this scale, the top competitors remain unseparated, whereas above it, the attention entropy collapses. This framework unifies prior scaling laws as different $N_n$ and yields a direct diagnostic for attention-score families, from idealized theoretical models to more practical transformers.

browse all of stat.ML → full archive · search · sub-categories

Pith Journal

Published journal articles will appear here.

published papers →

subcategories

arXiv map

group Statistics
archive stat · Statistics
parent stat
full arXiv category map

how this category works

The first three slots go to the most-Pithed papers in stat.ML.
The remaining feed is chronological so new papers are easy to find.
Each page shows 30 papers; the archive carries the deep tail.