AI4SLT: Empirical Processes in Lean 4 for Formal Statistical Learning Theory

Fanghui Liu; Jason D. Lee; Yuanhe Zhang

arxiv: 2602.02285 · v2 · pith:GZT4H2SNnew · submitted 2026-02-02 · 💻 cs.LG · cs.CL· math.ST· stat.TH

AI4SLT: Empirical Processes in Lean 4 for Formal Statistical Learning Theory

Yuanhe Zhang , Jason D. Lee , Fanghui Liu This is my paper

classification 💻 cs.LG cs.CLmath.STstat.TH

keywords theoryleanformallearningempiricalformalizationmissingprocess

0 comments

read the original abstract

We present the first comprehensive Lean 4 formalization of statistical learning theory (SLT) grounded in empirical process theory. Our en-to-end formal infrastructure implement the missing contents in latest Lean library, including a complete development of Gaussian Lipschitz concentration, Dudley's entropy integral theorem for sub-Gaussian processes, and an application to least-squares (sparse) regression with a sharp rate. The project was carried out using a human-AI collaborative workflow, in which humans design proof strategies and AI agents execute tactical proof construction, leading to the human-verified Lean 4 toolbox for SLT. Beyond implementation, the formalization process exposes and resolves implicit assumptions and missing details in standard SLT textbooks, enforcing a granular, line-by-line understanding of the theory. This work establishes a reusable formal foundation and opens the door for future developments in machine learning theory. The code is provided in https://github.com/YuanheZ/lean-stat-learning-theory.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Attribution Impossibility: No Feature Ranking Is Faithful, Stable, and Complete Under Collinearity
cs.LG 2026-04 accept novelty 8.0 full

Proves an impossibility theorem that no feature attribution ranking can be faithful, stable, and complete under collinearity, characterizes the design space as two families, introduces the DASH ensemble method, and fo...