Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions

David Duvenaud; Jon Lorraine; Matthew MacKay; Paul Vicol; Roger Grosse

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 1903.03088 v1 pith:DY7MWLHE submitted 2019-03-07 cs.LG stat.ML

Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions

Matthew MacKay , Paul Vicol , Jon Lorraine , David Duvenaud , Roger Grosse This is my paper

classification cs.LG stat.ML

keywords hyperparametersbest-responsehyperparameternetworksoptimizationtrainingapproachapproximations

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

Hyperparameter optimization can be formulated as a bilevel optimization problem, where the optimal parameters on the training set depend on the hyperparameters. We aim to adapt regularization hyperparameters for neural networks by fitting compact approximations to the best-response function, which maps hyperparameters to optimal weights and biases. We show how to construct scalable best-response approximations for neural networks by modeling the best-response as a single network whose hidden units are gated conditionally on the regularizer. We justify this approximation by showing the exact best-response for a shallow linear network with L2-regularized Jacobian can be represented by a similar gating mechanism. We fit this model using a gradient-based hyperparameter optimization algorithm which alternates between approximating the best-response around the current hyperparameters and optimizing the hyperparameters using the approximate best-response function. Unlike other gradient-based approaches, we do not require differentiating the training loss with respect to the hyperparameters, allowing us to tune discrete hyperparameters, data augmentation hyperparameters, and dropout probabilities. Because the hyperparameters are adapted online, our approach discovers hyperparameter schedules that can outperform fixed hyperparameter values. Empirically, our approach outperforms competing hyperparameter optimization methods on large-scale deep learning problems. We call our networks, which update their own hyperparameters online during training, Self-Tuning Networks (STNs).

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Stealthy World Model Manipulation via Data Poisoning
cs.LG 2026-06 unverdicted novelty 7.0

SWAAP is the first two-stage poisoning framework that identifies a harmful target world model via bilevel optimization and realizes it through stealth-constrained gradient matching on a limited fraction of fine-tuning...
On Constraint Qualifications for MPECs with Applications to Bilevel Hyperparameter Optimization for Machine Learning
math.OC 2025-08 unverdicted novelty 5.0

Clarifies relationships among MPEC constraint qualifications and fully characterizes MPEC-LICQ for the MPEC from bilevel hyperparameter optimization in L1-loss SVM classification.
Bilevel Optimization for Neural Architecture Search
cs.LG 2026-06 unverdicted novelty 3.0

Reviews NAS methods through bilevel optimization lens, categorizing them into sampling-based and theory-based, and proposes an auxiliary math programming framework for more principled architecture and weight updates.