PyLO: Towards Accessible Learned Optimizers in PyTorch

Abhinav Moudgil; Benjamin Therien; Eugene Belilovsky; Paul Janson; Quentin Anthony; Xiaolong Huang

arxiv: 2506.10315 · v3 · submitted 2025-06-12 · 💻 cs.LG

PyLO: Towards Accessible Learned Optimizers in PyTorch

Paul Janson , Benjamin Therien , Quentin Anthony , Xiaolong Huang , Abhinav Moudgil , Eugene Belilovsky This is my paper

Pith reviewed 2026-05-19 09:26 UTC · model grok-4.3

classification 💻 cs.LG

keywords learned optimizersPyTorchCUDA accelerationVeLOpre-trainingoptimization libraryViT training

0 comments

The pith

PyLO makes learned optimizers like VeLO available in PyTorch with CUDA speedups that reach over 190 samples per second on ViT-B/16 training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PyLO as a PyTorch library that exposes learned optimizers through the standard torch.optim.Optimizer interface to reach users who work outside JAX. It prioritizes real-world large-scale pre-training applications over small academic benchmarks. CUDA-accelerated versions of small fc lopt and VeLO deliver training throughput increases from roughly 40-50 to 190-205 samples per second on a ViT-B/16 model at batch size 32. The implementations remain compatible with common additions such as learning-rate schedules and weight decay, and the paper reports that these additions further improve the learned optimizers.

Core claim

PyLO supplies CUDA-accelerated implementations of small fc lopt and VeLO that integrate directly into PyTorch training loops, raising throughput on ViT-B/16 (batch size 32) from 39.36 and 49.73 to 205.59 and 191.18 samples per second while supporting combination with learning-rate schedules and weight decay.

What carries the argument

CUDA-accelerated learned optimizer modules exposed via the torch.optim.Optimizer interface.

If this is right

Learned optimizers become practical drop-in options for PyTorch users working on large-scale pre-training.
Combining learned optimizers with learning-rate schedules and weight decay produces better results than using the learned optimizer alone.
The same interface allows researchers to test learned optimizers on their own models without rewriting training code.
Accessibility removes the JAX barrier that previously limited VeLO to a small subset of the community.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Wider availability in the dominant deep-learning framework could shift more groups toward experimenting with meta-trained optimizers on production-scale workloads.
The observed benefit from standard schedules suggests hybrid training recipes that mix learned and classical components may become common.
Porting similar accelerations to other frameworks could further lower the barrier for adopting these optimizers.

Load-bearing premise

The reported speedups and compatibility with schedules and weight decay extend beyond the single ViT-B/16 setup and batch size tested.

What would settle it

Measuring throughput and compatibility on a different model scale or task, such as a language-model pre-training run at larger batch size, and finding no meaningful speedup or broken integration with standard schedules.

read the original abstract

Learned optimizers have been an active research topic over the past decade, with increasing progress toward practical, general-purpose optimizers that can serve as drop-in replacements for widely used methods like Adam. However, recent advances such as VeLO, which was meta-trained for 4000 TPU-months, remain largely inaccessible to the broader community, in part due to their reliance on JAX and the absence of user-friendly packages for independently using the optimizers after meta-training. To address this gap, we introduce PyLO, a PyTorch-based library that brings learned optimizers to the remaining ~70% of machine learning community via the familiar torch.optim.Optimizer interface. Unlike prior work focused on limited-scale academic tasks, our emphasis is on applying learned optimization to real-world large-scale pre-training tasks. Our systems contribution includes CUDA-accelerated implementations of the small fc lopt(Metz et al., 2022a) and VeLO(Metz et al., 2022b) learned optimizers, achieving substantial performance gains, with training throughput on ViT-B/16 (batch size 32) increasing from 39.36 and 49.73 to 205.59 and 191.18 samples per second, respectively. PyLO has the versatility that allows us to easily combine learned optimizers with existing optimization tools such as learning rate schedules and weight decay. When doing so, we discover that learned optimizers can substantially benefit from it. Our code is available at https://github.com/Belilovsky-Lab/pylo

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PyLO is a practical PyTorch port of two existing learned optimizers with real CUDA speedups on the tested run, but the large-scale pre-training claims rest on a single narrow configuration.

read the letter

PyLO is mainly a useful port that brings two learned optimizers into PyTorch with CUDA kernels, plus the observation that they work better when paired with standard schedules and weight decay. The speedups on the ViT-B/16 run are real and substantial. The work does a good job on the engineering side. They expose the optimizers through the standard torch.optim interface, ship CUDA implementations that cut the overhead, and release the code. The throughput numbers are tied to specific hardware and look reproducible. The compatibility experiments show that adding LR schedules and weight decay helps rather than hurts, which is a practical finding worth noting. The soft spot is the narrow experimental base. All the quantitative results come from ViT-B/16 at batch size 32. The paper talks about real-world large-scale pre-training, but there are no runs on bigger models, larger batches, or other tasks to back that up. At different scales the relative cost of the learned optimizer state could change, so the practical benefit might not carry over as cleanly. The compatibility claim is also only shown in that one setting. This paper is for PyTorch practitioners who want to experiment with learned optimizers without moving to JAX. It lowers the barrier for that group. The core math is not new, but the implementation and the usage note are. I would send it to peer review. The contribution is solid enough on the systems side to be worth referee time, even if the scope of the claims needs tightening.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces PyLO, a PyTorch library that exposes learned optimizers (small fc lopt and VeLO) through the standard torch.optim.Optimizer interface. It contributes CUDA-accelerated implementations and reports concrete throughput gains on ViT-B/16 (batch size 32) from 39.36/49.73 to 205.59/191.18 samples per second, together with compatibility results when the learned optimizers are combined with external learning-rate schedules and weight decay.

Significance. If the reported measurements hold, PyLO removes a practical barrier for the majority of the ML community that uses PyTorch rather than JAX. The provision of open-source code, concrete wall-clock throughput numbers tied to a stated hardware setup, and the demonstration that learned optimizers can be grafted onto standard schedules are genuine strengths that could encourage wider experimentation.

major comments (1)

Abstract: the positioning around 'real-world large-scale pre-training tasks' is not supported by the experimental evidence, which is restricted to ViT-B/16 at batch size 32. The relative cost of per-parameter state and custom kernels, as well as the value of grafting external schedules, can change materially at the model scales and batch sizes typical of pre-training; without additional configurations the practical-utility claim rests on extrapolation.

minor comments (2)

The manuscript would be strengthened by stating the exact hardware, PyTorch version, and commit hash used to obtain the throughput figures, so that the numbers can be independently reproduced.
A short limitations paragraph discussing memory overhead or scaling behavior of the learned-optimizer state would help readers assess applicability beyond the reported setting.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The single major comment concerns the abstract's framing of the work around real-world large-scale pre-training. We address it directly below and will revise the manuscript accordingly.

read point-by-point responses

Referee: Abstract: the positioning around 'real-world large-scale pre-training tasks' is not supported by the experimental evidence, which is restricted to ViT-B/16 at batch size 32. The relative cost of per-parameter state and custom kernels, as well as the value of grafting external schedules, can change materially at the model scales and batch sizes typical of pre-training; without additional configurations the practical-utility claim rests on extrapolation.

Authors: We agree that the reported experiments use ViT-B/16 at batch size 32 and that the relative overhead of learned-optimizer state and kernels, as well as the benefit of external schedules, could differ at the larger model sizes and batch sizes common in pre-training. The abstract's phrasing was intended to convey the library's design goal and intended use case rather than to claim that the current measurements already demonstrate performance at those scales. We will revise the abstract to state the experimental scope more precisely while preserving the motivation that PyLO is intended to support larger-scale work. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical throughput results rest on direct measurement

full rationale

The paper presents a PyTorch library with CUDA kernels for learned optimizers and reports wall-clock throughput numbers on one concrete configuration (ViT-B/16, batch size 32). No equations, fitted parameters, or derivations are invoked to support the speed-up or compatibility claims; the numbers are obtained by running the implemented code. Citations to Metz et al. (2022) are external references to the original learned-optimizer definitions and do not form a self-citation chain that justifies the present results. Because the central claims are benchmark measurements rather than a derivation that reduces to its own inputs, the work is self-contained with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no new mathematical axioms, free parameters, or invented entities. All claims rest on standard CUDA programming, the existing learned-optimizer weights from Metz et al., and direct runtime measurement.

pith-pipeline@v0.9.0 · 5826 in / 1244 out tokens · 24428 ms · 2026-05-19T09:26:15.713374+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CUDA-accelerated implementations of the small fc lopt ... achieving substantial performance gains, with training throughput on ViT-B/16 (batch size 32) increasing from 39.36 to 205.59 samples/sec
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat ≃ Nat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PyLO ... seamlessly integrates with torch.optim.Optimizer and the Huggingface ecosystem

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning to Optimize Radiotherapy Plans via Fluence Maps Diffusion Model Generation and LSTM-based Optimization
cs.CV 2026-05 unverdicted novelty 7.0

A distilled diffusion model generates clinically feasible fluence maps for VMAT and an LSTM-based optimizer refines them to meet dose objectives, improving efficiency and deliverability on prostate cancer data.