PyLO: Towards Accessible Learned Optimizers in PyTorch
Pith reviewed 2026-05-19 09:26 UTC · model grok-4.3
The pith
PyLO makes learned optimizers like VeLO available in PyTorch with CUDA speedups that reach over 190 samples per second on ViT-B/16 training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PyLO supplies CUDA-accelerated implementations of small fc lopt and VeLO that integrate directly into PyTorch training loops, raising throughput on ViT-B/16 (batch size 32) from 39.36 and 49.73 to 205.59 and 191.18 samples per second while supporting combination with learning-rate schedules and weight decay.
What carries the argument
CUDA-accelerated learned optimizer modules exposed via the torch.optim.Optimizer interface.
If this is right
- Learned optimizers become practical drop-in options for PyTorch users working on large-scale pre-training.
- Combining learned optimizers with learning-rate schedules and weight decay produces better results than using the learned optimizer alone.
- The same interface allows researchers to test learned optimizers on their own models without rewriting training code.
- Accessibility removes the JAX barrier that previously limited VeLO to a small subset of the community.
Where Pith is reading between the lines
- Wider availability in the dominant deep-learning framework could shift more groups toward experimenting with meta-trained optimizers on production-scale workloads.
- The observed benefit from standard schedules suggests hybrid training recipes that mix learned and classical components may become common.
- Porting similar accelerations to other frameworks could further lower the barrier for adopting these optimizers.
Load-bearing premise
The reported speedups and compatibility with schedules and weight decay extend beyond the single ViT-B/16 setup and batch size tested.
What would settle it
Measuring throughput and compatibility on a different model scale or task, such as a language-model pre-training run at larger batch size, and finding no meaningful speedup or broken integration with standard schedules.
read the original abstract
Learned optimizers have been an active research topic over the past decade, with increasing progress toward practical, general-purpose optimizers that can serve as drop-in replacements for widely used methods like Adam. However, recent advances such as VeLO, which was meta-trained for 4000 TPU-months, remain largely inaccessible to the broader community, in part due to their reliance on JAX and the absence of user-friendly packages for independently using the optimizers after meta-training. To address this gap, we introduce PyLO, a PyTorch-based library that brings learned optimizers to the remaining ~70% of machine learning community via the familiar torch.optim.Optimizer interface. Unlike prior work focused on limited-scale academic tasks, our emphasis is on applying learned optimization to real-world large-scale pre-training tasks. Our systems contribution includes CUDA-accelerated implementations of the small fc lopt(Metz et al., 2022a) and VeLO(Metz et al., 2022b) learned optimizers, achieving substantial performance gains, with training throughput on ViT-B/16 (batch size 32) increasing from 39.36 and 49.73 to 205.59 and 191.18 samples per second, respectively. PyLO has the versatility that allows us to easily combine learned optimizers with existing optimization tools such as learning rate schedules and weight decay. When doing so, we discover that learned optimizers can substantially benefit from it. Our code is available at https://github.com/Belilovsky-Lab/pylo
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PyLO, a PyTorch library that exposes learned optimizers (small fc lopt and VeLO) through the standard torch.optim.Optimizer interface. It contributes CUDA-accelerated implementations and reports concrete throughput gains on ViT-B/16 (batch size 32) from 39.36/49.73 to 205.59/191.18 samples per second, together with compatibility results when the learned optimizers are combined with external learning-rate schedules and weight decay.
Significance. If the reported measurements hold, PyLO removes a practical barrier for the majority of the ML community that uses PyTorch rather than JAX. The provision of open-source code, concrete wall-clock throughput numbers tied to a stated hardware setup, and the demonstration that learned optimizers can be grafted onto standard schedules are genuine strengths that could encourage wider experimentation.
major comments (1)
- Abstract: the positioning around 'real-world large-scale pre-training tasks' is not supported by the experimental evidence, which is restricted to ViT-B/16 at batch size 32. The relative cost of per-parameter state and custom kernels, as well as the value of grafting external schedules, can change materially at the model scales and batch sizes typical of pre-training; without additional configurations the practical-utility claim rests on extrapolation.
minor comments (2)
- The manuscript would be strengthened by stating the exact hardware, PyTorch version, and commit hash used to obtain the throughput figures, so that the numbers can be independently reproduced.
- A short limitations paragraph discussing memory overhead or scaling behavior of the learned-optimizer state would help readers assess applicability beyond the reported setting.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The single major comment concerns the abstract's framing of the work around real-world large-scale pre-training. We address it directly below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: Abstract: the positioning around 'real-world large-scale pre-training tasks' is not supported by the experimental evidence, which is restricted to ViT-B/16 at batch size 32. The relative cost of per-parameter state and custom kernels, as well as the value of grafting external schedules, can change materially at the model scales and batch sizes typical of pre-training; without additional configurations the practical-utility claim rests on extrapolation.
Authors: We agree that the reported experiments use ViT-B/16 at batch size 32 and that the relative overhead of learned-optimizer state and kernels, as well as the benefit of external schedules, could differ at the larger model sizes and batch sizes common in pre-training. The abstract's phrasing was intended to convey the library's design goal and intended use case rather than to claim that the current measurements already demonstrate performance at those scales. We will revise the abstract to state the experimental scope more precisely while preserving the motivation that PyLO is intended to support larger-scale work. revision: yes
Circularity Check
No circularity: empirical throughput results rest on direct measurement
full rationale
The paper presents a PyTorch library with CUDA kernels for learned optimizers and reports wall-clock throughput numbers on one concrete configuration (ViT-B/16, batch size 32). No equations, fitted parameters, or derivations are invoked to support the speed-up or compatibility claims; the numbers are obtained by running the implemented code. Citations to Metz et al. (2022) are external references to the original learned-optimizer definitions and do not form a self-citation chain that justifies the present results. Because the central claims are benchmark measurements rather than a derivation that reduces to its own inputs, the work is self-contained with no circular steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CUDA-accelerated implementations of the small fc lopt ... achieving substantial performance gains, with training throughput on ViT-B/16 (batch size 32) increasing from 39.36 to 205.59 samples/sec
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat ≃ Nat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PyLO ... seamlessly integrates with torch.optim.Optimizer and the Huggingface ecosystem
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Learning to Optimize Radiotherapy Plans via Fluence Maps Diffusion Model Generation and LSTM-based Optimization
A distilled diffusion model generates clinically feasible fluence maps for VMAT and an LSTM-based optimizer refines them to meet dose objectives, improving efficiency and deliverability on prostate cancer data.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.