On-Device Fine-Tuning via Backprop-Free Zeroth-Order Optimization
Pith reviewed 2026-05-17 22:01 UTC · model grok-4.3
The pith
MeZO lets larger models fine-tune on edge devices by estimating gradients from forward passes alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Memory-efficient zeroth-order optimization (MeZO) alleviates the memory bottleneck in on-device fine-tuning by estimating gradients using forward evaluations alone, eliminating the need for storing intermediate activations or optimizer states. This enables significantly larger models to fit within on-chip memory, albeit at the cost of potentially longer fine-tuning wall-clock time. The paper first provides a theoretical estimate of the relative model sizes that can be accommodated under BP and MeZO training, then numerically validates that MeZO exhibits accuracy advantages under on-device memory constraints, provided sufficient wall-clock time is available for fine-tuning.
What carries the argument
Memory-efficient zeroth-order optimization (MeZO), a gradient estimation technique that relies solely on forward model evaluations to avoid storing activations and optimizer states during fine-tuning.
If this is right
- Larger models become feasible for on-device adaptation because intermediate storage is no longer required.
- Under tight memory limits, MeZO can reach higher final accuracy than backpropagation methods that must use aggressive checkpointing.
- Fine-tuning requires more wall-clock time due to the higher number of forward passes needed for gradient estimates.
- A closed-form estimate exists for the factor by which model size can increase when switching from backpropagation to MeZO.
Where Pith is reading between the lines
- Edge deployments could prioritize MeZO when model size is the binding constraint and latency requirements allow slower adaptation.
- Hardware accelerators optimized for forward passes might reduce the time penalty of MeZO enough to make it competitive in more settings.
- The same forward-only gradient approach could extend to continual learning scenarios where memory must be shared across multiple tasks.
Load-bearing premise
Sufficient wall-clock time is available for the longer fine-tuning process required by MeZO.
What would settle it
An experiment where, under the same strict on-device memory budget that prevents standard backpropagation from fitting the model at all, MeZO produces lower task accuracy than a checkpointed backpropagation baseline even after extended training time.
read the original abstract
On-device fine-tuning is a critical capability for edge AI systems, which must support adaptation to different agentic tasks under stringent memory constraints. Conventional backpropagation (BP)-based training requires storing layer activations and optimizer states, a demand that can be only partially alleviated through checkpointing. In edge deployments in which the model weights must reside entirely in device memory, this overhead severely limits the maximum model size that can be deployed. Memory-efficient zeroth-order optimization (MeZO) alleviates this bottleneck by estimating gradients using forward evaluations alone, eliminating the need for storing intermediate activations or optimizer states. This enables significantly larger models to fit within on-chip memory, albeit at the cost of potentially longer fine-tuning wall-clock time. This paper first provides a theoretical estimate of the relative model sizes that can be accommodated under BP and MeZO training. We then numerically validate the analysis, demonstrating that MeZO exhibits accuracy advantages under on-device memory constraints, provided sufficient wall-clock time is available for fine-tuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Memory-efficient Zeroth-Order optimization (MeZO) for on-device fine-tuning. It derives theoretical estimates showing that MeZO permits substantially larger models to fit in on-device memory than backpropagation-based training because it avoids storing activations and optimizer states. Numerical experiments are presented to validate that MeZO achieves accuracy advantages over memory-constrained BP baselines when sufficient wall-clock time is available for the longer fine-tuning process.
Significance. If the central claims hold, the work would provide a practical route to fine-tuning larger models directly on edge hardware under strict memory limits, with direct relevance to privacy-sensitive and latency-critical applications. The theoretical model-size comparison supplies a concrete, reusable framework for memory budgeting, and the numerical validation offers initial evidence of accuracy gains under the stated time proviso.
major comments (1)
- [Abstract and §4] Abstract and §4 (Numerical Validation): the central accuracy-advantage claim is explicitly conditioned on 'sufficient wall-clock time' being available, yet the manuscript provides neither quantitative bounds on the iteration overhead of MeZO relative to BP nor time-versus-accuracy curves or total wall-clock measurements; this assumption is load-bearing because zeroth-order methods incur higher per-step cost and slower convergence, and without such data the practical scope of the claimed advantage cannot be assessed.
minor comments (2)
- [§3] §3 (Theoretical Analysis): the derivation of the relative model-size ratio would benefit from an explicit statement of the memory model assumptions (e.g., whether activation checkpointing is included for the BP baseline).
- [Figures and §4] Figure captions and experimental details: error bars or multiple random seeds should be reported for the accuracy comparisons to allow readers to judge statistical reliability.
Simulated Author's Rebuttal
Thank you for reviewing our manuscript and for the positive assessment of its significance. We address the major comment below and have made revisions to strengthen the presentation of the time-accuracy tradeoff.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Numerical Validation): the central accuracy-advantage claim is explicitly conditioned on 'sufficient wall-clock time' being available, yet the manuscript provides neither quantitative bounds on the iteration overhead of MeZO relative to BP nor time-versus-accuracy curves or total wall-clock measurements; this assumption is load-bearing because zeroth-order methods incur higher per-step cost and slower convergence, and without such data the practical scope of the claimed advantage cannot be assessed.
Authors: We agree that explicit quantification of the wall-clock overhead is necessary to delineate the practical conditions under which MeZO's accuracy advantage holds. The original experiments in §4 compare final accuracies under identical iteration budgets while respecting memory limits, with the time proviso stated in the abstract and introduction. To address this directly, the revised manuscript adds a new subsection in §4 that (i) derives a theoretical bound on per-update cost (MeZO requires two forward passes per gradient estimate versus one forward-backward pair for BP, yielding an approximate 2–4× increase in FLOPs per step depending on batch size and perturbation strategy) and (ii) includes time-versus-accuracy curves obtained by re-plotting the existing experimental logs after scaling iteration counts by the estimated per-step cost ratio. These additions make the scope of the claimed advantage concrete without requiring new hardware runs. revision: yes
Circularity Check
No significant circularity; model-size estimates follow from standard memory accounting
full rationale
The paper's theoretical estimate of relative model sizes under BP versus MeZO follows directly from conventional memory accounting (weights plus activations and optimizer states for BP; weights only for forward-only MeZO). This accounting is independent of the accuracy results and does not reduce to a fitted parameter or self-referential definition. Numerical validation is presented as a demonstration under the explicit caveat of sufficient wall-clock time, with no evidence of self-citation load-bearing steps, ansatz smuggling, or renaming of known results. The derivation chain remains self-contained against external benchmarks of memory-constrained optimization.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Zeroth-order gradient estimates via forward passes are sufficiently accurate for fine-tuning under the tested conditions.
Reference graph
Works this paper leans on
-
[1]
On-Device Fine-Tuning via Backprop-Free Zeroth-Order Optimization
INTRODUCTION Context and Motivation:A key avenue for advancing ar- tificial intelligence (AI) performance is tailoring models to specific tasks and user needs via fine-tuning [1], particularly for agentic systems [2]. User demands for modern AI mod- els are not static, making fine-tuning a recurring necessity rather than a one-time cost. On-device fine-tu...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Throughout this section, as illustrated in Fig
MEMORY REQUIREMENTS OF BP AND MEZO In this section, we analyze the memory requirements of BP- and ZO-based fine-tuning for Transformer models with the aim of providing theoretical limits on the model size that can be accommodated under a fixed on-chip memory budget. Throughout this section, as illustrated in Fig. 2, we focus on a basic decoder-only Transf...
work page 2048
-
[3]
EXPERIMENTS Fig. 4. On-device fine-tuning for Boolq data set [ 22] as a function of fine-tuning wall-clock time [s]. We consider MeZO with Llama2-7B and LlaMa2-13B, while BP adopts GPT2- medium model. The batch size is B= 8 . All models require a similar memory consumption of around 17 GB according to the analysis in Sec. 2 when setting L′/L= 0.15 for Lla...
-
[4]
This enables on-device learning with larger models and longer contexts
CONCLUSIONS MeZO optimization eliminates the need to store activations and optimizer states, reducing training memory demands to the level of inference memory. This enables on-device learning with larger models and longer contexts. We have analyzed the relative memory requirements of MeZO compared to conven- tional BP-based fine-tuning and presented exper...
-
[5]
V . Lialin, V . Deshpande, and A. Rumshisky, “Scaling down to scale up: A guide to parameter-efficient fine- tuning,”arXiv preprint arXiv:2303.15647, 2023
-
[6]
Fireact: Toward language agent fine-tuning
B. Chen, C. Shu, E. Shareghi, N. Collier, K. Narasimhan, and S. Yao, “Fireact: Toward language agent fine-tuning,” arXiv preprint arXiv:2310.05915, 2023
-
[7]
From llms to edge: Parameter-efficient fine-tuning on edge devices,
G. Slamanig, F. Corti, and O. Saukh, “From llms to edge: Parameter-efficient fine-tuning on edge devices,”arXiv preprint arXiv:2507.23536, 2025
-
[8]
Fine-Tuning Small Language Models for Domain-Specific AI: An Edge AI Perspective,
R. Aralimatti, S. A. G. Shakhadri, K. R. Kruthika, and K. B. Angadi, “Fine-Tuning Small Language Models for Domain-Specific AI: An Edge AI Perspective,” inIntelli- gent Systems and Applications, K. Arai, Ed. Springer Nature Switzerland, 2025, pp. 503–520
work page 2025
-
[9]
Small Language Models are the Future of Agentic AI
P. Belcak, G. Heinrich, S. Diao, Y . Fu, X. Dong, S. Mu- ralidharan, Y . C. Lin, and P. Molchanov, “Small lan- guage models are the future of agentic ai,”arXiv preprint arXiv:2506.02153, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Fine-tuning language models with just forward passes,
S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, and S. Arora, “Fine-tuning language models with just forward passes,”Advances in Neural Infor- mation Processing Systems, vol. 36, pp. 53 038–53 075, 2023
work page 2023
-
[11]
PocketLLM: Enabling on-device fine-tuning for personalized LLMs,
D. Peng, Z. Fu, and J. Wang, “PocketLLM: Enabling on-device fine-tuning for personalized LLMs,” in Proceedings of the Fifth Workshop on Privacy in Natural Language Processing. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 91–96. [Online]. Available: https://aclanthology.org/ 2024.privatenlp-1.10/
work page 2024
-
[12]
Reducing acti- vation recomputation in large transformer models,
V . A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. An- dersch, M. Shoeybi, and B. Catanzaro, “Reducing acti- vation recomputation in large transformer models,”Pro- ceedings of Machine Learning and Systems, vol. 5, pp. 341–353, 2023
work page 2023
-
[13]
The cost of avoiding backpropagation,
K. Panchalet al., “The cost of avoiding backpropagation,” 2025, arXiv preprint. [Online]. Available: https: //arxiv.org/abs/2506.21833
-
[14]
An overview of the simultaneous perturba- tion method for efficient optimization,
J. C. Spall, “An overview of the simultaneous perturba- tion method for efficient optimization,”Johns Hopkins APL Technical Digest, vol. 19, no. 4, pp. 482–492, 1998
work page 1998
-
[15]
Optimal rates for zero-order convex opti- mization: The power of two function evaluations,
J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono, “Optimal rates for zero-order convex opti- mization: The power of two function evaluations,”IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2788–2806, 2015
work page 2015
-
[16]
Sparse MeZO: Less parameters for better performance in zeroth-order LLM fine-tuning,
Y . Liu, Z. Zhu, C. Gong, M. Cheng, C.-J. Hsieh, and Y . You, “Sparse MeZO: Less parameters for better performance in zeroth-order LLM fine-tuning,” in International Conference on Machine Learning (ICML),
-
[17]
Available: https://arxiv.org/abs/2402
[Online]. Available: https://arxiv.org/abs/2402. 11888
-
[18]
Y . Yang, K. Zhen, E. Banijamali, A. Mouchtaris, and Z. Zhang, “AdaZeta: Adaptive zeroth-order tensor-train adaptation for memory-efficient large language model fine-tuning,” inEmpirical Methods in Natural Language Processing (EMNLP), 2024. [Online]. Available: https://arxiv.org/abs/2406.08301
-
[19]
Towards efficient low-order hybrid optimizer for language model fine-tuning,
M. Chen, Y .-L. Huang, and Z. Wen, “Towards efficient low-order hybrid optimizer for language model fine-tuning,” inAAAI Conference on Artificial Intelligence, 2025, preprint. [Online]. Available: https: //arxiv.org/abs/2409.18075
-
[20]
Towards understanding the effect of leak in Spiking Neural Networks,
J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,”Neurocomput., vol. 568, no. C, Feb. 2024. [Online]. Available: https://doi.org/10.1016/j.neucom. 2023.127063
-
[21]
Longformer: The Long-Document Transformer
I. Beltagy, M. E. Peters, and A. Cohan, “Long- former: The long-document transformer,”arXiv preprint arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[22]
Outrageously large neural networks: The sparsely-gated mixture- of-experts layer,
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture- of-experts layer,” inInternational Conference on Learning Representations, 2017. [Online]. Available: https://openreview.net/forum?id=B1ckMDqlg
work page 2017
-
[23]
Adaptive mixtures of local experts,
R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hin- ton, “Adaptive mixtures of local experts,”Neural compu- tation, vol. 3, no. 1, pp. 79–87, 1991
work page 1991
-
[24]
Language models are unsupervised multitask learners,
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskeveret al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019
work page 2019
-
[25]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhos- aleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Z. Huang, J. Hu, H. Lin, C. Zhu, Y . Tang, Q. Zhang, Z. Guo, Z. Li, S. Yan, Z. Zhuet al., “Reducing gpu memory fragmentation via spatio-temporal planning for efficient large-scale model training,”arXiv preprint arXiv:2507.16274, 2025
-
[27]
Boolq: Exploring the sur- prising difficulty of natural yes/no questions,
C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova, “Boolq: Exploring the sur- prising difficulty of natural yes/no questions,” inNAACL, 2019
work page 2019
-
[28]
Making pre-trained language models better few-shot learners
T. Gao, A. Fisch, and D. Chen, “Making pre-trained language models better few-shot learners,”arXiv preprint arXiv:2012.15723, 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.