On-Device Fine-Tuning via Backprop-Free Zeroth-Order Optimization

Bipin Rajendran; Houssem Sifaou; Osvaldo Simeone; Prabodh Katti; Sangwoo Park

arxiv: 2511.11362 · v2 · submitted 2025-11-14 · 💻 cs.LG · cs.CL

On-Device Fine-Tuning via Backprop-Free Zeroth-Order Optimization

Prabodh Katti , Houssem Sifaou , Sangwoo Park , Bipin Rajendran , Osvaldo Simeone This is my paper

Pith reviewed 2026-05-17 22:01 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords on-device fine-tuningzeroth-order optimizationmemory-efficient trainingedge AIbackpropagation-freemodel adaptationgradient estimation

0 comments

The pith

MeZO lets larger models fine-tune on edge devices by estimating gradients from forward passes alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that conventional backpropagation for on-device fine-tuning requires storing layer activations and optimizer states, which restricts the largest model that fits in device memory. Memory-efficient zeroth-order optimization, called MeZO, replaces this with gradient estimates from forward evaluations only, removing those storage needs. The authors derive a theoretical bound on the relative model sizes possible under each method and run experiments to check performance. If the approach works as described, edge systems could adapt larger models to new tasks without extra memory hardware, though fine-tuning would take more time. This matters for practical deployment of adaptive AI on constrained devices like phones or sensors.

Core claim

Memory-efficient zeroth-order optimization (MeZO) alleviates the memory bottleneck in on-device fine-tuning by estimating gradients using forward evaluations alone, eliminating the need for storing intermediate activations or optimizer states. This enables significantly larger models to fit within on-chip memory, albeit at the cost of potentially longer fine-tuning wall-clock time. The paper first provides a theoretical estimate of the relative model sizes that can be accommodated under BP and MeZO training, then numerically validates that MeZO exhibits accuracy advantages under on-device memory constraints, provided sufficient wall-clock time is available for fine-tuning.

What carries the argument

Memory-efficient zeroth-order optimization (MeZO), a gradient estimation technique that relies solely on forward model evaluations to avoid storing activations and optimizer states during fine-tuning.

If this is right

Larger models become feasible for on-device adaptation because intermediate storage is no longer required.
Under tight memory limits, MeZO can reach higher final accuracy than backpropagation methods that must use aggressive checkpointing.
Fine-tuning requires more wall-clock time due to the higher number of forward passes needed for gradient estimates.
A closed-form estimate exists for the factor by which model size can increase when switching from backpropagation to MeZO.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Edge deployments could prioritize MeZO when model size is the binding constraint and latency requirements allow slower adaptation.
Hardware accelerators optimized for forward passes might reduce the time penalty of MeZO enough to make it competitive in more settings.
The same forward-only gradient approach could extend to continual learning scenarios where memory must be shared across multiple tasks.

Load-bearing premise

Sufficient wall-clock time is available for the longer fine-tuning process required by MeZO.

What would settle it

An experiment where, under the same strict on-device memory budget that prevents standard backpropagation from fitting the model at all, MeZO produces lower task accuracy than a checkpointed backpropagation baseline even after extended training time.

read the original abstract

On-device fine-tuning is a critical capability for edge AI systems, which must support adaptation to different agentic tasks under stringent memory constraints. Conventional backpropagation (BP)-based training requires storing layer activations and optimizer states, a demand that can be only partially alleviated through checkpointing. In edge deployments in which the model weights must reside entirely in device memory, this overhead severely limits the maximum model size that can be deployed. Memory-efficient zeroth-order optimization (MeZO) alleviates this bottleneck by estimating gradients using forward evaluations alone, eliminating the need for storing intermediate activations or optimizer states. This enables significantly larger models to fit within on-chip memory, albeit at the cost of potentially longer fine-tuning wall-clock time. This paper first provides a theoretical estimate of the relative model sizes that can be accommodated under BP and MeZO training. We then numerically validate the analysis, demonstrating that MeZO exhibits accuracy advantages under on-device memory constraints, provided sufficient wall-clock time is available for fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Memory-efficient Zeroth-Order optimization (MeZO) for on-device fine-tuning. It derives theoretical estimates showing that MeZO permits substantially larger models to fit in on-device memory than backpropagation-based training because it avoids storing activations and optimizer states. Numerical experiments are presented to validate that MeZO achieves accuracy advantages over memory-constrained BP baselines when sufficient wall-clock time is available for the longer fine-tuning process.

Significance. If the central claims hold, the work would provide a practical route to fine-tuning larger models directly on edge hardware under strict memory limits, with direct relevance to privacy-sensitive and latency-critical applications. The theoretical model-size comparison supplies a concrete, reusable framework for memory budgeting, and the numerical validation offers initial evidence of accuracy gains under the stated time proviso.

major comments (1)

[Abstract and §4] Abstract and §4 (Numerical Validation): the central accuracy-advantage claim is explicitly conditioned on 'sufficient wall-clock time' being available, yet the manuscript provides neither quantitative bounds on the iteration overhead of MeZO relative to BP nor time-versus-accuracy curves or total wall-clock measurements; this assumption is load-bearing because zeroth-order methods incur higher per-step cost and slower convergence, and without such data the practical scope of the claimed advantage cannot be assessed.

minor comments (2)

[§3] §3 (Theoretical Analysis): the derivation of the relative model-size ratio would benefit from an explicit statement of the memory model assumptions (e.g., whether activation checkpointing is included for the BP baseline).
[Figures and §4] Figure captions and experimental details: error bars or multiple random seeds should be reported for the accuracy comparisons to allow readers to judge statistical reliability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for reviewing our manuscript and for the positive assessment of its significance. We address the major comment below and have made revisions to strengthen the presentation of the time-accuracy tradeoff.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Numerical Validation): the central accuracy-advantage claim is explicitly conditioned on 'sufficient wall-clock time' being available, yet the manuscript provides neither quantitative bounds on the iteration overhead of MeZO relative to BP nor time-versus-accuracy curves or total wall-clock measurements; this assumption is load-bearing because zeroth-order methods incur higher per-step cost and slower convergence, and without such data the practical scope of the claimed advantage cannot be assessed.

Authors: We agree that explicit quantification of the wall-clock overhead is necessary to delineate the practical conditions under which MeZO's accuracy advantage holds. The original experiments in §4 compare final accuracies under identical iteration budgets while respecting memory limits, with the time proviso stated in the abstract and introduction. To address this directly, the revised manuscript adds a new subsection in §4 that (i) derives a theoretical bound on per-update cost (MeZO requires two forward passes per gradient estimate versus one forward-backward pair for BP, yielding an approximate 2–4× increase in FLOPs per step depending on batch size and perturbation strategy) and (ii) includes time-versus-accuracy curves obtained by re-plotting the existing experimental logs after scaling iteration counts by the estimated per-step cost ratio. These additions make the scope of the claimed advantage concrete without requiring new hardware runs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; model-size estimates follow from standard memory accounting

full rationale

The paper's theoretical estimate of relative model sizes under BP versus MeZO follows directly from conventional memory accounting (weights plus activations and optimizer states for BP; weights only for forward-only MeZO). This accounting is independent of the accuracy results and does not reduce to a fitted parameter or self-referential definition. Numerical validation is presented as a demonstration under the explicit caveat of sufficient wall-clock time, with no evidence of self-citation load-bearing steps, ansatz smuggling, or renaming of known results. The derivation chain remains self-contained against external benchmarks of memory-constrained optimization.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions about memory scaling in neural network training and the validity of zeroth-order gradient estimates; no new entities or fitted parameters are introduced in the abstract.

axioms (1)

domain assumption Zeroth-order gradient estimates via forward passes are sufficiently accurate for fine-tuning under the tested conditions.
Invoked implicitly in the numerical validation claim.

pith-pipeline@v0.9.0 · 5484 in / 1094 out tokens · 25305 ms · 2026-05-17T22:01:05.559613+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 4 internal anchors

[1]

On-Device Fine-Tuning via Backprop-Free Zeroth-Order Optimization

INTRODUCTION Context and Motivation:A key avenue for advancing ar- tificial intelligence (AI) performance is tailoring models to specific tasks and user needs via fine-tuning [1], particularly for agentic systems [2]. User demands for modern AI mod- els are not static, making fine-tuning a recurring necessity rather than a one-time cost. On-device fine-tu...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Throughout this section, as illustrated in Fig

MEMORY REQUIREMENTS OF BP AND MEZO In this section, we analyze the memory requirements of BP- and ZO-based fine-tuning for Transformer models with the aim of providing theoretical limits on the model size that can be accommodated under a fixed on-chip memory budget. Throughout this section, as illustrated in Fig. 2, we focus on a basic decoder-only Transf...

work page 2048
[3]

EXPERIMENTS Fig. 4. On-device fine-tuning for Boolq data set [ 22] as a function of fine-tuning wall-clock time [s]. We consider MeZO with Llama2-7B and LlaMa2-13B, while BP adopts GPT2- medium model. The batch size is B= 8 . All models require a similar memory consumption of around 17 GB according to the analysis in Sec. 2 when setting L′/L= 0.15 for Lla...

work page
[4]

This enables on-device learning with larger models and longer contexts

CONCLUSIONS MeZO optimization eliminates the need to store activations and optimizer states, reducing training memory demands to the level of inference memory. This enables on-device learning with larger models and longer contexts. We have analyzed the relative memory requirements of MeZO compared to conven- tional BP-based fine-tuning and presented exper...

work page
[5]

Scaling down to scale up: A guide to parameter-efficient fine-tuning.arXiv preprint arXiv:2303.15647,

V . Lialin, V . Deshpande, and A. Rumshisky, “Scaling down to scale up: A guide to parameter-efficient fine- tuning,”arXiv preprint arXiv:2303.15647, 2023

work page arXiv 2023
[6]

Fireact: Toward language agent fine-tuning

B. Chen, C. Shu, E. Shareghi, N. Collier, K. Narasimhan, and S. Yao, “Fireact: Toward language agent fine-tuning,” arXiv preprint arXiv:2310.05915, 2023

work page arXiv 2023
[7]

From llms to edge: Parameter-efficient fine-tuning on edge devices,

G. Slamanig, F. Corti, and O. Saukh, “From llms to edge: Parameter-efficient fine-tuning on edge devices,”arXiv preprint arXiv:2507.23536, 2025

work page arXiv 2025
[8]

Fine-Tuning Small Language Models for Domain-Specific AI: An Edge AI Perspective,

R. Aralimatti, S. A. G. Shakhadri, K. R. Kruthika, and K. B. Angadi, “Fine-Tuning Small Language Models for Domain-Specific AI: An Edge AI Perspective,” inIntelli- gent Systems and Applications, K. Arai, Ed. Springer Nature Switzerland, 2025, pp. 503–520

work page 2025
[9]

Small Language Models are the Future of Agentic AI

P. Belcak, G. Heinrich, S. Diao, Y . Fu, X. Dong, S. Mu- ralidharan, Y . C. Lin, and P. Molchanov, “Small lan- guage models are the future of agentic ai,”arXiv preprint arXiv:2506.02153, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Fine-tuning language models with just forward passes,

S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, and S. Arora, “Fine-tuning language models with just forward passes,”Advances in Neural Infor- mation Processing Systems, vol. 36, pp. 53 038–53 075, 2023

work page 2023
[11]

PocketLLM: Enabling on-device fine-tuning for personalized LLMs,

D. Peng, Z. Fu, and J. Wang, “PocketLLM: Enabling on-device fine-tuning for personalized LLMs,” in Proceedings of the Fifth Workshop on Privacy in Natural Language Processing. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 91–96. [Online]. Available: https://aclanthology.org/ 2024.privatenlp-1.10/

work page 2024
[12]

Reducing acti- vation recomputation in large transformer models,

V . A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. An- dersch, M. Shoeybi, and B. Catanzaro, “Reducing acti- vation recomputation in large transformer models,”Pro- ceedings of Machine Learning and Systems, vol. 5, pp. 341–353, 2023

work page 2023
[13]

The cost of avoiding backpropagation,

K. Panchalet al., “The cost of avoiding backpropagation,” 2025, arXiv preprint. [Online]. Available: https: //arxiv.org/abs/2506.21833

work page arXiv 2025
[14]

An overview of the simultaneous perturba- tion method for efficient optimization,

J. C. Spall, “An overview of the simultaneous perturba- tion method for efficient optimization,”Johns Hopkins APL Technical Digest, vol. 19, no. 4, pp. 482–492, 1998

work page 1998
[15]

Optimal rates for zero-order convex opti- mization: The power of two function evaluations,

J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono, “Optimal rates for zero-order convex opti- mization: The power of two function evaluations,”IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2788–2806, 2015

work page 2015
[16]

Sparse MeZO: Less parameters for better performance in zeroth-order LLM fine-tuning,

Y . Liu, Z. Zhu, C. Gong, M. Cheng, C.-J. Hsieh, and Y . You, “Sparse MeZO: Less parameters for better performance in zeroth-order LLM fine-tuning,” in International Conference on Machine Learning (ICML),

work page
[17]

Available: https://arxiv.org/abs/2402

[Online]. Available: https://arxiv.org/abs/2402. 11888

work page
[18]

AdaZeta: Adaptive zeroth-order tensor-train adaptation for memory-efficient large language model fine-tuning,

Y . Yang, K. Zhen, E. Banijamali, A. Mouchtaris, and Z. Zhang, “AdaZeta: Adaptive zeroth-order tensor-train adaptation for memory-efficient large language model fine-tuning,” inEmpirical Methods in Natural Language Processing (EMNLP), 2024. [Online]. Available: https://arxiv.org/abs/2406.08301

work page arXiv 2024
[19]

Towards efficient low-order hybrid optimizer for language model fine-tuning,

M. Chen, Y .-L. Huang, and Z. Wen, “Towards efficient low-order hybrid optimizer for language model fine-tuning,” inAAAI Conference on Artificial Intelligence, 2025, preprint. [Online]. Available: https: //arxiv.org/abs/2409.18075

work page arXiv 2025
[20]

Towards understanding the effect of leak in Spiking Neural Networks,

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,”Neurocomput., vol. 568, no. C, Feb. 2024. [Online]. Available: https://doi.org/10.1016/j.neucom. 2023.127063

work page doi:10.1016/j.neucom 2024
[21]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan, “Long- former: The long-document transformer,”arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[22]

Outrageously large neural networks: The sparsely-gated mixture- of-experts layer,

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture- of-experts layer,” inInternational Conference on Learning Representations, 2017. [Online]. Available: https://openreview.net/forum?id=B1ckMDqlg

work page 2017
[23]

Adaptive mixtures of local experts,

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hin- ton, “Adaptive mixtures of local experts,”Neural compu- tation, vol. 3, no. 1, pp. 79–87, 1991

work page 1991
[24]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskeveret al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019

work page 2019
[25]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhos- aleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Reducing gpu memory fragmentation via spatio-temporal planning for efficient large-scale model training,

Z. Huang, J. Hu, H. Lin, C. Zhu, Y . Tang, Q. Zhang, Z. Guo, Z. Li, S. Yan, Z. Zhuet al., “Reducing gpu memory fragmentation via spatio-temporal planning for efficient large-scale model training,”arXiv preprint arXiv:2507.16274, 2025

work page arXiv 2025
[27]

Boolq: Exploring the sur- prising difficulty of natural yes/no questions,

C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova, “Boolq: Exploring the sur- prising difficulty of natural yes/no questions,” inNAACL, 2019

work page 2019
[28]

Making pre-trained language models better few-shot learners

T. Gao, A. Fisch, and D. Chen, “Making pre-trained language models better few-shot learners,”arXiv preprint arXiv:2012.15723, 2020

work page arXiv 2012

[1] [1]

On-Device Fine-Tuning via Backprop-Free Zeroth-Order Optimization

INTRODUCTION Context and Motivation:A key avenue for advancing ar- tificial intelligence (AI) performance is tailoring models to specific tasks and user needs via fine-tuning [1], particularly for agentic systems [2]. User demands for modern AI mod- els are not static, making fine-tuning a recurring necessity rather than a one-time cost. On-device fine-tu...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Throughout this section, as illustrated in Fig

MEMORY REQUIREMENTS OF BP AND MEZO In this section, we analyze the memory requirements of BP- and ZO-based fine-tuning for Transformer models with the aim of providing theoretical limits on the model size that can be accommodated under a fixed on-chip memory budget. Throughout this section, as illustrated in Fig. 2, we focus on a basic decoder-only Transf...

work page 2048

[3] [3]

EXPERIMENTS Fig. 4. On-device fine-tuning for Boolq data set [ 22] as a function of fine-tuning wall-clock time [s]. We consider MeZO with Llama2-7B and LlaMa2-13B, while BP adopts GPT2- medium model. The batch size is B= 8 . All models require a similar memory consumption of around 17 GB according to the analysis in Sec. 2 when setting L′/L= 0.15 for Lla...

work page

[4] [4]

This enables on-device learning with larger models and longer contexts

CONCLUSIONS MeZO optimization eliminates the need to store activations and optimizer states, reducing training memory demands to the level of inference memory. This enables on-device learning with larger models and longer contexts. We have analyzed the relative memory requirements of MeZO compared to conven- tional BP-based fine-tuning and presented exper...

work page

[5] [5]

Scaling down to scale up: A guide to parameter-efficient fine-tuning.arXiv preprint arXiv:2303.15647,

V . Lialin, V . Deshpande, and A. Rumshisky, “Scaling down to scale up: A guide to parameter-efficient fine- tuning,”arXiv preprint arXiv:2303.15647, 2023

work page arXiv 2023

[6] [6]

Fireact: Toward language agent fine-tuning

B. Chen, C. Shu, E. Shareghi, N. Collier, K. Narasimhan, and S. Yao, “Fireact: Toward language agent fine-tuning,” arXiv preprint arXiv:2310.05915, 2023

work page arXiv 2023

[7] [7]

From llms to edge: Parameter-efficient fine-tuning on edge devices,

G. Slamanig, F. Corti, and O. Saukh, “From llms to edge: Parameter-efficient fine-tuning on edge devices,”arXiv preprint arXiv:2507.23536, 2025

work page arXiv 2025

[8] [8]

Fine-Tuning Small Language Models for Domain-Specific AI: An Edge AI Perspective,

R. Aralimatti, S. A. G. Shakhadri, K. R. Kruthika, and K. B. Angadi, “Fine-Tuning Small Language Models for Domain-Specific AI: An Edge AI Perspective,” inIntelli- gent Systems and Applications, K. Arai, Ed. Springer Nature Switzerland, 2025, pp. 503–520

work page 2025

[9] [9]

Small Language Models are the Future of Agentic AI

P. Belcak, G. Heinrich, S. Diao, Y . Fu, X. Dong, S. Mu- ralidharan, Y . C. Lin, and P. Molchanov, “Small lan- guage models are the future of agentic ai,”arXiv preprint arXiv:2506.02153, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Fine-tuning language models with just forward passes,

S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, and S. Arora, “Fine-tuning language models with just forward passes,”Advances in Neural Infor- mation Processing Systems, vol. 36, pp. 53 038–53 075, 2023

work page 2023

[11] [11]

PocketLLM: Enabling on-device fine-tuning for personalized LLMs,

D. Peng, Z. Fu, and J. Wang, “PocketLLM: Enabling on-device fine-tuning for personalized LLMs,” in Proceedings of the Fifth Workshop on Privacy in Natural Language Processing. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 91–96. [Online]. Available: https://aclanthology.org/ 2024.privatenlp-1.10/

work page 2024

[12] [12]

Reducing acti- vation recomputation in large transformer models,

V . A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. An- dersch, M. Shoeybi, and B. Catanzaro, “Reducing acti- vation recomputation in large transformer models,”Pro- ceedings of Machine Learning and Systems, vol. 5, pp. 341–353, 2023

work page 2023

[13] [13]

The cost of avoiding backpropagation,

K. Panchalet al., “The cost of avoiding backpropagation,” 2025, arXiv preprint. [Online]. Available: https: //arxiv.org/abs/2506.21833

work page arXiv 2025

[14] [14]

An overview of the simultaneous perturba- tion method for efficient optimization,

J. C. Spall, “An overview of the simultaneous perturba- tion method for efficient optimization,”Johns Hopkins APL Technical Digest, vol. 19, no. 4, pp. 482–492, 1998

work page 1998

[15] [15]

Optimal rates for zero-order convex opti- mization: The power of two function evaluations,

J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono, “Optimal rates for zero-order convex opti- mization: The power of two function evaluations,”IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2788–2806, 2015

work page 2015

[16] [16]

Sparse MeZO: Less parameters for better performance in zeroth-order LLM fine-tuning,

Y . Liu, Z. Zhu, C. Gong, M. Cheng, C.-J. Hsieh, and Y . You, “Sparse MeZO: Less parameters for better performance in zeroth-order LLM fine-tuning,” in International Conference on Machine Learning (ICML),

work page

[17] [17]

Available: https://arxiv.org/abs/2402

[Online]. Available: https://arxiv.org/abs/2402. 11888

work page

[18] [18]

AdaZeta: Adaptive zeroth-order tensor-train adaptation for memory-efficient large language model fine-tuning,

Y . Yang, K. Zhen, E. Banijamali, A. Mouchtaris, and Z. Zhang, “AdaZeta: Adaptive zeroth-order tensor-train adaptation for memory-efficient large language model fine-tuning,” inEmpirical Methods in Natural Language Processing (EMNLP), 2024. [Online]. Available: https://arxiv.org/abs/2406.08301

work page arXiv 2024

[19] [19]

Towards efficient low-order hybrid optimizer for language model fine-tuning,

M. Chen, Y .-L. Huang, and Z. Wen, “Towards efficient low-order hybrid optimizer for language model fine-tuning,” inAAAI Conference on Artificial Intelligence, 2025, preprint. [Online]. Available: https: //arxiv.org/abs/2409.18075

work page arXiv 2025

[20] [20]

Towards understanding the effect of leak in Spiking Neural Networks,

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,”Neurocomput., vol. 568, no. C, Feb. 2024. [Online]. Available: https://doi.org/10.1016/j.neucom. 2023.127063

work page doi:10.1016/j.neucom 2024

[21] [21]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan, “Long- former: The long-document transformer,”arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[22] [22]

Outrageously large neural networks: The sparsely-gated mixture- of-experts layer,

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture- of-experts layer,” inInternational Conference on Learning Representations, 2017. [Online]. Available: https://openreview.net/forum?id=B1ckMDqlg

work page 2017

[23] [23]

Adaptive mixtures of local experts,

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hin- ton, “Adaptive mixtures of local experts,”Neural compu- tation, vol. 3, no. 1, pp. 79–87, 1991

work page 1991

[24] [24]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskeveret al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019

work page 2019

[25] [25]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhos- aleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Reducing gpu memory fragmentation via spatio-temporal planning for efficient large-scale model training,

Z. Huang, J. Hu, H. Lin, C. Zhu, Y . Tang, Q. Zhang, Z. Guo, Z. Li, S. Yan, Z. Zhuet al., “Reducing gpu memory fragmentation via spatio-temporal planning for efficient large-scale model training,”arXiv preprint arXiv:2507.16274, 2025

work page arXiv 2025

[27] [27]

Boolq: Exploring the sur- prising difficulty of natural yes/no questions,

C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova, “Boolq: Exploring the sur- prising difficulty of natural yes/no questions,” inNAACL, 2019

work page 2019

[28] [28]

Making pre-trained language models better few-shot learners

T. Gao, A. Fisch, and D. Chen, “Making pre-trained language models better few-shot learners,”arXiv preprint arXiv:2012.15723, 2020

work page arXiv 2012