arxiv: 2601.10940 · v4 · submitted 2026-01-16 · 💻 cs.LG

HOSL: Hybrid-Order Split Learning for Memory-Constrained Edge Training

Aakriti Lnu , Zhe Li , Dandan Liang , Chao Huang , Rui Li , Haibo Yang This is my paper

Pith reviewed 2026-05-16 13:57 UTC · model grok-4.3

classification 💻 cs.LG

keywords split learninghybrid-order optimizationmemory-efficient trainingedge deviceszeroth-order optimizationfirst-order optimizationlarge language modelsconvergence rate

0 comments p. Extension

The pith

HOSL uses zeroth-order optimization on the client and first-order on the server to cut memory use in split learning for large models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HOSL to resolve the memory overhead that first-order split learning imposes on edge devices when training large models. First-order methods require clients to store activations for backpropagation, which largely cancels the benefit of model partitioning. Pure zeroth-order methods avoid this storage but converge slowly and reach lower accuracy. HOSL places memory-efficient zeroth-order gradient estimation on the client while keeping first-order optimization on the server. This produces a convergence rate governed only by client dimension and yields substantial memory savings with accuracy close to the first-order baseline.

Core claim

HOSL achieves an O(sqrt(d_c/TQ)) convergence rate that depends only on client-side dimension d_c, reduces client GPU memory by up to 3.7 times versus first-order split learning, and reaches accuracy within 0.20-4.23 percent of the first-order baseline while outperforming pure zeroth-order optimization by up to 15.55 percent on OPT models across six tasks.

What carries the argument

The hybrid-order partition that runs zeroth-order gradient estimation on the client-side model slice to eliminate backpropagation and activation storage, while the server applies standard first-order updates to the remaining parameters.

If this is right

Convergence rate improves as more computation is offloaded to the server because the bound depends solely on client dimension d_c.
Client devices can train OPT-scale models with up to 3.7 times lower GPU memory than first-order split learning.
Accuracy remains within a few percent of full first-order performance across multiple language-modeling tasks.
The hybrid strategy outperforms pure zeroth-order optimization in both convergence speed and final accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Choosing the split point to minimize client dimension could further enlarge memory savings for models larger than 1.3B parameters.
The same client-side zeroth-order plus server-side first-order pattern may apply to other memory-constrained distributed settings such as federated learning with heterogeneous devices.
Adding simple variance-reduction steps on the client estimates might shrink the observed accuracy gap to the first-order baseline.
Evaluating HOSL on vision or multimodal models would test whether the hybrid-order benefit generalizes beyond language models.

Load-bearing premise

Zeroth-order gradient estimates computed only on the client-side partition remain sufficiently accurate for the server-side first-order optimizer to converge at the claimed rate.

What would settle it

Run the same OPT-125M and OPT-1.3B experiments with a higher-variance client zeroth-order estimator or a split point that increases client dimension; if accuracy drops beyond 4.23 percent or memory reduction falls below 3 times while convergence exceeds O(sqrt(d_c/TQ)), the central claim does not hold.

Figures

Figures reproduced from arXiv: 2601.10940 by Aakriti Lnu, Chao Huang, Dandan Liang, Haibo Yang, Rui Li, Zhe Li.

**Figure 1.** Figure 1: Overview of HOSL Framework on resource-constrained edge devices. These limitations highlight the need for more memory-efficient training mechanisms tailored to LLM deployment in edge environments. Zeroth-Order (ZO) Optimization. ZO Optimization refers to a class of gradient-free methods that estimate gradient by only function value differences without requiring explicit gradient computation [8]–[10]. Rec… view at source ↗

**Figure 2.** Figure 2: Client GPU (CGPU) Memory Comparison between FO-FO and Our [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Split learning (SL) enables collaborative training of large language models (LLMs) between resource-constrained edge devices and compute-rich servers by partitioning model computation across the network boundary. However, existing SL systems predominantly rely on first-order (FO) optimization, which requires clients to store intermediate quantities such as activations for backpropagation. This results in substantial memory overhead, largely negating benefits of model partitioning. In contrast, zeroth-order (ZO) optimization eliminates backpropagation and significantly reduces memory usage, but often suffers from slow convergence and degraded performance. In this work, we propose HOSL, a novel Hybrid-Order Split Learning framework that addresses this fundamental trade-off between memory efficiency and optimization effectiveness by strategically integrating ZO optimization on the client side with FO optimization on the server side. By employing memory-efficient ZO gradient estimation at the client, HOSL eliminates backpropagation and activation storage, reducing client memory consumption. Meanwhile, server-side FO optimization ensures fast convergence and competitive performance. Theoretically, we show that HOSL achieves an $\mathcal{O}(\sqrt{d_c/TQ})$ rate, which depends on client-side model dimension $d_c$ rather than the full model dimension $d$, demonstrating that convergence improves as more computation is offloaded to the server. Extensive experiments on OPT models (125M and 1.3B parameters) across 6 tasks demonstrate that HOSL reduces client GPU memory by up to 3.7$\times$ compared to the FO method while achieving accuracy within 0.20%-4.23% of this baseline. Furthermore, HOSL outperforms the ZO baseline by up to 15.55%, validating the effectiveness of our hybrid strategy for memory-efficient training on edge devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HOSL pairs client ZO with server FO in split learning to cut memory 3.7x on OPT models while tying the rate to client dimension only, but the noise-decoupling assumption is the part that still needs checking.

read the letter

HOSL puts zeroth-order gradient estimates on the client side of split learning and keeps first-order optimization on the server. This removes backpropagation and activation storage from the client, which the experiments show cuts GPU memory by up to 3.7 times on OPT-125M and 1.3B models. Accuracy lands within 0.20 to 4.23 percent of the full first-order baseline and beats pure zeroth-order by as much as 15.55 percent across six tasks. The convergence claim of O(sqrt(d_c/TQ)) that depends only on client dimension rather than total model size is the clearest theoretical contribution here.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes HOSL, a hybrid-order split learning framework for memory-constrained edge training of LLMs. It integrates zeroth-order (ZO) optimization on the client side to eliminate backpropagation and activation storage, with first-order (FO) optimization on the server side for fast convergence. The paper claims a convergence rate of O(sqrt(d_c/TQ)) that depends only on the client-side dimension d_c, up to 3.7x reduction in client GPU memory versus FO split learning, and accuracy within 0.20%-4.23% of the FO baseline while outperforming pure ZO by up to 15.55% on OPT-125M and 1.3B models across 6 tasks.

Significance. If the convergence bound is rigorously established and the experimental gains are reproducible with proper controls, this approach could be significant for enabling training of large models on resource-limited devices by leveraging the strengths of both ZO and FO methods. The theoretical result that convergence improves with more offloading to the server is potentially impactful if the decoupling holds.

major comments (3)

Abstract: The O(sqrt(d_c/TQ)) convergence rate claim relies on the assumption that client-side ZO gradient estimates do not propagate variance through the activation interface to corrupt server-side FO optimization. No explicit bound on activation perturbation or variance reduction mechanism is provided, which is load-bearing for the claim that the rate depends solely on d_c rather than full model dimension d.
Experiments (as summarized in abstract): The reported accuracy within 0.20%-4.23% of FO baseline and up to 15.55% improvement over ZO lacks details on experimental protocol, such as choice of split point, value of Q, number of runs, and variance reporting. This makes it impossible to verify whether the gains survive different split depths or statistical tests.
Theoretical analysis: The derivation should include a sensitivity analysis w.r.t. split depth to confirm server parameters dominate the loss landscape; absent this, the rate may revert to depending on full d when the ZO variance couples back.

minor comments (2)

Abstract: Clarify whether the 0.20%-4.23% accuracy range is absolute or relative difference, and specify the tasks and models for each endpoint.
Notation: Ensure d_c, T, and Q are defined consistently when first introduced and that all equations receive numbers for cross-reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each comment and provide our responses below. We believe these revisions will significantly improve the clarity and rigor of the paper.

read point-by-point responses

Referee: [—] Abstract: The O(sqrt(d_c/TQ)) convergence rate claim relies on the assumption that client-side ZO gradient estimates do not propagate variance through the activation interface to corrupt server-side FO optimization. No explicit bound on activation perturbation or variance reduction mechanism is provided, which is load-bearing for the claim that the rate depends solely on d_c rather than full model dimension d.

Authors: We appreciate the referee pointing out the need for a more explicit treatment of variance propagation in the theoretical analysis. While our derivation assumes that the ZO-induced noise on the client is controlled and the server FO optimization proceeds on the received activations, we acknowledge that a formal bound on the perturbation is missing. In the revised manuscript, we will include an additional lemma providing a bound on the activation perturbation variance, showing that it scales with the ZO estimation variance but does not alter the dependence on d_c in the convergence rate. This will be supported by a sensitivity analysis as requested in the third comment. revision: yes
Referee: [—] Experiments (as summarized in abstract): The reported accuracy within 0.20%-4.23% of FO baseline and up to 15.55% improvement over ZO lacks details on experimental protocol, such as choice of split point, value of Q, number of runs, and variance reporting. This makes it impossible to verify whether the gains survive different split depths or statistical tests.

Authors: We agree that more details on the experimental setup are necessary for reproducibility. In the revised version, we will expand the experimental section to specify: the split points used (after the 4th layer for OPT-125M and the 8th layer for OPT-1.3B), Q=2 for the number of function queries in ZO estimation, 5 independent runs with different random seeds, and reporting of mean and standard deviation for accuracy. We will also add results for additional split depths to verify the robustness of the gains. revision: yes
Referee: [—] Theoretical analysis: The derivation should include a sensitivity analysis w.r.t. split depth to confirm server parameters dominate the loss landscape; absent this, the rate may revert to depending on full d when the ZO variance couples back.

Authors: We thank the referee for this suggestion. To address this, we will augment the theoretical analysis with a sensitivity analysis with respect to the split depth. Specifically, we will show both theoretically and empirically that as the client-side dimension d_c is reduced by offloading more layers to the server, the convergence rate improves accordingly without the ZO variance dominating. This will include plots of convergence versus split depth and a discussion of how the server-side FO optimization dominates the loss landscape. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; convergence rate and claims are independently derived

full rationale

The paper derives an O(sqrt(d_c/TQ)) convergence rate explicitly in terms of client-side dimension d_c and iteration parameters T, Q that are defined independently of the final accuracy or memory numbers. No equation reduces the claimed rate or memory savings (3.7x) directly to a fitted hyperparameter or self-citation chain. The hybrid ZO-client/FO-server split is justified by standard ZO variance bounds and FO convergence results without smuggling an ansatz or renaming a known empirical pattern as a new theorem. Experiments report accuracy gaps (0.20%-4.23%) against baselines without the performance metric being used to define the rate itself. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard smoothness and bounded-variance assumptions for ZO and FO methods plus the modeling choice that the client partition can be treated as an independent sub-model whose dimension d_c governs the rate; no new entities are postulated and no parameters are fitted inside the convergence statement itself.

axioms (1)

domain assumption Zeroth-order gradient estimates on the client partition satisfy standard bounded-variance and smoothness conditions that allow the hybrid update to inherit the stated rate.
Invoked to obtain the O(sqrt(d_c/TQ)) bound; typical for ZO analysis but not verified in the abstract.

pith-pipeline@v0.9.0 · 5631 in / 1448 out tokens · 21408 ms · 2026-05-16T13:57:59.788478+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 3 internal anchors

[1]

Split learning for health: Distributed deep learning without sharing raw patient data

P. Vepakomma, O. Gupta, T. Swedish, and R. Raskar, “Split learning for health: Distributed deep learning without sharing raw patient data,” arXiv preprint arXiv:1812.00564, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Split learning for collaborative deep learning in health- care,

M. G. Poirot, P. Vepakomma, K. Chang, J. Kalpathy-Cramer, R. Gupta, and R. Raskar, “Split learning for collaborative deep learning in health- care,”arXiv preprint arXiv:1912.12115, 2019

work page arXiv 1912
[3]

Efficient parallel split learning over resource-constrained wireless edge networks,

Z. Lin, G. Zhu, Y . Deng, X. Chen, Y . Gao, K. Huang, and Y . Fang, “Efficient parallel split learning over resource-constrained wireless edge networks,”IEEE Transactions on Mobile Computing, vol. 23, no. 10, pp. 9224–9239, 2024

work page 2024
[4]

Vflair-llm: A comprehensive framework and benchmark for split learning of llms,

Z. Gu, Q. Fan, L. Sun, Y . Liu, and X. Ye, “Vflair-llm: A comprehensive framework and benchmark for split learning of llms,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, 2025, pp. 5470–5481

work page 2025
[5]

Splitfed: When federated learning meets split learning,

C. Thapa, P. C. M. Arachchige, S. Camtepe, and L. Sun, “Splitfed: When federated learning meets split learning,” inProceedings of the AAAI conference on artificial intelligence, vol. 36, no. 8, 2022, pp. 8485– 8493

work page 2022
[6]

Convergence analysis of split federated learning on heterogeneous data,

P. Han, C. Huang, G. Tian, M. Tang, and X. Liu, “Convergence analysis of split federated learning on heterogeneous data,”Advances in Neural Information Processing Systems, vol. 37, pp. 103 476–103 544, 2024

work page 2024
[7]

Towards a unified framework for split learning,

B. Radovi ˇc, M. Canini, S. Horv ´ath, V . Pejovi ´c, and P. Vepakomma, “Towards a unified framework for split learning,”EuroMLSys’ 25, pp. 183–191, 2025

work page 2025
[8]

Multivariate stochastic approximation using a simultaneous perturbation gradient approximation,

J. C. Spall, “Multivariate stochastic approximation using a simultaneous perturbation gradient approximation,”IEEE transactions on automatic control, vol. 37, no. 3, pp. 332–341, 2002

work page 2002
[9]

Stochastic first-and zeroth-order methods for nonconvex stochastic programming,

S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order methods for nonconvex stochastic programming,”SIAM journal on optimization, vol. 23, no. 4, pp. 2341–2368, 2013

work page 2013
[10]

Random gradient-free minimization of convex functions,

Y . Nesterov and V . Spokoiny, “Random gradient-free minimization of convex functions,”Foundations of Computational Mathematics, vol. 17, no. 2, pp. 527–566, 2017

work page 2017
[11]

Fine-tuning language models with just forward passes,

S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, and S. Arora, “Fine-tuning language models with just forward passes,”Ad- vances in Neural Information Processing Systems, vol. 36, pp. 53 038– 53 075, 2023

work page 2023
[12]

Achieving dimension-free communication in federated learning via zeroth-order optimization,

Z. Li, B. Ying, Z. Liu, C. Dong, and H. Yang, “Achieving dimension-free communication in federated learning via zeroth-order optimization,” in The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[13]

Sparse meZO: Less parameters for better performance in zeroth-order LLM fine-tuning,

Y . Liu, Z. Zhu, C. Gong, M. Cheng, C.-J. Hsieh, and Y . You, “Sparse meZO: Less parameters for better performance in zeroth-order LLM fine-tuning,” inThe Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems, 2025

work page 2025
[14]

Towards straggler-resilient split federated learning: An unbalanced update ap- proach,

D. Liang, J. Zhang, E. Chen, Z. Li, R. Li, and H. Yang, “Towards straggler-resilient split federated learning: An unbalanced update ap- proach,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[15]

Reconciling hessian- informed acceleration and scalar-only communication for efficient feder- ated zeroth-order fine-tuning,

Z. Li, B. Ying, Z. Liu, C. Dong, and H. Yang, “Reconciling hessian- informed acceleration and scalar-only communication for efficient feder- ated zeroth-order fine-tuning,”arXiv preprint arXiv:2506.02370, 2025

work page arXiv 2025
[16]

Communication-efficient learning of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inArtificial intelligence and statistics. PMLR, 2017, pp. 1273– 1282

work page 2017
[17]

Splitlora: A split parameter-efficient fine-tuning framework for large language models,

Z. Lin, X. Hu, Y . Zhang, Z. Chen, Z. Fang, X. Chen, A. Li, P. Vepakomma, and Y . Gao, “Splitlora: A split parameter-efficient fine-tuning framework for large language models,”arXiv preprint arXiv:2407.00952, 2024

work page arXiv 2024
[18]

Communication-efficient stochastic zeroth-order optimization for fed- erated learning,

W. Fang, Z. Yu, Y . Jiang, Y . Shi, C. N. Jones, and Y . Zhou, “Communication-efficient stochastic zeroth-order optimization for fed- erated learning,”IEEE Transactions on Signal Processing, vol. 70, pp. 5058–5073, 2022

work page 2022
[19]

Federated full- parameter tuning of billion-sized language models with communication cost under 18 kilobytes,

Z. Qin, D. Chen, B. Qian, B. Ding, Y . Li, and S. Deng, “Federated full- parameter tuning of billion-sized language models with communication cost under 18 kilobytes,” inProceedings of the 41st International Conference on Machine Learning. PMLR, 21–27 Jul 2024

work page 2024
[20]

Zeroth-order fine- tuning of LLMs with transferable static sparsity,

W. Guo, J. Long, Y . Zeng, Z. Liu, X. Yang, Y . Ran, J. R. Gardner, O. Bastani, C. D. Sa, X. Yu, B. Chen, and Z. Xu, “Zeroth-order fine- tuning of LLMs with transferable static sparsity,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[21]

OPT: Open Pre-trained Transformer Language Models

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Linet al., “Opt: Open pre-trained transformer language models,”arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022

work page 2022
[23]

Glue: A multi-task benchmark and analysis platform for natural language understanding,

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” inProceedings of the 2018 EMNLP workshop Black- boxNLP: Analyzing and interpreting neural networks for NLP, 2018, pp. 353–355

work page 2018
[24]

Superglue: A stickier benchmark for general- purpose language understanding systems,

A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “Superglue: A stickier benchmark for general- purpose language understanding systems,”Advances in neural informa- tion processing systems, vol. 32, 2019

work page 2019
[25]

Recursive deep models for semantic compositionality over a sentiment treebank,

R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y . Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” inProceedings of the 2013 conference on empirical methods in natural language processing, 2013, pp. 1631–1642

work page 2013
[26]

The commit- mentbank: Investigating projection in naturally occurring discourse,

M.-C. De Marneffe, M. Simons, and J. Tonhauser, “The commit- mentbank: Investigating projection in naturally occurring discourse,” in proceedings of Sinn und Bedeutung, vol. 23, no. 2, 2019, pp. 107–124

work page 2019
[27]

Wic: the word-in-context dataset for evaluating context-sensitive meaning representations,

M. T. Pilehvar and J. Camacho-Collados, “Wic: the word-in-context dataset for evaluating context-sensitive meaning representations,” in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language tech- nologies, volume 1 (Long and short papers), 2019, pp. 1267–1273

work page 2019
[28]

A review of winograd schema challenge datasets and approaches,

V . Kocijan, T. Lukasiewicz, E. Davis, G. Marcus, and L. Morgenstern, “A review of winograd schema challenge datasets and approaches,”arXiv preprint arXiv:2004.13831, 2020

work page arXiv 2004
[29]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova, “Boolq: Exploring the surprising difficulty of natural yes/no questions,”arXiv preprint arXiv:1905.10044, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[30]

A large annotated corpus for learning natural language inference,

S. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” inProceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp. 632–642. VII. APPENDIX A. Experiment Setting

work page 2015
[31]

For zeroth-order optimiza- tion, we fixϵ= 1e-3and use 10 perturbation vectors across all configurations

Hyperparameters:Table IV reports the hyperparameters used for split learning experiments. For zeroth-order optimiza- tion, we fixϵ= 1e-3and use 10 perturbation vectors across all configurations. Both client and server use SGD without mo- mentum. For LoRA fine-tuning, we use rankr= 8,α= 16, and apply adapters to query and value projection matrices. We eval...

work page
[32]

Table V reports tokenized sequence lengths on each dataset

Prompt:We adopt the prompt templates from [11] with- out modification. Table V reports tokenized sequence lengths on each dataset. B. GPU Memory Calculation for Split Learning with OPT- 125M

work page
[33]

Table VI de- fines all variables

Variable Definitions:We derive theoretical peak GPU memory for split learning with OPT-125M on SST-2, using sequence lengthS= 64and batch sizeB= 64. Table VI de- fines all variables. Theoretical estimates are validated against empirical measurements from our implementation

work page
[34]

The attention block has four projection matrices (Q, K, V , O), each H×H, plus biases (Eq

Per-Layer Parameter Count:The following formulas compute the parameter count in a single transformer layer. The attention block has four projection matrices (Q, K, V , O), each H×H, plus biases (Eq. 10). The FFN has two linear layers (H→d f f →H) with biases (Eq. 11). Two LayerNorms contribute4Hparameters via scale(γ)and shift(β)vectors. The total per-lay...

work page
[35]

Client Memory (ZO Optimization): a) Parameters.:Model weights for the client’s portion of the split model, including token embeddings, positional embeddings, and transformer layers. OPT uses learned po- sitional embeddings with two additional offset positions for padding and BOS tokens, resulting in(M+ 2)total position embeddings: M c params = V H|{z} tok...

work page 2048
[36]

Server Memory (FO Optimization): TABLE IV EXPERIMENTSETTINGS FORSPLITLEARNING WITHMIXEDOPTIMIZERS Model Parameter SST-2 CB WSC WIC RTE BoolQ OPT-125M+FP ZO LR 1e-6 1e-6 1e-6 1e-6 1e-6 2e-7 FO LR 1e-3 5e-4 5e-4 1e-3 5e-4 1e-3 Batch Size 64 16 64 64 16 16 Rounds 3000 1000 1000 2000 1500 1500 OPT-125M+LoRA ZO LR 5e-5 5e-5 5e-5 5e-5 1e-6 5e-5 FO LR 1e-4 1e-4 ...

work page 2000
[37]

Effect of Split Layer Position:HOSL’s accuracy- memory trade-off holds across split configurations.Ta- ble VII compares split positionsk∈ {3,5,7}for OPT-1.3B, where the client holds the firstklayers and the server handles the remaining24−klayers. Across all six tasks and all three split points, the accuracy of every method decreases gradually as more laye...

work page
[38]

All times include evaluation every 25 steps; this overhead is identical across methods

Training Time:HOSLincurs moderate time overhead compared to FO-FO while remaining faster than ZO- ZO.Table VIII reports the wall-clock training time for each optimizer configuration. All times include evaluation every 25 steps; this overhead is identical across methods. ZO-ZO incurs the highest training time due to the multiple perturbation-based forward ...

work page
[39]

In the forward direction (Client→Server), all three methods transmit the same volume, since the client always sends intermediate activations of shapeB×S×H

Communication Cost:HOSLreduces total communi- cation cost by up to 1.9×compared to FO-FO.Table IX reports the total communication volume for each optimizer configuration. In the forward direction (Client→Server), all three methods transmit the same volume, since the client always sends intermediate activations of shapeB×S×H. The key difference lies in the...

work page