HOSL: Hybrid-Order Split Learning for Memory-Constrained Edge Training
Pith reviewed 2026-05-16 13:57 UTC · model grok-4.3
The pith
HOSL uses zeroth-order optimization on the client and first-order on the server to cut memory use in split learning for large models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HOSL achieves an O(sqrt(d_c/TQ)) convergence rate that depends only on client-side dimension d_c, reduces client GPU memory by up to 3.7 times versus first-order split learning, and reaches accuracy within 0.20-4.23 percent of the first-order baseline while outperforming pure zeroth-order optimization by up to 15.55 percent on OPT models across six tasks.
What carries the argument
The hybrid-order partition that runs zeroth-order gradient estimation on the client-side model slice to eliminate backpropagation and activation storage, while the server applies standard first-order updates to the remaining parameters.
If this is right
- Convergence rate improves as more computation is offloaded to the server because the bound depends solely on client dimension d_c.
- Client devices can train OPT-scale models with up to 3.7 times lower GPU memory than first-order split learning.
- Accuracy remains within a few percent of full first-order performance across multiple language-modeling tasks.
- The hybrid strategy outperforms pure zeroth-order optimization in both convergence speed and final accuracy.
Where Pith is reading between the lines
- Choosing the split point to minimize client dimension could further enlarge memory savings for models larger than 1.3B parameters.
- The same client-side zeroth-order plus server-side first-order pattern may apply to other memory-constrained distributed settings such as federated learning with heterogeneous devices.
- Adding simple variance-reduction steps on the client estimates might shrink the observed accuracy gap to the first-order baseline.
- Evaluating HOSL on vision or multimodal models would test whether the hybrid-order benefit generalizes beyond language models.
Load-bearing premise
Zeroth-order gradient estimates computed only on the client-side partition remain sufficiently accurate for the server-side first-order optimizer to converge at the claimed rate.
What would settle it
Run the same OPT-125M and OPT-1.3B experiments with a higher-variance client zeroth-order estimator or a split point that increases client dimension; if accuracy drops beyond 4.23 percent or memory reduction falls below 3 times while convergence exceeds O(sqrt(d_c/TQ)), the central claim does not hold.
Figures
read the original abstract
Split learning (SL) enables collaborative training of large language models (LLMs) between resource-constrained edge devices and compute-rich servers by partitioning model computation across the network boundary. However, existing SL systems predominantly rely on first-order (FO) optimization, which requires clients to store intermediate quantities such as activations for backpropagation. This results in substantial memory overhead, largely negating benefits of model partitioning. In contrast, zeroth-order (ZO) optimization eliminates backpropagation and significantly reduces memory usage, but often suffers from slow convergence and degraded performance. In this work, we propose HOSL, a novel Hybrid-Order Split Learning framework that addresses this fundamental trade-off between memory efficiency and optimization effectiveness by strategically integrating ZO optimization on the client side with FO optimization on the server side. By employing memory-efficient ZO gradient estimation at the client, HOSL eliminates backpropagation and activation storage, reducing client memory consumption. Meanwhile, server-side FO optimization ensures fast convergence and competitive performance. Theoretically, we show that HOSL achieves an $\mathcal{O}(\sqrt{d_c/TQ})$ rate, which depends on client-side model dimension $d_c$ rather than the full model dimension $d$, demonstrating that convergence improves as more computation is offloaded to the server. Extensive experiments on OPT models (125M and 1.3B parameters) across 6 tasks demonstrate that HOSL reduces client GPU memory by up to 3.7$\times$ compared to the FO method while achieving accuracy within 0.20%-4.23% of this baseline. Furthermore, HOSL outperforms the ZO baseline by up to 15.55%, validating the effectiveness of our hybrid strategy for memory-efficient training on edge devices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes HOSL, a hybrid-order split learning framework for memory-constrained edge training of LLMs. It integrates zeroth-order (ZO) optimization on the client side to eliminate backpropagation and activation storage, with first-order (FO) optimization on the server side for fast convergence. The paper claims a convergence rate of O(sqrt(d_c/TQ)) that depends only on the client-side dimension d_c, up to 3.7x reduction in client GPU memory versus FO split learning, and accuracy within 0.20%-4.23% of the FO baseline while outperforming pure ZO by up to 15.55% on OPT-125M and 1.3B models across 6 tasks.
Significance. If the convergence bound is rigorously established and the experimental gains are reproducible with proper controls, this approach could be significant for enabling training of large models on resource-limited devices by leveraging the strengths of both ZO and FO methods. The theoretical result that convergence improves with more offloading to the server is potentially impactful if the decoupling holds.
major comments (3)
- Abstract: The O(sqrt(d_c/TQ)) convergence rate claim relies on the assumption that client-side ZO gradient estimates do not propagate variance through the activation interface to corrupt server-side FO optimization. No explicit bound on activation perturbation or variance reduction mechanism is provided, which is load-bearing for the claim that the rate depends solely on d_c rather than full model dimension d.
- Experiments (as summarized in abstract): The reported accuracy within 0.20%-4.23% of FO baseline and up to 15.55% improvement over ZO lacks details on experimental protocol, such as choice of split point, value of Q, number of runs, and variance reporting. This makes it impossible to verify whether the gains survive different split depths or statistical tests.
- Theoretical analysis: The derivation should include a sensitivity analysis w.r.t. split depth to confirm server parameters dominate the loss landscape; absent this, the rate may revert to depending on full d when the ZO variance couples back.
minor comments (2)
- Abstract: Clarify whether the 0.20%-4.23% accuracy range is absolute or relative difference, and specify the tasks and models for each endpoint.
- Notation: Ensure d_c, T, and Q are defined consistently when first introduced and that all equations receive numbers for cross-reference.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each comment and provide our responses below. We believe these revisions will significantly improve the clarity and rigor of the paper.
read point-by-point responses
-
Referee: [—] Abstract: The O(sqrt(d_c/TQ)) convergence rate claim relies on the assumption that client-side ZO gradient estimates do not propagate variance through the activation interface to corrupt server-side FO optimization. No explicit bound on activation perturbation or variance reduction mechanism is provided, which is load-bearing for the claim that the rate depends solely on d_c rather than full model dimension d.
Authors: We appreciate the referee pointing out the need for a more explicit treatment of variance propagation in the theoretical analysis. While our derivation assumes that the ZO-induced noise on the client is controlled and the server FO optimization proceeds on the received activations, we acknowledge that a formal bound on the perturbation is missing. In the revised manuscript, we will include an additional lemma providing a bound on the activation perturbation variance, showing that it scales with the ZO estimation variance but does not alter the dependence on d_c in the convergence rate. This will be supported by a sensitivity analysis as requested in the third comment. revision: yes
-
Referee: [—] Experiments (as summarized in abstract): The reported accuracy within 0.20%-4.23% of FO baseline and up to 15.55% improvement over ZO lacks details on experimental protocol, such as choice of split point, value of Q, number of runs, and variance reporting. This makes it impossible to verify whether the gains survive different split depths or statistical tests.
Authors: We agree that more details on the experimental setup are necessary for reproducibility. In the revised version, we will expand the experimental section to specify: the split points used (after the 4th layer for OPT-125M and the 8th layer for OPT-1.3B), Q=2 for the number of function queries in ZO estimation, 5 independent runs with different random seeds, and reporting of mean and standard deviation for accuracy. We will also add results for additional split depths to verify the robustness of the gains. revision: yes
-
Referee: [—] Theoretical analysis: The derivation should include a sensitivity analysis w.r.t. split depth to confirm server parameters dominate the loss landscape; absent this, the rate may revert to depending on full d when the ZO variance couples back.
Authors: We thank the referee for this suggestion. To address this, we will augment the theoretical analysis with a sensitivity analysis with respect to the split depth. Specifically, we will show both theoretically and empirically that as the client-side dimension d_c is reduced by offloading more layers to the server, the convergence rate improves accordingly without the ZO variance dominating. This will include plots of convergence versus split depth and a discussion of how the server-side FO optimization dominates the loss landscape. revision: yes
Circularity Check
No circularity in derivation chain; convergence rate and claims are independently derived
full rationale
The paper derives an O(sqrt(d_c/TQ)) convergence rate explicitly in terms of client-side dimension d_c and iteration parameters T, Q that are defined independently of the final accuracy or memory numbers. No equation reduces the claimed rate or memory savings (3.7x) directly to a fitted hyperparameter or self-citation chain. The hybrid ZO-client/FO-server split is justified by standard ZO variance bounds and FO convergence results without smuggling an ansatz or renaming a known empirical pattern as a new theorem. Experiments report accuracy gaps (0.20%-4.23%) against baselines without the performance metric being used to define the rate itself. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Zeroth-order gradient estimates on the client partition satisfy standard bounded-variance and smoothness conditions that allow the hybrid update to inherit the stated rate.
Reference graph
Works this paper leans on
-
[1]
Split learning for health: Distributed deep learning without sharing raw patient data
P. Vepakomma, O. Gupta, T. Swedish, and R. Raskar, “Split learning for health: Distributed deep learning without sharing raw patient data,” arXiv preprint arXiv:1812.00564, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[2]
Split learning for collaborative deep learning in health- care,
M. G. Poirot, P. Vepakomma, K. Chang, J. Kalpathy-Cramer, R. Gupta, and R. Raskar, “Split learning for collaborative deep learning in health- care,”arXiv preprint arXiv:1912.12115, 2019
-
[3]
Efficient parallel split learning over resource-constrained wireless edge networks,
Z. Lin, G. Zhu, Y . Deng, X. Chen, Y . Gao, K. Huang, and Y . Fang, “Efficient parallel split learning over resource-constrained wireless edge networks,”IEEE Transactions on Mobile Computing, vol. 23, no. 10, pp. 9224–9239, 2024
work page 2024
-
[4]
Vflair-llm: A comprehensive framework and benchmark for split learning of llms,
Z. Gu, Q. Fan, L. Sun, Y . Liu, and X. Ye, “Vflair-llm: A comprehensive framework and benchmark for split learning of llms,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, 2025, pp. 5470–5481
work page 2025
-
[5]
Splitfed: When federated learning meets split learning,
C. Thapa, P. C. M. Arachchige, S. Camtepe, and L. Sun, “Splitfed: When federated learning meets split learning,” inProceedings of the AAAI conference on artificial intelligence, vol. 36, no. 8, 2022, pp. 8485– 8493
work page 2022
-
[6]
Convergence analysis of split federated learning on heterogeneous data,
P. Han, C. Huang, G. Tian, M. Tang, and X. Liu, “Convergence analysis of split federated learning on heterogeneous data,”Advances in Neural Information Processing Systems, vol. 37, pp. 103 476–103 544, 2024
work page 2024
-
[7]
Towards a unified framework for split learning,
B. Radovi ˇc, M. Canini, S. Horv ´ath, V . Pejovi ´c, and P. Vepakomma, “Towards a unified framework for split learning,”EuroMLSys’ 25, pp. 183–191, 2025
work page 2025
-
[8]
Multivariate stochastic approximation using a simultaneous perturbation gradient approximation,
J. C. Spall, “Multivariate stochastic approximation using a simultaneous perturbation gradient approximation,”IEEE transactions on automatic control, vol. 37, no. 3, pp. 332–341, 2002
work page 2002
-
[9]
Stochastic first-and zeroth-order methods for nonconvex stochastic programming,
S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order methods for nonconvex stochastic programming,”SIAM journal on optimization, vol. 23, no. 4, pp. 2341–2368, 2013
work page 2013
-
[10]
Random gradient-free minimization of convex functions,
Y . Nesterov and V . Spokoiny, “Random gradient-free minimization of convex functions,”Foundations of Computational Mathematics, vol. 17, no. 2, pp. 527–566, 2017
work page 2017
-
[11]
Fine-tuning language models with just forward passes,
S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, and S. Arora, “Fine-tuning language models with just forward passes,”Ad- vances in Neural Information Processing Systems, vol. 36, pp. 53 038– 53 075, 2023
work page 2023
-
[12]
Achieving dimension-free communication in federated learning via zeroth-order optimization,
Z. Li, B. Ying, Z. Liu, C. Dong, and H. Yang, “Achieving dimension-free communication in federated learning via zeroth-order optimization,” in The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[13]
Sparse meZO: Less parameters for better performance in zeroth-order LLM fine-tuning,
Y . Liu, Z. Zhu, C. Gong, M. Cheng, C.-J. Hsieh, and Y . You, “Sparse meZO: Less parameters for better performance in zeroth-order LLM fine-tuning,” inThe Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems, 2025
work page 2025
-
[14]
Towards straggler-resilient split federated learning: An unbalanced update ap- proach,
D. Liang, J. Zhang, E. Chen, Z. Li, R. Li, and H. Yang, “Towards straggler-resilient split federated learning: An unbalanced update ap- proach,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[15]
Z. Li, B. Ying, Z. Liu, C. Dong, and H. Yang, “Reconciling hessian- informed acceleration and scalar-only communication for efficient feder- ated zeroth-order fine-tuning,”arXiv preprint arXiv:2506.02370, 2025
-
[16]
Communication-efficient learning of deep networks from decentralized data,
B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inArtificial intelligence and statistics. PMLR, 2017, pp. 1273– 1282
work page 2017
-
[17]
Splitlora: A split parameter-efficient fine-tuning framework for large language models,
Z. Lin, X. Hu, Y . Zhang, Z. Chen, Z. Fang, X. Chen, A. Li, P. Vepakomma, and Y . Gao, “Splitlora: A split parameter-efficient fine-tuning framework for large language models,”arXiv preprint arXiv:2407.00952, 2024
-
[18]
Communication-efficient stochastic zeroth-order optimization for fed- erated learning,
W. Fang, Z. Yu, Y . Jiang, Y . Shi, C. N. Jones, and Y . Zhou, “Communication-efficient stochastic zeroth-order optimization for fed- erated learning,”IEEE Transactions on Signal Processing, vol. 70, pp. 5058–5073, 2022
work page 2022
-
[19]
Z. Qin, D. Chen, B. Qian, B. Ding, Y . Li, and S. Deng, “Federated full- parameter tuning of billion-sized language models with communication cost under 18 kilobytes,” inProceedings of the 41st International Conference on Machine Learning. PMLR, 21–27 Jul 2024
work page 2024
-
[20]
Zeroth-order fine- tuning of LLMs with transferable static sparsity,
W. Guo, J. Long, Y . Zeng, Z. Liu, X. Yang, Y . Ran, J. R. Gardner, O. Bastani, C. D. Sa, X. Yu, B. Chen, and Z. Xu, “Zeroth-order fine- tuning of LLMs with transferable static sparsity,” inThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[21]
OPT: Open Pre-trained Transformer Language Models
S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Linet al., “Opt: Open pre-trained transformer language models,”arXiv preprint arXiv:2205.01068, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Lora: Low-rank adaptation of large language models
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022
work page 2022
-
[23]
Glue: A multi-task benchmark and analysis platform for natural language understanding,
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” inProceedings of the 2018 EMNLP workshop Black- boxNLP: Analyzing and interpreting neural networks for NLP, 2018, pp. 353–355
work page 2018
-
[24]
Superglue: A stickier benchmark for general- purpose language understanding systems,
A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “Superglue: A stickier benchmark for general- purpose language understanding systems,”Advances in neural informa- tion processing systems, vol. 32, 2019
work page 2019
-
[25]
Recursive deep models for semantic compositionality over a sentiment treebank,
R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y . Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” inProceedings of the 2013 conference on empirical methods in natural language processing, 2013, pp. 1631–1642
work page 2013
-
[26]
The commit- mentbank: Investigating projection in naturally occurring discourse,
M.-C. De Marneffe, M. Simons, and J. Tonhauser, “The commit- mentbank: Investigating projection in naturally occurring discourse,” in proceedings of Sinn und Bedeutung, vol. 23, no. 2, 2019, pp. 107–124
work page 2019
-
[27]
Wic: the word-in-context dataset for evaluating context-sensitive meaning representations,
M. T. Pilehvar and J. Camacho-Collados, “Wic: the word-in-context dataset for evaluating context-sensitive meaning representations,” in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language tech- nologies, volume 1 (Long and short papers), 2019, pp. 1267–1273
work page 2019
-
[28]
A review of winograd schema challenge datasets and approaches,
V . Kocijan, T. Lukasiewicz, E. Davis, G. Marcus, and L. Morgenstern, “A review of winograd schema challenge datasets and approaches,”arXiv preprint arXiv:2004.13831, 2020
-
[29]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova, “Boolq: Exploring the surprising difficulty of natural yes/no questions,”arXiv preprint arXiv:1905.10044, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[30]
A large annotated corpus for learning natural language inference,
S. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” inProceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp. 632–642. VII. APPENDIX A. Experiment Setting
work page 2015
-
[31]
Hyperparameters:Table IV reports the hyperparameters used for split learning experiments. For zeroth-order optimiza- tion, we fixϵ= 1e-3and use 10 perturbation vectors across all configurations. Both client and server use SGD without mo- mentum. For LoRA fine-tuning, we use rankr= 8,α= 16, and apply adapters to query and value projection matrices. We eval...
-
[32]
Table V reports tokenized sequence lengths on each dataset
Prompt:We adopt the prompt templates from [11] with- out modification. Table V reports tokenized sequence lengths on each dataset. B. GPU Memory Calculation for Split Learning with OPT- 125M
-
[33]
Table VI de- fines all variables
Variable Definitions:We derive theoretical peak GPU memory for split learning with OPT-125M on SST-2, using sequence lengthS= 64and batch sizeB= 64. Table VI de- fines all variables. Theoretical estimates are validated against empirical measurements from our implementation
-
[34]
The attention block has four projection matrices (Q, K, V , O), each H×H, plus biases (Eq
Per-Layer Parameter Count:The following formulas compute the parameter count in a single transformer layer. The attention block has four projection matrices (Q, K, V , O), each H×H, plus biases (Eq. 10). The FFN has two linear layers (H→d f f →H) with biases (Eq. 11). Two LayerNorms contribute4Hparameters via scale(γ)and shift(β)vectors. The total per-lay...
-
[35]
Client Memory (ZO Optimization): a) Parameters.:Model weights for the client’s portion of the split model, including token embeddings, positional embeddings, and transformer layers. OPT uses learned po- sitional embeddings with two additional offset positions for padding and BOS tokens, resulting in(M+ 2)total position embeddings: M c params = V H|{z} tok...
work page 2048
-
[36]
Server Memory (FO Optimization): TABLE IV EXPERIMENTSETTINGS FORSPLITLEARNING WITHMIXEDOPTIMIZERS Model Parameter SST-2 CB WSC WIC RTE BoolQ OPT-125M+FP ZO LR 1e-6 1e-6 1e-6 1e-6 1e-6 2e-7 FO LR 1e-3 5e-4 5e-4 1e-3 5e-4 1e-3 Batch Size 64 16 64 64 16 16 Rounds 3000 1000 1000 2000 1500 1500 OPT-125M+LoRA ZO LR 5e-5 5e-5 5e-5 5e-5 1e-6 5e-5 FO LR 1e-4 1e-4 ...
work page 2000
-
[37]
Effect of Split Layer Position:HOSL’s accuracy- memory trade-off holds across split configurations.Ta- ble VII compares split positionsk∈ {3,5,7}for OPT-1.3B, where the client holds the firstklayers and the server handles the remaining24−klayers. Across all six tasks and all three split points, the accuracy of every method decreases gradually as more laye...
-
[38]
All times include evaluation every 25 steps; this overhead is identical across methods
Training Time:HOSLincurs moderate time overhead compared to FO-FO while remaining faster than ZO- ZO.Table VIII reports the wall-clock training time for each optimizer configuration. All times include evaluation every 25 steps; this overhead is identical across methods. ZO-ZO incurs the highest training time due to the multiple perturbation-based forward ...
-
[39]
Communication Cost:HOSLreduces total communi- cation cost by up to 1.9×compared to FO-FO.Table IX reports the total communication volume for each optimizer configuration. In the forward direction (Client→Server), all three methods transmit the same volume, since the client always sends intermediate activations of shapeB×S×H. The key difference lies in the...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.